# 3. Missing Data and Imputation
Compiled by [Morgan Williams](mailto:morgan.williams@csiro.au) for C3DIS 2018 

The 'achilles heel' of compositional data analysis is the incompatibility with 'null' or zero components. In practice, most compositional data contains such values - as either true zeros (e.g. count data as found in surveys), below-detection values (e.g. geochemistry) or components which are simply missing (e.g. components are not measured, but may be of significant quantity).

The simplest but ultimately least practical solution to this problem is to use subcompositions - a set of components free of missing/zero data. Beyond this approach, missing data may be able to be imputed using relationships between variables. With regard to compositional data, imputation using nominal values may be marginally valid in some cases (i.e. using small values to represent values below decection, without altering the closure operation greatly), but overall this approach typically serves as a confounding factor (e.g. creating bimodal distributions and spurious clusters).

Parametric imputation attempts to preserve - and ideally *restore* (detection-limits effectively truncate distributions) - the distribution of multivariate compositional data. Missing values are imputed using regression against other variables. However, there remain difficulties with regards to imputation:
* Omitted/not measured vs. 'below detection' values have different overall expectations: below detection values have upper bounds (and also upper error bounds), omitted values may be significant quantities, but are not well constrained
* An iterative algorithm is needed (e.g. expectation-maximisation) as imputed values alter the closure operator and hence adjust other compostional components, if only slightly. Iteration continues until the imputed dataset resembles the original dataset within some specified tolerance.
* Ideally, imputed values should be tagged - such that the user can choose whether to utilise values or not (reasoning based on imputed values may not be particularly robust)


In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

sys.path.insert(0, './src')
from geochem import *

Notably, when values are imputed from low density dataset, the output is strongly dependent on the data quality and 'representiveness' of the present values. For this reason, using imputed values for geological inference may be misleading - especially for rarely recorded parameters.