<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Overview</a></span></li></ul></li><li><span><a href="#Load-some-Data" data-toc-modified-id="Load-some-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load some Data</a></span></li><li><span><a href="#Iterative-PCA-(Missing-X-values)" data-toc-modified-id="Iterative-PCA-(Missing-X-values)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Iterative PCA (Missing X values)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li><li><span><a href="#Iterative-PLS-(Missing-X-values)" data-toc-modified-id="Iterative-PLS-(Missing-X-values)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Iterative PLS (Missing X values)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li><li><span><a href="#Below-LOD" data-toc-modified-id="Below-LOD-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Below LOD</a></span><ul class="toc-item"><li><span><a href="#Missing-values-<-LOD-only" data-toc-modified-id="Missing-values-<-LOD-only-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Missing values &lt; LOD only</a></span></li><li><span><a href="#Missing-values-and-<-LOD" data-toc-modified-id="Missing-values-and-<-LOD-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Missing values and &lt; LOD</a></span></li></ul></li></ul></div>

In [None]:
using_colab = 'google.colab' in str(get_ipython())
if using_colab:
    !git clone https://github.com/mahynski/pychemauth.git --depth 1
    !cd pychemauth; pip3 install .; cd ..

import pychemauth

import matplotlib.pyplot as plt
%matplotlib notebook

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [None]:
import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd

Overview
--------
This are some examples ways to impute missing data. scikit-learn has a [library](https://scikit-learn.org/stable/modules/impute.html#univariate-vs-multivariate-imputation) for simple methods which is also very useful. 

In [None]:
%watermark -t -m -v --iversions

# Load some Data

In [None]:
if using_colab:
    loc = 'https://raw.githubusercontent.com/mahynski/pychemauth/main/tests/data/pls_train.csv'
else:
    loc = '../tests/data/pls_train.csv'
df = pd.read_csv(loc)

raw_X = np.array(df.values[:,3:], dtype=float) # Extract features
raw_y = np.array(df['Water'].values, dtype=float) # Take the water content as the target

# Randomly delete some entries
n_delete = 10

np.random.seed(0)
a = [np.random.randint(low=0, high=raw_X.shape[0]) 
     for i in range(n_delete)]
b = [np.random.randint(low=0, high=raw_X.shape[1]) 
     for i in range(n_delete)]

missing_X = raw_X.copy()
for i,j in zip(a,b):
    missing_X[i,j] = np.nan 
    
def compare(raw_X, reconstructed_X):
    print('Reconstructed\tOriginal\tDifference\tRelative Err')
    for i,j in zip(a,b):
        print('%.3e\t'%reconstructed_X[i,j]
              +'%.3e\t'%raw_X[i,j]
              +'%.3e\t'%(reconstructed_X[i,j]-raw_X[i,j])
              +'%.3f'%(np.abs((reconstructed_X[i,j]-raw_X[i,j])/raw_X[i,j]))
             )

# Iterative PCA (Missing X values)

## Fixed n_components

If you know the number of components to use you can just perform this directly.

In [None]:
from pychemauth.preprocessing.missing import PCA_IA

In [None]:
itim = PCA_IA(n_components=3, 
              scale_x=True,
              missing_values=np.nan, 
              tol=1.0e-6, 
              max_iters=5000)

In [None]:
reconstructed_X = itim.fit_transform(missing_X)
compare(raw_X, reconstructed_X)

## Unknown n_components

Usually, we need to figure out what a good n_components value is. We can use cross-validation for this.

In [None]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pca_ia", PCA_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pca_ia__n_components': np.arange(1, 10, 2),
    'pca_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(missing_X, raw_y.reshape(-1,1))

In [None]:
gs.best_params_

In [None]:
filler = PCA_IA(
        n_components=9, 
        scale_x=False)
reconstructed_X = filler.fit_transform(missing_X, 
                                       raw_y.reshape(-1,1))

In [None]:
compare(raw_X, reconstructed_X)

You can then use this in other pipelines.  You can specify the imputer without any hyperparameters in those cases, for example.
Below is an example of how you might do that. Of course, you can also include the imputer's hyperparameters as part of the CV, too.

```

pipeline = imblearn.pipeline.Pipeline(steps=[
    # Insert other preprocessing steps here...
    ("pca_ia", PCA_IA(n_components=9, scale_x=False)),
    ("plsda", PLSDA(n_components=5, 
                    alpha=0.05,
                    scale_x=True, 
                    not_assigned='UNKNOWN',
                    style='soft', 
                   )
    )
])

# NO HYPERPARAMETERS ASSOCIATED WITH THE IMPUTER
param_grid = [{
    'plsda__n_components':np.arange(1, 10, 2),
    'plsda__alpha': [0.07, 0.05, 0.03, 0.01],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    error_score=0,
    refit=True
)

_ = gs.fit(x_train, y_train)
```

# Iterative PLS (Missing X values)

## Fixed n_components

In [None]:
from pychemauth.preprocessing.missing import PLS_IA

In [None]:
itim = PLS_IA(
    n_components=3, 
    missing_values=np.nan, 
    scale_x=True,
    tol=1.0e-6, 
    max_iters=5000)

In [None]:
reconstructed_X = itim.fit_transform(missing_X, raw_y.reshape(-1,1))

In [None]:
compare(raw_X, reconstructed_X)

## Unknown n_components

In [None]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pls_ia", PLS_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pls_ia__n_components': np.arange(1, 10, 2),
    'pls_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(missing_X, raw_y.reshape(-1,1))

In [None]:
gs.best_params_

# Below LOD

In [None]:
from pychemauth.preprocessing.missing import LOD

## Missing values < LOD only

In [None]:
X = np.array(
    [
        [1.0, 2.0, 3.0, 4.0],
        [np.nan, 3.0, 2.0, np.nan],
        [5.0, 1.0, np.nan, 5.0],
        [2.0, 3.0, 4.0, 5.0]
    ]
)

lod = np.array([0.15, 0.15, 0.25, 0.15])

In [None]:
imputer = LOD(lod, missing_values=np.nan, seed=0)
imputer.fit_transform(X)

## Missing values and < LOD

In [None]:
# Now assume -1 indicates < LOD and a corrupted data entry is
# indicated by a NaN
X = np.array(
    [
        [1.0, np.nan, 3.0, 4.0],
        [-1, 3.0, 2.0, -1],
        [5.0, 1.0, -1, 5.0],
        [2.0, 3.0, np.nan, 5.0]
    ]
)

lod = np.array([0.15, 0.15, 0.25, 0.15])

In [None]:
# If you leave "-1" then when doing imputation that will be 
# considered a "real" value which is not what you (probably) want.

# Step 1: Remove values encoded by numbers. 
imputer = LOD(lod, missing_values=-1, seed=0)
X_lod = imputer.fit_transform(X)
X_lod

In [None]:
# Step 2: Remove NaNs by doing imputation
itim = PLS_IA(
    n_components=2, 
    missing_values=np.nan, 
    scale_x=True,
    tol=1.0e-6, 
    max_iters=5000)
X_recon = itim.fit_transform(X_lod, np.arange(X.shape[0]).reshape(-1,1))
X_recon

In [None]:
# Note how some imputed values are now < 0.  This may, or may
# not be sensible. If you want, you can re-perform the LOD
# check because this will register as < LOD due to the sign.

imputer = LOD(lod, missing_values=-1, seed=0)
X_lod = imputer.fit_transform(X_recon)
X_lod

In [None]:
# Lesson: Be careful when combining preprocessing!