<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Overview</a></span></li></ul></li><li><span><a href="#Load-some-Data" data-toc-modified-id="Load-some-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load some Data</a></span></li><li><span><a href="#Iterative-PCA-(Missing-X-values)" data-toc-modified-id="Iterative-PCA-(Missing-X-values)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Iterative PCA (Missing X values)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li><li><span><a href="#Iterative-PLS-(Missing-X-and-y)" data-toc-modified-id="Iterative-PLS-(Missing-X-and-y)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Iterative PLS (Missing X and y)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li></ul></div>

In [1]:
import matplotlib.pyplot as plt
%matplotlib notebook

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import sys
sys.path.append('../../')
import chemometrics

import numpy as np
import pandas as pd

import watermark
%load_ext watermark



In [2]:
%load_ext autoreload
%autoreload 2

Overview
--------
This are some examples ways to impute missing data. scikit-learn has a [library](https://scikit-learn.org/stable/modules/impute.html#univariate-vs-multivariate-imputation) for simple methods which is also very useful. 

In [3]:
%watermark -t -m -v --iversions

json      2.0.9
watermark 2.0.2
imblearn  0.5.0
sklearn   0.22.2.post1
pandas    0.25.1
numpy     1.21.4
13:24:16 

CPython 3.7.4
IPython 7.8.0

compiler   : GCC 7.3.0
system     : Linux
release    : 4.15.0-167-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


# Load some Data

In [8]:
df = pd.read_csv('../tests/data/pls_train.csv')
raw_X = np.array(df.values[:,3:], dtype=float) # Extract features
raw_y = np.array(df['Water'].values, dtype=float) # Take the water content as the target

# Randomly delete some entries
n_delete = 10

np.random.seed(0)
a = [np.random.randint(low=0, high=raw_X.shape[0]) 
     for i in range(n_delete)]
b = [np.random.randint(low=0, high=raw_X.shape[1]) 
     for i in range(n_delete)]

missing_X = raw_X.copy()
for i,j in zip(a,b):
    missing_X[i,j] = np.nan 
    
def compare(raw_X, reconstructed_X):
    print('Reconstructed\tOriginal\tDifference\tRelative Err')
    for i,j in zip(a,b):
        print('%.3e\t'%reconstructed_X[i,j]
              +'%.3e\t'%raw_X[i,j]
              +'%.3e\t'%(reconstructed_X[i,j]-raw_X[i,j])
              +'%.3f'%(np.abs((reconstructed_X[i,j]-raw_X[i,j])/raw_X[i,j]))
             )

# Iterative PCA (Missing X values)

## Fixed n_components

If you know the number of components to use you can just perform this directly.

In [9]:
from chemometrics.preprocessing.missing import PCA_IA

In [10]:
itim = PCA_IA(n_components=3, 
              scale_x=True,
              missing_values=np.nan, 
              tol=1.0e-6, 
              max_iters=5000)

In [11]:
reconstructed_X = itim.fit_transform(missing_X)
compare(raw_X, reconstructed_X)

Reconstructed	Original	Difference	Relative Err
5.814e-01	5.629e-01	1.848e-02	0.033
-1.458e+00	-1.457e+00	-9.806e-04	0.001
6.187e-01	6.290e-01	-1.027e-02	0.016
6.521e-01	6.713e-01	-1.927e-02	0.029
1.000e+00	9.980e-01	2.025e-03	0.002
-1.540e+00	-1.542e+00	1.949e-03	0.001
-1.608e+00	-1.609e+00	2.426e-04	0.000
1.104e+00	1.107e+00	-3.625e-03	0.003
-5.570e-01	-5.565e-01	-5.697e-04	0.001
4.703e-01	4.465e-01	2.377e-02	0.053


## Unknown n_components

Usually, we need to figure out what a good n_components value is. We can use cross-validation for this.

In [12]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pca_ia", PCA_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pca_ia__n_components': np.arange(1, 10, 2),
    'pca_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    error_score=0,
    refit=True
)

_ = gs.fit(missing_X, raw_y.reshape(-1,1))

In [13]:
gs.best_params_

{'pca_ia__n_components': 9, 'pca_ia__scale_x': False}

In [14]:
filler = PCA_IA(
        n_components=9, 
        scale_x=False)
reconstructed_X = filler.fit_transform(missing_X, 
                                       raw_y.reshape(-1,1))

In [16]:
compare(raw_X, reconstructed_X)

Reconstructed	Original	Difference	Relative Err
5.169e-01	5.629e-01	-4.597e-02	0.082
-1.456e+00	-1.457e+00	3.158e-04	0.000
6.290e-01	6.290e-01	-4.315e-06	0.000
6.116e-01	6.713e-01	-5.977e-02	0.089
1.012e+00	9.980e-01	1.440e-02	0.014
-1.542e+00	-1.542e+00	9.013e-05	0.000
-1.609e+00	-1.609e+00	2.404e-05	0.000
1.108e+00	1.107e+00	5.621e-05	0.000
-5.563e-01	-5.565e-01	1.776e-04	0.000
5.140e-01	4.465e-01	6.749e-02	0.151


You can then use this in other pipelines.  You can specify the imputer without any hyperparameters in those cases, for example.
Below is an example of how you might do that. Of course, you can also include the imputer's hyperparameters as part of the CV, too.

# Iterative PLS (Missing X and y)

## Fixed n_components

In [14]:
from chemometrics.preprocessing.missing import PLS_IA

In [15]:
itim = PLS_IA(
    n_components=3, 
    missing_values=np.nan, 
    scale_x=True,
    tol=1.0e-6, 
    max_iters=5000)

In [16]:
_ = itim.fit(missing_X, raw_y.reshape(-1,1))

In [17]:
reconstructed_X = itim.fit_transform(missing_X, raw_y.reshape(-1,1))

In [18]:
print('Reconstructed\tOriginal\tDifference\tRelative Err')
for i,j in zip(a,b):
    print('%.3e\t'%reconstructed_X[i,j]
          +'%.3e\t'%raw_X[i,j]
          +'%.3e\t'%(reconstructed_X[i,j]-raw_X[i,j])
          +'%.3f'%(np.abs((reconstructed_X[i,j]-raw_X[i,j])/raw_X[i,j]))
         )

Reconstructed	Original	Difference	Relative Err
5.646e-01	5.629e-01	1.679e-03	0.003
-1.455e+00	-1.457e+00	1.376e-03	0.001
6.299e-01	6.290e-01	9.861e-04	0.002
6.705e-01	6.713e-01	-7.987e-04	0.001
9.934e-01	9.980e-01	-4.562e-03	0.005
-1.541e+00	-1.542e+00	1.084e-03	0.001
-1.607e+00	-1.609e+00	1.668e-03	0.001
1.106e+00	1.107e+00	-1.266e-03	0.001
-5.569e-01	-5.565e-01	-4.453e-04	0.001
4.477e-01	4.465e-01	1.213e-03	0.003


## Unknown n_components

In [25]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pls_ia", PLS_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pls_ia__n_components': np.arange(1, 10, 2),
    'pls_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(
        n_splits=3, 
        shuffle=True, 
        random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(missing_X, raw_y.reshape(-1,1))

In [26]:
gs.best_params_

{'pls_ia__n_components': 9, 'pls_ia__scale_x': False}