<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Overview</a></span></li></ul></li><li><span><a href="#Create-some-Data" data-toc-modified-id="Create-some-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create some Data</a></span></li><li><span><a href="#Iterative-PCA-(Missing-X-values)" data-toc-modified-id="Iterative-PCA-(Missing-X-values)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Iterative PCA (Missing X values)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li><li><span><a href="#Iterative-PLS-(Missing-X-and-y)" data-toc-modified-id="Iterative-PLS-(Missing-X-and-y)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Iterative PLS (Missing X and y)</a></span><ul class="toc-item"><li><span><a href="#Fixed-n_components" data-toc-modified-id="Fixed-n_components-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Fixed n_components</a></span></li><li><span><a href="#Unknown-n_components" data-toc-modified-id="Unknown-n_components-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Unknown n_components</a></span></li></ul></li></ul></div>

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import sys
sys.path.append('../../')
import chemometrics

import numpy as np
import pandas as pd

import watermark
%load_ext watermark

In [None]:
%load_ext autoreload
%autoreload 2

Overview
--------
This are some examples ways to impute missing data. scikit-learn has a [library](https://scikit-learn.org/stable/modules/impute.html#univariate-vs-multivariate-imputation) for simple methods which is also very useful. 

In [None]:
%watermark -t -m -v --iversions

# Create some Data

In [None]:
X = np.array(
    [
        [1, 2, 3, 5, 9],
        [4, np.nan, 5, 6, 2],
        [7, 8, np.nan, 2, 0],
        [6, 1, 0, 3, 10],
        [5, 6, 7, np.nan, 9]
    ])
X

In [None]:
y = np.array([10,3,1,2,3]).reshape(-1,1)
y

# Iterative PCA (Missing X values)

## Fixed n_components

If you know the number of components to use you can just perform this directly.

In [None]:
from chemometrics.preprocessing.missing import PCA_IA

In [None]:
itim = PCA_IA(n_components=3, 
              scale_x=True,
              missing_values=np.nan, 
              tol=1.0e-6, 
              max_iters=5000)

In [None]:
_ = itim.fit(X)

In [None]:
itim.transform(X)

## Unknown n_components

Usually, we need to figure out what a good n_components value is. We can use cross-validation for this.

In [None]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pca_ia", PCA_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pca_ia__n_components': [1,2,3],
    'pca_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(
        n_splits=2, 
        shuffle=True, 
        random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X, y)

In [None]:
gs.best_params_

You can then use this in other pipelines.  You can specify the imputer without any hyperparameters in those cases, for example.
Below is an example of how you might do that.

# Iterative PLS (Missing X and y)

## Fixed n_components

In [None]:
from chemometrics.preprocessing.missing import PLS_IA

In [None]:
itim = PLS_IA(
    n_components=3, 
    missing_values=np.nan, 
    scale_x=True,
    tol=1.0e-6, 
    max_iters=5000)

In [None]:
_ = itim.fit(X, y)

In [None]:
itim.transform(X)

## Unknown n_components

In [None]:
pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pls_ia", PLS_IA(
        n_components=1, 
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    'pls_ia__n_components': [1,2,3,4],
    'pls_ia__scale_x': [True, False],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(
        n_splits=2, 
        shuffle=True, 
        random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X, y)

In [None]:
gs.best_params_