<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Let's-apply-the-feature-engineering-hierarchy-to-imputing-missing-values" data-toc-modified-id="Let's-apply-the-feature-engineering-hierarchy-to-imputing-missing-values-1">Let's apply the feature engineering hierarchy to imputing missing values</a></span></li><li><span><a href="#How-to-impute-missing-values:--Ad-hoc" data-toc-modified-id="How-to-impute-missing-values:--Ad-hoc-2">How to impute missing values: <br> Ad hoc</a></span></li><li><span><a href="#How-to-impute-missing-values:--Hand-crafted-Rules" data-toc-modified-id="How-to-impute-missing-values:--Hand-crafted-Rules-3">How to impute missing values: <br> Hand-crafted Rules</a></span></li><li><span><a href="#How-to-impute-missing-values:--Learned-Rules" data-toc-modified-id="How-to-impute-missing-values:--Learned-Rules-4">How to impute missing values: <br> Learned Rules</a></span></li><li><span><a href="#How-to-impute-missing-values:--Learned-Simple-Model" data-toc-modified-id="How-to-impute-missing-values:--Learned-Simple-Model-5">How to impute missing values: <br> Learned Simple Model</a></span></li><li><span><a href="#scikit-learn's-IterativeImputer" data-toc-modified-id="scikit-learn's-IterativeImputer-6">scikit-learn's IterativeImputer</a></span></li><li><span><a href="#How-to-impute-missing-values:--Learned-Complex-Model" data-toc-modified-id="How-to-impute-missing-values:--Learned-Complex-Model-7">How to impute missing values: <br> Learned Complex Model</a></span></li><li><span><a href="#Marking-imputed-values" data-toc-modified-id="Marking-imputed-values-8">Marking imputed values</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-9">Check for understanding</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-10">Takeaways</a></span></li></ul></div>

<center><h2>Let's apply the feature engineering hierarchy to imputing missing values</h2></center>

- Ad hoc
- Hand-crafted rules
- Feature learning:
    - Rule-based models
    - Simple models
    - Complex models

<center><h2>How to impute missing values: <br> Ad hoc</h2></center>

1. Visually inspect.
1. Try to get the missing data!
1. Given domain knowledge, guess value.

<center><h2>How to impute missing values: <br> Hand-crafted Rules</h2></center>

1. Replace with a reasonable guess based on knowledge of the underlying domain (heuristic).
1. Replace with random value sampled from the empirical distribution.

<center><h2>How to impute missing values: <br> Learned Rules</h2></center>

Calculate the central tendency of existing values and impute them for missing data:

- Median works best for numeric features. 


- Mode works best for categorical features.

- Another option for categorical features - encoding a "missing" category.

There only reason to impute the mean is because the median is too costly to compute. In this situations, you can easily compute the median.

In [1]:
reset -fs

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [3]:
import numpy as np

# Let's replace a single actual value with a missing value
original_data_point = X_train[0][0]
X_train[0][0] = np.nan

In [11]:
from sklearn.impute import SimpleImputer

# Create our imputer to replace missing values with the median
imp = SimpleImputer(missing_values=np.nan, strategy='median')

X_train_imp = imp.fit_transform(X_train)

In [5]:
# Let's compare
print(f"Orginal datapoint: {original_data_point:>5}")
print(f"Imputated datapoint: {X_train_imp[0][0]}")

Orginal datapoint:   5.5
Imputated datapoint: 5.8


In [6]:
# Apply model to test dataset
X_test_imp = imp.transform(X_test)

<center><h2>How to impute missing values: <br> Learned Simple Model</h2></center>

Fit a model that estimates a missing value based on other features.

- [Linear Regression](https://en.wikipedia.org/wiki/Imputation_(statistics)#Regression) 
- [k-nearest neighbors algorithm (k-NN) ](http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf) 

<center><h2>scikit-learn's IterativeImputer</h2></center>

Models each feature with missing values as a function of other features, and uses that estimate for imputation.

An iterated round-robin fashion: 

- At each step, a feature column is designated as output y and the other feature columns are treated as inputs X. 
- A regressor is fit on (X, y) for known y. 
- Then, the regressor is used to predict the missing values of y. 

This is done for each feature in an iterative fashion.

Source: https://scikit-learn.org/stable/modules/impute.html

In [12]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer()
X_train_imp = imp.fit_transform(X_train)

In [8]:
print(f"Orginal datapoint: {original_data_point:>5}")
print(f"Imputated datapoint: {X_train_imp[0][0]}")

Orginal datapoint:   5.5
Imputated datapoint: 5.473978368203126


<center><h2>How to impute missing values: <br> Learned Complex Model</h2></center>

<center><img src="images/dl.png" width="70%"/></center>

Sources: 

- https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/
- https://ssc.io/pdf/p2017-biessmann.pdf

<center><h2>Marking imputed values</h2></center>

The presence of missing data is a feature.

In [13]:
from sklearn.impute import MissingIndicator

X = np.array([[-1, 42, -1, 1, 3]])

indicator = MissingIndicator(missing_values=-1, features="all")

In [14]:
mask_all = indicator.fit_transform(X)
mask_all

array([[ True, False,  True, False, False]])

Source: https://scikit-learn.org/stable/modules/impute.html

<center><h2>Check for understanding</h2></center>

What should you do if you are missing target values?

Discard that instance. One of the assumptions of Supervised Machine Learning is that each instance has a label.

Reframe the problem as a Reinforcement Learning or Unsupervised problem.

<center><h2>Takeaways</h2></center>

- Feature Engineering (FE) creates derived data that improve model fitting, thus improving performance metrics.
- All feature engineering, including imputation, be done through
    - Ad hoc
    - Learned rules
    - Simple models
    - Complex models

<br>