In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>


$
\newcommand{\tx}{\x^{(m')}}
\newcommand{\txi}{\hat{\x}^{(m')}}
\def\prox#1{\mathcal{prox}(#1)}
\def\proxtx#1{\prox{\tx, #1}}
$


In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
MOVIE_DIR="./images"

CREATE_MOVIE = False # True if you have ffmpeg installed

import training_models_helper
%aimport training_models_helper

tmh = training_models_helper.TrainingModelsHelper()

import mnist_helper
%aimport mnist_helper

mnh = mnist_helper.MNIST_Helper()

import class_helper
%aimport class_helper

clh= class_helper.Classification_Helper()

In [4]:
from sklearn.model_selection import cross_val_score

# Missing features

What if our data is not perfect ?

What do we do about examples with missing features
- training examples
- test examples

We have thus far being treating this as an annoyance
- problem is important
- far from simple

The simplest thing to do would be to drop the example
- can't do it with a test example
- reduces amount of data in training

We could alternatively, drop the feature entirely
- the features missing among several examples may be disjoint

Either way, losing training data is not desirable, particularly for small datasets

- We will motivate the problem and illustrate the issues
- We will examine naive solutions
    - almost always bad !
- We will show an interesting solution using Random Forests
- Preview of clustering methods

The term **imputation** refers to creating a substitute for the missing value of a feature
in one example.

To frame our discussion

Let 
- Let $f$ denote the index of a feature
- $\tx$ be an example (either training or test) with missing feature $\tx_f$
- $\txi$ be the imputed valued for $\tx$

As usual let $\X, \y$ be our labeled training examples.

$$
\{ (\x^\ip, \y^\ip) | 1 \le i \le m \} 
$$

# Naive methods for imputation

## Magic numbers
Let's start with a truly awful method: set $\txi_f$ to a  "magic number"
- 0
- -999


Why is the magic number awful ?

Consider a training set representing the population of NYC, with features Weight, Height, Age, etc.

Suppose Weight, which is at index $f$ of example vector $\tx$ is missing.

Setting $\txi_f = 0$ is awful because the imputed value is not likely
- $\pr{\tx_f = 0} \approx 0$

## Mean, median, percentile

How about something more likely, like the mean or median ?

Better
- $\pr{\tx_f = \bar{\x}_f} > 0$

Still not perfect.
- What if $\pr{\x_f}$ were a bi-modal distribution
    - lots of examples with extreme values, few in the middle
    
So mean and median are better than magic numbers in many situations but not all.

Even worse:

Suppose example $\tx$ is an infant: is $\bar{\x}_f$ still reasonable ?
- $\pr{\tx_f = \bar{\x}_f | \tx_{\text{Age}} < 1} \approx 0$

So the mean, median etc. 
- provides a reasonable imputation in a univariate sense
- provides a less reasonable imputation in a multivariate sense
    - conditional on other features like Age

## Imputation depends on **how** the imputed value is used

Less obvious is that $\txi_f$ will be used for training some model.

So perhaps we should consider how the model that we are going to fit uses $\txi_f$.


To illustrate, suppose the model  uses
the dot product to measure similarity among examples
- a variant of KNN

Knowing this, we can show that different choices of $\txi_f$ influence the similarity metric.

In [5]:
import pandas as pd
import numpy as np

S1 = pd.Series( { "a": 4,               "d": 5, "e":1 })
S2 = pd.Series( { "a": 5, "b":5, "c": 4})
S3 = pd.Series( {                       "d": 2, "e": 4, "f": 5})

df = pd.DataFrame( [S1, S2, S3], index=["A", "B", "C"])


A = df.loc["A",:]
B = df.loc["B",:]  
C = df.loc["C",:]


def sim(A, B):
    """
    Compute cosine similarity of vectors A and B
    
    Parameters
    -----------
    A, B: ndarrays of same length
    """
    return (A * B).sum()/( np.sqrt( (A*A).sum() ) * np.sqrt( (B*B).sum() ) )

def compare_subs(X, Y):
    """
    Compare various ways of filling in missing values 
    
    Parameters
    ----------
    X, Y: ndarrays of equal length
    """
    (Xp, Yp) = (X.fillna(0), Y.fillna(0))
    print("\tSubstitute 0: similarity= {s:0.2f}".format(s= sim( Xp, Yp )) )

    # Substitute mean of each student
    (Xp, Yp) = (X.fillna( X.mean() ), Y.fillna( Y.mean() ))
    print("\tSubstitute w/feature mean: similarity= {s:0.2f}".format(s= sim( Xp, Yp  )) )

    # Center data, then substitute 0
    # Centered mean tells a different story (b/c of magnitudes of entries ?)
    # n.Y., Cov(X,Y) = E(X*Y) - E(X)E(Y) so is X, Y not centered, dot product is not Cov(X,Y)
    (Xp, Yp) = ( (X - X.mean()).fillna(0), (Y - Y.mean()).fillna(0) )
    print("\tCenter, then Substitute 0: similarity= {s:0.2f}".format(s= sim( Xp, Yp  )) )

In [6]:
df

print("A vs B")
compare_subs(A,B)

print("A vs C")
compare_subs(A,C)

Unnamed: 0,a,d,e,b,c,f
A,4.0,5.0,1.0,,,
B,5.0,,,5.0,4.0,
C,,2.0,4.0,,,5.0


A vs B
	Substitute 0: similarity= 0.38
	Substitute w/feature mean: similarity= 0.94
	Center, then Substitute 0: similarity= 0.09
A vs C
	Substitute 0: similarity= 0.32
	Substitute w/feature mean: similarity= 0.87
	Center, then Substitute 0: similarity= -0.56


- A versus B
    - missing values are paired against relatively high values
        - Substituting $0$ (a low value) reduces similarity
        - Substituting mean (a relatively high value) increase similarity
            - A is a tougher rater: A.mean() < B.mean()
                
    - in the end, A and B had only a *single* true point of comparison ("a")
        - you made up the similarity

- A versus C:
    - *NO* feature "a" in common !
    - But at least A and B are closer than A and C for some substitutions
    

Cosine similarity
- is a *scale dependent* measure
    - so centering, scaling matter
    - Analogy
        - difference between Covariance and Correlation
            - Correlation is Covariance of normalized (scaled) variables

The choice of $\txi_f = 0$ is **not** neutral if the data is not centered
- e.g., if $0$ is the minimum of $\x_f$, rather than the average

# Predictive methods for imputation

Hopefully the preceding examples illustrated some issues in imputation.

Can we do better ?

Let $\x_{\bar{f}}$ denote the vector of features *excluding* the one at index $f$.

We can frame the imputation problem as finding
$$
\pr{\tx_f | \tx_{\bar{f}}}
$$

That is: find likely values for the missing feature, *given* values for the non-missing features.

How do we do this ?

Machine Learning to the rescue !
- fit a model on the subset of training examples 
    - that *have* feature $f$ (used as target)
- use the model to predict $\txi_f$

## Simple predictive imputation
Some ideas
- Naive Bayes
    - Assumption of distribution of features can compensate for missing features
- Regression
    - $\x_f = \Theta^T \x_{\bar{f}}$
        - feature $f$ as a function of the other features
    - $\x_f = \Theta^T \z$
        - $\z$ may be features, not present in $\x$, that are believed to be correlated with $\x_f$

## Proximity based imputation

A proximity based method
- creates a proximity (opposite of distance) measure $\proxtx{\x^\ip}$ between $\tx$ and training example $\x^\ip$
-  $\txi_f$ is the proximity weighted average of the values of the feature in the training set
$$
\txi_f = \sum_{i=1, i \ne m'}^m { \proxtx{\x^\ip} \x^\ip_f }
$$

That is
- the missing value should be similar to the feature value
in training examples  "similar" to $\tx$.

The definition of proximity (similarity) will vary.

**Note** 

For categorical $\tx_f$  use the most frequent non-missing value
- where the frequency is weighted by proximities.

The method works for multiple missing features but we illustrate with a single one for simplicity.

## Limitations

For the imputation of 
$$
\pr{\tx_f | \tx_{\bar{f}}}
$$

we are implicitly assuming that if feature vectors $\x^\ip, \x^{(i')}$ are "similar" then
so are their targets
$$\y^\ip \approx \y^{(i')}$$

With that limitation in mind there are related methods
- Clustering
    - find groups of examples with common features
        - K-means
    - PCA, Recommender systems
        - Unsupervised Machine Learning: Preview of coming lecture !

# Random Forest proximity method for missing data

There is an interesting method for using a Random Forest to impute missing data.

It is interesting because proximity is determined by both the features *and* the target
so we are modeling

$$
\pr{\x_f | \y,\x_{\bar{f}}}
$$

That is, it fits a model of $\x_f$ given the features other than $f$ **and** the target.

A test example with missing features doesn't have a target; 
we will see how this method adapts.


### Missing feature in Training example

We will use a Random Forest to define a proximity measure.

[Missing Value Imputation using Random Forest](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1)

- Initialization
    - Set $\txi_f$ to a "reasonable" guess
        - Continuous: mean, median
        - Categorical: most frequent
    - Create the initial Random Forest $F_{(0)}$

- Iteration $i$:
    - Define the proximity to example $\x^\ip$
        - $\proxtx{\x^\ip} = \text{# of trees in } F_{(i-1)} \text{ with } \tx, \x^\ip \text{ in same leaf}$
    - Update imputed value $\txi_f$
$$
\txi_f = \sum_{i=1}^m { \proxtx{\x^\ip} \x^\ip_f }
$$
    - Create next Random Forest $F_{(i)}$

Iterate until convergence.

The authors suggest 4-6 iterations suffice.

### Missing feature in Test example

Method  similar to that for a missing feature in a Training example, once
we deal with a crucial difference
- there is **no label** for the test example (that's what we're trying to predict)


Suppose the classification target $\y \in C$ (i.e., the possible labels)
$$C = \{ c_k | 1 \le k \le |C| \}$$

- For each $c \in C$:
    - $\txi_{f,c}$ is the imputed value obtained from the above by assuming $\y^{(m')} = c$
        - that is, run the missing feature for training algorithm assuming label of $\tx$ is $c$

We now have one imputed value $\txi_{f,c}$  and final Random Forest per class $c$.

Which one do we choose ?

Observe that the $c^{th}$ Random Forest *should* predict class $c$ given
input $\tx$ since we set $\y^{(m')} = c$

So we choose the forest and imputed value $\txi_{f,c}$ 
- from the class $c$
in which $\tx$ is most often classified as being in class $c$.


# Now casting

The field of economic forecasting encounters a problem similar to missing data
- many economic indices are combinations of sub-indices
    - sub-indices published at different frequencies
    - sub-indices published on different days
    
The "total" index can't be computed until **all** sub-indices have been released.

So with respect to an "early" publication date, some sub-index data is missing.

Now-casting (a play on "forecasting") uses techniques to make early predictions of sub-index values

In some cases they use features $\z$ believed to be correlated with actual features $\x$:
- derive higher frequency values for low frequency data (annual GDP, monthly Manufacturing)
    - National Manufacturing employment may be highly correlated to Employment in a few states
     - state-level employment may be published
         - at higher frequency
         - earlier date
  - Many of these low frequency features are *composites* of other features
     - some elements of the composite are released before others
         - may predict the whole

   - [Now-casting site](https://www.now-casting.com/home)

# Missing data imputation in `sklearn`
- [`SimpleImputer`, `IterativeImputer`](https://scikit-learn.org/stable/modules/impute.html)
- [`IterativeImputer` with different regression estimators](https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html)

In [7]:
print("Done")

Done
