## Adding a Missing Indicator variable with Scikit-learn ==> MissingIndicator

Scikit-learn provides the **MissingIndicator** class to add a binary variable that flags NA.

The MissingIndicator has the option of adding a Missing indicator binary variable to all the variables in the dataset, or only those that show NA in the train set.

### Attention!

The transformer only returns the binary variables, which need to be added to the original train data.

### More details about the transformers

- [MissingIndicaror](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html#sklearn.impute.MissingIndicator)

## In this demo:

We will add a Missing Indicator to the variables of the Ames House Price Dataset

- To download the dataset please refer to the lecture **Datasets** in **Section 1** of this course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the objects we need to impute missing data
# with sklearn
from sklearn.impute import SimpleImputer, MissingIndicator  ### key functions from the library
from sklearn.pipeline import Pipeline

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
df1 = pd.read_csv('train.csv')
df2 = pd.read_csv('test.csv')

data = df1.append(df2)

# we use only the following variables for the demo:
# 3 of which contain NA

cols_to_use = [
    'OverallQual', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'WoodDeckSF',
    'BsmtUnfSF', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'SalePrice'
]

data = data[cols_to_use]

data.head(4)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,OverallQual,TotalBsmtSF,1stFlrSF,GrLivArea,WoodDeckSF,BsmtUnfSF,LotFrontage,MasVnrArea,GarageYrBlt,SalePrice
0,7,856.0,856,1710,0,150.0,65.0,196.0,2003.0,208500.0
1,6,1262.0,1262,1262,298,284.0,80.0,0.0,1976.0,181500.0
2,7,920.0,920,1786,0,434.0,68.0,162.0,2001.0,223500.0
3,7,756.0,961,1717,0,540.0,60.0,0.0,1998.0,140000.0


In [3]:
# let's check the null values
data.isnull().mean()

OverallQual    0.000000
TotalBsmtSF    0.000343
1stFlrSF       0.000000
GrLivArea      0.000000
WoodDeckSF     0.000000
BsmtUnfSF      0.000343
LotFrontage    0.166495
MasVnrArea     0.007879
GarageYrBlt    0.054471
SalePrice      0.499829
dtype: float64

In [4]:
# let's separate into training and testing set

# first let's remove the target from the features
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use], # just the features
                                                    data['SalePrice'], # the target
                                                    test_size=0.3, # the percentage of obs in the test set
                                                    random_state=0) # for reproducibility
X_train.shape, X_test.shape

((2043, 9), (876, 9))

In [5]:
# let's check the misssing data again
X_train.isnull().mean()

OverallQual    0.000000
TotalBsmtSF    0.000489
1stFlrSF       0.000000
GrLivArea      0.000000
WoodDeckSF     0.000000
BsmtUnfSF      0.000489
LotFrontage    0.162996
MasVnrArea     0.006853
GarageYrBlt    0.058248
dtype: float64

## Add a Missing Indicator

In [6]:
indicator = MissingIndicator(error_on_new=True, features='missing-only')
indicator.fit(X_train)  

MissingIndicator(error_on_new=True, features='missing-only',
         missing_values=nan, sparse='auto')

In [7]:
# we can see the features with na:
# the result shows the index

indicator.features_

array([1, 5, 6, 7, 8], dtype=int64)

In [8]:
# we can find the feature names by passing the index to the
# list of columns

X_train.columns[indicator.features_]

Index(['TotalBsmtSF', 'BsmtUnfSF', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt'], dtype='object')

In [9]:
# the indicator returns only the additional indicators
# when we transform the dataset

tmp = indicator.transform(X_train)

tmp

array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       ...,
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False]])

In [14]:
# so we need to join it manually to the original X_train

# let's create a column name for each of the new MissingIndicators
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]

# and now we concatenate
X_train = pd.concat([
    X_train.reset_index(),
    pd.DataFrame(tmp, columns = indicator_cols)],
    axis=1)

X_train.head()

Unnamed: 0,index,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley,BsmtQual_NA,FireplaceQu_NA,LotFrontage_NA,MasVnrArea_NA,Alley_NA
0,64,Gd,,RL,318,,573.0,Pave,,False,True,True,False,True
1,682,Gd,Gd,RL,288,,0.0,Pave,,False,False,True,False,True
2,960,TA,,RL,162,50.0,0.0,Pave,,False,True,False,False,True
3,1384,TA,,RL,356,60.0,0.0,Pave,,False,True,False,False,True
4,1100,TA,,RL,0,60.0,0.0,Pave,,False,True,False,False,True


In [15]:
# now the same for the test set
tmp = indicator.transform(X_test)

X_test = pd.concat([
    X_test.reset_index(),
    pd.DataFrame(tmp, columns = indicator_cols)],
    axis=1)

X_test.head()

Unnamed: 0,index,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley,BsmtQual_NA,FireplaceQu_NA,LotFrontage_NA,MasVnrArea_NA,Alley_NA
0,529,TA,TA,RL,816,,,Pave,,False,False,True,True,True
1,491,TA,TA,RL,238,79.0,0.0,Pave,,False,False,False,False,True
2,459,TA,TA,RL,524,,161.0,Pave,,False,False,True,False,True
3,279,Gd,TA,RL,768,83.0,299.0,Pave,,False,False,False,False,True
4,655,TA,,RM,525,21.0,381.0,Pave,,False,True,False,False,True


### SimpleImputer on the entire dataset

In [16]:
# Now we impute the missing values with SimpleImputer

# create an instance of the simple imputer
# we indicate that we want to impute with the 
# most frequent category

imputer = SimpleImputer(strategy='most_frequent')

# we fit the imputer to the train set
# the imputer will learn the median of all variables
imputer.fit(X_train)

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='most_frequent', verbose=0)

In [17]:
# we can look at the learnt frequent values like this:
imputer.statistics_

array([0, 'TA', 'Gd', 'RL', 0, 60.0, 0.0, 'Pave', 'Pave', False, False,
       False, False, True], dtype=object)

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

In [18]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

array([[64, 'Gd', 'Gd', ..., True, False, True],
       [682, 'Gd', 'Gd', ..., True, False, True],
       [960, 'TA', 'Gd', ..., False, False, True],
       ...,
       [1216, 'TA', 'Gd', ..., False, False, True],
       [559, 'Gd', 'TA', ..., True, False, True],
       [684, 'Gd', 'Gd', ..., False, False, True]], dtype=object)