## Adding a variable to capture NA

**Mean / median and random sample imputation assume that the data are (MCAR)**. **Arbitrary value imputation or end of distribution imputation techniques will affect the variable distribution dramatically**, and are therefore not suitable for linear models. If data are **not missing at random (MNAR)**, it is a good idea to **replace missing observations by the mean / median / mode** AND  **flag** those missing observations as well with a **Missing Indicator**. A Missing Indicator is an **additional binary variable**, which indicates whether the data was missing for an observation (1) or not (0). We can add a missing indicator to **both numerical and categorical variables**. It is easy to implement and captures the importance of missing data if there is one!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('titanic.csv', usecols=['age', 'fare', 'survived'])
data.head()

Unnamed: 0,survived,age,fare
0,1,29.0,211.3375
1,1,0.9167,151.55
2,0,2.0,151.55
3,0,30.0,151.55
4,0,25.0,151.55


In [3]:
data.isnull().mean()

survived    0.000000
age         0.200917
fare        0.000764
dtype: float64

To add a **binary missing indicator**, we don't necessarily need to learn anything from the training set, so in principle **we could do this in the original dataset** and **then separate into train and test**. But it is not preferable! Now, let's create a **binary missing indicator manually**.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['age', 'fare']],  # predictors
    data['survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((916, 2), (393, 2))

In [5]:
X_train.isnull().mean()

age     0.191048
fare    0.000000
dtype: float64

In [6]:
X_train['Age_NA'] = np.where(X_train['age'].isnull(), 1, 0)
X_test['Age_NA'] = np.where(X_test['age'].isnull(), 1, 0)
X_train.head()

Unnamed: 0,age,fare,Age_NA
501,13.0,19.5,0
588,4.0,23.0,0
402,30.0,13.8583,0
1193,,7.725,1
686,22.0,7.725,0


In [7]:
X_train['Age_NA'].mean()

0.19104803493449782

In [8]:
X_train.isnull().mean()

age       0.191048
fare      0.000000
Age_NA    0.000000
dtype: float64

In [9]:
median = X_train['age'].median()
X_train['age'] = X_train['age'].fillna(median)
X_test['age'] = X_test['age'].fillna(median)
X_train.isnull().mean()

age       0.0
fare      0.0
Age_NA    0.0
dtype: float64

### House Prices dataset

In [12]:
cols_to_use = ['LotFrontage', 'MasVnrArea', # numerical
               'BsmtQual', 'FireplaceQu', # categorical
               'SalePrice' ] # target

In [13]:
data = pd.read_csv('HousingPrices_train.csv', usecols=cols_to_use)
print(data.shape)
data.head()

(1460, 5)


Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,SalePrice
0,65.0,196.0,Gd,,208500
1,80.0,0.0,Gd,TA,181500
2,68.0,162.0,Gd,TA,223500
3,60.0,0.0,TA,Gd,140000
4,84.0,350.0,Gd,TA,250000


In [14]:
data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

In [15]:
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 5), (438, 5))

In [16]:
def missing_indicator(df, variable):    
    return np.where(df[variable].isnull(), 1, 0)

**Loop over all the variables and add a binary missing indicator with the function we created!**

In [17]:
for variable in cols_to_use:
    X_train[variable+'_NA'] = missing_indicator(X_train, variable)
    X_test[variable+'_NA'] = missing_indicator(X_test, variable)
X_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[variable+'_NA'] = missing_indicator(X_train, variable)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[variable+'_NA'] = missing_indicator(X_test, variable)



Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,SalePrice,LotFrontage_NA,MasVnrArea_NA,BsmtQual_NA,FireplaceQu_NA,SalePrice_NA
64,,573.0,Gd,,219500,1,0,0,1,0
682,,0.0,Gd,Gd,173000,1,0,0,0,0
960,50.0,0.0,TA,,116500,0,0,0,1,0
1384,60.0,0.0,TA,,105000,0,0,0,1,0
1100,60.0,0.0,TA,,60000,0,0,0,1,0


In [18]:
missing_ind = [col for col in X_train.columns if 'NA' in col]
X_train[missing_ind].mean()

LotFrontage_NA    0.184932
MasVnrArea_NA     0.004892
BsmtQual_NA       0.023483
FireplaceQu_NA    0.467710
SalePrice_NA      0.000000
dtype: float64

In [19]:
X_train.isnull().mean()

LotFrontage       0.184932
MasVnrArea        0.004892
BsmtQual          0.023483
FireplaceQu       0.467710
SalePrice         0.000000
LotFrontage_NA    0.000000
MasVnrArea_NA     0.000000
BsmtQual_NA       0.000000
FireplaceQu_NA    0.000000
SalePrice_NA      0.000000
dtype: float64

In [20]:
def impute_na(df, variable, value):
    return df[variable].fillna(value)

In [21]:
median = X_train['LotFrontage'].median()
X_train['LotFrontage'] = impute_na(X_train, 'LotFrontage', median)
X_test['LotFrontage'] = impute_na(X_test, 'LotFrontage', median)
median = X_train['MasVnrArea'].median()
X_train['MasVnrArea'] = impute_na(X_train, 'MasVnrArea', median)
X_test['MasVnrArea'] = impute_na(X_test, 'MasVnrArea', median)
mode = X_train['BsmtQual'].mode()[0]
X_train['BsmtQual'] = impute_na(X_train, 'BsmtQual', mode)
X_test['BsmtQual'] = impute_na(X_test, 'BsmtQual', mode)
mode = X_train['FireplaceQu'].mode()[0]
X_train['FireplaceQu'] = impute_na(X_train, 'FireplaceQu', mode)
X_test['FireplaceQu'] = impute_na(X_test, 'FireplaceQu', mode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['LotFrontage'] = impute_na(X_train, 'LotFrontage', median)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['LotFrontage'] = impute_na(X_test, 'LotFrontage', median)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['MasVnrArea'] = impute_na(X_train, 'MasVnrArea', median)

A val

In [22]:
X_train.isnull().mean()

LotFrontage       0.0
MasVnrArea        0.0
BsmtQual          0.0
FireplaceQu       0.0
SalePrice         0.0
LotFrontage_NA    0.0
MasVnrArea_NA     0.0
BsmtQual_NA       0.0
FireplaceQu_NA    0.0
SalePrice_NA      0.0
dtype: float64

As you can see, we have now **the double of features respect to the original dataset. The original dataset had 4 variables, the pre-processed dataset contains 8, plus the target.**