# Missing data imputation

Missing data refers to the absence of values for certain observations and is an unavoidable
problem in most data sources. Scikit-learn does not support missing values as input, so we
need to remove observations with missing data or transform them into permitted values.

**The act of replacing missing data with statistical estimates of missing values is
called imputation.**



In [2]:
import random
import pandas as pd
import numpy as np

## Removing observations with missing data

**Complete Case Analysis (CCA)**, also called list-wise deletion of cases, consists
of **discarding those observations where the values in any of the variables are missing**. 
- CCA can be applied to categorical and numerical variables. 
- CCA is quick and easy to implement and has the advantage that it **preserves the distribution of the variables**, provided the data is missing at random and only a **small proportion of the data is missing**. 
- However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset.

In [3]:
# Load the data with the following command
data = pd.read_csv("data/boston_listings.csv")

In [4]:
data.head()

Unnamed: 0,id,name,summary,access,interaction,house_rules,host_id,host_since,host_location,host_response_time,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,instant_bookable,is_business_travel_ready,cancellation_policy,reviews_per_month
0,3781,HARBORSIDE-Walk to subway,Fully separate apartment in a two apartment bu...,Guests solely occupy the 1 floor apartment wit...,We sometimes travel. Always available via: mob...,"No pets, no smoking.",4804,2008-12-03,Massachusetts,within a few hours,...,10.0,10.0,10.0,10.0,t,,f,f,super_strict_30,0.28
1,5506,**$49 Special ** Private! Minutes to center!,"Private guest room with private bath, You do n...",You get full access to the guest room with pri...,"We give guests privacy, but we are available ...",No Smoking in the Building.,8229,2009-02-19,"Boston, Massachusetts, United States",within an hour,...,10.0,10.0,9.0,10.0,t,Exempt: This listing is a unit that has contra...,t,f,strict_14_with_grace_period,0.79
2,6695,$99 Special!! Home Away! Condo,"Comfortable, Fully Equipped private apartment...","Full Private apartment. 1 bedroom, kitchen, ...",,No Smoking in the Building.,8229,2009-02-19,"Boston, Massachusetts, United States",within an hour,...,10.0,10.0,9.0,10.0,t,STR-404620,t,f,strict_14_with_grace_period,0.88
3,8789,Curved Glass Studio/1bd facing Park,"Bright, 1 bed with curved glass windows facing...",Guests have access to the full unit,I'm available for questions and/or issues.,,26988,2009-07-22,"Boston, Massachusetts, United States",within a few hours,...,10.0,10.0,10.0,9.0,t,,f,f,strict_14_with_grace_period,0.35
4,10730,Bright 1bed facing Golden Dome,"Bright, spacious unit, new galley kitchen, new...",Guests have access to everything in the unit.,I'm available as needed.,"NO SMOKING, NO PETS. $100 move-in fee payable ...",26988,2009-07-22,"Boston, Massachusetts, United States",within a few hours,...,10.0,10.0,10.0,9.0,t,,f,f,strict_14_with_grace_period,0.24


In [5]:
data.isnull().mean().sort_values(ascending=True) * 100

id                              0.000000
longitude                       0.000000
is_location_exact               0.000000
property_type                   0.000000
room_type                       0.000000
cancellation_policy             0.000000
amenities_dict                  0.000000
price                           0.000000
availability_30                 0.000000
availability_60                 0.000000
availability_90                 0.000000
availability_365                0.000000
number_of_reviews               0.000000
requires_license                0.000000
instant_bookable                0.000000
is_business_travel_ready        0.000000
latitude                        0.000000
neighbourhood_cleansed          0.000000
accommodates                    0.000000
host_since                      0.000000
host_identity_verified          0.000000
host_verifications              0.000000
host_total_listings_count       0.000000
name                            0.000000
host_is_superhos

In [6]:
data_cca = data.dropna()

In [7]:
print(f'Number of total observations: {len(data)}')
print(f'Number of observations with complete cases: {len(data_cca)}')

Number of total observations: 3845
Number of observations with complete cases: 861


## Performing mean or median imputation

**Mean or median imputation consists of replacing missing values with the variable mean or
median**. 
- This can only be performed in numerical variables. 
- The **mean or the median is calculated using a train set**, and these values are used to impute missing data in train and test sets, as well as in future data we intend to score with the machine learning model.
- Therefore, we need to store these mean and median values. **Scikit-learn and Feature-engine transformers learn the parameters from the train set and store these parameters for future use**.

> **Use mean imputation if variables are normally distributed** and **median
imputation otherwise**. Mean and median imputation may distort the
distribution of the original variables if there is a high percentage of
missing data.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [22]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In mean and median imputation, the **mean or median values should be
calculated using the variables in the train set**.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(data[['A2', 'A3', 'A8', 'A11', 'A15']], data['A16'], test_size=0.3, random_state=0)

> `SimpleImputer()` from scikit-learn will impute all variables in the
dataset. Therefore, **if we use mean or median imputation and the dataset
contains categorical variables, we will get an error**.

In [24]:
X_train.isnull().mean() * 100

A2      2.277433
A3     14.078675
A8     14.078675
A11     0.000000
A15     0.000000
dtype: float64

In [25]:
imputer = SimpleImputer(strategy='median')
# imputer = SimpleImputer(strategy = 'mean')

In [26]:
imputer.fit(X_train)

In [27]:
# Let's inspect the learned median values:
imputer.statistics_

array([28.835,  2.75 ,  1.   ,  0.   ,  6.   ])

In [28]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [29]:
X_train

array([[4.608e+01, 3.000e+00, 2.375e+00, 8.000e+00, 4.159e+03],
       [1.592e+01, 2.875e+00, 8.500e-02, 0.000e+00, 0.000e+00],
       [3.633e+01, 2.125e+00, 8.500e-02, 1.000e+00, 1.187e+03],
       ...,
       [1.958e+01, 6.650e-01, 1.665e+00, 0.000e+00, 5.000e+00],
       [2.283e+01, 2.290e+00, 2.290e+00, 7.000e+00, 2.384e+03],
       [4.058e+01, 3.290e+00, 3.500e+00, 0.000e+00, 0.000e+00]])

---

In [30]:
from feature_engine.imputation import MeanMedianImputer

X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

# To perform mean imputation, change the imputation method, as follows: MeanMedianImputer(imputation_method='mean').
median_imputer = MeanMedianImputer(imputation_method='median', variables=['A2', 'A3', 'A8', 'A11', 'A15'])

In [31]:
median_imputer.fit(X_train)

In [32]:
median_imputer.imputer_dict_

{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A15': 6.0}

In [33]:
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

In [34]:
X_train.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
596,a,46.08,3.0,u,g,c,v,2.375,t,t,8,t,g,396.0,4159
303,a,15.92,2.875,u,g,q,v,0.085,f,f,0,f,g,120.0,0
204,b,36.33,2.125,y,p,w,v,0.085,t,t,1,f,g,50.0,1187
351,b,22.17,0.585,y,p,ff,ff,0.0,f,f,0,f,g,100.0,0
118,b,57.83,7.04,u,g,m,v,14.0,t,t,6,t,g,360.0,1332


In [35]:
X_train[['A2','A3', 'A8', 'A11', 'A15']].isnull().mean()

A2     0.0
A3     0.0
A8     0.0
A11    0.0
A15    0.0
dtype: float64

## Implementing mode or frequent category imputation

**Mode imputation consists of replacing missing values with the mode.** 
- We normally use this procedure in categorical variables, hence the frequent category imputation name. 
- Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. 
- Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers.

> If the percentage of missing values is high, frequent category imputation
may distort the original distribution of categories.

In [36]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [38]:
X_train, X_test, y_train, y_test = train_test_split(data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)

In [39]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')

In [40]:
imputer.fit(X_train)

In [41]:
imputer.statistics_

array(['u', 'g', 'c', 'v'], dtype=object)

In [42]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

---

In [43]:
from feature_engine.imputation import CategoricalImputer

X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

mode_imputer = CategoricalImputer(imputation_method="frequent", variables=['A4', 'A5', 'A6','A7'])

In [44]:
mode_imputer.fit(X_train)

In [45]:
mode_imputer.imputer_dict_

{'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v'}

In [46]:
X_train = mode_imputer.transform(X_train)
X_test = mode_imputer.transform(X_test)

In [47]:
X_train.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
596,a,46.08,3.0,u,g,c,v,2.375,t,t,8,t,g,396.0,4159
303,a,15.92,2.875,u,g,q,v,0.085,f,f,0,f,g,120.0,0
204,b,36.33,2.125,y,p,w,v,0.085,t,t,1,f,g,50.0,1187
351,b,22.17,0.585,y,p,ff,ff,0.0,f,f,0,f,g,100.0,0
118,b,57.83,7.04,u,g,m,v,14.0,t,t,6,t,g,360.0,1332


## Replacing missing values with an arbitrary number

**Arbitrary number imputation consists of replacing missing values with an arbitrary value**.
- Some commonly used values include 999, 9999, or -1 for positive distributions. 
- This method is suitable for numerical variables. 
- When replacing missing values with an arbitrary number, we need to be careful **not to select a value close to the mean or the median, or any other common value of the distribution**.

> Arbitrary number imputation **can be used when data is not missing at
random, when we are building non-linear models, and when the
percentage of missing data is high**. This imputation technique distorts the
original variable distribution.

In [48]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3, random_state=0)

imputer = SimpleImputer(strategy='constant', fill_value=99)

imputer.fit(X_train)

In [49]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

## Capturing missing values in a bespoke category

Missing data in **categorical variables can be treated as a different category, so it is common
to replace missing values with the Missing string**. 

In [50]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)

imputer = SimpleImputer(strategy='constant', fill_value='Missing')
imputer.fit(X_train)

In [51]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [52]:
X_train[35:45]

array([['u', 'g', 'c', 'bb'],
       ['y', 'p', 'ff', 'ff'],
       ['y', 'p', 'ff', 'ff'],
       ['u', 'g', 'q', 'v'],
       ['Missing', 'Missing', 'Missing', 'Missing'],
       ['y', 'p', 'c', 'h'],
       ['u', 'g', 'd', 'v'],
       ['y', 'p', 'aa', 'v'],
       ['y', 'p', 'j', 'v'],
       ['u', 'g', 'k', 'v']], dtype=object)

## Replacing missing values with a value at the end of the distribution

**Replacing missing values with a value at the end of the variable distribution is equivalent
to replacing them with an arbitrary value**, but instead of identifying the arbitrary values
manually, these values are **automatically selected** as those at the very end of the variable
distribution. 


**End-of-tail imputation may distort the distribution of the original
variables, so it may not be suitable for linear models.**

In [53]:
from feature_engine.imputation import EndTailImputer

data = pd.read_csv('data/creditApprovalUCI.csv')
X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

In [54]:
imputer = EndTailImputer(imputation_method='iqr', tail='right', variables=['A2', 'A3', 'A8', 'A11', 'A15'])

In [55]:
imputer.fit(X_train)

In [56]:
imputer.imputer_dict_

{'A2': 88.18,
 'A3': 27.31,
 'A8': 11.504999999999999,
 'A11': 12.0,
 'A15': 1800.0}

In [57]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)