# Missing data imputation

Missing data refers to the absence of values for certain observations and is an unavoidable
problem in most data sources. Scikit-learn does not support missing values as input, so we
need to remove observations with missing data or transform them into permitted values.

**The act of replacing missing data with statistical estimates of missing values is
called imputation.**



In [3]:
import random
import pandas as pd
import numpy as np

## Removing observations with missing data

**Complete Case Analysis (CCA)**, also called list-wise deletion of cases, consists
of **discarding those observations where the values in any of the variables are missing**. 
- CCA can be applied to categorical and numerical variables. 
- CCA is quick and easy to implement and has the advantage that it **preserves the distribution of the variables**, provided the data is missing at random and only a **small proportion of the data is missing**. 
- However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset.

In [None]:
# Load the data with the following command
data = pd.read_csv("data/boston_listings.csv")

In [None]:
data.head()

In [None]:
data.isnull().mean().sort_values(ascending=True) * 100

In [None]:
data_cca = data.dropna()

In [None]:
print(f'Number of total observations: {len(data)}')
print(f'Number of observations with complete cases: {len(data_cca)}')

## Performing mean or median imputation

**Mean or median imputation consists of replacing missing values with the variable mean or
median**. 
- This can only be performed in numerical variables. 
- The **mean or the median is calculated using a train set**, and these values are used to impute missing data in train and test sets, as well as in future data we intend to score with the machine learning model.
- Therefore, we need to store these mean and median values. **Scikit-learn and Feature-engine transformers learn the parameters from the train set and store these parameters for future use**.

> **Use mean imputation if variables are normally distributed** and **median
imputation otherwise**. Mean and median imputation may distort the
distribution of the original variables if there is a high percentage of
missing data.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In mean and median imputation, the **mean or median values should be
calculated using the variables in the train set**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['A2', 'A3', 'A8', 'A11', 'A15']], data['A16'], test_size=0.3, random_state=0)

> `SimpleImputer()` from scikit-learn will impute all variables in the
dataset. Therefore, **if we use mean or median imputation and the dataset
contains categorical variables, we will get an error**.

In [None]:
X_train.isnull().mean() * 100

In [None]:
imputer = SimpleImputer(strategy='median')
# imputer = SimpleImputer(strategy = 'mean')

In [None]:
imputer.fit(X_train)

In [None]:
# Let's inspect the learned median values:
imputer.statistics_

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

---

In [None]:
from feature_engine.imputation import MeanMedianImputer

X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

# To perform mean imputation, change the imputation method, as follows: MeanMedianImputer(imputation_method='mean').
median_imputer = MeanMedianImputer(imputation_method='median', variables=['A2', 'A3', 'A8', 'A11', 'A15'])

In [None]:
median_imputer.fit(X_train)

In [None]:
median_imputer.imputer_dict_

In [None]:
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train[['A2','A3', 'A8', 'A11', 'A15']].isnull().mean()

## Implementing mode or frequent category imputation

**Mode imputation consists of replacing missing values with the mode.** 
- We normally use this procedure in categorical variables, hence the frequent category imputation name. 
- Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. 
- Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers.

> If the percentage of missing values is high, frequent category imputation
may distort the original distribution of categories.

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)

In [None]:
imputer = SimpleImputer(strategy='most_frequent')

In [None]:
imputer.fit(X_train)

In [None]:
imputer.statistics_

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

---

In [None]:
from feature_engine.imputation import CategoricalImputer

X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

mode_imputer = CategoricalImputer(imputation_method="frequent", variables=['A4', 'A5', 'A6','A7'])

In [None]:
mode_imputer.fit(X_train)

In [None]:
mode_imputer.imputer_dict_

In [None]:
X_train = mode_imputer.transform(X_train)
X_test = mode_imputer.transform(X_test)

In [None]:
X_train.head()

## Replacing missing values with an arbitrary number

**Arbitrary number imputation consists of replacing missing values with an arbitrary value**.
- Some commonly used values include 999, 9999, or -1 for positive distributions. 
- This method is suitable for numerical variables. 
- When replacing missing values with an arbitrary number, we need to be careful **not to select a value close to the mean or the median, or any other common value of the distribution**.

> Arbitrary number imputation **can be used when data is not missing at
random, when we are building non-linear models, and when the
percentage of missing data is high**. This imputation technique distorts the
original variable distribution.

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3, random_state=0)

imputer = SimpleImputer(strategy='constant', fill_value=99)

imputer.fit(X_train)

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

## Capturing missing values in a bespoke category

Missing data in **categorical variables can be treated as a different category, so it is common
to replace missing values with the Missing string**. 

In [None]:
data = pd.read_csv('data/creditApprovalUCI.csv')

X_train, X_test, y_train, y_test = train_test_split(data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)

imputer = SimpleImputer(strategy='constant', fill_value='Missing')
imputer.fit(X_train)

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
X_train[35:45]

## Replacing missing values with a value at the end of the distribution

**Replacing missing values with a value at the end of the variable distribution is equivalent
to replacing them with an arbitrary value**, but instead of identifying the arbitrary values
manually, these values are **automatically selected** as those at the very end of the variable
distribution. 


**End-of-tail imputation may distort the distribution of the original
variables, so it may not be suitable for linear models.**

In [6]:
from feature_engine.imputation import EndTailImputer

data = pd.read_csv('data/creditApprovalUCI.csv')
X_train, X_test, y_train, y_test = train_test_split(data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)

In [None]:
imputer = EndTailImputer(imputation_method='iqr', tail='right', variables=['A2', 'A3', 'A8', 'A11', 'A15'])

In [None]:
imputer.fit(X_train)

In [None]:
imputer.imputer_dict_

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)