# Machine Learning Imputation - Missforest Imputation

What is MissForest Imputer?

MissForest Imputer is a non-parametric iterative imputation method used to fill in missing values in a dataset.  It leverages the power of Random Forest, a machine learning algorithm, to predict missing values. Unlike simple imputation methods like mean/median imputation, MissForest can capture complex non-linear relationships within the data.

How it Works:

Initialization:

The algorithm begins by imputing the missing values with an initial guess. For numerical variables, this could be the mean, and for categorical variables, it could be the mode.

Iterative Imputation:

The core of MissForest is an iterative process:

* Variable Selection: One variable with missing values is selected as the target variable.
* Model Training: A Random Forest model is trained to predict the target variable based on the other variables in the dataset. The observations with non-missing values in the target variable are used as the training data.
* Imputation: The trained Random Forest model is then used to predict the missing values in the target variable.
* Iteration: This process is repeated for each variable with missing values. After all variables have been imputed, the entire process is repeated until a stopping criterion is met. The stopping criterion is often based on how much the imputed values change between iterations.

Final Imputation:

The algorithm terminates when the stopping criterion is satisfied, and the last set of imputed values is used as the final result.
Example:

Imagine you have a dataset with customer information, including age, income, and purchase history, with some missing values.  MissForest would:

* Start by filling in the missing ages, incomes, and purchase history values with some initial guesses.

Iteratively refine these guesses:

* For example, when imputing missing ages, it would train a Random Forest model to predict age based on income and purchase history. This model would then be used to predict the missing ages.
* This process would be repeated for income and purchase history, and the whole cycle would be repeated.
* The algorithm would stop when the imputed values stabilize, and you would have a complete dataset.

# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for MissForest imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [2]:
# libraries & dataset

import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# 2. Create the dataset

In [3]:
# Sample DataFrame with missing values

data = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Purchase_Amount': [500, 1200, np.nan, 600, 1500, 2000, 100, 700, 1300, np.nan],
    'Discount_Percentage': [0.1, 0.05, 0.15, np.nan, 0.03, 0.0, 0.2, 0.12, 0.07, 0.5],
    'Delivery_Time_Days': [2, 3, 4, 2, 5, 6, 1, np.nan, 4, 3]
})

In [4]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,CustomerID,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,500.0,0.1,2.0
1,2,1200.0,0.05,3.0
2,3,,0.15,4.0
3,4,600.0,,2.0
4,5,1500.0,0.03,5.0
5,6,2000.0,0.0,6.0
6,7,100.0,0.2,1.0
7,8,700.0,0.12,
8,9,1300.0,0.07,4.0
9,10,,0.5,3.0


# 3. Define the missforest Imputer utility function

The missforest_imputation function imputes missing values in a DataFrame using the MissForest algorithm. It iteratively predicts missing values with a Random Forest model, using other columns as predictors, until a stopping criterion is met, returning the completed DataFrame.

def missforest_imputation(df) Function:
* Takes the input DataFrame as an argument.
* Creates a copy of the DataFrame to avoid modifying the original data.
* Initializes an IterativeImputer object from scikit-learn, which usesRandomForestRegressor to estimate missing values.
* Applies the imputer to the entire DataFrame, iteratively imputing missing values in each column based on the other columns.
* Returns the DataFrame with imputed values.

In [5]:
def missforest_imputation(df):
    """Imputes missing values using MissForest (IterativeImputer with RandomForestRegressor).

    Args:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: A new DataFrame with the missing values imputed.
    """
    df_imputed = df.copy()

    # 1. Impute using IterativeImputer with RandomForestRegressor
    imputer = IterativeImputer(estimator=RandomForestRegressor(), random_state=0)  # Use RandomForestRegressor
    df_imputed[:] = imputer.fit_transform(df_imputed) #impute for all the columns

    return df_imputed

# 4. Execution of the utility function

In [6]:
# Perform MissForest imputation

data_imputed = missforest_imputation(data.copy())



In [7]:
print("\nData after MissForest Imputation:\n")
data_imputed


Data after MissForest Imputation:



Unnamed: 0,CustomerID,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,500.0,0.1,2.0
1,2,1200.0,0.05,3.0
2,3,1018.0,0.15,4.0
3,4,600.0,0.1177,2.0
4,5,1500.0,0.03,5.0
5,6,2000.0,0.0,6.0
6,7,100.0,0.2,1.0
7,8,700.0,0.12,2.77
8,9,1300.0,0.07,4.0
9,10,837.0,0.5,3.0


All the NaN values in Numerical columns are imputed using Missforest imputer