# Loan Prediction 04 - Data Imputation With Random Forest

Let us try to improve the previous results by imputing missing data with a Random Forest

In [1]:
import sys

sys.path.append('utils')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
plt.style.use('seaborn')

from missingpy import MissForest
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

import dataframe_utils

ImportError: No module named 'missingpy'

### Loading original dataset

In [None]:
df_import = pd.read_csv('dataset/train_loan.csv')
df_import.drop(columns=['Loan_ID'],inplace = True)
df_import.shape

### Counting missing values by column

In [None]:
df_import.isnull().sum()

### Counting missing values by row

In [None]:
nulls = (df_import.isnull().sum(axis = 1) > 0)
df_null_rows = df_import.loc[nulls,:]
df_null_rows.shape[0]

## Replacing missing values with MissForest

Let us prepare the dataset in order to execute the MissForest algorithm

In [None]:
def encode_with_nan(df_input,categorical_columns,ordinal_encoder):
    df = df_input.copy()
    for category,col in zip(ordinal_encoder.categories_,categorical_columns):
        for index, label in enumerate(category):
            df.loc[df[col] == label,col] = index
    return df

In [None]:
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.categories_ = np.load('utils/variable_encoder_categories.npy', allow_pickle= True)
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area','Loan_Status']
df_encoded_nans = encode_with_nan(df_import,categorical_columns,ordinal_encoder)

Although Credit_History and Loan_Amount_Term are represented as numerical values, we will set them as categorical variables. 

That is because these variables have limited options, as shown below.

In [None]:
dataframe_utils.show_column_options(df_import[['Credit_History','Loan_Amount_Term']])

In [None]:
categorical_index = [0,1,2,3,4,8,9,10]
X = df_encoded_nans.copy().drop(columns = ['Loan_Status'])
y = df_encoded_nans.copy()[['Loan_Status']]
imputer = MissForest()
imputer.fit(X,y,cat_vars = categorical_index)

In [None]:
X_filled = imputer.transform(X)
df_filled = pd.DataFrame(X_filled,columns = X.columns)

In [None]:
df_imputed = df_filled.join(y)
df_imputed.head(20)

In [None]:
df_encoded_nans.head(20)

In [None]:
df_imputed.to_csv('dataset/train_rf_imputed.csv')

We were able to impute all missing data with the MissForest algorithm.

Now, let us see how the models will perform with this new dataset in the next notebook.