# Loan Prediction - 02 - Data Preprocessing

Based on the conclusion of the Exploratory Data Analysis, we fill in some missing data assuming the following hypothesis:

- Fill in the Credit_History with 1. 
- Fill in the LoanAmount_Terms with 360 for clients who have this value missing.
- Fill in the CoapplicantIncome with 0.

In [2]:
import sys
sys.path.append('utils')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
plt.style.use('seaborn')

from sklearn.preprocessing import OrdinalEncoder
import preprocess_utils

In [3]:
df_import = pd.read_csv('dataset/train_loan_new_variables.csv')
df_import.drop(columns=['Loan_ID'],inplace = True)
df_import.shape

(614, 14)

## Filling Missing Values

In [4]:
df_fill = df_import.copy()
df_fill.loc[df_fill['Credit_History'].isnull(),'Credit_History'] = 1
df_fill.loc[df_fill['Loan_Amount_Term'].isnull(),'Loan_Amount_Term'] = 360
# df_fill.loc[df_fill['Self_Employed'].isnull(),'Self_Employed'] = 'No'
df_fill.loc[df_fill['CoapplicantIncome'].isnull(),'CoapplicantIncome'] = 0

## Recalculate Base_Loan_Installment and Remaining_Income

In [5]:
base_loan_installment = df_fill['LoanAmount'] * 1000 / df_fill['Loan_Amount_Term']

total_income = df_fill['ApplicantIncome'] + df_fill['CoapplicantIncome']
remaining_income = (total_income - base_loan_installment) / total_income

df_fill['Base_Loan_Installment'] = base_loan_installment
df_fill['Remaining_Income'] = remaining_income

## Removing NaNs and Checking Valid Values Again

In [7]:
df_fill.dropna(inplace = True)
df_fill.count()

Gender                   535
Married                  535
Dependents               535
Education                535
Self_Employed            535
ApplicantIncome          535
CoapplicantIncome        535
LoanAmount               535
Loan_Amount_Term         535
Credit_History           535
Property_Area            535
Loan_Status              535
Base_Loan_Installment    535
Remaining_Income         535
dtype: int64

## Encoding Categorical Variables to Numeric Values

In [8]:
def encode_labels(df_input,show_encoding = True, ordinal_encoder = None):
    df = df_input.copy()
    ordinal_encoder = OrdinalEncoder()
    ordinal_encoder.fit(df)
    df = ordinal_encoder.transform(df)
    
    if show_encoding:
        list(map(lambda x: print('Labels:',x,'| Ecoding:',np.array(range(x.shape[0]))),ordinal_encoder.categories_))
    return df,ordinal_encoder

In [9]:
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
df_encoded = df_fill.copy()
df_encoded[categorical_columns],ordinal_encoder = encode_labels(df_fill[categorical_columns])

Labels: ['Female' 'Male'] | Ecoding: [0 1]
Labels: ['No' 'Yes'] | Ecoding: [0 1]
Labels: ['0' '1' '2' '3+'] | Ecoding: [0 1 2 3]
Labels: ['Graduate' 'Not Graduate'] | Ecoding: [0 1]
Labels: ['No' 'Yes'] | Ecoding: [0 1]
Labels: ['Rural' 'Semiurban' 'Urban'] | Ecoding: [0 1 2]
Labels: ['N' 'Y'] | Ecoding: [0 1]


Let us save the original encoder, in order to reuse it later on.

In [10]:
np.save('saves/variable_encoder_categories.npy',ordinal_encoder.categories_)

In [11]:
df_encoded.reset_index(inplace = True)
df_encoded.drop(columns='index',inplace = True)
df_encoded.to_csv('dataset/train_loan_preprocessed.csv',index = False)
df_encoded.head(10)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Base_Loan_Installment,Remaining_Income
0,1.0,1.0,1.0,0.0,0.0,4583,1508.0,128.0,360.0,1.0,0.0,0.0,355.555556,0.941626
1,1.0,1.0,0.0,0.0,1.0,3000,0.0,66.0,360.0,1.0,2.0,1.0,183.333333,0.938889
2,1.0,1.0,0.0,1.0,0.0,2583,2358.0,120.0,360.0,1.0,2.0,1.0,333.333333,0.932537
3,1.0,0.0,0.0,0.0,0.0,6000,0.0,141.0,360.0,1.0,2.0,1.0,391.666667,0.934722
4,1.0,1.0,2.0,0.0,1.0,5417,4196.0,267.0,360.0,1.0,2.0,1.0,741.666667,0.922848
5,1.0,1.0,0.0,1.0,0.0,2333,1516.0,95.0,360.0,1.0,2.0,1.0,263.888889,0.93144
6,1.0,1.0,3.0,0.0,0.0,3036,2504.0,158.0,360.0,0.0,1.0,0.0,438.888889,0.920778
7,1.0,1.0,2.0,0.0,0.0,4006,1526.0,168.0,360.0,1.0,2.0,1.0,466.666667,0.915642
8,1.0,1.0,1.0,0.0,0.0,12841,10968.0,349.0,360.0,1.0,1.0,0.0,969.444444,0.959282
9,1.0,1.0,2.0,0.0,0.0,3200,700.0,70.0,360.0,1.0,2.0,1.0,194.444444,0.950142


We were able to recover some of the missing data by manually filling in some variables. 

Also, since we encoded the categorical variables, we can treat them as numerical from now on.