# Loan Prediction - 02 - Data Preprocessing

Based on the conclusion of the Exploratory Data Analysis, we fill in some missing data assuming the following hypothesis:

- Fill in the Credit_History with 0 for clients who have this value missing and had their loan application rejected.
- Fill in the LoanAmount_Terms with 360 for clients who have this value missing.
- Fill in the CoapplicantIncome with 0.
- Fill in the SelfEmployed status as No.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
plt.style.use('seaborn')

from sklearn.preprocessing import OrdinalEncoder

In [26]:
df_import = pd.read_csv('dataset/train_loan_new_variables.csv')
df_import.drop(columns=['Loan_ID'],inplace = True)
df_import.shape

(614, 14)

## Filling Missing Values

In [27]:
df_fill = df_import.copy()
df_fill.loc[(df_fill['Loan_Status'] == 'N') & (df_fill['Credit_History'].isnull()),'Credit_History'] = 0
df_fill.loc[df_fill['Loan_Amount_Term'].isnull(),'Loan_Amount_Term'] = 360
df_fill.loc[df_fill['Self_Employed'].isnull(),'Self_Employed'] = 'No'
df_fill.loc[df_fill['CoapplicantIncome'].isnull(),'CoapplicantIncome'] = 0

## Removing NaNs and Checking Valid Values Again

In [28]:
df_fill.dropna(inplace = True)
df_fill.count()

Gender                   517
Married                  517
Dependents               517
Education                517
Self_Employed            517
ApplicantIncome          517
CoapplicantIncome        517
LoanAmount               517
Loan_Amount_Term         517
Credit_History           517
Property_Area            517
Loan_Status              517
Base_Loan_Installment    517
Remaining_Income         517
dtype: int64

## Encoding Categorical Variables to Numeric Values

In [29]:
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
df_fill[categorical_columns].head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
1,Male,Yes,1,Graduate,No,Rural,N
2,Male,Yes,0,Graduate,Yes,Urban,Y
3,Male,Yes,0,Not Graduate,No,Urban,Y
4,Male,No,0,Graduate,No,Urban,Y
5,Male,Yes,2,Graduate,Yes,Urban,Y


In [30]:
def encode_labels(df_input,show_encoding = True, ordinal_encoder = None):
    df = df_input.copy()
    ordinal_encoder = OrdinalEncoder()
    ordinal_encoder.fit(df)
    df = ordinal_encoder.transform(df)
    
    if show_encoding:
        list(map(lambda x: print('Labels:',x,'| Ecoding:',np.array(range(x.shape[0]))),ordinal_encoder.categories_))
    return df,ordinal_encoder

In [31]:
df_encoded = df_fill.copy()
df_encoded[categorical_columns],ordinal_encoder = encode_labels(df_fill[categorical_columns])

Labels: ['Female' 'Male'] | Ecoding: [0 1]
Labels: ['No' 'Yes'] | Ecoding: [0 1]
Labels: ['0' '1' '2' '3+'] | Ecoding: [0 1 2 3]
Labels: ['Graduate' 'Not Graduate'] | Ecoding: [0 1]
Labels: ['No' 'Yes'] | Ecoding: [0 1]
Labels: ['Rural' 'Semiurban' 'Urban'] | Ecoding: [0 1 2]
Labels: ['N' 'Y'] | Ecoding: [0 1]


Let us save the original encoder, in order to reuse it later on.

In [32]:
np.save('utils/variable_encoder_categories.npy',ordinal_encoder.categories_)

In [33]:
df_encoded.reset_index(inplace = True)
df_encoded.drop(columns='index',inplace = True)
df_encoded.to_csv('dataset/train_loan_preprocessed.csv',index = False)
df_encoded.head(10)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Base_Loan_Installment,Remaining_Income
0,1.0,1.0,1.0,0.0,0.0,4583,1508.0,128.0,360.0,1.0,0.0,0.0,359.111111,0.941042
1,1.0,1.0,0.0,0.0,1.0,3000,0.0,66.0,360.0,1.0,2.0,1.0,185.166667,0.938278
2,1.0,1.0,0.0,1.0,0.0,2583,2358.0,120.0,360.0,1.0,2.0,1.0,336.666667,0.931863
3,1.0,0.0,0.0,0.0,0.0,6000,0.0,141.0,360.0,1.0,2.0,1.0,395.583333,0.934069
4,1.0,1.0,2.0,0.0,1.0,5417,4196.0,267.0,360.0,1.0,2.0,1.0,749.083333,0.922076
5,1.0,1.0,0.0,1.0,0.0,2333,1516.0,95.0,360.0,1.0,2.0,1.0,266.527778,0.930754
6,1.0,1.0,3.0,0.0,0.0,3036,2504.0,158.0,360.0,0.0,1.0,0.0,443.277778,0.919986
7,1.0,1.0,2.0,0.0,0.0,4006,1526.0,168.0,360.0,1.0,2.0,1.0,471.333333,0.914799
8,1.0,1.0,1.0,0.0,0.0,12841,10968.0,349.0,360.0,1.0,1.0,0.0,979.138889,0.958875
9,1.0,1.0,2.0,0.0,0.0,3200,700.0,70.0,360.0,1.0,2.0,1.0,196.388889,0.949644


We were able to recover some of the missing data by manually filling in some variables. 

Also, since we encoded the categorical variables, we can treat them as numerical from now on.