# Loan Prediction 07 - Preparing Final Dataset

Although we could have done this in the previous notebooks, let us replicate some of these steps altogether, so we can have more clarity about how the data is being transformed until it is inputed in the model.

Variable:
- Creation of Base_Loan_Installment and Remaining_Income

Missing data treatment:
- Fill in the Self_employed with 'No'.
- Fill in the LoanAmount_Terms with 360 for clients who have this value missing.
- Fill in the CoapplicantIncome with 0.

Dataset preprocess:
- Encode categorical variables
- Scale all variables from 0 to 1

In the real world, it could occur that the life input service did not provide all the necessary information for the model. Therefore, we assume that if that happens, we will fill in the missing data with the most common cases (mean or mode of the training data).

In [2]:
import sys
sys.path.append('utils')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
plt.style.use('seaborn')

from sklearn import preprocessing

import metrics_utils 
import model_utils

In [3]:
df_import = pd.read_csv('dataset/test_loan.csv')
display(df_import.head(10))
print(df_import.shape)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
5,LP001054,Male,Yes,0,Not Graduate,Yes,2165,3422,152.0,360.0,1.0,Urban
6,LP001055,Female,No,1,Not Graduate,No,2226,0,59.0,360.0,1.0,Semiurban
7,LP001056,Male,Yes,2,Not Graduate,No,3881,0,147.0,360.0,0.0,Rural
8,LP001059,Male,Yes,2,Graduate,,13633,0,280.0,240.0,1.0,Urban
9,LP001067,Male,No,0,Not Graduate,No,2400,2400,123.0,360.0,1.0,Semiurban


(367, 12)


In [15]:
#TODO:
# df_train.describe() for the numerical variables
# get mode of categorical variables
# Fill in the missing data

In [16]:
df_import.loc[df_import['Loan_Amount_Term'].isnull(),'Loan_Amount_Term'] = 360
df_import.loc[df_import['Self_Employed'].isnull(),'Self_Employed'] = 'No'
df_import.loc[df_import['CoapplicantIncome'].isnull(),'CoapplicantIncome'] = 0

In [11]:
base_loan_installment = df_import['LoanAmount'] * 1000 / df_import['Loan_Amount_Term']

total_income = df_import['ApplicantIncome'] + df_import['CoapplicantIncome']
remaining_income = (total_income - base_loan_installment) / total_income

df_import['Base_Loan_Installment'] = base_loan_installment
df_import['Remaining_Income'] = remaining_income

In [12]:
df_import.count()

Loan_ID                  367
Gender                   356
Married                  367
Dependents               357
Education                367
Self_Employed            367
ApplicantIncome          367
CoapplicantIncome        367
LoanAmount               362
Loan_Amount_Term         367
Credit_History           338
Property_Area            367
Base_Loan_Installment    362
Remaining_Income         362
dtype: int64

In [13]:
df_import.dropna(inplace = True)
df_import.count()

Loan_ID                  316
Gender                   316
Married                  316
Dependents               316
Education                316
Self_Employed            316
ApplicantIncome          316
CoapplicantIncome        316
LoanAmount               316
Loan_Amount_Term         316
Credit_History           316
Property_Area            316
Base_Loan_Installment    316
Remaining_Income         316
dtype: int64