# Loan Prediction - 02 - Data Preprocessing

Based on the conclusion of the Exploratory Data Analysis, we fill in some missing data assuming the following hypothesis:

- Fill in the Credit_History with 0 for clients who have this value missing and had their loan application rejected.
- Fill in the LoanAmount_Terms with 360 for clients who have this value missing.
- Fill in the CoapplicantIncome with 0.
- Fill in the SelfEmployed status as No.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
plt.style.use('seaborn')

from sklearn import preprocessing

In [2]:
df_import = pd.read_csv('dataset/train_loan.csv')
df_import.drop(columns=['Loan_ID'],inplace = True)
df_import.shape

(614, 12)

## Filling Missing Values

In [3]:
df_fill = df_import.copy()
df_fill.loc[(df_fill['Loan_Status'] == 'N') & (df_fill['Credit_History'].isnull()),'Credit_History'] = 0
df_fill.loc[df_fill['Loan_Amount_Term'].isnull(),'Loan_Amount_Term'] = 360
df_fill.loc[df_fill['Self_Employed'].isnull(),'Self_Employed'] = 'No'
df_fill.loc[df_fill['CoapplicantIncome'].isnull(),'CoapplicantIncome'] = 0

## Removing NaNs and Checking Valid Values Again

In [4]:
df_fill.dropna(inplace = True)
df_fill.count()

Gender               530
Married              530
Dependents           530
Education            530
Self_Employed        530
ApplicantIncome      530
CoapplicantIncome    530
LoanAmount           530
Loan_Amount_Term     530
Credit_History       530
Property_Area        530
Loan_Status          530
dtype: int64

We were able to recover 50 samples compared to the original data without NaNs.

## Encoding Categorical Variables to Numeric Values

Let us manually encode the variables, so we have more visibility and control of labels and codes.

In [5]:
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
df_fill[categorical_columns].head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
1,Male,Yes,1,Graduate,No,Rural,N
2,Male,Yes,0,Graduate,Yes,Urban,Y
3,Male,Yes,0,Not Graduate,No,Urban,Y
4,Male,No,0,Graduate,No,Urban,Y
5,Male,Yes,2,Graduate,Yes,Urban,Y


In [6]:
def encode_labels(df_input,show_encoding = True):
    df = df_input.copy()
    df_encoding = pd.DataFrame(columns = df.columns,index = ['Labels','Encoding'])
    for col in df.columns:
        labels = df[col].unique()
        labels.sort()
        df_encoding.loc['Labels',col] = labels
        if show_encoding:
            print(col,' labels: ',labels)

        for ii,label in enumerate(labels):
            df.loc[df[col] == label,col] = ii
            
        encodings = list(range(labels.shape[0]))
        encodings.sort()
        df_encoding.loc['Encoding',col] = encodings
        
        if show_encoding:
            print(col,' labels encoded: ',encodings)
            print('')
        
    return df, df_encoding

In [7]:
df_encoded = df_fill.copy()
df_encoded[categorical_columns],df_encoding_labels = encode_labels(df_fill[categorical_columns])

Gender  labels:  ['Female' 'Male']
Gender  labels encoded:  [0, 1]

Married  labels:  ['No' 'Yes']
Married  labels encoded:  [0, 1]

Dependents  labels:  ['0' '1' '2' '3+']
Dependents  labels encoded:  [0, 1, 2, 3]

Education  labels:  ['Graduate' 'Not Graduate']
Education  labels encoded:  [0, 1]

Self_Employed  labels:  ['No' 'Yes']
Self_Employed  labels encoded:  [0, 1]

Property_Area  labels:  ['Rural' 'Semiurban' 'Urban']
Property_Area  labels encoded:  [0, 1, 2]

Loan_Status  labels:  ['N' 'Y']
Loan_Status  labels encoded:  [0, 1]



In [8]:
df_encoded.reset_index(inplace = True)
df_encoded.drop(columns='index',inplace = True)
df_encoded.to_csv('dataset/train_loan_preprocessed.csv')
df_encoded.head(10)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
1,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
2,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
3,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1
4,1,1,2,0,1,5417,4196.0,267.0,360.0,1.0,2,1
5,1,1,0,1,0,2333,1516.0,95.0,360.0,1.0,2,1
6,1,1,3,0,0,3036,2504.0,158.0,360.0,0.0,1,0
7,1,1,2,0,0,4006,1526.0,168.0,360.0,1.0,2,1
8,1,1,1,0,0,12841,10968.0,349.0,360.0,1.0,1,0
9,1,1,2,0,0,3200,700.0,70.0,360.0,1.0,2,1


In [9]:
df_encoding_labels.to_csv('dataset/encoding_labels.csv')
df_encoding_labels

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
Labels,"[Female, Male]","[No, Yes]","[0, 1, 2, 3+]","[Graduate, Not Graduate]","[No, Yes]","[Rural, Semiurban, Urban]","[N, Y]"
Encoding,"[0, 1]","[0, 1]","[0, 1, 2, 3]","[0, 1]","[0, 1]","[0, 1, 2]","[0, 1]"


We we able to recover some of the missing data and also encode the categorical variables. 

Hence, all the variables in the dataset can be treated as numerical 

In [10]:
#TODO: Study the necessity to remove outliers