# **Loan Prediction Problem**

A Finance company wants to automate the loan eligibility process based on the customer details provided while filling application form. These details are Gender,Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

# Import Libraries

In [360]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
%matplotlib inline

# Read datasets

In [361]:
train = pd.read_csv('./train_u6lujuX_CVtuZ9i.csv')
test = pd.read_csv('./test_Y3wMUE5_7gLdaTN.csv')

Let's Make copy of the original datasets so we do not hamper the original datasets.

In [362]:
train_original = train.copy()
test_original = test.copy()

In [363]:
train.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [364]:
test.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban


In [365]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [366]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            367 non-null    object 
 1   Gender             356 non-null    object 
 2   Married            367 non-null    object 
 3   Dependents         357 non-null    object 
 4   Education          367 non-null    object 
 5   Self_Employed      344 non-null    object 
 6   ApplicantIncome    367 non-null    int64  
 7   CoapplicantIncome  367 non-null    int64  
 8   LoanAmount         362 non-null    float64
 9   Loan_Amount_Term   361 non-null    float64
 10  Credit_History     338 non-null    float64
 11  Property_Area      367 non-null    object 
dtypes: float64(3), int64(2), object(7)
memory usage: 34.5+ KB


# Data Cleaning

In [375]:
train.isnull().sum()

ApplicantIncome      0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Gender               0
Married              0
Education            0
Self_Employed       32
Loan_Status          0
dtype: int64

In [368]:
test.isnull().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [369]:
train['Gender'].fillna(train['Gender'].mode()[0], inplace=True)
train['Married'].fillna(train['Married'].mode()[0], inplace=True)
train['Self_Employed'].fillna(train['Self_Employed'].mode()[0], inplace=True)
train['Credit_History'].fillna(train['Credit_History'].mode()[0], inplace=True)
train['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)
train['LoanAmount'].fillna(train['LoanAmount'].median(), inplace=True)

test['Gender'].fillna(test['Gender'].mode()[0], inplace=True)
test['Married'].fillna(test['Married'].mode()[0], inplace=True)
test['Self_Employed'].fillna(test['Self_Employed'].mode()[0], inplace=True)
test['Credit_History'].fillna(test['Credit_History'].mode()[0], inplace=True)
test['Loan_Amount_Term'].fillna(test['Loan_Amount_Term'].mode()[0], inplace=True)
test['LoanAmount'].fillna(test['LoanAmount'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Gender'].fillna(train['Gender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Married'].fillna(train['Married'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

In [370]:
train = train[['ApplicantIncome','LoanAmount', 
           'Loan_Amount_Term', 'Credit_History',
           'Gender','Married','Education','Self_Employed','Loan_Status']].copy()
test = test[['ApplicantIncome','LoanAmount', 
           'Loan_Amount_Term', 'Credit_History',
           'Gender','Married','Education','Self_Employed']].copy()

In [371]:
gender_map = {'Male': 0, 'Female': 1}
married_map = {'No':0,'Yes':1}
edu_map = {'Graduate':1,'Not Graduate':0}
employed_map = {'No':0,'Yes':1}
loan_map = {'N':0,'Y':1}

train['Gender'] = train['Gender'].map(gender_map)
train['Married'] = train['Married'].map(married_map)
train['Education'] = train['Education'].map(edu_map)
train['Self_Employed'] = train['Self_Employed'].map(employed_map)
train['Loan_Status'] = train['Loan_Status'].map(loan_map)


test['Gender'] = test['Gender'].map(gender_map)
test['Married'] = test['Married'].map(married_map)
test['Education'] = test['Education'].map(edu_map)
test['Self_Employed'] = test['Self_Employed'].map(employed_map)

# Model Building

In [372]:
train

Unnamed: 0,ApplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender,Married,Education,Self_Employed,Loan_Status
0,5849,128.0,360.0,1.0,0,0,1,0.0,1
1,4583,128.0,360.0,1.0,0,1,1,0.0,0
2,3000,66.0,360.0,1.0,0,1,1,1.0,1
3,2583,120.0,360.0,1.0,0,1,0,0.0,1
4,6000,141.0,360.0,1.0,0,0,1,0.0,1
...,...,...,...,...,...,...,...,...,...
609,2900,71.0,360.0,1.0,1,0,1,0.0,1
610,4106,40.0,180.0,1.0,0,1,1,0.0,1
611,8072,253.0,360.0,1.0,0,1,1,0.0,1
612,7583,187.0,360.0,1.0,0,1,1,0.0,1


In [374]:
train = train.astype(int)
X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.3)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfmodel = RandomForestClassifier(n_estimators=500)

In [None]:
rfmodel.fit(X_train, y_train)

In [None]:
rfpredictions = rfmodel.predict(X_cv)

In [None]:
print(sklearn.metrics.accuracy_score(y_cv, rfpredictions))

0.7675675675675676


In [None]:
print(sklearn.metrics.classification_report(y_cv, rfpredictions))

              precision    recall  f1-score   support

           N       0.67      0.49      0.57        57
           Y       0.80      0.89      0.84       128

    accuracy                           0.77       185
   macro avg       0.73      0.69      0.70       185
weighted avg       0.76      0.77      0.76       185



In [None]:
import pickle
filename = 'Random_Forest.sav'
pickle.dump(rfmodel, open(filename, 'wb'))