## Dream Housing Finance company

### Problem Statement
***About Company***
* Dream Housing Finance company deals in all home loans.
* They have presence across all urban, semi urban and rural areas. 
* Customer first apply for home loan after that company validates the customer eligibility for loan.

### Problem
* Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. 
* These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. 
* To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. 

**_Here they have provided a partial data set._**
* ***Data set taken from : https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/***


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("loan_data_set.csv") #reading csv file

In [3]:
df.info() #information about data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 617 entries, 0 to 616
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            617 non-null    object 
 1   Gender             604 non-null    object 
 2   Married            614 non-null    object 
 3   Dependents         602 non-null    object 
 4   Education          617 non-null    object 
 5   Self_Employed      585 non-null    object 
 6   ApplicantIncome    617 non-null    int64  
 7   CoapplicantIncome  617 non-null    float64
 8   LoanAmount         595 non-null    float64
 9   Loan_Amount_Term   603 non-null    float64
 10  Credit_History     567 non-null    float64
 11  Property_Area      617 non-null    object 
 12  Loan_Status        617 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.8+ KB


In [4]:
df.isnull().sum() #numbers of null values in every column

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [5]:
df.shape
# df_encoded.shape

(617, 13)

## Data Wrangling / Data Cleaning

In [6]:
#creating copy of DataFrame for Data Cleaning operations
df_encoded = df.copy()

In [7]:
#filling the null values with the value which majorly occures
#.value_counts() gives you a count of values avaliable in column
df_encoded['Gender'].value_counts() 
# df_encoded['Married'].value_counts()
# df_encoded['Dependents'].value_counts()
# df_encoded['Self_Employed'].value_counts()
# df_encoded['Loan_Amount_Term'].value_counts()
# df_encoded['Credit_History'].value_counts()

Male      492
Female    112
Name: Gender, dtype: int64

In [8]:
df_encoded['Gender'].fillna('Male',inplace = True) # filling the all null values with 'Male' parameter
df_encoded['Married'].fillna('Yes', inplace = True) # Filling the all null values with majorly occured 'Yes' parameter
df_encoded['Dependents'].fillna(0, inplace = True) # Filling the all null values with majorly occured '0' parameter
df_encoded['Self_Employed'].fillna('No', inplace = True) # Filling the all null values with majorly occured 'No' parameter
df_encoded['Credit_History'].fillna(1.0, inplace = True) # Filling the all null values with majorly occured '1.0' parameter

In [9]:
# We can't get properly the most occured value using df_encoded['LoanAmount'].value_counts() in LoanAmount column
# so we have other option to fill that NULL places with average value from that column 
df_encoded['LoanAmount'].fillna(df_encoded['LoanAmount'].mean(), inplace = True)

# Same thing we will do in loan_Amount_Term column
df_encoded['Loan_Amount_Term'].fillna(df_encoded['Loan_Amount_Term'].mean(), inplace = True)


In [10]:
df_encoded.duplicated().sum() #to check weather any duplicate is avaliable or not

3

In [11]:
df_encoded['Loan_ID'].value_counts()

LP001032    3
LP001003    2
LP001194    1
LP001562    1
LP001097    1
           ..
LP002234    1
LP002110    1
LP002911    1
LP001636    1
LP002281    1
Name: Loan_ID, Length: 614, dtype: int64

In [12]:
# to drop duplicate from table we use following code
df_encoded.drop_duplicates(inplace = True)
df_encoded.shape

(614, 13)

In [13]:
df_encoded.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.309244,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
3,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
4,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
5,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
6,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
7,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
8,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
9,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
10,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [14]:
# handling special characters to string
df_encoded['Dependents'] = df_encoded['Dependents'].astype(str)

In [15]:
from sklearn.preprocessing import LabelEncoder # class that change the text to numbers
le = LabelEncoder()

In [16]:
# this will convert 'Gender' column into binary numbers 0 = Female or 1 = male
df_encoded['Gender'] = le.fit_transform(df_encoded["Gender"].values)
# this will convert 'Married' column into binary numbers 0 = No or 1 = yes
df_encoded['Married'] = le.fit_transform(df_encoded["Married"].values)
# this will convert 'Dependents' column into numbers 0,1,2,3
df_encoded['Dependents'] = le.fit_transform(df_encoded["Dependents"].values)
# this will convert 'Education' column into binary numbers 0 = Graduate or 1 = Not Graduate
df_encoded['Education'] = le.fit_transform(df_encoded["Education"].values)
# this will convert 'Self_Employed' column into binary numbers 0 = NO or 1 = Yes
df_encoded['Self_Employed'] = le.fit_transform(df_encoded["Self_Employed"].values)
# this will convert 'Property_Area' column into binary numbers 0 = Rural or 1 = Semiurban 2 = urban
df_encoded['Property_Area'] = le.fit_transform(df_encoded["Property_Area"].values)
# this will convert 'Loan_Status' column into binary numbers 0 = N or 1 = Y
df_encoded['Loan_Status'] = le.fit_transform(df_encoded["Loan_Status"].values)

In [17]:
print(df['Property_Area'].value_counts())
df_encoded['Property_Area'].value_counts()

Semiurban    233
Urban        204
Rural        180
Name: Property_Area, dtype: int64


1    233
2    202
0    179
Name: Property_Area, dtype: int64

In [18]:
df_encoded.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0,0,0,5849,0.0,146.309244,360.0,1.0,2,1
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
3,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
4,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
5,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1


In [19]:
df_encoded['Loan_Status'].value_counts()

1    422
0    192
Name: Loan_Status, dtype: int64

# Spliting data for training and testing

In [20]:
from sklearn.model_selection import train_test_split #import train_test_split function to random creation of train data and test data

In [21]:
df_encoded.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [22]:
# define the input features
feature_cols = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area']

X = df_encoded[feature_cols] #define the input features
y = df_encoded['Loan_Status'] #define the dependent variable


In [23]:
# 70% of data is to train the model, 30% for testing the model
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

# Applying algorithm

In [24]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier 

In [25]:
# clf = DecisionTreeClassifier(random_state=1) # create the decision tree classsifier
clf = DecisionTreeClassifier(criterion="entropy",random_state=1, max_depth=2) # create a decision tree classifier with entropy and pruning
clf = clf.fit(X_train, y_train) # fit the model

In [26]:
# predict the test values
y_pred = clf.predict(X_test)

from sklearn import metrics
# Model Accuracy
print("Accuracy :", metrics.accuracy_score(y_test,y_pred)*100)

Accuracy : 78.91891891891892


In [27]:
# build a confusion metrix
print(metrics.confusion_matrix(y_test,y_pred))

[[ 25  36]
 [  3 121]]


In [29]:
df_encoded.to_csv('Encoded_Loan_data.csv',index = False)