# Loan Eligibility Test

A data sciencce beginner's project. We are using [Home loan data-set](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/) here.

In [None]:
# Required Libraries

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
%matplotlib inline

### Exploration

In [None]:
data = pd.read_csv("Labeled_Data.csv")

In [None]:
data.head(3)

In [None]:
# Fields and data types
data.dtypes

In [None]:
# Select duplicate rows except first occurrence based on all columns
duplicates = data[data.duplicated()]
print(duplicates)

In [None]:
data.shape

In [None]:
# Checking missing values per column
data.apply(lambda x: sum(x.isnull()))

In [None]:
# Checking statistical values, specially 'std' for outliers
data.describe()

In [None]:
'''
From the statistical values we've found that these Columns [ApplicantIncome, CoapplicantIncome, LoanAmount, 
Loan_Amount_Term] have very high standard deviation.
'''

data.boxplot(column=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term'], figsize=(15,7))


In [None]:
# Checking correlation
corr = data.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

In [None]:
# Checking target class distribution: Balance in dataset
data.Loan_Status.value_counts()

In [None]:
# Loan_status values (Y/N) in ratio
data.Loan_Status.value_counts(normalize=True)

In [None]:
# On plot
ax = data.Loan_Status.value_counts().plot.barh()

#### Findings:
- Small dataset containing both numerical and categorical values
- No duplicates
- Lots of missing values
- There are extreme outliers in 'ApplicantIncome' and 'CoapplicantIncome' fields.
- There is little positive correlation between LoanAmount and ApplicantIncome. 
- Data is highly imbalanced, (Y / N) distribution is approximately 2:1

**Further exploration** of some particular fields to understand their role in the final outcome


In [None]:
# Checking relation of Loan_Status (Y/N) with credit history(1 = meets guidelines / 0 = doesn't meet)

hypothesis1 = data.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,'N':0}).mean())

print ('Distribution of Loan status based on Credit History:\n')
print (hypothesis1)
ax = hypothesis1.plot.barh()

People who have good credit history, most of the time they got their loan approaved. Which satisfy our **1st Hypothesis** (assumption). Now, let's look at the AplicantIncome and Loan_Status relation.

In [None]:
# From a scatter plot we can analyze the role of ApplicantIncome in outcome of Loan_Status
ax1 = data.plot(kind='scatter', x='ApplicantIncome', y='Loan_Status', color='r', figsize=(15, 3))    

print(ax1)

Here, we can see the applicant's income does not play a significant role. We have to reject our **2nd Hypothesis**. But we can try binning applicant's incomes and check again their relation with Loan status.

In [None]:
# k = 1000 $
bins = [0, 5000, 10000, 15000, 25000, np.inf]
names = ['< 5k', '5-10k', '10-15k', '15-25k', '25k+']

data['IncomeRange'] = pd.cut(data['ApplicantIncome'], bins, labels=names)

In [None]:
data.head(2)

In [None]:
hypothesis2 = pd.crosstab(data['IncomeRange'], data['Loan_Status'])
hypothesis2.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

Now, we can see that applicants with low or medium income, tend to apply more for loans and thus their number of applications are higher. However, the approval i.e. Loan Status (Y/N) seems not much influenced by Applicant's income and we also need to remember that our data (label) is highly imbalanced.
Finally, we reject our **2nd Hypothesis** for this case.


#### Checking influence of applicant's property

In [None]:
data['Property_Area'].value_counts()

In [None]:
hypothesis3 = pd.crosstab(data['Property_Area'], data['Loan_Status'])
hypothesis3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

It is clearly visible that applicants with valuable property such as in the urban and semi-urban areas tend to get more loans, which satisfies our **3rd Hypothesis**. 

Since it is a binary classification problem with many categorical values and also, we want to keep our base model simple, we will use Deccision Tree or Naive Byes classification algorithms to set a base standard for future comparison.

### Data pre-processing

There are no duplicates,so we can skip this step.

#### Treating missing values

In [None]:
# Self_Employed column
data['Self_Employed'].value_counts()

Since nearly 86% values are “No”, we can safely impute the missing values as “No”.

In [None]:
# Self_Employed column has 32 missing values
data['Self_Employed'].fillna('No',inplace=True)

We can treat the missing values many different wasy such as droping the rows, filling with zeros or mean, as well as mode/median. Since we have categorical values, we can group by categories and take the median per group to fill loan amount missing values. So, we need a pivot table.

In [None]:
table = data.pivot_table(values='LoanAmount', index='Self_Employed', columns='Education', aggfunc=np.median)

# function to return value of this pivot_table
def group_median(x):
    return table.loc[x['Self_Employed'],x['Education']]

In [None]:
# Replace missing values
data['LoanAmount'].fillna(data[data['LoanAmount'].isnull()].apply(group_median, axis=1), inplace=True)

Now let's fill all the missing values in other columns with mode (highest number of occurance)

In [None]:
data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
data['Married'].fillna(data['Married'].mode()[0], inplace=True)
data['Dependents'].fillna(data['Dependents'].mode()[0], inplace=True)
data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mode()[0], inplace=True)
data['Credit_History'].fillna(data['Credit_History'].mode()[0], inplace=True)

In [None]:
# Checking again for missing values
data.apply(lambda x: sum(x.isnull()),axis=0) 

#### Treating extreme values with log transformation

In [None]:
# LoanAmount
data['LoanAmount_log'] = np.log(data['LoanAmount'])
data['LoanAmount_log'].hist(bins=20)

In [None]:
# ApplicantIncome
data['ApplicantIncome_log'] = np.log(data['ApplicantIncome'])
data['ApplicantIncome_log'].hist(bins=20)

#### Type conversion: to numeric

In [None]:
# Label encoding for numerical processing
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    data[i] = le.fit_transform(data[i])
data.dtypes

In [None]:
data.head()

#### Processed dataset for analysis

In [None]:
pro_data = data.drop('IncomeRange', axis=1)

In [None]:
pro_data.head(2)

In [None]:
# moving the loan status column to the end since it is our label
new_cols = [col for col in pro_data.columns if col != 'Loan_Status'] + ['Loan_Status']
pro_data = pro_data[new_cols]

In [None]:
pro_data.head(2)

In [None]:
# It's better to save the data
pro_data.to_csv('processed_data.csv')

### Base Model

In [None]:
# Required libraries (Scikit learn)
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [None]:
# dividing the dataset into x (variables) and y (label). Also Loan_ID is not required for analysis
x = pro_data.drop(['Loan_ID', 'Loan_Status'], axis = 1) 
y = pro_data["Loan_Status"] 
print(x.shape) 
print(y.shape) 

In [None]:
# Using Skicit-learn to split data into training and testing sets 
from sklearn.model_selection import train_test_split 

# Split the data into training and testing sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42) 

In [None]:
# Decision tree
dtc = DecisionTreeClassifier()

#Fit the model:
dtc.fit(x_train, y_train)

#Make predictions on training set:
dtc_predictions = dtc.predict(x_test)

#Print accuracy
dtc_accuracy = metrics.accuracy_score(y_test, dtc_predictions)
print ("Accuracy of Decision Tree: %s" % "{0:.3%}".format(dtc_accuracy))

Read more on [Decision tree](https://www.datacamp.com/community/tutorials/decision-tree-classification-python).

In [None]:
# Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

# Create a Gaussian Classifier
nbc = GaussianNB()

#Fit the model:
nbc.fit(x_train, y_train)

#Make predictions on training set:
nbc_predictions = nbc.predict(x_test)

#Print accuracy
nbc_accuracy = metrics.accuracy_score(y_test, nbc_predictions)
print ("Accuracy : %s" % "{0:.3%}".format(nbc_accuracy))

More resources on Naive Byes can be found [here](https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn).

#### Basic improvements with Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

Performing 10-fold cross-validation (Rule of thumb)



In [None]:
clf1 = DecisionTreeClassifier()
dtc_CV_scores = cross_val_score(clf1, x, y, cv=10)
print("DT CV Accuracy: %0.2f (+/- %0.2f)" % (dtc_CV_scores.mean(), dtc_CV_scores.std() * 2))

In [None]:
clf2 = GaussianNB()
nbc_CV_scores = cross_val_score(clf2, x, y, cv=10)
print("NB CV Accuracy: %0.2f (+/- %0.2f)" % (nbc_CV_scores.mean(), nbc_CV_scores.std() * 2))

Next we can do parameter optimization with **GridsearchCV** or **RandomSearch**. Also, we can try **ensemble** methods or Neural network. (...to be continued)