# Loan Prediction Practice Problem
A Data Visualization & Machine Learning Practice, dataset is from **Analytics Vidhya** ([link](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#About))
<br>
<br>
***Objective: Predict Loan Eligibility for Dream Housing Finance company***

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are **Gender**, **Marital Status**, **Education**, **Number of Dependents**, **Income**, **Loan Amount**, **Credit History** and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 
<br>
<br>
***Column Description***

* *Loan_ID* -	Unique Loan ID
* *Gender* -	Male/ Female
* *Married* -	Applicant married (Y/N)
* *Dependents* -	Number of dependents
* *Education* -	Applicant Education (Graduate/ Under Graduate)
* *Self_Employed* -	Self employed (Y/N)
* *ApplicantIncome* -	Applicant income
* *CoapplicantIncome* -	Coapplicant income
* *LoanAmount* -	Loan amount in thousands
* *Loan_Amount_Term* -	Term of loan in months
* *Credit_History* -	credit history meets guidelines
* *Property_Area* -	Urban/ Semi Urban/ Rural
* *Loan_Status* -	(Target) Loan approved (Y/N)

In [None]:
# Set up
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import *
%matplotlib inline

train = pd.read_csv('./data/train.csv')
print(train.shape)
train.head()

In [None]:
train.dtypes

In [None]:
# Collecting the 'object' columns
obj_cols = list(train.columns[train.dtypes=='object'])
obj_cols.remove('Loan_ID')
obj_cols

In [None]:
# Showing unique values of each column with 'object' data types
# This will help determine how to preprocess them into number data types
for col in obj_cols:
    print(f"{col}:")
    print(f"{train[col].unique()}\n")

# Preprocessing Data

In [None]:
ntrain = train.copy()

## Finding missing values

In [None]:
rows = train.shape[0]

for col in ntrain.columns:
    missing = ntrain[col].isnull().sum()
    if missing > 0:
        print(f"{col}: {(missing/rows)*100:.2f}%")
        print(f"{missing} missing values out of {rows}\n")

Luckily, none of the columns above have significantly large amounts of missing data, so I will simply determine the missing values myself based on similar patterns. The first step that comes to mind is starting with the personal information data of the applicants, like gender, marriage status, number of dependents, employment type, and credit history.

### Gender

In [None]:
ntrain[ntrain.Gender.isnull()]

In [None]:
sns.set(rc={'figure.figsize': (12,6)})

# The 'Gender' column
subplot(1,2,1)
sns.countplot(data=ntrain, x='Gender')

# The 'Married' column
subplot(1,2,2)
sns.countplot(data=ntrain, x='Married', hue='Gender')

plt.show()

One thing I can infer from this is that a married applicant with a null gender value would most probably be male.

In [None]:
ntrain.loc[(ntrain.Gender.isnull()) & (ntrain.Married=='Yes'), "Gender"] = "Male"
ntrain.loc[(ntrain.Gender.isnull()) & (ntrain.Married=='Yes')]

In [None]:
ntrain[ntrain.Gender.isnull()]

All three applicants who still do not have a gender value are not married and have an education level of graduate (or above). Applicants no. 507 and no. 588 both have no dependents and are not self employed while applicant no. 592 has 3+ dependents, is self employed, and is a high income earner. Now I want to see which gender share similar patterns.

In [None]:
# Plotting cases for applicants who have 3+ dependents, are above-average income earners, and are self employed.

sns.set(rc={'figure.figsize': (18,7)})

subplot(1,3,1)
plt.title('3+ Dependents')
sns.countplot(data=ntrain[ntrain.Dependents=='3+'], x='Gender')

subplot(1,3,2)
plt.title('Above-average income earners')
sns.swarmplot(data=ntrain[ntrain.ApplicantIncome>ntrain.ApplicantIncome.mean()], x='Gender', y='ApplicantIncome')

subplot(1,3,3)
plt.title('Self employed applicants')
sns.countplot(data=ntrain, x='Self_Employed', hue='Gender')

plt.show()

So applicant no. 592 is, by probability, a male.

In [None]:
ntrain.loc[592,'Gender'] = 'Male'
ntrain.iloc[592, :]

In [None]:
# Plotting cases for applicants with no dependents and applied for a loan amount of about $90~100 grand
sns.set(rc={'figure.figsize': (14,6)})

subplot(1,2,1)
plt.title('No dependents')
sns.countplot(data=ntrain[ntrain.Dependents=='0'], x='Gender')

subplot(1,2,2)
plt.title('Loan amount between $90~100 grand')
sns.swarmplot(data=ntrain[(ntrain.LoanAmount<100)&(ntrain.LoanAmount>90)], x='Gender', y='LoanAmount')

plt.show()

So... applicants no. 507 and no. 588 would also be male

In [None]:
ntrain.loc[[507,588],'Gender'] = 'Male'
ntrain.iloc[[507,588], :]

In [None]:
ntrain[ntrain.Gender.isnull()]

### Marriage Status

In [None]:
ntrain[ntrain.Married.isnull()]

Observations:
1. Applicants no.104 and no.228 are both male, have below-average income, and applied for approximately the same amount.
2. Applicant no.435 is female and has a well over average income

So I will see the marriage status of applicants with similar gender & income patterns

In [None]:
# Marriage status of male applicants with below-average income and who applied for a loan amount between $150~170 grand
# I'm supposing the loan amount and marriage status would have some sort of correlation...
filtered = ntrain[(ntrain.Gender=='Male')&(ntrain.ApplicantIncome<ntrain.ApplicantIncome.mean())&(ntrain.LoanAmount>150)&(ntrain.LoanAmount<170)]
sns.countplot(data=filtered, x='Married')

It seems fairly just to give applicants no.104 and no.228 a 'yes' value to their marriage status

In [None]:
ntrain.loc[[104,228],'Married'] = 'Yes'
ntrain.iloc[[104,228], :]

In [None]:
# Marriage status of high-income female applicants
sns.swarmplot(data=ntrain[(ntrain.Gender=='Female')&(ntrain.ApplicantIncome>8000)], x='Married', y='ApplicantIncome')

In [None]:
ntrain.loc[435, 'Married'] = 'No'
ntrain.iloc[435]

In [None]:
ntrain[ntrain.Married.isnull()]

### Dependents

In [None]:
ntrain[ntrain.Dependents.isnull()]

In [None]:
sns.countplot(data=ntrain, x='Dependents', hue='Married')

Some immediate findings can be that regardless of marriage status or education level of the applicant, the chances of them having any dependents is relatively small, especially is the applicant is not married. To me this seems like a safe indicator that it would be okay to replace all null values of 'Dependents' unmarried applicants to 0.

In [None]:
ntrain.loc[ntrain.Dependents.isnull() & (ntrain.Married=='No'), 'Dependents'] = '0'
ntrain[ntrain.Dependents.isnull()]

Now, if the applicant is married, is there a definite pattern between the education level, income, loan amount of applicants and the number of dependents they have?

In [None]:
females = ntrain[ntrain.Gender=='Female']
males = ntrain[ntrain.Gender=='Male']

### Marriage status and Income, by gender

In [None]:
print("Female applicants' Income, by their marriage status")
print(f"The average income for all female applicants is: {ntrain[ntrain.Gender=='Female'].ApplicantIncome.mean():.2f}")
income_by_marriage_female = pd.pivot_table(females, index='Married', values='ApplicantIncome')
income_by_marriage_female

In [None]:
print("Male applicants' Income, by their marriage status")
print(f"The average income for all male applicants is: {ntrain[ntrain.Gender=='Male'].ApplicantIncome.mean():.2f}")
income_by_marriage_male = pd.pivot_table(males, index='Married', values='ApplicantIncome')
income_by_marriage_male

In [None]:
print(f"Married female applicants earn {(income_by_marriage_female.iloc[1][0]/income_by_marriage_female.iloc[0][0])*100:.2f}% more than their unmarried counterparts")
print(f"Married male applicants earn {(income_by_marriage_male.iloc[1][0]/income_by_marriage_male.iloc[0][0])*100:.2f}% more than their unmarried counterparts")

In [None]:
pd.pivot_table(females, index=['Married','Education'], values=['ApplicantIncome', 'CoapplicantIncome'])

1. The highest earning group of female applicants is married females with an education level of graduate (or higher).
2. Unmarried female applicants with a lower education status has the lowest income but the highest co-applicant income.
3. Surprisingly, if not married, non-graduate applicants earn more than graduate applicants.

In [None]:
pd.pivot_table(males, index=['Married','Education'], values=['ApplicantIncome', 'CoapplicantIncome'])

In [None]:
sns.countplot(data=females, x='Married', hue='Education')

In [None]:
males.describe()

In [None]:
# How does the level of education affect an applicant's Income bracket 
# (as well as the income bracket of their coapplicant)
print(pd.pivot_table(ntrain, index='Education', values='ApplicantIncome'))
print('\n')
print(pd.pivot_table(ntrain, index='Education', values='CoapplicantIncome'))

In [None]:
# The 'Dependents' column
sns.set(rc={'figure.figsize': (14,20)})

subplot(3,1,1)
plt.title("Number of dependents & Marriage status")
f1 = sns.countplot(data=ntrain, x='Dependents', hue='Married')

subplot(3,1,2)
plt.title("Number of dependents & Education level")
f2 = sns.countplot(data=ntrain, x='Dependents', hue='Education')

subplot(3,1,3)
plt.title("Number of dependents & Applicant Income")
f3 = sns.swarmplot(data=ntrain, x='Dependents', y='ApplicantIncome')

plt.show()

In [None]:
# The 'Self_Employed' column

In [None]:
# The 'LoanAmount' and 'Loan_Amount_Term' columns

In [None]:
# The 'Credit_History' column

In [None]:
ntrain.loc[ntrain.Dependents.isnull(), 'Dependents'] = '0'
ntrain[ntrain.Dependents.isnull()]

In [None]:
# Creating a copy of the original dataset
ntrain = train.copy()

# Binary Encoding the following columns:
# (Generally 0 means 'no' or implies negation, while 1 the opposite)

# Gender
ntrain.loc[ntrain.Gender == 'Male', 'Gender_encoded'] = 0
ntrain.loc[ntrain.Gender == 'Female', 'Gender_encoded'] = 1
print(ntrain.shape)

# Married
ntrain.loc[ntrain.Married == 'No', 'is_married'] = 0
ntrain.loc[ntrain.Married == 'Yes', 'is_married'] = 1
print(ntrain.shape)

# Married
ntrain.loc[ntrain.Education == 'No', 'is_graduate'] = 0
ntrain.loc[ntrain.Education == 'Yes', 'is_graduate'] = 1
print(ntrain.shape)

# Self_Employed
ntrain.loc[ntrain.Self_Employed == 'No', 'is_self_employed'] = 0
ntrain.loc[ntrain.Self_Employed == 'Yes', 'is_self_employed'] = 1
print(ntrain.shape)

# Loan_Status
ntrain.loc[ntrain.Loan_Status == 'N', 'loan_approved'] = 0
ntrain.loc[ntrain.Loan_Status == 'Y', 'loan_approved'] = 1
print(ntrain.shape)

ntrain.head()

In [None]:
from sklearn.preprocessing import LabelBinarizer
binarizer = LabelBinarizer()

# Creat a new column called 'has_child'
ntrain.loc[ntrain.Dependents=='0', 'has_child'] = 0
ntrain.loc[ntrain.has_child.isnull(), 'has_child'] = 1

# Dependents
ntrain.Dependents = ntrain.Dependents.astype(str)
results = binarizer.fit_transform(ntrain["Dependents"])
dependents_encoded = pd.DataFrame(results, columns=['0','1','2','3','nan'])

# Property_Area
results = binarizer.fit_transform(ntrain["Property_Area"])
property_area_encoded = pd.DataFrame(results, columns=binarizer.classes_)

ntrain = pd.concat([ntrain, education_encoded, property_area_encoded, dependents_encoded], axis=1, sort=False)
ntrain.drop('Property_Area', axis=1, inplace=True)
ntrain.drop('Dependents', axis=1, inplace=True)

ntrain.head()

# Finding correlation between columns

In [None]:
train.head()

In [None]:
sns.

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(data=train, x='Married', hue='Gender')

In [None]:
train[train.Gender.isnull()]

In [None]:
sns.countplot(data=train, x='Self_Employed', hue='Gender')

In [None]:
sns.pairplot(train, hue='Gender', height=2.5)

# Preprocessing Data (2)