## Loan Prediction with Python 

This tutorial is based on the course [Loan Prediction with Python](https://www.analyticsvidhya.com/blog/2018/07/learn-and-test-your-machine-learning-skills-with-avs-new-practice-problems-and-free-courses/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29)

### 1. Problem statement

> *Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.*
***
***From the above statement, we understand that the goal is to predict which loan applicants will be approved***

### 2. Generating Hypotheses  

A crucial step in any model building is to generate a set of hypotheses. Just like [hypothesis testing](http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/) in statistics, it's important that this be done prior to diving into the data. So we may ask **How would a customer details affect loan approval?**
Here are my own hypotheses:   
> - Married people would have a higher chance of being approved compared to single applicants.  
- Approval probability will increase with the level of education and with higher credit score.  
- The higher in income, the higher the probability of been approved for a loan.  
- Approval probability will decrease with the loan amount.  
- There will be an age group dependence:   
    - Apporval probability for people in their early 20's will be lower than people in their 30's-40's.
    - Old and retired people will have a lower chance of approval.
- People who possess other assets such as bussinesses may have higher chance of approval.
- Certain people such as government employees may be considered to have more stable job(income), hence greater chance of approval compared to others such as contractors.

### 3. Data Exploration

The data contains the following features: 

| Variable | Description   |
|:---|------|
| Loan_ID | Unique Loan ID|
| Gender  | Male/Female |
| Married | Applicant married (Y/N)|
| Dependents | Number of dependents|
| Education | Applicant Education (Graduate/ Under Graduate)|
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome| Applicant income|
| CoapplicantIncome| Coapplicant income|
| LoanAmount| Loan amount in thousands|
| Loan_Amount_Term| Term of loan in months|
| Credit_History| credit history meets guidelines|
| Property_Area| Urban/ Semi Urban/ Rural|
| Loan_Status | Loan approved (Y/N) |

In [1]:
import pandas as pd
import numpy as np                     
#import seaborn as sns                  
import matplotlib.pyplot as plt        
%matplotlib inline
import warnings                        
warnings.filterwarnings("ignore")

In [2]:
#parse the data into pandas dataframes
df_train = pd.read_csv('data/train_u6lujuX_CVtuZ9i.csv')
df_test = pd.read_csv('data/test_Y3wMUE5_7gLdaTN.csv') 

#make a safe copy of dataframes
df_train_original = df_train.copy()
df_test_original = df_test.copy()

In [3]:
df_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
print("Train set contains {} observations and {} features".format( df_train.shape[0], df_train.shape[1] )) 

Train set contains 614 observations and 13 features


### Testing our Hypotheses

###### Married people would have a higher chance of being approved compared to single applicants

In [5]:
df_train['Loan_Status'].replace('Y', 1, inplace=True)
df_train['Loan_Status'].replace('N', 0, inplace=True)
df_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1


In [6]:
df_train[['Married', 'Loan_Status']].groupby('Married').count()

Unnamed: 0_level_0,Loan_Status
Married,Unnamed: 1_level_1
No,213
Yes,398


In [7]:
def get_proportion(df, col_name):
    data = df[df[col_name]=="Yes"]
    approved = len(data[data['Loan_Status']==1])
    rejected = len(data[data['Loan_Status']==0])
    print("proportion of approved {} {:6.2f}, rejected {:6.2f}".format(col_name, approved/len(data), rejected/len(data)))

In [8]:
get_proportion(df_train, col_name="Married")

proportion of approved Married   0.72, rejected   0.28


We see that there are more married applicants than single applicants. We also find that the proportion of married applicants that got approved was higher than the proportion of single applicants that got approved. This agrees with our initial hypothesis. We may perform a more rigourous statistical test wto confirm that the difference in proportions is indeed significant.  