![Loan Prediction](./imgs/header.png "Loan Prediction")

# Data Science Approach

-  Problem understanding
-  Data understanding
-  Exploratory data analysis
-  Data Pre-processing
-  Feature engineering
-  Model building
-  Model Evaluation


## Problem Statement
#### A Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to classify the customers, which are eligible for loan amount. The challenge is to predict approval status of loan (Approved/Reject).

# Problem Understanding


# Data Understanding

|Variable Name | Meaning
| :---      | --:	
|Variable  	|   	Description
|Loan_ID   	|   	Unique Loan ID
|Gender   	|   	Male/ Female
|Married    |       Applicant married (Y/N)
|Dependents |       Number of dependents
|Education  |       Applicant Education (Graduate/ Under Graduate)
|Self_Employed|    Self employed (Y/N)
|ApplicantIncome|	          Applicant income
|CoapplicantIncome |	  Coapplicant income
|LoanAmount	|      Loan amount in thousands
|Loan_Amount_Term |	  Term of loan in months
|Credit_History |	          credit history meets guidelines
|Property_Area |	          Urban/ Semi Urban/ Rural
|Loan_Status |          Loan approved (Y/N)

# Load libraries

In [None]:
# import plotly.plotly as py
import plotly.graph_objs as go
import cufflinks as cf
import pandas as pd # data processing, CSV file I/O
import numpy as np # linear algebra, support for large high level mathematical computation
import matplotlib as plt # plotting 
import os # accessing directory structure

# plotly offline, injects plotly source code directly into notebook. Most of the time it is the mode that works on kaggle
from plotly.offline import init_notebook_mode, iplot as py

# matplotlib inline is a magic function that renders the figure in a notebook
%matplotlib inline 

# Import dataset

In [None]:
print(os.listdir('../input'))
train_df = "../input/train.csv"
test_df = "../input/test.csv"

train_df = pd.read_csv(train_df)
test_df = pd.read_csv(test_df)

## pandas.DataFrame.info : index dtype, column dtypes, non-null values and memory usage.    

In [None]:
train_df.info()

In [None]:
test_df.info()

## pandas.DataFrame.describe : Summary of numeric features

In [None]:
train_df.describe()

In [None]:
train_df.head(10)

### Missing data
-  Loan Amount, NaN values  can be replaced with **Mean** of Loan amount  
-  Loan amount term, missing values can be replaced with **Mode** of amount term
-  Only 84% of customers have credit history,  null values can be considered as **0** i.e no credit history 

In [None]:
train_df.head(10)

## Exploratory data analysis
-  Numeric
 -  Scatter plots
 -  Histograms
-  Categorical
 -  Bar Chart
 -  Stacked Bar Chart

## Plotly & Matplotlib
-  Iplot means interactive plot. Plotly takes python code and makes great interactive JavaScript plots. They let you have a lot of control over how these plots look and they let you zoom, show info on hover and toggle data.

In [None]:
genderCount = train_df['Gender'].value_counts()
genderCount.dtype

In [None]:
init_notebook_mode(connected=True)

## Gender Distribution (Categorical)
### Bar chart - A bar chart is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

In [None]:
genderCount = train_df['Gender'].value_counts()
genderCount

In [None]:
bar_data = [go.Bar(
            x = ['Male','Female'],
            y = genderCount.tolist()
)]
iplot(bar_data, filename='basic-bar')

## Applicant Income (Numeric)
### Histogram - A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of the continuous variable.

## Applicant Income
### Box plot / Box Whiskers plot - A box plot is a method for graphically depicting groups of numerical data through their quartiles. It is a standardized way of displaying the distribution of data based on the summary: minimum, first quartile, median, third quartile, and maximum.
-  Tells about the spread of the data
-  __[Box plot tutorial](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review)__

## Relation b/w ApplicantIncome by Education
### Boxplot (plotly)

## Reltaion bw ApplicantIncome by Education
### Boxplot (Matplotlib)

In [None]:
train_df.boxplot('ApplicantIncome', py)

## LoanAmount 
### Histogram

## LoanAmount 
### BoxPlot 

## Frequency & Probabilty of getting loan based on Credit History

## Applicants by credit history
### Bar chart 

## Probability of getting loan by credit history
### Bar chart

## Applicant credit history & probability of getting loans
### Stacked Bar chart 

## Gender & Credit history (Cross tab)
### Stacked Bar chart 

***

## Data Pre-processing
-  Missing value treatment

## Loan Amount 

In [None]:
train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean(), inplace=True)

In [None]:
train_df[['LoanAmount']].info()

## Treat extreme values in loan amount
### Log Transformation
-  Small values that are close together are spread further out.
-   Large values that are spread out are brought closer together.

## Self_Employed

## LoanAmount_log

## Missing value for Categorical features

### GENDER

### Married

## Dependents

## Loan_Amount_Term

## Credit_History

## Feature Engineering

### Total Income Feature

## Model Building
### Label Encoding the categorical features

### Check the data types

## Classification
-  Fit the model
-  Predict using the model
-  Score the model
-  Cross-validation

## LOGISTIC REGRESSION

<br>__[LOGISTIC REGRESSION TUTORIAL](https://www.youtube.com/watch?v=H6ii7NFdDeg)__ 
<br>
<br>

## Logistic Regression Decision Boundary

![LR_Decision_Boundary](./imgs/lr_1.png "LR_Decision_Boundary")

## DECISION TREES

![DT https://www.crondose.com/2016/07/easy-way-understand-decision-trees/](./imgs/dt_2.png "DT")

## Decision Tree Decision Boundary

![DT_Decision_Boundary](./imgs/dt_1.png "DT_Decision_Boundary")

## RANDOM FOREST : Bootstrap aggregation/Bagging
-  Combining N weak learners on M samples

![Random Forest](./imgs/rf_2.png "Random Forest")

## Why do we need weak learners?

<br>__[RANDOM FOREST TUTORIAL](https://www.youtube.com/watch?v=QHOazyP-YlM)__
<br>
<br>

## Random Forest Decision Boundary

![RF_Decision_Boundary](./imgs/rf_1.png "RF_Decision_Boundary")

## Random Forest Model Parameters

## Hyper-parameter Optimization
### Parameters that are external to model and cannot be estimated from the data
-  Grid Search
-  Random Search

## Logistic Regression
### Parameters
-  Coefficients

### Hyper-parameters 
-  Regularization penalty - ['l1', 'l2']
-  Inverse of regularization - C. By using bigger C values, the model can increase it's complexity and adjust better to the data.

## Random Forest
### Hyper-parameters
-  n_estimators : number of trees in the foreset
-  min_samples_split : min number of data points placed in a node before the node is split
-  max_depth : max number of levels in each decision tree
-  max_features : max number of features considered for splitting a node
-  min_samples_leaf : min number of data points allowed in a leaf node
-  bootstrap : method for sampling data points (with or without replacement
-  criterion : splitting criteria

<br>__[HYPERPARAMETER OPTIMIZATION TUTORIAL](https://www.youtube.com/watch?v=ttE0F7fghfk)__
<br>
<br>

## Tuning Random Forest

# WHAT TO DO NEXT

-  Determine the creditworthiness, particularly for those without credit histories
