# <font size=6.5> <font color = darkblue> Understanding Decision Trees with Loan problem

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
plt.rcParams['font.size']=14
plt.rcParams['axes.grid']=True

# Can you predict which customers to not give loan to?

<font color = darkblue > <strong>
Bunjab National Bank (BNB) is a large bank, and often gives out personal and business loans. Off-late the bank is in a lot of financial stress, as a lot of borrowers are not paying their EMI's and hence defaulting on their loans. Some of the borrowers, like PJ Mallya and Mirav Modi have even fled to UK to avoid paying back their loans.<br>
BNB has some data about their past customers, and wants to see if Machine Learning (ML) can be used on this dataset to predict whether certain customers will default on their loans or not. Since BNB plans to use the ML model to decide whether to sanction or reject loans to future customers, it wants to be absolutely sure of the model before deploying it.<br>
To test the ML model performnce, BNB has hidden the target column of a portion of the dataset, and wants you to make predictions on this portion of the dataset based on your best Machine Learning Model.<br>
Since, both precision and recall of the model are equally important to BNB, it has decided to use F1 as a metric to evaluate your ML model.

---

* __Column Name:-------------------Description__ 
* UniqueID:------------------------------------- Identifier for customers
* disbursed_amount:------------------------- Amount of Loan disbursed
* asset_cost:----------------------------------- Cost of the Asset
* ltv:---------------------------------------------- Loan to Value of the asset
* branch_id:------------------------------------	Branch where the loan was disbursed
* Employment_Type:------------------------	Employment Type of the customer (Salaried/Self Employed)
* State_ID:------------------------------------- State of disbursement
* MobileNo_Avl_Flag:-----------------------	if Mobile no. was shared by the customer then flagged as 1
* Aadhar_flag:--------------------------------- if aadhar was shared by the customer then flagged as 1
* PERFORM_CNS_SCORE_DESCRIPTION:--------- Credit Bureau score category
* PRI_NO_OF_ACCTS:--------------------- count of total loans taken by the customer at the time of disbursement
* PRI_ACTIVE_ACCTS:-------------------- count of active loans taken by the customer at the time of disbursement
* PRI_OVERDUE_ACCTS:----------------- count of default accounts at the time of disbursement
* PRI_CURRENT_BALANCE:------------- total Principal outstanding amount of the active loans at the time of disbursement
* PRIMARY_INSTAL_AMT:-----------------	EMI Amount of the primary loan
* NEW_ACCTS_IN_LAST_SIX_MONTHS:--------------	New loans taken by the customer in last 6 months before the disbursment
* DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS:------- Loans defaulted in the last 6 months
* NO_OF_INQUIRIES:-----------------------	Enquries done by the customer for loans
* loan_default:----------------------------------	Payment default in the first EMI on due date (Target)

In [3]:
loan = pd.read_csv('d:/Cfiles/Datasets/Class/loan/loan_past.csv')

In [4]:
loan.head()

Unnamed: 0,UniqueID,disbursed_amount,asset_cost,ltv,branch_id,Employment_Type,State_ID,MobileNo_Avl_Flag,Aadhar_flag,PERFORM_CNS_SCORE_DESCRIPTION,PRI_NO_OF_ACCTS,PRI_ACTIVE_ACCTS,PRI_OVERDUE_ACCTS,PRI_CURRENT_BALANCE,PRIMARY_INSTAL_AMT,NEW_ACCTS_IN_LAST_SIX_MONTHS,DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS,NO_OF_INQUIRIES,loan_default
0,471668,46349,94896,50.58,103,Self employed,7,1,1,No History,0,0,0,0,0,0,0,0,0
1,616598,68369,87736,78.87,2,Self employed,4,1,1,No History,0,0,0,0,0,0,0,0,0
2,548216,58447,73842,81.93,251,Self employed,13,1,0,No History,0,0,0,0,0,0,0,0,1
3,530120,49803,69035,73.88,34,Self employed,6,1,1,No History,0,0,0,0,0,0,0,0,0
4,429247,55089,67131,87.89,42,Salaried,3,1,1,C-Medium Risk,9,3,1,1064938,17236,0,0,1,0


In [5]:
loan.shape

(100000, 19)

### Data Preperation

In [6]:
loan.isnull().sum()

UniqueID                                  0
disbursed_amount                          0
asset_cost                                0
ltv                                       0
branch_id                                 0
Employment_Type                        3312
State_ID                                  0
MobileNo_Avl_Flag                         0
Aadhar_flag                               0
PERFORM_CNS_SCORE_DESCRIPTION             0
PRI_NO_OF_ACCTS                           0
PRI_ACTIVE_ACCTS                          0
PRI_OVERDUE_ACCTS                         0
PRI_CURRENT_BALANCE                       0
PRIMARY_INSTAL_AMT                        0
NEW_ACCTS_IN_LAST_SIX_MONTHS              0
DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS       0
NO_OF_INQUIRIES                           0
loan_default                              0
dtype: int64

In [7]:
loan.Employment_Type.fillna('unknown', inplace = True)

In [8]:
loan.nunique()

UniqueID                               100000
disbursed_amount                        15255
asset_cost                              32303
ltv                                      5872
branch_id                                  82
Employment_Type                             3
State_ID                                   22
MobileNo_Avl_Flag                           1
Aadhar_flag                                 2
PERFORM_CNS_SCORE_DESCRIPTION               7
PRI_NO_OF_ACCTS                            87
PRI_ACTIVE_ACCTS                           34
PRI_OVERDUE_ACCTS                          22
PRI_CURRENT_BALANCE                     33850
PRIMARY_INSTAL_AMT                      16699
NEW_ACCTS_IN_LAST_SIX_MONTHS               21
DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS        13
NO_OF_INQUIRIES                            20
loan_default                                2
dtype: int64

In [9]:
loan.drop(['MobileNo_Avl_Flag', 'UniqueID'], axis=1, inplace=True)

In [10]:
loan.dtypes

disbursed_amount                         int64
asset_cost                               int64
ltv                                    float64
branch_id                                int64
Employment_Type                         object
State_ID                                 int64
Aadhar_flag                              int64
PERFORM_CNS_SCORE_DESCRIPTION           object
PRI_NO_OF_ACCTS                          int64
PRI_ACTIVE_ACCTS                         int64
PRI_OVERDUE_ACCTS                        int64
PRI_CURRENT_BALANCE                      int64
PRIMARY_INSTAL_AMT                       int64
NEW_ACCTS_IN_LAST_SIX_MONTHS             int64
DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS      int64
NO_OF_INQUIRIES                          int64
loan_default                             int64
dtype: object

In [11]:
loan.Employment_Type.unique()

array(['Self employed', 'Salaried', 'unknown'], dtype=object)

In [15]:
# loan.groupby(['Employment_Type'])['loan_default'].mean()

In [12]:
loan.PERFORM_CNS_SCORE_DESCRIPTION.unique()

array(['No History', 'C-Medium Risk', 'D-High Risk', 'A-Very Low Risk',
       'B-Low Risk', 'Not Scored', 'E-Very High Risk'], dtype=object)

In [13]:
from sklearn.preprocessing import OrdinalEncoder

In [16]:
encoder = OrdinalEncoder(categories=[['unknown', 'Self employed', 'Salaried']])
loan.Employment_Type = encoder.fit_transform(loan[['Employment_Type']])

In [17]:
encoder = OrdinalEncoder(categories=[['No History', 'E-Very High Risk', 'D-High Risk', 'Not Scored', 'C-Medium Risk', \
                                      'A-Very Low Risk', 'B-Low Risk']])
loan.PERFORM_CNS_SCORE_DESCRIPTION = encoder.fit_transform(loan[['PERFORM_CNS_SCORE_DESCRIPTION']])

In [18]:
loan.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,branch_id,Employment_Type,State_ID,Aadhar_flag,PERFORM_CNS_SCORE_DESCRIPTION,PRI_NO_OF_ACCTS,PRI_ACTIVE_ACCTS,PRI_OVERDUE_ACCTS,PRI_CURRENT_BALANCE,PRIMARY_INSTAL_AMT,NEW_ACCTS_IN_LAST_SIX_MONTHS,DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS,NO_OF_INQUIRIES,loan_default
0,46349,94896,50.58,103,1.0,7,1,0.0,0,0,0,0,0,0,0,0,0
1,68369,87736,78.87,2,1.0,4,1,0.0,0,0,0,0,0,0,0,0,0
2,58447,73842,81.93,251,1.0,13,0,0.0,0,0,0,0,0,0,0,0,1
3,49803,69035,73.88,34,1.0,6,1,0.0,0,0,0,0,0,0,0,0,0
4,55089,67131,87.89,42,2.0,3,1,4.0,9,3,1,1064938,17236,0,0,1,0


### Splitting and Dividing the dataset

In [None]:
X = loan.drop('loan_default', axis=1)
y= loan.loan_default

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1) 

---

## Understanding Decision Tree through plots

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
x_mini = x_train.iloc[:250, :4]

In [None]:
for ind, col in enumerate(x_mini.columns):
    print(ind, col)

In [None]:
dt = DecisionTreeClassifier()
dt.fit(x_mini, y_train[:250])

plt.figure(figsize=(20, 30))
plot_tree(dt, filled = True, fontsize=14)
plt.show()

In [None]:
p1 = 210/250
p2 = 40/250
gini = 1 - (p1**2 + p2**2)
np.round(gini, 3)

In [None]:
0.144*64/250 + .306*186/250

In [None]:
ig = 0.269 - 0.264
ig

## \begin{equation*} Gini = 1- \sum p_i^2 \end{equation*}
* ### $p_{i}$ = Probability of occurance for the i  Class

In [None]:
p1 = 210/250
p2 = 40/250
entropy = -p1*np.log2(p1) + -p2*np.log2(p2)
np.round(entropy, 3)

## \begin{equation*} Entropy = \sum -p_i log_2 p_i\end{equation*}
* ### $p_{i}$ = Probability of occurance for the i  Class

---

### Building the Model

In [None]:
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)

print('Train Accuracy: ', dt.score(x_train, y_train))
print('Test Accuracy: ', dt.score(x_test, y_test))

In [None]:
dt = DecisionTreeClassifier(max_depth= 1)
dt.fit(x_train, y_train)

print('Train Accuracy: ', dt.score(x_train, y_train))
print('Test Accuracy: ', dt.score(x_test, y_test))

In [None]:
dt.feature_importances_

### Other Pruning measures...

In [None]:
DecisionTreeClassifier?

---

## Another way to create better models using Decision Tree: Ensembling

### Ensemble Models:
* #### Bagging: 
    * Generalises better on Training set. Prone to Underfitting
* #### Boosting: 
    * Performs better on Training set. Prone to Overfitting
* #### Although in practice Bagging and Boosting are more popular with decision trees, in theory they can be used with any other models as well

---

### Bagging: Bootstrap AGGregation
* Bootstrap sampling -> sampling with replacement

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
%%time
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

print('Train Accuracy: ', rf.score(x_train, y_train))
print('Test Accuracy: ', rf.score(x_test, y_test))

In [None]:
RandomForestClassifier?

In [None]:
%%time
rf = RandomForestClassifier(n_estimators=10, max_features= 10, max_depth=3, oob_score=True)
rf.fit(x_train, y_train)

print('Train Accuracy: ', rf.score(x_train, y_train))
print('Test Accuracy: ', rf.score(x_test, y_test))

In [None]:
rf.oob_score_

---

![](https://media.giphy.com/media/3ohs4xsq0oEhqC4why/giphy.gif)