<a href="https://colab.research.google.com/github/leenu10/DataScienceAnalytics/blob/main/Week11_RandomForestModel%26GradientBoosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Steps involved in supervised learning
1. Loading the dataset.
2. Feature Description.
3. Verifying the missing values.
4. Preprocessing.
5. Feature Engineering.
6. Building the model.
7. Evaluate the performance of the model.
8. Verifying the feature importance.
9. Check the distribution of prediction probabilities.
10. Fine tuning of hyper parameters.

# 1.Loading dataset 

In [None]:
data = pd.read_csv('/content/loan_dataset.csv')
data.head()

Unnamed: 0,loan_id,source,financial_institution,interest_rate,unpaid_principal_bal,loan_term,origination_date,first_payment_date,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,loan_purpose,insurance_percent,co-borrower_credit_score,insurance_type,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13
0,268055008619,Z,"Turner, Baldwin and Rhodes",4.25,214000,360,2012-03-01,05/2012,95,1.0,22.0,694.0,C86,30.0,0.0,0.0,0,0,0,0,0,0,1,0,0,0,0,0,1
1,672831657627,Y,"Swanson, Newton and Miller",4.875,144000,360,2012-01-01,03/2012,72,1.0,44.0,697.0,B12,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,742515242108,Z,Thornton-Davis,3.25,366000,180,2012-01-01,03/2012,49,1.0,33.0,780.0,B12,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,601385667462,X,OTHER,4.75,135000,360,2012-02-01,04/2012,46,2.0,44.0,633.0,B12,0.0,638.0,0.0,0,0,0,0,0,0,0,0,1,1,1,1,1
4,273870029961,X,OTHER,4.75,124000,360,2012-02-01,04/2012,80,1.0,43.0,681.0,C86,0.0,0.0,0.0,0,1,2,3,4,5,6,7,8,9,10,11,1


In [None]:
data.shape

(116058, 29)

# 2.Feature Description

Loan Delinquency dataset features are,
1. loan_id - Unique ID
2. source - Loan origination channel
3. financial_institution - Bank name
4. interest_rate - Rate of the interest
5. unpaid_principal_bal - Unpaid principal balance of the loan
6. loan_term - Loan term in days
7. origination_date - Loan origination date (YYYY-MM-DD)
8. first_payment_date - First instalment payment date
9. loan_to_value - Loan to value ratio
10. number_of_borrowers - Number 0f borrowers
11. debt_to_income_ratio - Loan debt to income ratio
12. borrower_credit_score - Borrowe credit score
13. loan_purpose - Purpose of the loan
14. insurance_percent - Loan amount percent covered by insurance
15. co_borrower_credit_score - Co-borrower credit score
16. insurance_type - 0:Premium paid by borrower, 1: Premium paid by lender.
17. m1 to m12 - Delinquency in months
18. m13 - Loan delinquency status of the 13th month. 0: non-delinquent, 1: delinquent

# 3.Missing values

In [None]:
data.isna().sum()

loan_id                     0
source                      0
financial_institution       0
interest_rate               0
unpaid_principal_bal        0
loan_term                   0
origination_date            0
first_payment_date          0
loan_to_value               0
number_of_borrowers         0
debt_to_income_ratio        0
borrower_credit_score       0
loan_purpose                0
insurance_percent           0
co-borrower_credit_score    0
insurance_type              0
m1                          0
m2                          0
m3                          0
m4                          0
m5                          0
m6                          0
m7                          0
m8                          0
m9                          0
m10                         0
m11                         0
m12                         0
m13                         0
dtype: int64

# 4.Preprocessing

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116058 entries, 0 to 116057
Data columns (total 29 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   loan_id                   116058 non-null  int64  
 1   source                    116058 non-null  object 
 2   financial_institution     116058 non-null  object 
 3   interest_rate             116058 non-null  float64
 4   unpaid_principal_bal      116058 non-null  int64  
 5   loan_term                 116058 non-null  int64  
 6   origination_date          116058 non-null  object 
 7   first_payment_date        116058 non-null  object 
 8   loan_to_value             116058 non-null  int64  
 9   number_of_borrowers       116058 non-null  float64
 10  debt_to_income_ratio      116058 non-null  float64
 11  borrower_credit_score     116058 non-null  float64
 12  loan_purpose              116058 non-null  object 
 13  insurance_percent         116058 non-null  f

In [None]:
data[['origination_date',
       'first_payment_date']]

Unnamed: 0,origination_date,first_payment_date
0,2012-03-01,05/2012
1,2012-01-01,03/2012
2,2012-01-01,03/2012
3,2012-02-01,04/2012
4,2012-02-01,04/2012
...,...,...
116053,2012-02-01,04/2012
116054,2012-01-01,03/2012
116055,2012-02-01,04/2012
116056,2012-02-01,04/2012


Since no special pattern is observed, we can drop these features.

## Dropping the irrelevant features

In [None]:
data.columns

Index(['loan_id', 'source', 'financial_institution', 'interest_rate',
       'unpaid_principal_bal', 'loan_term', 'origination_date',
       'first_payment_date', 'loan_to_value', 'number_of_borrowers',
       'debt_to_income_ratio', 'borrower_credit_score', 'loan_purpose',
       'insurance_percent', 'co-borrower_credit_score', 'insurance_type', 'm1',
       'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8', 'm9', 'm10', 'm11', 'm12',
       'm13'],
      dtype='object')

In [None]:
x = data.drop(['loan_id','origination_date',
       'first_payment_date', 'insurance_percent', 'insurance_type', 'm13'], axis=1)
y = pd.DataFrame(data['m13'])

In [None]:
y.head()

Unnamed: 0,m13
0,1
1,1
2,1
3,1
4,1


In [None]:
x.head()

Unnamed: 0,source,financial_institution,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,loan_purpose,co-borrower_credit_score,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12
0,Z,"Turner, Baldwin and Rhodes",4.25,214000,360,95,1.0,22.0,694.0,C86,0.0,0,0,0,0,0,0,1,0,0,0,0,0
1,Y,"Swanson, Newton and Miller",4.875,144000,360,72,1.0,44.0,697.0,B12,0.0,0,0,0,0,0,0,0,0,0,0,1,0
2,Z,Thornton-Davis,3.25,366000,180,49,1.0,33.0,780.0,B12,0.0,0,0,0,0,0,0,0,0,0,0,0,0
3,X,OTHER,4.75,135000,360,46,2.0,44.0,633.0,B12,638.0,0,0,0,0,0,0,0,0,1,1,1,1
4,X,OTHER,4.75,124000,360,80,1.0,43.0,681.0,C86,0.0,0,1,2,3,4,5,6,7,8,9,10,11


##Encoding

In [None]:
data['financial_institution'].value_counts()

OTHER                          49699
Browning-Hart                  31852
Swanson, Newton and Miller      6874
Edwards-Hoffman                 4857
Martinez, Duffy and Bird        4715
Miller, Mcclure and Allen       3158
Nicholson Group                 2116
Turner, Baldwin and Rhodes      1846
Suarez Inc                      1790
Cole, Brooks and Vincent        1642
Richards-Walters                1459
Taylor, Hunt and Rodriguez      1259
Sanchez-Robinson                1193
Sanchez, Hays and Wilkerson      853
Romero, Woods and Johnson        750
Thornton-Davis                   651
Anderson-Taylor                  483
Richardson Ltd                   473
Chapman-Mcmahon                  388
Name: financial_institution, dtype: int64

In [None]:
# Label Encoding 'Financial Institutions'

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

a = ['financial_institution']
for i in np.arange(len(a)):
  x[a[i]] = le.fit_transform(x[a[i]])

In [None]:
x.head()

Unnamed: 0,source,financial_institution,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,loan_purpose,co-borrower_credit_score,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12
0,Z,18,4.25,214000,360,95,1.0,22.0,694.0,C86,0.0,0,0,0,0,0,0,1,0,0,0,0,0
1,Y,15,4.875,144000,360,72,1.0,44.0,697.0,B12,0.0,0,0,0,0,0,0,0,0,0,0,1,0
2,Z,17,3.25,366000,180,49,1.0,33.0,780.0,B12,0.0,0,0,0,0,0,0,0,0,0,0,0,0
3,X,8,4.75,135000,360,46,2.0,44.0,633.0,B12,638.0,0,0,0,0,0,0,0,0,1,1,1,1
4,X,8,4.75,124000,360,80,1.0,43.0,681.0,C86,0.0,0,1,2,3,4,5,6,7,8,9,10,11


In [None]:
x['source'].value_counts()

X    63858
Y    37554
Z    14646
Name: source, dtype: int64

In [None]:
x['loan_purpose'].value_counts()

A23    58462
B12    29383
C86    28213
Name: loan_purpose, dtype: int64

In [None]:
# OneHotEncoding 'source' and 'loan_purpose'

source_en = pd.get_dummies(x['source'])
loanpur_en = pd.get_dummies(x['loan_purpose'])

In [None]:
# concatenating the encoded data to the feature dataset x

x = pd.concat([x, source_en, loanpur_en], axis=1)

In [None]:
# dropping the already existing source and loan_purpose object type columns.

x = x.drop(['source', 'loan_purpose'], axis=1)

In [None]:
x.head()

Unnamed: 0,financial_institution,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,co-borrower_credit_score,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,X,Y,Z,A23,B12,C86
0,18,4.25,214000,360,95,1.0,22.0,694.0,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
1,15,4.875,144000,360,72,1.0,44.0,697.0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0
2,17,3.25,366000,180,49,1.0,33.0,780.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,8,4.75,135000,360,46,2.0,44.0,633.0,638.0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,0
4,8,4.75,124000,360,80,1.0,43.0,681.0,0.0,0,1,2,3,4,5,6,7,8,9,10,11,1,0,0,0,0,1


In [None]:
x.shape

(116058, 27)

# 5.Feature Engineering

Creating new features from the existing features.

In [None]:
x.columns

Index(['financial_institution', 'interest_rate', 'unpaid_principal_bal',
       'loan_term', 'loan_to_value', 'number_of_borrowers',
       'debt_to_income_ratio', 'borrower_credit_score',
       'co-borrower_credit_score', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7',
       'm8', 'm9', 'm10', 'm11', 'm12', 'X', 'Y', 'Z', 'A23', 'B12', 'C86'],
      dtype='object')

In [None]:
x['credit_score'] = x['borrower_credit_score']+x['co-borrower_credit_score']

In [None]:
x = x.drop(['borrower_credit_score', 'co-borrower_credit_score'], axis=1)

In [None]:
x.columns

Index(['financial_institution', 'interest_rate', 'unpaid_principal_bal',
       'loan_term', 'loan_to_value', 'number_of_borrowers',
       'debt_to_income_ratio', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12', 'X', 'Y', 'Z', 'A23', 'B12', 'C86',
       'credit_score'],
      dtype='object')

In [None]:
x['mean'] = x[['m1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12']].mean(axis=1)

In [None]:
x['sum'] = x[['m1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12']].sum(axis=1)

In [None]:
x['skew'] = x[['m1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12']].skew(axis=1)
# like people willing to pay the EMI at initial stages but not willing at later stages, will lead to skewness.

In [None]:
x['kurtosis'] = x[['m1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12']].kurt(axis=1)
# like people paying EMI regularly at initital stages, then not paying in the middle stages and then again starting to pay towards the later stages, will lead to kurtosis.

In [None]:
x.columns

Index(['financial_institution', 'interest_rate', 'unpaid_principal_bal',
       'loan_term', 'loan_to_value', 'number_of_borrowers',
       'debt_to_income_ratio', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8',
       'm9', 'm10', 'm11', 'm12', 'X', 'Y', 'Z', 'A23', 'B12', 'C86',
       'credit_score', 'mean', 'sum', 'skew', 'kurtosis'],
      dtype='object')

# 6.Building the Model.

In [None]:
# splitting the dataset into train and test.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)

In [None]:
x_train.shape

(81240, 30)

In [None]:
y_train.shape

(81240, 1)

In [None]:
x_test.shape

(34818, 30)

In [None]:
y_test.shape

(34818, 1)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)

  This is separate from the ipykernel package so we can avoid doing imports until


# 7.Evaluating the performance of the model.

In [None]:
from sklearn.metrics import f1_score, confusion_matrix 
f1_score(y_test, y_pred)

0.48299319727891155

In [None]:
confusion_matrix(y_test, y_pred)

array([[34595,    24],
       [  128,    71]])

# 8.Feature Importance

In [None]:
pd.Series(rf.feature_importances_, index = x.columns).sort_values(ascending=False)*100

credit_score             11.784754
unpaid_principal_bal     11.655292
m12                      11.249575
debt_to_income_ratio      8.270859
loan_to_value             7.600510
interest_rate             7.018024
sum                       5.254407
m11                       4.615043
mean                      4.444065
financial_institution     3.881248
kurtosis                  3.760589
m9                        3.228914
skew                      2.761229
m10                       2.042728
loan_term                 1.783472
m8                        1.073373
m7                        0.970495
Y                         0.950610
m5                        0.950443
X                         0.927329
B12                       0.862614
number_of_borrowers       0.807951
C86                       0.802822
A23                       0.798408
m4                        0.676810
Z                         0.595286
m6                        0.384430
m1                        0.355795
m2                  

# 9.Distribution of prediction probabilities

In [None]:
threshold = 0.22
y_pred_prob = rf.predict_proba(x_test)[:,1]
y_pred = (y_pred_prob > threshold).astype(int)

In [None]:
f1_score(y_test, y_pred)

0.5221932114882506

# 10.Fine tuning of Hyper Parameters

In [None]:
rft = RandomForestClassifier(n_estimators = 500, max_depth = 10, random_state = 42, criterion = 'entropy')

In [None]:
rft.fit(x_train, y_train)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=500,
                       random_state=42)

In [None]:
y_pred = rft.predict(x_test)
f1_score(y_test, y_pred)

0.46689895470383275

# Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(x_train, y_train)
y_pred = gb.predict(x_test)

  y = column_or_1d(y, warn=True)


In [None]:
f1_score(y_test, y_pred)

0.48135593220338985

# Extreme Gradient Boosting

In [None]:
from xgboost import XGBClassifier # in jupyter notebook we need to first install xgboost
xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_test)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [None]:
f1_score(y_test, y_pred)

0.4755244755244755