# 3 Pre-Processing and Training Data<a id='3_Pre-Processing_and_Training_Data'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Pre-Processing and Training Data](#3_Pre-Processing_and_Training_Data)
  * [3.1 Imports](#3.1_Imports)
  * [3.2 Load Data](#3.4_Load_Data)
  * [3.3](#3.5_One-Hot_Encoding)
  * [3.4](#3.6_Logistic_Regression)
  * [3.5](#3.7_Random_Forest)

## 3.3 Imports<a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.utils import resample
from xgboost import XGBClassifier
from lazypredict.Supervised import LazyClassifier
from imblearn.pipeline import Pipeline

pd.set_option('display.max_columns',50)

## 3.4 Load Data<a id='3.4_Load_Data'></a>

In [2]:
explored_data = pd.read_csv('../data/processed/explored_data.csv', index_col=0)
explored_data.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,repay_fail,annual_inc_log,revol_bal_log,years_of_credit
3,2500.0,36 months,13.98,85.42,4 years,RENT,20004.0,Not Verified,other,MI,19.86,0.0,2000-08-05,5.0,7.0,0.0,981.0,21.3,10.0,0,9.9,6.89,0
4,5000.0,36 months,15.95,175.67,4 years,RENT,59000.0,Not Verified,debt_consolidation,NY,19.57,0.0,1994-04-01,1.0,7.0,0.0,18773.0,99.9,15.0,1,10.99,9.84,6
5,7000.0,36 months,9.91,225.58,10+ years,MORTGAGE,53796.0,Not Verified,other,TX,10.8,3.0,1998-03-01,3.0,7.0,0.0,3269.0,47.2,20.0,0,10.89,8.09,2
6,2000.0,36 months,5.42,60.32,10+ years,RENT,30000.0,Not Verified,debt_consolidation,NY,3.6,0.0,1975-01-01,0.0,7.0,0.0,0.0,0.0,15.0,0,10.31,0.0,25
7,3600.0,36 months,10.25,116.59,10+ years,MORTGAGE,675048.0,Not Verified,other,AL,1.55,0.0,1998-04-01,4.0,8.0,0.0,0.0,0.0,25.0,0,13.42,0.0,2


## 3.5 One-Hot Encoding<a id='3.5_One-Hot_Encoding'></a>

In [3]:
desired_cat_feat = ['term', 'emp_length', 'home_ownership', 'verification_status', 'addr_state', 'purpose']
df_encoded = pd.get_dummies(explored_data, columns = desired_cat_feat, drop_first=True)

In [4]:
df_encoded.rename(columns={"emp_length_< 1 year": "emp_length_0_years"}, inplace=True)

In [5]:
X = df_encoded.drop(columns=['earliest_cr_line','repay_fail'])
y = df_encoded.repay_fail

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [7]:
# clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
# models,predictions = clf.fit(X_train, X_test, y_train, y_test)
# models

In [8]:
# models.to_csv('../data/processed/lazypredict_models.csv')

In [9]:
models = pd.read_csv('../data/processed/lazypredict_models.csv')

In [10]:
models.sort_values(['F1 Score','Accuracy'], ascending=False)

Unnamed: 0,Model,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
5,BernoulliNB,0.84,0.52,0.52,0.79,0.14
9,LinearDiscriminantAnalysis,0.85,0.51,0.51,0.79,0.36
8,XGBClassifier,0.84,0.52,0.52,0.79,0.41
10,BaggingClassifier,0.84,0.51,0.51,0.79,6.68
12,AdaBoostClassifier,0.85,0.51,0.51,0.79,2.69
13,LGBMClassifier,0.85,0.51,0.51,0.79,0.39
14,LogisticRegression,0.85,0.51,0.51,0.78,0.31
11,KNeighborsClassifier,0.83,0.51,0.51,0.78,0.71
15,CalibratedClassifierCV,0.85,0.51,0.51,0.78,1.98
17,ExtraTreesClassifier,0.85,0.5,0.5,0.78,5.6


## 3.6 XGBoost Classifier<a id='3.6_XGBoost_Classifier'></a>

### 3.6.1 Initial Fit<a id='3.6.1_Initial_Fit'></a>

In [11]:
xgb_model = XGBClassifier(random_state=42)

xgb_model.fit(X_train, y_train)

# Predict on test data
xgb_predictions = xgb_model.predict(X_test)

# Optionally, get prediction probabilities
xgb_probs = xgb_model.predict_proba(X_test)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test, xgb_predictions,output_dict=True)).T)
print("XGBoost AUC: ", roc_auc_score(y_test, xgb_probs))

Unnamed: 0,precision,recall,f1-score,support
0,0.85,0.98,0.91,8152.0
1,0.39,0.06,0.1,1452.0
accuracy,0.84,0.84,0.84,0.84
macro avg,0.62,0.52,0.51,9604.0
weighted avg,0.78,0.84,0.79,9604.0


XGBoost AUC:  0.6911375835705615


### 3.6.2 Oversampling<a id='3.6.2_Oversampling'></a>

In [12]:
df_encoded.repay_fail.value_counts()

0    32608
1     5807
Name: repay_fail, dtype: int64

In [13]:
df_encoded.repay_fail.value_counts(normalize=True)

0   0.85
1   0.15
Name: repay_fail, dtype: float64

We can see that only 15% of the loans in our dataset are defaults, which may explain why our model isn't good at predicting defaults.

In [14]:
#create two different dataframe of majority and minority class 
df_majority = df_encoded[(df_encoded['repay_fail']==0)]
df_minority = df_encoded[(df_encoded['repay_fail']==1)]

# upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,
                                 n_samples= 32608,
                                 random_state=42)

df_upsampled = pd.concat([df_minority_upsampled, df_majority])

In [15]:
df_upsampled['repay_fail'].value_counts()

1    32608
0    32608
Name: repay_fail, dtype: int64

In [16]:
X_over = df_upsampled.drop(columns=['earliest_cr_line','repay_fail'])
y_over = df_upsampled.repay_fail

In [17]:
X_train_o, X_test_o, y_train_o, y_test_o = train_test_split(X_over, y_over, test_size=0.25, random_state=42)

In [18]:
xgb_over = XGBClassifier(random_state=42)

xgb_over.fit(X_train_o, y_train_o)

# Predict on test data
xgb_predictions_over = xgb_over.predict(X_test_o)

# Optionally, get prediction probabilities
xgb_probs_over = xgb_over.predict_proba(X_test_o)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test_o, xgb_predictions_over,output_dict=True)).T)
print("XGBoost AUC: ", roc_auc_score(y_test_o, xgb_probs_over))

Unnamed: 0,precision,recall,f1-score,support
0,0.83,0.75,0.79,8180.0
1,0.77,0.85,0.81,8124.0
accuracy,0.8,0.8,0.8,0.8
macro avg,0.8,0.8,0.8,16304.0
weighted avg,0.8,0.8,0.8,16304.0


XGBoost AUC:  0.8779626290661015


This is a remarkable improvement to the model!

### 3.6.3 Cross Validation<a id='3.6.3_Cross_Validation'></a>

In [19]:
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 6],
    'min_child_weight': [1, 5],
    'gamma': [0, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1, 1.5],
}

xgb_over_2 = XGBClassifier(random_state=42)

xgb_over_2.fit(X_train_o, y_train_o)

# Initialize the GridSearchCV object
xgb_grid_search = GridSearchCV(estimator=xgb_over_2, 
                           param_grid=param_grid, 
                           scoring='roc_auc', 
                           n_jobs=-1, 
                           cv=3, 
                           verbose=1)

# Fit the grid search to the data
xgb_grid_search.fit(X_train_o, y_train_o)

# Print the best parameters found
print("Best parameters found: ", xgb_grid_search.best_params_)
print("Best AUC found: ", xgb_grid_search.best_score_)

Fitting 3 folds for each of 512 candidates, totalling 1536 fits
Best parameters found:  {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 200, 'reg_alpha': 0.1, 'reg_lambda': 1, 'subsample': 0.8}
Best AUC found:  0.8485894949894871


In [20]:
xgb_best_model = xgb_grid_search.best_estimator_
xgb_predictions_cv = xgb_best_model.predict(X_test_o)
xgb_probs_cv = xgb_best_model.predict_proba(X_test_o)[:,1]

display(pd.DataFrame(classification_report(y_test_o, xgb_predictions_cv,output_dict=True)).T)
print("XGBoost AUC: ", roc_auc_score(y_test_o, xgb_probs_cv))

Unnamed: 0,precision,recall,f1-score,support
0,0.8,0.73,0.76,8180.0
1,0.75,0.81,0.78,8124.0
accuracy,0.77,0.77,0.77,0.77
macro avg,0.77,0.77,0.77,16304.0
weighted avg,0.77,0.77,0.77,16304.0


XGBoost AUC:  0.8554005518377135


## 3.7 Bernoulli Naive Bayes<a id='3.7_Bernoulli_Naive_Bayes'></a>

### 3.7.1 Initial Fit<a id='3.7.1_Initial_Fit'></a>

In [21]:
bnb_model = BernoulliNB()

bnb_model.fit(X_train, y_train)

# Predict on test data
bnb_predictions = bnb_model.predict(X_test)

# Optionally, get prediction probabilities
bnb_probs = bnb_model.predict_proba(X_test)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test, bnb_predictions,output_dict=True)).T)
print("BernoulliNB AUC: ", roc_auc_score(y_test, bnb_probs))

Unnamed: 0,precision,recall,f1-score,support
0,0.85,1.0,0.92,8152.0
1,0.25,0.0,0.01,1452.0
accuracy,0.85,0.85,0.85,0.85
macro avg,0.55,0.5,0.46,9604.0
weighted avg,0.76,0.85,0.78,9604.0


BernoulliNB AUC:  0.6494256762693399


### 3.7.2 Oversampling<a id='3.7.2_Oversampling'></a>

In [22]:
bnb_model_over = BernoulliNB()

bnb_model_over.fit(X_train_o, y_train_o)

# Predict on test data
bnb_predictions_over = bnb_model_over.predict(X_test_o)

# Optionally, get prediction probabilities
bnb_probs_over = bnb_model_over.predict_proba(X_test_o)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test_o, bnb_predictions_over,output_dict=True)).T)
print("BernoulliNB AUC: ", roc_auc_score(y_test_o, bnb_probs_over))

Unnamed: 0,precision,recall,f1-score,support
0,0.61,0.63,0.62,8180.0
1,0.61,0.59,0.6,8124.0
accuracy,0.61,0.61,0.61,0.61
macro avg,0.61,0.61,0.61,16304.0
weighted avg,0.61,0.61,0.61,16304.0


BernoulliNB AUC:  0.6466807425010141


### 3.7.3 Cross Validation<a id='3.7.3_Cross_Validation'></a>

In [23]:
param_grid = {
    'alpha': [0.01, 0.1, 0.5, 1.0],
    'binarize': [0.0, 0.1, 0.2, 0.5]
}

bnb_over_2 = BernoulliNB()

bnb_over_2.fit(X_train_o, y_train_o)

# Initialize the GridSearchCV object
bnb_grid_search = GridSearchCV(estimator=bnb_over_2, 
                           param_grid=param_grid, 
                           scoring='roc_auc', 
                           n_jobs=-1, 
                           cv=3, 
                           verbose=1)

# Fit the grid search to the data
bnb_grid_search.fit(X_train_o, y_train_o)

# Print the best parameters found
print("Best parameters found: ", bnb_grid_search.best_params_)
print("Best AUC found: ", bnb_grid_search.best_score_)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters found:  {'alpha': 1.0, 'binarize': 0.5}
Best AUC found:  0.654617120944336


## 3.8 Linear Discriminant Analysis<a id='3.8_Linear_Discriminant_Analysis'></a>

### 3.8.1 Initial Fit<a id='3.8.1_Initial_Fit'></a>

In [25]:
lda_model = LinearDiscriminantAnalysis()

lda_model.fit(X_train, y_train)

# Predict on test data
lda_predictions = lda_model.predict(X_test)

# Optionally, get prediction probabilities
lda_probs = lda_model.predict_proba(X_test)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test, lda_predictions,output_dict=True)).T)
print("LinearDiscriminantAnalysis AUC: ", roc_auc_score(y_test, lda_probs))

Unnamed: 0,precision,recall,f1-score,support
0,0.85,0.99,0.92,8152.0
1,0.44,0.03,0.06,1452.0
accuracy,0.85,0.85,0.85,0.85
macro avg,0.65,0.51,0.49,9604.0
weighted avg,0.79,0.85,0.79,9604.0


LinearDiscriminantAnalysis AUC:  0.7135177157424906


### 3.8.2 Oversampling<a id='3.8.2_Oversampling'></a>

In [26]:
lda_over = LinearDiscriminantAnalysis()

lda_over.fit(X_train_o, y_train_o)

# Predict on test data
lda_predictions_over = lda_over.predict(X_test_o)

# Optionally, get prediction probabilities
lda_probs_over = lda_over.predict_proba(X_test_o)[:,1]

# Evaluate predictions
display(pd.DataFrame(classification_report(y_test_o, lda_predictions_over,output_dict=True)).T)
print("LinearDiscriminantAnalysis AUC: ", roc_auc_score(y_test_o, lda_probs_over))

Unnamed: 0,precision,recall,f1-score,support
0,0.65,0.65,0.65,8180.0
1,0.65,0.64,0.64,8124.0
accuracy,0.65,0.65,0.65,0.65
macro avg,0.65,0.65,0.65,16304.0
weighted avg,0.65,0.65,0.65,16304.0


LinearDiscriminantAnalysis AUC:  0.7047155549857406


### 3.8.3 Cross Validation<a id='3.8.3_Cross_Validation'></a>

In [27]:
param_grid = {
    'solver': ['svd', 'lsqr', 'eigen'],
    'shrinkage': [None, 'auto', 0.0, 0.5, 1.0],
    'tol': [0.0001, 0.0002, 0.0005]
}

lda_over_2 = LinearDiscriminantAnalysis()

lda_over_2.fit(X_train_o, y_train_o)

# Initialize the GridSearchCV object
lda_grid_search = GridSearchCV(estimator=lda_over_2, 
                           param_grid=param_grid, 
                           scoring='roc_auc', 
                           n_jobs=-1, 
                           cv=3, 
                           verbose=1)

# Fit the grid search to the data
lda_grid_search.fit(X_train_o, y_train_o)

# Print the best parameters found
print("Best parameters found: ", lda_grid_search.best_params_)
print("Best AUC found: ", lda_grid_search.best_score_)

Fitting 3 folds for each of 45 candidates, totalling 135 fits
Best parameters found:  {'shrinkage': None, 'solver': 'svd', 'tol': 0.0001}
Best AUC found:  0.7097268859003437


In [28]:
lda_best_model = lda_grid_search.best_estimator_
lda_predictions_cv = lda_best_model.predict(X_test_o)
lda_probs_cv = lda_best_model.predict_proba(X_test_o)[:,1]

display(pd.DataFrame(classification_report(y_test_o, lda_predictions_cv,output_dict=True)).T)
print("LinearDiscriminantAnalysis AUC: ", roc_auc_score(y_test_o, lda_probs_cv))

Unnamed: 0,precision,recall,f1-score,support
0,0.65,0.65,0.65,8180.0
1,0.65,0.64,0.64,8124.0
accuracy,0.65,0.65,0.65,0.65
macro avg,0.65,0.65,0.65,16304.0
weighted avg,0.65,0.65,0.65,16304.0


LinearDiscriminantAnalysis AUC:  0.7047155549857406


References
1. https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
2. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
3. https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/