# 3 Pre-Processing and Training Data<a id='3_Pre-Processing_and_Training_Data'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Pre-Processing and Training Data](#3_Pre-Processing_and_Training_Data)
  * [3.1 Imports](#3.1_Imports)
  * [3.2 Introduction](#3.2_Introduction)
  * [3.3 Imports](#3.3_Imports)
  * [3.4 Load Data](#3.4_Load_Data)
  * [3.5 One-Hot Encoding](#3.5_One-Hot_Encoding)
  * [3.6 Train-Test Split](#3.6_Train-Test_Split)
  * [3.7 Superficial Modeling](#3.7_Superficial_Modeling)
  * [3.8 References](#3.8_References)

## 3.2 Introduction

We will now prepare the data for modeling by splitting into training and test sets as well as creating any dummy variables necessary.

## 3.3 Imports<a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyClassifier

sns.set()
pd.set_option('display.max_columns',50)

First, we import the appropriate libraries.

## 3.4 Load Data<a id='3.4_Load_Data'></a>

In [2]:
explored_data = pd.read_csv('../data/interim/explored_data.csv', index_col=0)
explored_data.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,repay_fail,annual_inc_log,revol_bal_log,years_of_credit
3,2500.0,36,13.98,85.42,4,RENT,20004.0,Not Verified,other,MI,19.86,0.0,2000-08-05,5.0,7.0,0.0,981.0,21.3,10.0,0,9.9,6.89,0
4,5000.0,36,15.95,175.67,4,RENT,59000.0,Not Verified,debt_consolidation,NY,19.57,0.0,1994-04-01,1.0,7.0,0.0,18773.0,99.9,15.0,1,10.99,9.84,6
5,7000.0,36,9.91,225.58,10,MORTGAGE,53796.0,Not Verified,other,TX,10.8,3.0,1998-03-01,3.0,7.0,0.0,3269.0,47.2,20.0,0,10.89,8.09,2
6,2000.0,36,5.42,60.32,10,RENT,30000.0,Not Verified,debt_consolidation,NY,3.6,0.0,1975-01-01,0.0,7.0,0.0,0.0,0.0,15.0,0,10.31,0.0,25
7,3600.0,36,10.25,116.59,10,MORTGAGE,675048.0,Not Verified,other,AL,1.55,0.0,1998-04-01,4.0,8.0,0.0,0.0,0.0,25.0,0,13.42,0.0,2


Here we have loaded the data once again and got a quick summary with the .head() method. The data successfully loaded with the new features we have created such as `years_of_credit`.

## 3.5 One-Hot Encoding<a id='3.5_One-Hot_Encoding'></a>

An important step in the process will be converting the desired categorical features into dummy variables. We will use pandas' .get_dummies() function to achieve this.

In [3]:
desired_cat_feat = ['home_ownership', 'verification_status', 'purpose']
df_encoded = pd.get_dummies(explored_data, columns = desired_cat_feat, drop_first=True)

The desired categorical features will be all except `earliest_cr_line` and `addr_state`. The feature `earliest_cr_line` will be removed since since we are taking this feature into consideration via `years_of_credit`. The feature `addr_state` will be removed to avoid having a sparse dataset from one-hot encoding the 50 states. If the predictive power of the model is sufficiently high without `addr_state`, then we will continue without it.

## 3.6 Train-Test Split<a id='3.6_Train-Test_Split'></a>

In [4]:
X = df_encoded.drop(columns=['addr_state','earliest_cr_line','repay_fail'])
y = df_encoded.repay_fail

When splitting the data, we remove `addr_state` and `earliest_cr_line` since they are still included in the initial encoded DataFrame.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

We make sure to stratify the split that way there is the same proportion of defaulted loans in both.

## 3.7 Superficial Modeling<a id='3.7_Superficial_Modeling'></a>

A useful tool we can utilize to get an idea of which models to delve deeper with is LazyPredict. LazyPredict provides a DataFrame that summarizes the scores of various models for comparison.

In [6]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None, random_state=42)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

100%|██████████| 29/29 [03:06<00:00,  6.43s/it]

[LightGBM] [Info] Number of positive: 4355, number of negative: 24456
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001724 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2537
[LightGBM] [Info] Number of data points in the train set: 28811, number of used features: 33
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.151158 -> initscore=-1.725551
[LightGBM] [Info] Start training from score -1.725551





Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.64,0.63,0.63,0.69,0.05
GaussianNB,0.78,0.57,0.57,0.78,0.05
QuadraticDiscriminantAnalysis,0.76,0.56,0.56,0.77,0.07
DecisionTreeClassifier,0.75,0.54,0.54,0.76,0.75
Perceptron,0.8,0.54,0.54,0.78,0.08
PassiveAggressiveClassifier,0.74,0.53,0.53,0.75,0.08
ExtraTreeClassifier,0.75,0.53,0.53,0.75,0.08
LabelSpreading,0.76,0.53,0.53,0.76,53.79
LabelPropagation,0.76,0.53,0.53,0.76,49.49
XGBClassifier,0.84,0.52,0.52,0.79,0.22


As seen above, we can see which models are the least computationally expensive by the time taken and which provide great accuracy as well as other scores.

In [7]:
models.sort_values(['F1 Score','Accuracy'], ascending=False)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.84,0.52,0.52,0.79,0.22
BernoulliNB,0.84,0.52,0.52,0.79,0.06
LinearDiscriminantAnalysis,0.85,0.51,0.51,0.79,0.14
BaggingClassifier,0.84,0.51,0.51,0.79,5.02
KNeighborsClassifier,0.83,0.52,0.52,0.79,0.73
LGBMClassifier,0.85,0.51,0.51,0.78,0.22
CalibratedClassifierCV,0.85,0.51,0.51,0.78,0.4
AdaBoostClassifier,0.85,0.51,0.51,0.78,2.02
ExtraTreesClassifier,0.85,0.51,0.51,0.78,3.8
LogisticRegression,0.85,0.51,0.51,0.78,0.08


We have sorted by accuracy and the F1 score to get the best models for our classification problem.

In [8]:
models.to_csv('../data/processed/lazypredict_models.csv')

We will save the summary and review it in the modeling portion.

In [9]:
df_encoded.to_csv('../data/processed/df_encoded.csv')

## 3.8 References<a id='3.8_References'></a>

1. Shankar Rao Pandala - LazyPredict Documentation (https://lazypredict.readthedocs.io/en/latest/)