# 3 Pre-Processing and Training Data<a id='3_Pre-Processing_and_Training_Data'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Pre-Processing and Training Data](#3_Pre-Processing_and_Training_Data)
  * [3.1 Imports](#3.1_Imports)
  * [3.2 Load Data](#3.4_Load_Data)
  * [3.3](#3.5_One-Hot_Encoding)
  * [3.4](#3.6_Logistic_Regression)
  * [3.5](#3.7_Random_Forest)

## 3.3 Imports<a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from lazypredict.Supervised import LazyClassifier
import datetime

pd.set_option('display.max_columns',50)

## 3.4 Load Data<a id='3.4_Load_Data'></a>

In [2]:
explored_data = pd.read_csv('../data/processed/explored_data.csv', index_col=0)
explored_data.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,repay_fail,annual_inc_log,revol_bal_log,years_of_credit
3,2500.0,36 months,13.98,85.42,4 years,RENT,20004.0,Not Verified,other,MI,19.86,0.0,2000-08-05,5.0,7.0,0.0,981.0,21.3,10.0,0,9.9,6.89,0
4,5000.0,36 months,15.95,175.67,4 years,RENT,59000.0,Not Verified,debt_consolidation,NY,19.57,0.0,1994-04-01,1.0,7.0,0.0,18773.0,99.9,15.0,1,10.99,9.84,6
5,7000.0,36 months,9.91,225.58,10+ years,MORTGAGE,53796.0,Not Verified,other,TX,10.8,3.0,1998-03-01,3.0,7.0,0.0,3269.0,47.2,20.0,0,10.89,8.09,2
6,2000.0,36 months,5.42,60.32,10+ years,RENT,30000.0,Not Verified,debt_consolidation,NY,3.6,0.0,1975-01-01,0.0,7.0,0.0,0.0,0.0,15.0,0,10.31,0.0,25
7,3600.0,36 months,10.25,116.59,10+ years,MORTGAGE,675048.0,Not Verified,other,AL,1.55,0.0,1998-04-01,4.0,8.0,0.0,0.0,0.0,25.0,0,13.42,0.0,2


## 3.5 One-Hot Encoding<a id='3.5_One-Hot_Encoding'></a>

In [3]:
desired_cat_feat = ['term', 'emp_length', 'home_ownership', 'verification_status']
df_encoded = pd.get_dummies(explored_data, columns = desired_cat_feat, drop_first=True)
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38415 entries, 3 to 38480
Data columns (total 35 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   loan_amnt                            38415 non-null  float64
 1   int_rate                             38415 non-null  float64
 2   installment                          38415 non-null  float64
 3   annual_inc                           38415 non-null  float64
 4   purpose                              38415 non-null  object 
 5   addr_state                           38415 non-null  object 
 6   dti                                  38415 non-null  float64
 7   delinq_2yrs                          38415 non-null  float64
 8   earliest_cr_line                     38415 non-null  object 
 9   inq_last_6mths                       38415 non-null  float64
 10  open_acc                             38415 non-null  float64
 11  pub_rec                     

In [4]:
X = df_encoded.drop(columns=['addr_state','purpose','earliest_cr_line','repay_fail'])
y = df_encoded.repay_fail

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

In [6]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

100%|██████████| 29/29 [03:37<00:00,  7.51s/it]

[LightGBM] [Info] Number of positive: 4355, number of negative: 24456
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003404 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2521
[LightGBM] [Info] Number of data points in the train set: 28811, number of used features: 31
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.151158 -> initscore=-1.725551
[LightGBM] [Info] Start training from score -1.725551





Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.64,0.63,0.63,0.69,0.04
PassiveAggressiveClassifier,0.77,0.55,0.55,0.77,0.08
GaussianNB,0.81,0.55,0.55,0.79,0.05
Perceptron,0.8,0.55,0.55,0.78,0.07
QuadraticDiscriminantAnalysis,0.81,0.54,0.54,0.79,0.07
LabelSpreading,0.77,0.53,0.53,0.77,58.0
LabelPropagation,0.77,0.53,0.53,0.77,53.62
ExtraTreeClassifier,0.75,0.53,0.53,0.76,0.08
DecisionTreeClassifier,0.74,0.53,0.53,0.75,0.71
KNeighborsClassifier,0.84,0.52,0.52,0.79,0.55


## 3.6 Logistic Regression<a id='3.6_Logistic_Regression'></a>

In [None]:
# # Build the steps
# steps = [("scaler", StandardScaler()),
#          ("logreg", LogisticRegression())]
# pipeline = Pipeline(steps)

# # Create the parameter space
# parameters = {"logreg__C": np.linspace(0.001, 1.0, 20),
#              "logreg__solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag']}

# # Instantiate the grid search object
# cv_log = GridSearchCV(pipeline, param_grid=parameters, scoring='recall')

# # Fit to the training data
# cv_log.fit(X_train, y_train)
# print(cv.best_score_, "\n", cv.best_params_)

In [None]:
# desired_cat_feat = ['term', 'emp_length', 'home_ownership', 'verification_status','purpose']
# df_encoded_purpose = pd.get_dummies(explored_data, columns = desired_cat_feat, drop_first=True)

In [None]:
# X = df_encoded_purpose.drop(columns=['addr_state','earliest_cr_line','repay_fail'])
# y = df_encoded_purpose.repay_fail

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

# # Build the steps
# steps = [("scaler", StandardScaler()),
#          ("logreg", LogisticRegression())]
# pipeline = Pipeline(steps)

# # Create the parameter space
# parameters = {"logreg__C": np.linspace(0.001, 1.0, 20),
#              "logreg__solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag']}

# # Instantiate the grid search object
# cv_log = GridSearchCV(pipeline, param_grid=parameters, scoring='recall')

# # Fit to the training data
# cv_log.fit(X_train, y_train)
# print(cv.best_score_, "\n", cv.best_params_)

In [None]:
# desired_cat_feat = ['term', 'emp_length', 'home_ownership', 'verification_status','purpose','addr_state']
# df_encoded_all = pd.get_dummies(explored_data, columns = desired_cat_feat, drop_first=True)

In [None]:
# X = df_encoded_all.drop(columns=['earliest_cr_line','repay_fail'])
# y = df_encoded_all.repay_fail

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

# # Build the steps
# steps = [("scaler", StandardScaler()),
#          ("logreg", LogisticRegression())]
# pipeline = Pipeline(steps)

# # Create the parameter space
# parameters = {"logreg__C": np.linspace(0.001, 1.0, 20),
#              "logreg__solver": ['newton-cg', 'lbfgs', 'liblinear']}

# # Instantiate the grid search object
# cv_log = GridSearchCV(pipeline, param_grid=parameters, scoring='recall')

# # Fit to the training data
# cv_log.fit(X_train, y_train)
# print(cv.best_score_, "\n", cv.best_params_)

In [None]:
# 3.7 Random Forest<a id='3.7_Random_Forest'><a>

In [None]:
# X = df_encoded.drop(columns=['addr_state','purpose','earliest_cr_line','repay_fail'])
# y = df_encoded.repay_fail

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

# model = RandomForestClassifier()

# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# # Number of features to consider at every split
# max_features = ['sqrt', 'log2']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# cv_rf = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, 
#                         random_state=47, n_jobs = -1, scoring='recall')
# # Fit the random search model
# cv_rf.fit(X_train, y_train)
# print(cv_rf.best_score_, "\n", cv_rf.best_params_)

In [None]:
# X = df_encoded_purpose.drop(columns=['addr_state','earliest_cr_line','repay_fail'])
# y = df_encoded_purpose.repay_fail

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

# model = RandomForestClassifier()

# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# # Number of features to consider at every split
# max_features = ['sqrt', 'log2']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# cv_rf_2 = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, 
#                         random_state=47, n_jobs = -1, scoring='recall')
# # Fit the random search model
# cv_rf_2.fit(X_train, y_train)
# print(cv_rf_2.best_score_, "\n", cv_rf_2.best_params_)

In [None]:
# X = df_encoded_all.drop(columns=['earliest_cr_line','repay_fail'])
# y = df_encoded_all.repay_fail

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47, stratify=y)

# model = RandomForestClassifier()

# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# # Number of features to consider at every split
# max_features = ['sqrt', 'log2']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# cv_rf_3 = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, 
#                         random_state=47, n_jobs = -1, scoring='recall')
# # Fit the random search model
# cv_rf_3.fit(X_train, y_train)
# print(cv_rf_3.best_score_, "\n", cv_rf_3.best_params_)

References
1. https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
2. https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
3. 