
# Capstone Project 

# Author : Hamidreza Salahi

# Notebook : 4

# Models Building

After completing EDA and having a clean dataset to work with, the next step is to do some baseline modeling. The goal of this notebook is to find the best classification model amongst *Logistic Regression, SVC, Decision tree* in terms of their accuracy using pipeline and grid search.

## Contents:
* [Artificial Neural Networks (ANNs)](#Artificial-Neural-Networks-(ANNs))
* [XGBoost Classifier](#XGBoost-Classifier)
* [Random Forest Classifier](#Random-Forest-Classifier)


In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#Importing clean data
loan_df = pd.read_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\loan_sample_after_EDA.csv')

loan_df.head()

Unnamed: 0,loan_status,last_fico_avg,int_rate,term,fico_avg,acc_open_past_24mths,funded_amnt,loan_amnt,tot_hi_cred_lim,dti,...,home_improvement,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,wedding
0,0,697.0,20.55,60,702.0,7.0,32025.0,32025.0,210073.0,39.97,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,682.0,9.99,36,687.0,4.0,11200.0,11200.0,97239.0,28.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,692.0,15.05,36,662.0,2.0,20000.0,20000.0,32716.0,19.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,507.0,11.53,36,672.0,2.0,10000.0,10000.0,14200.0,3.13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,507.0,17.27,60,662.0,5.0,11050.0,11050.0,245250.0,8.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
loan_df.shape

(228958, 79)

### Test-Train Split

The first step in modeling is to seperate the dependent, y = `loan_status`, from all the independent variables, X

In [4]:
# Seperating the dependent variable (y) from the independent variables (X)
X = loan_df.drop(columns='loan_status')
y = loan_df['loan_status']

Next step is to split the dataset into Training, Validation and Test.

In [5]:
# import train_test_split
from sklearn.model_selection import train_test_split

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                      y, 
                                                      test_size = 0.2, 
                                                      random_state = 15)

In [6]:
# check dataframes shapes
print(f"The shape of the X_train dataframe is: {X_train.shape}.")
print(f"The shape of the X_test dataframe is: {X_test.shape}.\n")
print(f"The shape of the y_train dataframe is: {y_train.shape}.")
print(f"The shape of the y_test dataframe is: {y_test.shape}.\n")

The shape of the X_train dataframe is: (183166, 78).
The shape of the X_test dataframe is: (45792, 78).

The shape of the y_train dataframe is: (183166,).
The shape of the y_test dataframe is: (45792,).



### Scaling Data

Now I am going to apply MinMaxScaler to the dataset. It is noted that the scaling is applied *after* train-test split to avoid data leakage i.e., the test data is not supposed to be exposed to MinMaxScaling at first. 

In [7]:
from sklearn.preprocessing import MinMaxScaler
# apply MinMaxScaler()
# instantiate the model
scaler = MinMaxScaler()

# fit the model
scaler = scaler.fit(X_train)

# transform
X_train_scaled= scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [14]:
from scipy import stats 

from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report, 
    roc_auc_score, roc_curve, auc,
    plot_confusion_matrix, plot_roc_curve
)

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization 
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC

  from pandas import MultiIndex, Int64Index


In [9]:
def evaluate_nn(true, pred, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
def plot_learning_evolution(r):
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 2, 1)
    plt.plot(r.history['loss'], label='Loss')
    plt.plot(r.history['val_loss'], label='val_Loss')
    plt.title('Loss evolution during trainig')
    plt.legend()

    plt.subplot(2, 2, 2)
    plt.plot(r.history['AUC'], label='AUC')
    plt.plot(r.history['val_AUC'], label='val_AUC')
    plt.title('AUC score evolution during trainig')
    plt.legend();

def nn_model(num_columns, num_labels, hidden_units, dropout_rates, learning_rate):
    inp = tf.keras.layers.Input(shape=(num_columns, ))
    x = BatchNormalization()(inp)
    x = Dropout(dropout_rates[0])(x)
    for i in range(len(hidden_units)):
        x = Dense(hidden_units[i], activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(dropout_rates[i + 1])(x)
    x = Dense(num_labels, activation='sigmoid')(x)
  
    model = Model(inputs=inp, outputs=x)
    model.compile(optimizer=Adam(learning_rate), loss='binary_crossentropy', metrics=[AUC(name='AUC')])
    return model

In [11]:
num_columns = X_train_scaled.shape[1]
num_labels = 1
hidden_units = [150, 150, 150]
dropout_rates = [0.1, 0, 0.1, 0]
learning_rate = 1e-3


model = nn_model(
    num_columns=num_columns, 
    num_labels=num_labels,
    hidden_units=hidden_units,
    dropout_rates=dropout_rates,
    learning_rate=learning_rate
)
r = model.fit(
    X_train_scaled, y_train,
    validation_data=(X_test_scaled, y_test),
    epochs=20,
    batch_size=32
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [12]:
y_test_pred = model.predict(X_test_scaled)
evaluate_nn(y_test, y_test_pred.round(), train=False)

Test Result:
Accuracy Score: 88.67%
_______________________________________________
CLASSIFICATION REPORT:
                      0             1  accuracy     macro avg  weighted avg
precision      0.912080      0.862646  0.886749      0.887363      0.888109
recall         0.863357      0.911596  0.886749      0.887477      0.886749
f1-score       0.887050      0.886446  0.886749      0.886748      0.886757
support    23587.000000  22205.000000  0.886749  45792.000000  45792.000000
_______________________________________________
Confusion Matrix: 
 [[20364  3223]
 [ 1963 20242]]



In [17]:
def print_score(true, pred, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")

In [18]:
xgb_clf = XGBClassifier(use_label_encoder=False)
# xgb_cv = RandomizedSearchCV(
#     xgb_clf, param_grid, cv=3, n_iter=60, 
#     scoring='roc_auc', n_jobs=-1, verbose=1
# )
# xgb_cv.fit(X_train, y_train)

# best_params = xgb_cv.best_params_
# best_params['tree_method'] = 'gpu_hist'
# # best_params = {'n_estimators': 50, 'tree_method': 'gpu_hist'}
# print(f"Best Parameters: {best_params}")

# xgb_clf = XGBClassifier(**best_params)
xgb_clf.fit(X_train_scaled, y_train)

y_test_pred = xgb_clf.predict(X_test_scaled)

print_score(y_test, y_test_pred, train=False)

Parameters: { use_label_encoder } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Test Result:
Accuracy Score: 88.95%
_______________________________________________
CLASSIFICATION REPORT:
                      0             1  accuracy     macro avg  weighted avg
precision      0.912859      0.866978  0.889457      0.889919      0.890611
recall         0.868275      0.911957  0.889457      0.890116      0.889457
f1-score       0.890009      0.888899  0.889457      0.889454      0.889471
support    23587.000000  22205.000000  0.889457  45792.000000  45792.000000
_______________________________________________
Confusion Matrix: 
 [[20480  3107]
 [ 1955 20250]]



In [19]:
rf_clf = RandomForestClassifier(n_estimators=100)
# rf_cv = RandomizedSearchCV(
#     rf_clf, param_grid, cv=3, n_iter=60, 
#     scoring='roc_auc', n_jobs=-1, verbose=1
# )
# rf_cv.fit(X_train, y_train)
# best_params = rf_cv.best_params_
# print(f"Best Parameters: {best_params}")
# rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(X_train, y_train)

y_train_pred = rf_clf.predict(X_train_scaled)
y_test_pred = rf_clf.predict(X_test_scaled)

print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                 0        1  accuracy  macro avg  weighted avg
precision      1.0      1.0       1.0        1.0           1.0
recall         1.0      1.0       1.0        1.0           1.0
f1-score       1.0      1.0       1.0        1.0           1.0
support    94203.0  88963.0       1.0   183166.0      183166.0
_______________________________________________
Confusion Matrix: 
 [[94203     0]
 [    0 88963]]

Test Result:
Accuracy Score: 88.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0             1  accuracy     macro avg  weighted avg
precision      0.914287      0.865245  0.889173      0.889766      0.890506
recall         0.866028      0.913758  0.889173      0.889893      0.889173
f1-score       0.889503      0.888840  0.889173      0.889172      0.889182
support    23587.000000  22205.000000  0.889173  45792.000000  45