# Model Development of Loan Predictor

## 1. Introduction

#### **Performance Evaluation**

Let's see below confusion matrix of our prediction:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: left;">
      <th>Actual/Prediction</th>
      <th>Good Loan [0]</th>
      <th>Bad Loan [1]</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Good loan [0]</th>
      <td>TN</td>
      <td>FP</td>
    </tr>
    <tr>
      <th>Bad Loan [1]</th>
      <td>FN</td>
      <td>TP</td>
    </tr>
    </tbody>

So, what metric shall we use for our model? Consider this:

- We want model to be able to detect most of (if not all) possibly defaulted loan as much. For example, if there would be 10 loan to be defaulted, our model would be reliable  if all of those loan is tagged as `Bad Loan`. The metric is **sensitivity** (or *True Positive Rate*): the ratio of positive instances that are correctly detected by the model,
$$sensitivity/TPR=\frac{TP}{FN + TP}$$
         
- In an extreme case, **TPR** can be inflated by a model that tag most of the loan as `Bad Loan`, resulting in a huge number of *False Positive* prediction: most of the `Good Loan` tagged as `Bad Loan`. Therefore, another metric to watch is **FPR** (or *False Positive Rate*): the ratio of negative instance that are incorrectly tagged as positive instance,
$$FPR=\frac{FP}{TN + FP}$$
- TPR is the ability of the model to detect potential loss, which should be avoided. On the other side, FPR is the consequence of its high sensitivity to positive instance, in our cases, we would miss potential profit. We want to optimize trade-off between both metrics. They are commonly considered in **Receiver Operating Characteristic** (ROC): Comparation of *TPR* and *FPR*.

## 1. Dataset Preparation

#### **The Dataset**

We already prepared the similar dataset for exclusively for training and testing:

- Dataset for training: Loan Dataset of 2010 - 2015
- Dataset for testing: Loan Dataset of 2016 - 2017

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.float', '{:.2f}'.format)
pd.set_option('display.max_columns', 75)
pd.set_option('display.max_rows', 75)

train_set = pd.read_csv('dataset\lc_2010-2015.csv', dtype={'desc': 'str', 'verification_status_joint': 'str'})
test_set = pd.read_csv('dataset\lc_2016-2017.csv', dtype={'desc': 'str', 'verification_status_joint': 'str'})

#### **Preparing the Dataset**

In this first part of dataset preparation, we will create a pipeline to transform the dataset which consists of:

- Filtering the dataset: 1) To consider only `INDIVIDUAL` loan; 2) To exclude the on-going loan
- Replacing missing values, as explained in previous section.

I already developed classes for each of the activity: 1) `LoanDataLabelPredictor` for filtering; 2) `LoanDataMissingHandler` to handle missing data. The output of the pipeline is splitted predictor-label set.

In [2]:
from sklearn.pipeline import Pipeline
from modules.data_preprocess import LoanDataPreprocess, LoanDataMissingHandler, LoanDataLabelPredictor 
from sklearn import set_config

dataset_preprocess = Pipeline([
    ('preprocess', LoanDataPreprocess()),
    ('extract_label_predictor', LoanDataLabelPredictor(exclude=['loan_amnt', 'term', 'earliest_cr_line'])),
    ('missing_handler', LoanDataMissingHandler())
    ])

set_config(display='diagram')
dataset_preprocess

In [3]:
from modules.data_exploration import DataExploration

X_train, y_train = dataset_preprocess.fit_transform(train_set)
DataExploration(X_train).show_nans_or_zeroes('nans')

Unnamed: 0,Nans Count,Nans Percentage (%),Data Types
emp_length,9132,4.34,category
home_ownership,0,0.0,category
annual_inc,0,0.0,float64
verification_status,0,0.0,category
dti,0,0.0,float64
delinq_2yrs,0,0.0,float64
inq_last_6mths,0,0.0,float64
mths_since_last_delinq,0,0.0,float64
mths_since_last_record,0,0.0,float64
open_acc,0,0.0,float64


In [4]:
X_test, y_test = dataset_preprocess.transform(test_set)

Based on previous confusion matrix, we will map the label as follows:

In [23]:
def map_array(x):
    if x == 'Good Loan':
        return 0
    return 1

y_train = np.vectorize(map_array)(y_train)
y_test = np.vectorize(map_array)(y_test)

## 2. Features Preparation

#### **Categorical Features**

In [13]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(handle_unknown='ignore')

#### **Numerical Features**

For our base model, let's first just consider scaling the features. This is commonly a standard practice and may turns out to give a good result. However, scaling is not necessary for `logistic regression` and tree-based method.

For dataset which has outliers, scikit-learn provides scaling that robust to this condition.

In [14]:
from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()

#### **Combining Feature Preprocess**

In [18]:
from sklearn.compose import ColumnTransformer

categorical_cols = X_train.select_dtypes(include='category').columns.values
numerical_cols = X_train.select_dtypes(exclude='category').columns.values
ct = ColumnTransformer([
    ('categorical', one_hot, categorical_cols),
    ('numerical', robust_scaler, numerical_cols)
    ])
ct.fit(X_train)
X_train_transfd = ct.transform(X_train)
X_test_transfd = ct.transform(X_test)

## 3. Baseline Models

In [None]:
import tensorflow as tf
from tensorflow import keras

# creating model function, specify parameters to be cv-ed later
def mlp_model(input_shape, n_hidden=5, n_neurons=100, learning_rate=3e-3, activation='relu'):
    input_ = keras.layers.Input(shape=[input_shape,])
    dense = keras.layers.Dense(n_neurons, activation=activation)(input_)
    for layer in range(n_hidden - 1):
        dense = keras.layers.Dense(n_neurons, activation=activation)(dense)
    output_ = keras.layers.Dense(1, activation='sigmoid')(dense)
    model = keras.Model(inputs=[input_], outputs=[output_])
    # compiling model
    model.compile(
        loss='binary_crossentropy', 
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate))
    return model

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
# classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
# scoring
from sklearn.metrics import roc_auc_score
# Feature Selection
from sklearn.feature_selection import RFECV

sssplit = StratifiedShuffleSplit(n_splits=10)
# create a random forest classifier
log_clf  = Pipeline([
    ('feature_selection', RFECV(LogisticRegression(max_iter=1000), scoring=roc_auc_score)),
    ('classifier', LogisticRegression(max_iter=1000))
    ])
rf_clf  = Pipeline([
    ('feature_selection', RFECV(RandomForestClassifier(), scoring=roc_auc_score)),
    ('classifier', RandomForestClassifier())
    ])
sv_clf  = Pipeline([
    ('feature_selection', RFECV(SVC(), scoring=roc_auc_score)),
    ('classifier', SVC())
    ])
mlp_clf  = 
# initialize an empty list to store the scores
models = [log_clf, rf_clf, sv_clf, mlp_clf]
model_ids = ['logisticreg', 'random forest', 'svm', 'mlp']
# iterate over the models
for model, model_id in zip(models, model_ids):
    model_scores = []
    # iterate over the splits
    for train_index, valid_index in sssplit.split(X_train_transfd):
        # get the training and validation data for this split
        X_train_s, X_valid_s = X_train_transfd.iloc[train_index, :], X_train_transfd.iloc[valid_index, :]
        y_train_s, y_valid_s = y_train[train_index], y_train[valid_index]
        # fit the model to the training data
        model.fit(X_train_s, y_train_s)
        # use the model to make predictions on the validation data
        y_pred_s = model.predict(X_valid_s)
        # calculate the accuracy score for this split
        score = roc_auc_score(y_valid_s, y_pred_s)
        model_scores.append(score)
    # calculate the mean and standard deviation of the scores
    # this is model-specific
    mean_score = np.mean(model_scores)
    std_dev_score = np.std(model_scores)
    # print the mean and standard deviation of the scores
    print(f'{model_id}: Mean score = {mean_score:.2f}, Standard deviation = {std_dev_score:.2f}')

## 3. Baseline Models

Now we are going to select several potentially baseline model for our classification task!
For each potential model, we are going to observe how the model performs by comparing the model performance. And also, we will perform *Feature Selection* using `Recursive Feature Elimination (RFE)`, focusing on `accuracy` score of `Charged Off` class. 

### 2.1. Stochastic Gradient Descent Classifier (SGDClassifier)

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=99)


#### **Feature Selection**

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt

features_evaluator = RFECV(estimator=sgd_clf, cv=StratifiedKFold(2), scoring='recall')
features_evaluator.fit(X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}))
rfecv_mean = features_evaluator.cv_results_['mean_test_score']
optimal_n_features = features_evaluator.n_features_

# plotting
def plot_feature_evaluator(rfecv_mean, optimal_n_features):
    plt.plot(
        range(1, len(rfecv_mean) + 1), rfecv_mean
    )
    plt.xlabel('Number of features selected')
    plt.ylabel('Cross validation score')
    print(f'Optimal number of features: {optimal_n_features}')

plot_feature_evaluator(rfecv_mean, optimal_n_features)

In [None]:
features_selector = RFECV(
    estimator=sgd_clf, cv=StratifiedKFold(2), scoring='recall',
    min_features_to_select=optimal_n_features)

In [None]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import make_pipeline, Pipeline

sgd_clf_model = Pipeline([
    ('features_selector', features_selector),
    ('sgd_classifier', SGDClassifier(random_state=99))
])

#### **Performance Measure**

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

# decision function to plot precision-recall curve
y_train_pred = \
    cross_val_predict(
        sgd_clf_model, X_train_res_transf, 
        y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), cv=3)
confusion_matrix(y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), y_train_pred)

In [None]:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
import seaborn as sns

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    fig, ax = plt.subplots(figsize=(12, 6))
    sns.lineplot(x=thresholds, y=precisions[:-1], ax=ax, label='Precision')
    sns.lineplot(x=thresholds, y=recalls[:-1], ax = ax, label='Recall')
    ax.lines[0].set_linestyle('--')
    ax.lines[1].set_linestyle('-')
    plt.xlabel('Treshold')
    plt.legend()
    plt.show()
    # plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    # plt.plot(thresholds, recalls[:-1], "g-", label="Recall")

y_scores_sgd = \
    cross_val_predict(
        sgd_clf_model, X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), cv=3,
        method='decision_function')
precisions, recalls, thresholds = \
    precision_recall_curve(y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), y_scores_sgd)

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

In [None]:
def plot_precision_recalls(precisions, recalls):
    fig, ax = plt.subplots(figsize=(12, 6))
    sns.lineplot(x=precisions, y=recalls, ax=ax)
    plt.xlabel('Precisions')
    plt.ylabel('Recalls')
    
plot_precision_recalls(precisions, recalls)

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), y_scores_sgd)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (recall)')

plot_roc_curve(fpr, tpr)
    

In [None]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}), y_scores_sgd)

print(f'AUC Score= {auc:.2f}')

In [None]:
from imblearn.metrics import classification_report_imbalanced

sgd_clf_model.fit(X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}))
y_pred_test = sgd_clf_model.predict(X_test_transf)

print(classification_report_imbalanced(y_test.map({'Fully Paid': 0, 'Charged Off': 1}), y_pred_test))

### 2.2. XGBoost Classifier

In [None]:
from xgboost import XGBClassifier

xgb_clf = XGBClassifier(verbosity=0, use_label_encoder=False)
xgb_clf.fit(X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}))
y_pred_test = xgb_clf.predict(X_test_transf)

print(classification_report_imbalanced(y_test.map({'Fully Paid': 0, 'Charged Off': 1}), y_pred_test))


In [None]:
features_evaluator = RFECV(estimator=xgb_clf, cv=StratifiedKFold(2), scoring='recall')
features_evaluator.fit(X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}))
rfecv_mean = features_evaluator.cv_results_['mean_test_score']
optimal_n_features = features_evaluator.n_features_

plot_feature_evaluator(rfecv_mean, optimal_n_features) 

In [None]:
features_selector = RFECV(
    estimator=xgb_clf, cv=StratifiedKFold(2), scoring='recall',
    min_features_to_select=optimal_n_features)

xgb_clf_model = Pipeline([
    ('features_selector', features_selector),
    ('xgb_classifier', XGBClassifier(verbosity=0))
])

xgb_clf_model.fit(X_train_res_transf, y_train_res.map({'Fully Paid': 0, 'Charged Off': 1}))
y_pred_test = xgb_clf.predict(X_test_transf)

print(classification_report_imbalanced(y_test.map({'Fully Paid': 0, 'Charged Off': 1}), y_pred_test))