# Loan Default Prediction
## Step 5: Modeling (part 1)

## Table of Content
1. Overview (current notebook)
2. Imports and Data Loading
3. Logistic Regression (current notebook)
4. KNN (part 2)
5. Random Forest (part 3)
6. SVM (part 4)
7. Final Model Selection and Summary (part 5)

## 5.1 Overview

### Evaluation Metrics
The data is now ready for use in training models. But before training the models it is important to consider the evaluation metrics in the context of the problem. Without a reasonable evaluation metric, it would not be possible to track the progress of model training (underfitting or overfitting?) or select the most effective among various models. The goal of the model is to minimize loss by identifying probable defaults; therefore false negatives are of particular importance. Furthermore, the target class is imbalanced, so accuracy alone might not be a good metric. Some metrics to be considered are the confusion matrix (highlighting the recall or sensitivity), log loss and AUC-ROC.

### Candidate Algorithms
The baseline will be established by a logistic regression model (which could be best performing by all means!). Some other models to be tested are KNN, random forest, and SVM.

### Work Flow
For each model, we will need to
1. set the parameters
2. fit the training data
3. predict based on the testing data
4. score using the above-mentioned metrics

Each set of parameters will be tested using k-fold cross-validations. Furthermore, the tunings of the parameters will be explored with a grid or random search. Algorithm-appropriate visualizations will also be made.

## 5.2 Imports and Data Loading

In [1]:
# packages and libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Scikit learn

In [3]:
# load training data
path = '../data/'
filename = 'train_scaled.csv'

df = pd.read_csv(path+filename)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/train_scaled.csv'

## 5.3 Logistic Regression


### 5.3.1 Baseline Model

In [None]:
# use the features scaled by StandardScaler()
X = df.iloc[:,7:13]
y = df['default']

In [None]:
# build a logistic regressor with default parameters
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.1, random_state=23)
clf = LogisticRegression(random_state=23)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:,1]

In [None]:
# visualize results with confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()

The baseline model is performing reasonably well; we will produce results on some other metrics as well.

In [None]:
# classification report
from sklearn.metrics import classification_report

target_names = ['0(no default)', '1(default)']
print(classification_report(y_test, y_pred, target_names=target_names))

In [None]:
# Brier score and log loss
from sklearn.metrics import brier_score_loss, log_loss

print('Brier score loss: \t' + str(brier_score_loss(y_test, y_proba)))
print('log loss: \t\t' + str(log_loss(y_test, y_proba)))

In [None]:
# AUC-ROC score and visualization
from sklearn.metrics import roc_auc_score, RocCurveDisplay

print('ROC_AUC score: ', str(roc_auc_score(y_test, y_proba)))
RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.show()

In [None]:
# for visualizing datapoints
'''
results = pd.DataFrame({'predicted':y_proba[:,1], 'actual':y_test})
sorted_results = results.sort_values(by='predicted').reset_index()

sns.scatterplot(x=sorted_results.index, y=sorted_results['predicted'])
sns.scatterplot(x=sorted_results.index, y=sorted_results['actual'])
plt.show()
'''

### 5.3.2 Fine-Tuning and Cross Validation

The model is performing well as is; but we will still experiment with the hyperparameters as well as perfoming k-fold validation.

In [None]:

from sklearn.model_selection import RandomizedSearchCV

# build a classifier
clf = LogisticRegression(random_state=23)


# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results["rank_test_score"] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print(
                "Mean validation score: {0:.3f} (std: {1:.3f})".format(
                    results["mean_test_score"][candidate],
                    results["std_test_score"][candidate],
                )
            )
            print("Parameters: {0}".format(results["params"][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {
    'penalty': [‘l1’, ‘l2’, ‘elasticnet’, None]
    'C': [True, False],
    "l1_ratio": stats.uniform(0, 1),
    "alpha": stats.loguniform(1e-2, 1e0),
}

# run randomized search
n_iter_search = 15
random_search = RandomizedSearchCV(
    clf, param_distributions=param_dist, n_iter=n_iter_search
)

start = time()
random_search.fit(X, y)
print(
    "RandomizedSearchCV took %.2f seconds for %d candidates parameter settings."
    % ((time() - start), n_iter_search)
)
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid = {
    "average": [True, False],
    "l1_ratio": np.linspace(0, 1, num=10),
    "alpha": np.power(10, np.arange(-2, 1, dtype=float)),
}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print(
    "GridSearchCV took %.2f seconds for %d candidate parameter settings."
    % (time() - start, len(grid_search.cv_results_["params"]))
)
report(grid_search.cv_results_)
