In [131]:
#!pip install pandas
#!pip install numpy
#!pip install matplotlib
#!pip install scikit-learn
#!pip install xlrd

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import xlrd

from sklearn.metrics import classification_report
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Import Data

I am importing the data from Excel, then reporting a quick snapshot of the absolute number of defaults (6636) and non-defaults (23364), as well as the proportion of defaults (22%). Almost a fourth of individuals from this dataset default. For training, I save the attribute data into a variable, `X`, where I remove the impertinent ID column, and the label data is saved in a variable, `y_true`. The first few rows of these dataset are outputted for confirmation.

In [379]:
# import dataset
data = pd.read_excel('./default of credit card clients.xls', header=1)
num_default = data['default payment next month'].sum()
num_no_default = data.shape[0] - num_default
default_ratio = num_default / (num_no_default + num_default)
print(f"# Default: {num_default:>8}\n# Repaid:  {num_no_default:>8}\n% Defaulters: {default_ratio:>5.2f}\n")

# assign variables
X = data.iloc[:,1:-1] # removing ID column
y_true = data['default payment next month']
print(f"Attribute Data:\n\n{X.head()}\n\nLabel Data:\n\n{y_true.head()}")

# Default:     6636
# Repaid:     23364
% Defaulters:  0.22

Attribute Data:

   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0      20000    2          2         1   24      2      2     -1     -1   
1     120000    2          2         2   26     -1      2      0      0   
2      90000    2          2         2   34      0      0      0      0   
3      50000    2          2         1   37      0      0      0      0   
4      50000    1          2         1   57     -1      0     -1      0   

   PAY_5  ...  BILL_AMT3  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  \
0     -2  ...        689          0          0          0         0       689   
1      0  ...       2682       3272       3455       3261         0      1000   
2      0  ...      13559      14331      14948      15549      1518      1500   
3      0  ...      49291      28314      28959      29547      2000      2019   
4      0  ...      35835      20940      19146      19131      200

# Implementing ML Algorithms to Predict Defaults

For my analysis, I take advantage of the scikit-learn library in Python. For all methods, I split my dataset into 80% training data and 20% test data. I chose not to work with a validation set; however, I implement certain methods that utilize built-in cross-validation, which breaks up the training dataset into mini validation sets.

# Dummy Classifier (baseline)

I start by implementing a Dummy Classifier to serve as a baseline method against which to compare other methods. The Dummy Classifier generates predictions at random, so we anticipate ~50% accuracy. I anchored the random number generator so the same results should be produced at each run. As expected, we see ~50% accuracy from the training dataset.

I output a set of metrics for the test dataset, which includes accuracy, precision, recall, and F1-score. Accuracy for the test data is ~50%, as expected. For precision, we expect `(0.5x0.22)/0.5 = 0.22`, which we see in the data. The top of the equation reflects true positives (defaults) since random sampling gives 50% predicted-positive and our expectation of true positives reflects the population proportion (i.e., 22%); the bottom are the predicted-positives from our random labeling, which is 50%. For binary classification, I would not expect different precision values for macro and weighted averaging, which we see here.

For recall, we expect `(0.5x0.22/0.22)`, which we see in the data. F1-score is `(2*0.22*0.5)/(0.22+0.5) = 0.31`, also reflected in the data.

In [408]:
# split data
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X, y_true, test_size=0.2, random_state=25)

# train
baseline = DummyClassifier(strategy='uniform')
baseline.fit(X_train_d,y_train_d)
accuracy_train_d = baseline.score(X_train_d, y_train_d)
print(f"Training Accuracy: {accuracy_train_d:0.4f} \n")

# test
y_pred_d = baseline.predict(X_test_d)
report_d = classification_report(y_test_d.to_numpy(), y_pred_d, digits=4)
print(f"Test Metrics:\n {report_d}")

Training Accuracy: 0.4980 

Test Metrics:
               precision    recall  f1-score   support

           0     0.7755    0.5023    0.6097      4649
           1     0.2258    0.4996    0.3111      1351

    accuracy                         0.5017      6000
   macro avg     0.5007    0.5009    0.4604      6000
weighted avg     0.6517    0.5017    0.5424      6000



# Logistic Regression

I next try Logistic Regression to predict defaults. Data is preprocessed by standardizing features to zero mean and unit variance using `StandardScaler()` because features in some attributes had large variance relative to others. For the logistic regression algorithm, I used an L2 penalty because we have 30000 observations compared to 23 attributes, so I'm not looking to exclude attributes. I have also included the typical regularization step to address concerns of over-fitting.

Since hyperparameters can have a large influence on outcomes, I used `LogisticRegressionCV` because it has built-in cross-validation to select the best value for the `C` hyperparameter, which inversely influences the strength of the regularization. I felt the default value of 5 was fine for the number of stratifications in cross-validation (i.e., the proportion of data in the validation set), as well as the default value of 10 for the number of `C` values to sample. I output the sampled `C` values and report the best `C` value (2.78) used in the final estimation.

I then allow the algorithm to train and report a training accuracy of 0.81. Below I then report metrics for the test dataset, where we see similar accuracy (0.80). Precision is reasonable at 0.73; however, recall (or sensitivity, in binary classification) is poor (0.24). This implies that the current Logistic Regression approach has a higher chance of missing true defaulters, which is undesirable for banks. The model could be further tuned in an attempt to improve performance.

In [416]:
# split data
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(X, y_true, test_size=0.2, random_state=25)

# preprocess data and train
pipe_l = make_pipeline(StandardScaler(), LogisticRegressionCV(random_state=25)) # default: L2 penalty, Stratified 5-Fold for cross-validation
X_scaled = pipe_l.fit(X_train_l, y_train_l)
accuracy_train_l = pipe_l.score(X_train_l, y_train_l)

# check C hyperparameter
grab_log_reg = pipe_l.named_steps['logisticregressioncv']
Cs_values = grab_log_reg.Cs_
coefficients = grab_log_reg.coefs_paths_
scores_C = grab_log_reg.scores_
best_C = grab_log_reg.C_

def sig_figs(x, sigs):
    return "{:.{p}g}".format(x, p=sigs)
C_vals_sampled = ", ".join(sig_figs(x, 3) for x in Cs_values)
print(f"Sampling C: [{C_vals_sampled}]")
print(f"Best C:      {C_value[0]:0.2f}\n")
print(f"Training Accuracy: {accuracy_train_l:0.4f} \n")

# test
y_pred_l = pipe_l.predict(X_test_l)
report_l = classification_report(y_test_l, y_pred_l, digits=4)
print(f"Test Metrics:\n {report_l}")

Sampling C: [0.0001, 0.000774, 0.00599, 0.0464, 0.359, 2.78, 21.5, 167, 1.29e+03, 1e+04]
Best C:      2.78

Training Accuracy: 0.8113 

Test Metrics:
               precision    recall  f1-score   support

           0     0.8151    0.9750    0.8880      4649
           1     0.7358    0.2391    0.3609      1351

    accuracy                         0.8093      6000
   macro avg     0.7755    0.6071    0.6244      6000
weighted avg     0.7973    0.8093    0.7693      6000



# Support Vector Classification (SVC)

Another classic approach - Support Vector Machines (specifically SVC for this problem). Again, I'm interested in sampling multiple hyperparameter values and using cross-validation to identify the best set, where cross-validation is built in to `GridSearchCV`. I'm opting to compare a linear kernel and an rbf kernel for the transformation portion of SVM because linear is simple and rbf is commonly used. I'm testing different `C` values but opting to use default for `gamma` (kernel coefficient) because it is adaptable to the dataset, i.e., `1 / (n_features * average_variance_across_features)`. As in my Logistic Regression approach, I am also standardizing the features to zero mean and unit variance.

Waiting for code to run...

In [None]:
# split data
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X, y_true, test_size=0.2, random_state=25)

# hyperparameter tuning and cross-validation
hyperparam_sampling = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__kernel': ['linear', 'rbf']
}
pipe_s = make_pipeline(StandardScaler(), SVC(random_state=25))
tuning_s = GridSearchCV(pipe_s, hyperparam_sampling) # default: Stratified 5-Fold for cross-validation

# train
tuning_s.fit(X_train_s, y_train_s)
best_hyperparams = tuning_s.best_params_
best_estimator = tuning_s.best_estimator_
print(f"Best parameters: {best_hyperparams}")
print(f"Best estimator: {best_estimator}")

# Test the model
y_pred_s = tuning_s.predict(X_test_s)
print("Test Metrics:")
print(classification_report(y_test, y_pred))

# Random Forest

I went for Random Forest over Decision Trees because I assumed I have enough computing power for a Random Forest, which performs better than a Decision Tree with less for over-fitting. I chose to explore three hyperparameters (`n_estimators`, `max_depth`, and `min_samples_split`) and again used cross-validation built in to `GridSearchCV`.

Waiting for code to run...

In [None]:
# split data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X, y_true, test_size=0.2, random_state=25)

# hyperparameter tuning and cross-validation
hyperparam_sampling = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
tuning_r = GridSearchCV(estimator=RandomForestClassifier(random_state=25), param_grid=hyperparam_sampling) # default: Stratified 5-Fold for cross-validation

# train
tuning_r.fit(X_train_r, y_train_r)
best_hyperparams = tuning_r.best_params_
print(best_hyperparams)

# test
y_pred_r = tuning_r.predict(X_test_r)
report_r = classification_report(y_test_r, y_pred_r, digits=4)
print(f"Test Metrics:\n {report_r}")

<h4>For all of these methods, I could dive further into the weeds by exploring colinearity of attributes, dimensionality reduction, etc. And better yet, I could explore even more machine learning algorithms.</h4>