## Python for Credit Card Default Risk: Example

Here I will use random and SMOTE oversampling in combination with logistic regression to predict whether or not someone is likely to default on their credit card loans in a given month using demographic information. Like fraud, defaults are more of the exception than the norm. Because they're underrepresented in the dataset, it can be useful to oversample defaults and balance the classes. If that all sounds too complicated -- don't worry. It's way simpler than you think!

In [1]:
# basic imports
from pathlib import Path
import pandas as pd
from collections import Counter

In [2]:
# First we'll bring in the data using Path and Pandas
data_path = Path('data/cc_default.csv')
df = pd.read_csv(data_path)

# Here is how our data looks:
display(df.head())
display(df.shape)

Unnamed: 0,ID,ln_balance_limit,sex,education,marriage,age,default_next_month
0,1,9.903488,1,2,0,24,1
1,2,11.695247,1,2,1,26,1
2,3,11.407565,1,2,1,34,0
3,4,10.819778,1,2,0,37,0
4,5,10.819778,0,2,0,57,0


(30000, 7)

We have 7 columns (although the ID column isn't a feature we're interested in). And we have 30,000 rows -- not very large but good enough for this example

In [3]:
# Now we need to seperate our features from the target we're trying to predict.
feature_cols = [i for i in df.columns if i not in ("ID", "default_next_month")]
X = df[feature_cols]
y = df['default_next_month']

In [4]:
# Let's see what the default rate looks like. I'll make a quick function to do this so we can use it later
def view_target_pop(target) -> None:
    """A function that prints out the positive and negative counts in our target column, e.g. the y column"""
    vals = list(Counter(target).values())
    print(f"There were {vals[0]} defaults out of 30000 customers, a rate of {100*(vals[0]/30000):.2f}%")
    return None

view_target_pop(y)    



There were 6636 defaults out of 30000 customers, a rate of 22.12%


That seems like a high rate! Perhaps we had already identified this set as "at risk". Either way, we still have an **imbalanced class**: there are around 4.5 times more non-defaults than defaults. So we'll try out some cool Python libraries to help us make better predictions.

In [5]:
# First we're going to use Python's sklearn module to split our data into training and testing sets.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


## Random Oversampling

In [6]:
# Next we're going to import a Random Over Sampling model from the imblearn library. 
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=1)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

view_target_pop(y_resampled)


There were 17532 defaults out of 30000 customers, a rate of 58.44%


Nice! No we have closer to a 50-50 split. Let's try some basic logistic regression and see how we do at predicting defaults.

In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver= 'lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [8]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)


array([[3744, 2088],
       [ 745,  923]], dtype=int64)

In my experience, confusion matrices can be just that -- *confusing*! But it's pretty simple. In binary classification (read -- is it a default or not?) We want to be specific about ***how*** we're right and ***how*** we're wrong. So let's go a little further.

In [9]:
# I'm going to bring in some more modules from the sklearn and imblearn libraries to help breakdown our model's performance so far.
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

print(f"overall accuracy: {100*round(balanced_accuracy_score(y_test, y_pred),2)}%")
print("----------------------------------------------------------------------------------")
print(classification_report_imbalanced(y_test, y_pred))

overall accuracy: 60.0%
----------------------------------------------------------------------------------
                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.64      0.55      0.73      0.60      0.36      5832
          1       0.31      0.55      0.64      0.39      0.60      0.35      1668

avg / total       0.72      0.62      0.57      0.65      0.60      0.36      7500



So an overall accuracy of 60% might be good or bad -- depends on the use case. But when we break into the classification report, we can see what the model is good at and where it falls off. 
- The model has *high precision* when classifying a borrower as a *non-default* (0). Precision gives us an idea about the **meaningfulness** of a positive prediction, and is equal to the number of true negatives (3744) divided by the total number of true and false negatives (3744+745). 
- However the model has *low precision* when detecting defaults. Let's see if we can do better with SMOTEEN (Synthetic Minority Oversampling Technique), which is a form of combination sampling

In [10]:
# Like before, we're going to import the SMOTE model and fit it to the data then check the count of each class
from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(
    X_train, y_train
)
from collections import Counter

view_target_pop(y_resampled)

There were 8179 defaults out of 30000 customers, a rate of 27.26%


In [11]:
# Let's rerun our logistic regression and see if we do any better
model = LogisticRegression(solver= 'lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("----------------------------------------------------------------------------------")
print(f"overall accuracy: {100*round(balanced_accuracy_score(y_test, y_pred),2)}%")
print("----------------------------------------------------------------------------------")
print(classification_report_imbalanced(y_test, y_pred))

[[4686 1146]
 [1094  574]]
----------------------------------------------------------------------------------
overall accuracy: 56.99999999999999%
----------------------------------------------------------------------------------
                   pre       rec       spe        f1       geo       iba       sup

          0       0.81      0.80      0.34      0.81      0.53      0.29      5832
          1       0.33      0.34      0.80      0.34      0.53      0.26      1668

avg / total       0.70      0.70      0.45      0.70      0.53      0.28      7500



So we marginally declined in overall accuracy, however we can now better predict non-defaults (F1 score of 0.81 compared to 0.73), and we know that when our model predicts a customer will default, there is a 

With this balanced class, I'll do some more advanced modeling and see if we can do a better job at predicting defaults. Here I'll used Random Forest and compare the results to Logistic Regression

In [12]:
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=1000, random_state=1)
brf.fit(X_train, y_train)

BalancedRandomForestClassifier(n_estimators=1000, random_state=1)

In [13]:
# Now I'll print out the imbalanced classification report for our balanced random forest classifier
y_pred_rf = brf.predict(X_test)
print(confusion_matrix(y_test, y_pred_rf))
print("----------------------------------------------------------------------------------")
print(f"overall accuracy: {100*round(balanced_accuracy_score(y_test, y_pred_rf),2)}%")
print("----------------------------------------------------------------------------------")
print(classification_report_imbalanced(y_test, y_pred_rf))

[[3130 2702]
 [ 683  985]]
----------------------------------------------------------------------------------
overall accuracy: 56.00000000000001%
----------------------------------------------------------------------------------
                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.54      0.59      0.65      0.56      0.32      5832
          1       0.27      0.59      0.54      0.37      0.56      0.32      1668

avg / total       0.70      0.55      0.58      0.59      0.56      0.32      7500



So with the balanced forest, we get much better **recall** when detecting defaults. In medical science, recall is sometimes called **sensitivity**, because it calculates how many of the Actual Positives our model captures. So how would you decide which model to use? When the cost of a *false negative* is higher than the cost of a *false positive*, the model with better **RECALL** is the right choice. 

From a business perspective, a false positive means we label someone as a default when they, in fact, make their payment. Perhaps this leads to over-hedging our portfolio, which could be costly. However, a false negative seems worse on the face of it. If we label someone as a non-default and they actually default, then we're susceptible to unforseen risk. 

Hopefully this was a helpful break down of some basic statistical packages in Python that lenders can use to build predicitve models!

-- Created by Laramie Dunlap on 9/21/2022