In [1]:
import pandas as pd
import numpy as np
from main import preprocess_df
from collections import defaultdict as dd


my_df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(my_df)

## Random Forest Classification for Predicting Loan Default Probability

Random Forest Classifier is another model that can predict the probability of a borrower defaulting on a loan. Instead of minimising an error function like logistic regression, RF takes thousands of decision trees and trains them on random samples of the data. Some trees are limited on what they're allowed to use as data, e.g. some trees are intentionally kept from receiving FICO scores, so these trees are forced to find less obvious patterns in debt or income, for example. This decorrelates the trees and gives a wider, more holistic understanding within the forest. Then when testing, we pose the same situation to all trees at once, have them go through their trained decision process, and take the average of their answers. The probability of default is the proportion of yes answers.


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.impute import SimpleImputer

X = cleaned_df.loc[:, cleaned_df.columns != "did_default"]
y = cleaned_df.loc[:, cleaned_df.columns == "did_default"]

y = (
    y.values.ravel()
)  # turns y from a 1-column DataFrame to just a list; in essence the same thing but sklearn prefers a list


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

imputer = SimpleImputer(strategy="mean")
imputed_X_train = imputer.fit_transform(X_train)
imputed_X_test = imputer.transform(X_test)
# no need to do scaling of X because random forest doesn't care about scale, it just splits
# like 'is income > 50,000' without caring how big 50,000 is

In [None]:
weights = [None, "balanced"]
for w in weights:
    rfc_model = RandomForestClassifier(
        n_estimators=100, max_depth=10, random_state=42, class_weight=w
    )
    rfc_model.fit(imputed_X_train, y_train)
    rfc_probs = rfc_model.predict_proba(imputed_X_test)[:, 1]
    rfc_preds = rfc_model.predict(imputed_X_test)
    print(f"With class weights: {w}")
    cr = classification_report(y_test, rfc_preds)
    print(cr)
    roc_auc = roc_auc_score(y_test, rfc_probs)
    print(f"ROC AUC score: {round(roc_auc, 3)}")

With class weights: None
              precision    recall  f1-score   support

       False       0.81      0.98      0.89     14028
        True       0.58      0.10      0.18      3551

    accuracy                           0.80     17579
   macro avg       0.70      0.54      0.53     17579
weighted avg       0.77      0.80      0.74     17579

ROC AUC score: 0.7334239876043062
With class weights: balanced
              precision    recall  f1-score   support

       False       0.88      0.70      0.78     14028
        True       0.35      0.63      0.45      3551

    accuracy                           0.69     17579
   macro avg       0.62      0.67      0.62     17579
weighted avg       0.78      0.69      0.72     17579

ROC AUC score: 0.7310768895487378


## Results

Like with logistic regression, the model realised that since there were many fewer defaulters, it could guess 'No default' to almost everyone and get a very high accuracy score. In the model's world of minimising error, this is a good outcome, but in the real world it's catastrophic, as lots of money would be lost.

#### Without class weighting

```
Classification report
              precision    recall  f1-score   support

       False       0.81      0.98      0.89     14028
        True       0.58      0.10      0.18      3551

    accuracy                           0.80     17579
   macro avg       0.70      0.54      0.53     17579
weighted avg       0.77      0.80      0.74     17579

ROC AUC score: 0.733
```

Just as before, the model gets good metrics on non-defaulters but does terribly on defaulters.

#### Without class weighting

```
Classification report
              precision    recall  f1-score   support

       False       0.88      0.70      0.78     14028
        True       0.35      0.63      0.45      3551

    accuracy                           0.69     17579
   macro avg       0.62      0.67      0.62     17579
weighted avg       0.78      0.69      0.72     17579

ROC AUC score: 0.731
```

With balanced class weights, the model values correctly rejecting defaulters. Thus it employs a more conservative approach, rejecting many more people, some of whom would've paid the loan back in full, as shown by the drop in precision on defaulters (0.58 to 0.35). But the recall jumped up from a dreadful 0.10 to 0.63, a huge increase, meaning that many fewer defaulters are being approved.
