In [1]:
import joblib
import os
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [2]:
try:
    _ = first_run
except NameError:
    first_run = True
    os.chdir(os.getcwd().rsplit("/", 1)[0])
    from _aux import ml

# Load Data

In [3]:
X_train, y_train = joblib.load(
    "../data/train/preprocessed/train_features_labels.joblib.gz"
)

X_validation, y_validation = joblib.load(
    "../data/train/preprocessed/validation_features_labels.joblib.gz"
)

# Hold your SMOTE for a moment

SMOTE has become a ubiquitous way to handle imbalanced classes by oversampling the minority class. However, the fact that many of our features have low variance due to a lot of zero values, generating artificial samples from them can actually become quite counterproductive. Thus, we will experiment with both SMOTE and a custom undersampler that tries to capture most of the variance of the majority class. Whichever strategy yields better results for our baseline model will be the one we move forward with.

### 1. Custom undersampler

In [4]:
(
    X_train_maj,
    y_train_maj,
    sample_variance,
    sample_variance_zscore,
    is_significant,
) = ml.BinaryUndersampler(n_iterations=1_000).fit(X_train, y_train)
print(sample_variance, sample_variance_zscore, is_significant)

0.5131495516149959 5.691116837902434 True


In [5]:
# Create new training set
X_train_undersample = np.concatenate((X_train_maj, X_train[y_train == 1]))
y_train_undersample = np.concatenate((y_train_maj, y_train[y_train == 1]))

# Save the new trainig set
joblib.dump(
    [X_train_undersample, y_train_undersample],
    "../data/train/preprocessed/undersampled_train_features_labels.joblib.gz",
)

# Check class balance
pd.Series(y_train_undersample).value_counts()

1.0    816
0.0    811
dtype: int64

As we can see, classes are almost equally matched. Hopefully, our stategy will improve the baseline performance, as the new sample catches an extremely high amount of variance if compared to bootstrap results. Let the drums roll...

In [6]:
baseline = RandomForestClassifier().fit(X_train_undersample, y_train_undersample)

In [7]:
predictions = baseline.predict_proba(X_validation)

threshold_perf = pd.DataFrame(
    [
        (
            threshold,
            *confusion_matrix(
                y_validation, (predictions[:, 1] > threshold).astype(int)
            ).ravel(),
        )
        for threshold in np.arange(0.05, 1, 0.05)
    ],
    columns=["threshold", "tn", "fp", "fn", "tp"],
).assign(
    precision=lambda df: df["tp"] / (df["tp"] + df["fp"]),
    recall=lambda df: df["tp"] / (df["tp"] + df["fn"]),
    f1=lambda df: 2
    * (df["precision"] * df["recall"])
    / (df["precision"] + df["recall"]),
)

threshold_perf.to_csv("../ml_artifacts/baseline2_model_performance.csv", index=False)

threshold_perf.query("threshold > .5")

Unnamed: 0,threshold,tn,fp,fn,tp,precision,recall,f1
10,0.55,10962,3220,55,159,0.047055,0.742991,0.088505
11,0.6,11191,2991,57,157,0.049873,0.733645,0.093397
12,0.65,11812,2370,73,141,0.056153,0.658879,0.103486
13,0.7,12431,1751,82,132,0.070101,0.616822,0.125894
14,0.75,12657,1525,93,121,0.073512,0.565421,0.130108
15,0.8,13032,1150,109,105,0.083665,0.490654,0.142954
16,0.85,13311,871,116,98,0.101135,0.457944,0.16568
17,0.9,13540,642,134,80,0.110803,0.373832,0.17094
18,0.95,13725,457,146,68,0.129524,0.317757,0.184032


Honestly, these results are better than what we had expected. As we move the threshold, we can the the "precision-recall" trade-off take place. However, note that the F1 score does continually improve, which is a sympton of the fact that the trade-off is not perfectly squred in this case -- as it rarely is.

Make no mistake, these are not good prediction results by any stretch of the imagination. Nevertheless, they do suggest that our strategy is successfull as baseline performance improved significantly with no change to the model, only the data changed. Let's compare them to the previous baseline.

In [8]:
pd.read_csv("../ml_artifacts/baseline_model_performance.csv").query("threshold > .5")

Unnamed: 0,threshold,tn,fp,fn,tp,precision,recall,f1
10,0.55,14144,38,200,14,0.269231,0.065421,0.105263
11,0.6,14157,25,201,13,0.342105,0.060748,0.103175
12,0.65,14166,16,205,9,0.36,0.042056,0.075314
13,0.7,14173,9,207,7,0.4375,0.03271,0.06087
14,0.75,14177,5,208,6,0.545455,0.028037,0.053333
15,0.8,14180,2,209,5,0.714286,0.023364,0.045249
16,0.85,14181,1,209,5,0.833333,0.023364,0.045455
17,0.9,14181,1,210,4,0.8,0.018692,0.03653


Contrary to what observed earlier, the F1 score drops as we move up the threshold, which is a symptom of the behaviour induced by the model. The model flags very little, which is good in the perpective of customer experience but at the cost of losing too much money for the company. In fact, such model doesn't even justify the cost of developing and maintaining it.

As we get good results from our undersampling strategy and due to time constraints, we choose not to explore how SMOTE would perform at this time. Instead, we decide to allocate more time for hyperparameter tuning and model selection next.