# Deal with imbalanced dataset

This notebook will tackle the problem known as classification with imbalanced classes. We will first introduce the problem and emphasize the difficulties related to both training and evaluating a predictive model under these circumstances.

Let's start to fetch a dataset from OpenML.

In [None]:
from sklearn.datasets import fetch_openml

dataset = fetch_openml(data_id=42397, as_frame=True)
X, y = dataset.data, dataset.target

The ID does not give us too much information regarding this dataset. Let's get some information looking at the related description provided by OpenML.

In [None]:
print(dataset.DESCR)

Thus, we got a bit more information. There is three important information: (i)the dataset is a classification problem to detect credit card frauds; (ii) it is supposidely highly imbalanced; (iii) the features are numerical features resulting from a principal component analysis (PCA) decomposition. Since we don't have a clue regarding the number of original features, we only know that the features `V**` are a linear combination of the original features. Such processing is used to encode the original data but let us the possibility to still work with a surrogate.

Let's have a first look at the dataset then.

In [None]:
X.head()

So in addition of the PCA features, we also have two other features: `"Time"` and `"Amount"` corresponding to the relative time of the transaction and the amount of the transaction, respectively. We can also check the size of the dataset:

In [None]:
X.shape

So we have almost 300,000 available samples. Let's now have a look at our target.

In [None]:
y.head()

The target is a binary target: `True` indicates that the transaction was a fraud while `False` indicates that it as legitimate.

## Problem definition

Before to start training an powerful predictive model, it is always nice to start by having a baseline. Earlier in this course, we notably presented two approaches to get baselines that we expect to be beaten by any predictive model.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>Create a dummy predictor that will always predict the most frequent class of the training set.</li>
        <li>Use cross-validation to get an estimate of the test score of such dummy baseline.</li>
        <li>Use the accuracy score as an evaluation metric.</li>
    </ul>
    What can you say about the statistical performance of the model?
</div>

In [None]:
# %load solutions/solution_01.py

It looks wonderful. We have a model that is highly accurate. Too accurate to be true. It might be a good idea to have a look at the confusion matrix.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>Split the original data to get a training and a testing set.</li>
        <li>Train the previous dummy classifier on the training data.</li>
        <li>Plot the confusion matrix using <tt>ConfusionMatrixDisplay.from_estimator</tt>.</li>
    </ul>
    What can you conclude?
</div>

In [None]:
# %load solutions/solution_02.py

In [None]:
# %load solutions/solution_03.py

This is pretty logical indeed. We force our estimator to always predict that there is no fraud. However, since there is a lot of legitimate transactions in regards to the fraudulent transactions, computing the accuracy score will not be helpful at representing a metric answering to the question "How good my predictive model is at detecting credit card fraud?".

Indeed, considering that our "positive" outcomes is detecting frauds, we should use metrics that focuses only on the frauds outcomes.

## Metrics to use in imbalanced classification setting

Before to look at the impact of imbalanced classes on the model, we can first define the metrics that we should use in this setting. It will help to compare models later.

As mentioned earlier, we should use metrics that only focus on the "positive" outcome. Thus, looking at the metrics derived from the confusion matrix, we could be interested in the following:

- recall (also called sensitivity)
- precision
- average precision (area under the curve of the precision-recall curve)
- balanced accuracy

Let's see what these metrics would have give us as indication regarding the statistical performance of our dummy predictive model.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>Repeat the cross-validation evaluation by passing the above scores as metrics to be evaluated.</li>
    </ul>
</div>

In [None]:
# %load solutions/solution_04.py

We observe that all scores are reflecting that our model is not good at detecting credit card fraud. Now to have we have a set of sensible metrics, we can go ahead looking at the impact of training a model on an imbalanced dataset.

## Impact of imbalanced classes on the training process

In the remainder of this notebook, we will compare the impact of imbalanced classes on the training process and a couple of strategies allowing to improve and alleviate this issue.

In [None]:
from collections import defaultdict

index = []
scores = defaultdict(list)

In [None]:
def update_scores(scores, cv_results):
    for key in cv_results:
        prefix = "test_"
        if prefix in key:
            scores[key.replace(prefix, "").replace("_", " ").capitalize()].append(
                cv_results[key].mean()
            )
    return scores

### Dummy baseline

Let's store the results of the dummy baseline that we used previously.

In [None]:
classifier = DummyClassifier(strategy="most_frequent")
cv_results = cross_validate(classifier, X, y, scoring=scoring, n_jobs=-1)

In [None]:
index.append(classifier.__class__.__name__)
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

### Linear classifier baseline

Now, we will use a linear classifier that is `LogisticRegression` with the default parameter. It will serve us as a baseline to compare future linear predictive models.
As already presented, we will normalize the feature using a `StandardScaler` that is a good practice.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

classifier = make_pipeline(
    StandardScaler(), LogisticRegression(max_iter=1000)
)
cv_results = cross_validate(classifier, X, y, scoring=scoring, n_jobs=-1)

In [None]:
index.append(classifier[-1].__class__.__name__)
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

We observe that our model is indeed learning something. It is much better than the baseline model. However, we will see that it is indeed impacted by the class imbalance and it can do even better.

Now, we will also train a `RandomForestClassifier` in order to have a powerful tree-based model for later comparison as well. `RandomForestClassifier` will not require any preprocessing since we are only dealing with numerical features.

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_jobs=-1)
cv_results = cross_validate(classifier, X, y, scoring=scoring)

In [None]:
index.append(classifier.__class__.__name__)
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

We observe that `RandomForestClassifier` is also learning something from data. The average precision is lower than for the linear model but the recall is higher.

However, we will see that both models can do both better in terms of metrics by tweaking the training procedure.

### Introduction of `sample_weight`

When we presented boosting algorithm where we used `sample_weight` to tweak the training procedure.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    Come with a strategy that would use <tt>sample_weight</tt> at <tt>fit</tt> in order to alleviate the issue of class imbalanced. Feel free to use a single split instead of cross-validation to evaluate your approach at first.
</div>

In [None]:
# %load solutions/solution_05.py

In [None]:
# %load solutions/solution_06.py

In [None]:
# %load solutions/solution_07.py

In [None]:
# %load solutions/solution_08.py

While `sample_weight` is providing the flexibility to change any weight for a given sample, scikit-learn provides sometimes a `class_weight` attribute in some estimator that would implement some strategie to reweight samples of the different classes.

### Use `class_weight` instead of `sample_weight`

Most of the models in `scikit-learn` have a parameter `class_weight`. This
parameter will affect the computation of the loss in linear model or the
criterion in the tree-based model to penalize differently a false
classification from the minority and majority class. We can set
`class_weight="balanced"` such that the weight applied is inversely
proportional to the class frequency. We test this parametrization in both
linear model and tree-based model.



In [None]:
classifier = make_pipeline(
    StandardScaler(),
    LogisticRegression(class_weight="balanced", max_iter=1000),
)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring, n_jobs=-1,
)

In [None]:
index.append(f"{classifier[-1].__class__.__name__} with balanced class weight")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

We see that this parameter has an impact on the overall performance. Looking at the precision and recall, we observe that our model becomes sensitive (it detects most of the fraud) at the cost of false detection. Since the balanced accuracy is an average of the recall of each class, the metric is still high.

In [None]:
classifier = RandomForestClassifier(class_weight="balanced")
cv_results = cross_validate(
    classifier, X, y, scoring=scoring, n_jobs=-1,
)

In [None]:
index.append(f"{classifier.__class__.__name__} with balanced class weight")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

We can as well see an impact of setting this parameter with `RandomForestClassifier`. With this model, the weights will increase the sensitivity of the model (i.e. increased recall) but with no trade-off on the precision.

An intuition regarding this results and difference with the `LogisticRegression` might be due to the fact that the model is non-linear.

### Resampling instead of passing weights

We saw that the semantic of `sample_weight` would be the following: a weight of 0 will mean that we don't consider the sample while a weight of 2 will be equivalent of having twice the sample in the dataset.

In the case that a model is not providing `sample_weight` and `class_weight` another library called `imbalanced-learn` allows to use an arbritrary resampling strategy in a pipeline. We will use these strategy to show that they are pretty much equivalent to `sample_weight` or `class_weight`. However, they would allow to specify a specific balancing ratio and could even be find by grid-search. 

Note that we are importing `make_pipeline` from `imblearn` because the `Pipeline` from `scikit-learn` will not handle sampler from `imbalanced-learn`.

In [None]:
from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

classifier = make_pipeline_with_sampler(
    StandardScaler(),
    RandomUnderSampler(random_state=42),
    LogisticRegression(max_iter=1000),
)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring, n_jobs=-1,
)

In [None]:
index.append(f"{classifier[-1].__class__.__name__} with {classifier[-2].__class__.__name__}")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    Repeat the experiment above but try to fine tune the balancing ratio by grid-search. Optimize the average precision score. The parameter to tune is called <tt>samling_strategy</tt>. You can refert to the <a href="https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html">documentation</a>.
</div>

In [None]:
# %load solutions/solution_09.py

In [None]:
# %load solutions/solution_10.py

In [None]:
index.append("LogisticRegression with RandomUnderSampler with an optimal ratio")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

We can repeat the previous experiment for the `RandomForestClassifier` and observe the impact of the sampling.

In [None]:
param_grid = {
    "randomundersampler__sampling_strategy": np.logspace(-2, 0, num=15)
}
classifier = GridSearchCV(
    make_pipeline_with_sampler(
        RandomUnderSampler(random_state=42),
        RandomForestClassifier(n_jobs=-1),
    ),
    param_grid=param_grid,
    scoring=make_scorer(
        average_precision_score, needs_proba=True, pos_label="True"
    ),
)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring, return_estimator=True,
)

In [None]:
for estimator in cv_results["estimator"]:
    print(estimator.best_params_)

In [None]:
index.append("RandomForestClassifier with RandomUnderSampler with an optimal ratio")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

### Integrating sampling within ensemble methods

Some methods based on ensemble are integrating some inner resampling that lead to more efficient algorithms. There are notably two algorithms. Let's start to show `BalancedRandomForestClassifier`.

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier

classifier = BalancedRandomForestClassifier(random_state=42, n_jobs=-1)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring,
)

Since the resampling happen at the level of the bootstrap, each tree in the forest is created on a lower number of samples. It will lower the computational cost. Resampling each bootstrap will also allow to potentially see more of the original data than with a strategy that resample the full training set before hand.

In [None]:
index.append(classifier.__class__.__name__)
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

It is as well possible to fine the ratio of the internal resampling as we previously did.

In [None]:
param_grid = {
    "sampling_strategy": np.logspace(-2, 0, num=10)
}
classifier = GridSearchCV(
    BalancedRandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    scoring=make_scorer(
        average_precision_score, needs_proba=True, pos_label="True"
    ),
    
)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring,
)

In [None]:
index.append("BalancedRandomForestClassifier with optimal ratio")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

In addition of the `BalancedRandomForestClassifier`, `imbalanced-learn` provides a `BalancedBaggingClassifier` that accepts any kind of estimator. Each estimator will be trained on a resampled bootstrap. Here, we show that we could use a strong learner like an `HistGradientBoostingClassifier`.

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from imblearn.ensemble import BalancedBaggingClassifier

param_grid = {
    "sampling_strategy": np.logspace(-2.1, 0, num=10)
}
classifier = GridSearchCV(
    BalancedBaggingClassifier(
        base_estimator=HistGradientBoostingClassifier(max_iter=1_000, early_stopping=True, random_state=42),
        n_estimators=5,
        random_state=42,
    ),
    param_grid=param_grid,
    scoring=make_scorer(
        average_precision_score, needs_proba=True, pos_label="True"
    ),
)
cv_results = cross_validate(
    classifier, X, y, scoring=scoring, n_jobs=-1
)

In [None]:
index.append("BalancedBaggingClassifier with optimal ratio")
scores = update_scores(scores, cv_results)

df_scores = pd.DataFrame(scores, index=index)
df_scores

The last approach is probably the most effective but request a huge amount of resource since it relies on powerful models.

However keep in mind that whatever we did here was to optimize a given metric. Is the metric choosen the right one?

## Cost-sensitive metric

Since we are dealing with a business oriented dataset, it might be interested to ask ourselve if the metrics chosen to optimized our model previously were also meaningful for our application (business).

Let's define a real cost-driven metric based on the confusion matrix. We will compute the confusion matrix given us the true positive and negative and the false positive en negative. We will then apply some business rules (completely arbitrary) to convert it into a monetary cost/benefit metric.

In short, we could have the following rules:

- not detecting a fraud will cost us the amount of the transaction
- detecting a fraud will benefit us 20 euros
- refusing a legitimate transaction will annoy our customer and cost us 20 euros
- accepting a legitimate transaction will increase customer confidence and the benefit will depend of the transaction amount

In [None]:
def benefit_matrix(estimator, X, y):
    y_pred = estimator.predict(X)
    tp = (y == "True") & (y == y_pred)
    tn = (y == "False") & (y == y_pred)
    fp = (y_pred == "True") & (y != y_pred)
    fn = (y_pred == "False") & (y != y_pred)
    
    # transform into benefit matrix
    # little benefit when accepting a true transaction
    # it will be related to the amount
    tn_benefit = (X["Amount"][tn] * 0.02).sum()
    # detecting a fraud is not trivial and arbritary
    tp_benefit = tp.sum() * 20
    # blocking a legitimate transaction will annoy our
    # customer
    fp_benefit = fp.sum() * -20
    # not blocking a fraud will cost us the transaction
    # money
    fn_benefit = -(X["Amount"][fn]).sum()
    return {
        "tp_benefit": tp_benefit,
        "tn_benefit": tn_benefit,
        "fp_benefit": fp_benefit,
        "fn_benefit": fn_benefit,
    }

So now that we have our business metric, we can evalutate them.

In [None]:
model = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)
cv_results = cross_validate(
    model, X, y, scoring=benefit_matrix, n_jobs=-1, error_score="raise"
)
cv_results = pd.DataFrame(cv_results)

In [None]:
metric_names = [name for name in cv_results.columns if "test_" in name]
cv_results[metric_names]

In [None]:
cv_results[metric_names].sum(axis=1).mean()

Now, we can try to evalute a model where we will resample the dataset. To select the sampling rate, we will maximize the total benefit instead of the average precision that we earlier used.

In [None]:
def total_benefit(estimator, X, y):
    return sum(benefit_matrix(estimator, X, y).values())


param_grid = {
    "randomundersampler__sampling_strategy": np.logspace(-2.1, -1, num=15)
}
classifier = GridSearchCV(
    make_pipeline_with_sampler(
        StandardScaler(),
        RandomUnderSampler(random_state=42),
        LogisticRegression(max_iter=1000),
    ),
    param_grid=param_grid,
    scoring=total_benefit,
    n_jobs=-1
)
cv_results = cross_validate(
    classifier, X, y, scoring=benefit_matrix, return_estimator=True,
)
cv_results = pd.DataFrame(cv_results)

In [None]:
cv_results[metric_names].sum(axis=1).mean()

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    Repeat the above experiment. However, optimize the balanced accuracy within the grid-search. Is it better or worse than optimising in terms of final business metric?
</div>

In [None]:
# %load solutions/solution_11.py

In [None]:
# %load solutions/solution_12.py

In [None]:
# %load solutions/solution_13.py