# Classifier calibration

For some applications, experts would like to interpret the output of the probabilities that a classifier is providing. Let's take the example of weather forecasting and specifically prediction of severe rainfall classification. If an event declare a 80% probability to have a severe rainfall, on 100 events with such particular weather, 80 of them had a severe rainfall outcome while 20 of them did not.

Thus, it happens that classifiers do not provide a probabilities that translate into such interpretation: these classifiers are not calibrated. When such interpretations are required, one should make sure that classifiers are calibrated and if not, should calibrate them.

In this notebook, we will investigate how to check if a classifier or not is calibrated and how to calibrate one.

## Presentation of our dataset

Let's load the dataset where we will illustrate our problem.

In [None]:
import sklearn

sklearn.set_config(display="diagram")

In [None]:
import pandas as pd

data = pd.read_csv("../datasets/weather.csv", parse_dates=["Date"])
data.head()

This dataset contains information about weather forecast. We will modify this dataset such that our target will be the `"Rainfall"` column. We will create a classification problem by thresholding the target to get to category: >50 mm that will be severe rainfall and <50 mm that will not be a severe rainfall.

In addition, we will drop the `"RainToday"` and `"RainTomorrow"` features that are the link with the original classification problem.

In [None]:
import numpy as np

target_name = "Rainfall"
data = data.dropna(axis="index", subset=[target_name])
X = data.drop(columns=["RainToday", "RainTomorrow", target_name])
y = (data[target_name] > 50).astype(np.int64)

Now let's have a look at the available features and their types

In [None]:
X.info()

So we can see that we will need to:

- encode the `"Date"` feature;
- encode the column the `object` columns using an `OrdinalEncoder`;
- let the numerical features as-is;
- impute the missing values with a constant.

In addition, let's check the distributionof the target.

In [None]:
y.value_counts()

Thus, we can observe that the problem is imbalanced.

## Our machine learning model

We use a `BalancedRandomForestClassifier` on this problem. First, let's start to create a preprocessor.

### Date parser

Let's create a small function that would encode the date into three different features for the day, month, and year.

In [None]:
def date_parser(X):
    X = X.copy()
    X["day"] = X["Date"].dt.day
    X["month"] = X["Date"].dt.month
    X["year"] = X["Date"].dt.year
    return X.drop(columns=["Date"])

### Data preprocessor

Now, let's create a preprocessor to encode the categorical columns and let the numerical columns as-is. We will use `make_column_selector` based on the dtype to select the right columns.

In [None]:
from sklearn.compose import make_column_selector

numerical_columns = make_column_selector(dtype_exclude=[object, "datetime"])(X)
categorical_columns = make_column_selector(dtype_include=object)(X)

Now, we will use a `ColumnTransformer` to encode and impute the missing data.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

preprocessor = ColumnTransformer(
    transformers=[
        (
            "categorical",
            make_pipeline(
                OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
                SimpleImputer(strategy="constant", fill_value=-1),
            ),
            categorical_columns,
        ),
        (
            "numerical",
            SimpleImputer(strategy="constant", fill_value=-1),
            numerical_columns,
        )
    ],
)
preprocessor

### Full model

Now that we have our preprocessor, we can create our entire model finshing by a `BalancedRandomForestClassifier`.

In [None]:
from sklearn.preprocessing import FunctionTransformer
from imblearn.ensemble import BalancedRandomForestClassifier

model = make_pipeline(
    FunctionTransformer(date_parser),
    preprocessor,
    BalancedRandomForestClassifier(n_jobs=-1, random_state=0),
)
model

We can now evaluate our model using cross-validation. Since we deal with time series, we will use a `TimeSeriesSplit` cross-validation scheme.

In addition, we will use several metrics: balanced accuracy, average precision, and brier loss.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer, balanced_accuracy_score, average_precision_score, brier_score_loss

scoring = {
    "balanced_accuracy": make_scorer(balanced_accuracy_score),
    "average_precision": make_scorer(average_precision_score, needs_proba=True),
    "brier_score": make_scorer(
        brier_score_loss, greater_is_better=False, needs_proba=True,
    ),
}
cv = TimeSeriesSplit()

We are finally ready to run our cross-validation.

In [None]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    model, X, y, cv=cv, scoring=scoring, n_jobs=-1,
)

In [None]:
cv_results = pd.DataFrame(cv_results)
cv_results

In [None]:
cv_results.mean()

### A note about the Brier score

The Brier score (that is indeed a loss) measures the if the probability predicted by a classifier are accurate. An uncalibrated classifier will result in an higher Brier score than a well calibrated classifier.

## About classifier calibration

Now, that we have our model, we will show how to check if it is calibrated. Let's first make a single train-test split and train our classifier.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.5, random_state=0
)

In [None]:
model.fit(X_train, y_train)

We can use the `CalibrationDisplay` that will plot the fraction of positive against the mean predicted probability. For a calibrated classifier, we would expect the fraction of positive to be aligned wiht the mean predicted probability, such that is follow our original explanation. Let's check if our classifier is calibrated.

In [None]:
import seaborn as sns
sns.set_context("poster")

In [None]:
from sklearn.calibration import CalibrationDisplay

display = CalibrationDisplay.from_estimator(
    model, X_test, y_test, strategy="quantile", n_bins=20,
    name="Original classifier", markersize=10,
)
display.ax_.legend(loc="best", bbox_to_anchor=(1, 0.5))
_ = display.ax_.set_title("Reliabiliry of original classifier")

We observe that our classifier is not well calibrated since it does not follow the diagonal.

When a model is not calibrated, it can be either:

- overconfident: the predicted probability will be higher than the fraction of positives, or
- underconfident: the predicted probability will be lower than the fraction of positives.

Here, our model is clearly overconfident. A classifier can be recalibrated using `CalibratedClassifierCV`. This classifier will use a calibrator that will fit a function to map the probabilities of the uncalibrated classifier to the true predictions.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

model_calibrated = CalibratedClassifierCV(
    model, method="isotonic",
)
model_calibrated

Now, we can evaluate our model with cross-validation.

In [None]:
cv_results = cross_validate(
    model_calibrated, X, y, cv=cv, scoring=scoring, n_jobs=-1,
)

In [None]:
cv_results = pd.DataFrame(cv_results)
cv_results

In [None]:
cv_results.mean()

We observe that while the balanced accuracy goes down, the average precision remains more or less stable. More importantly the Brier score is much smaller meaning that our classifier is better calibrated. We can check now the reliability diagram.

In [None]:
model_calibrated.fit(X_train, y_train)

In [None]:
display = CalibrationDisplay.from_estimator(
    model_calibrated, X_test, y_test, strategy="quantile", n_bins=20
)
display.ax_.legend(loc="best", bbox_to_anchor=(1, 0.5))
_ = display.ax_.set_title("Reliability of calibrated classifier")

We observe that our classifier follow the diagonal. Since we are using quantile and most probability are in the lower probability, we don't have data point above 10%. We could force the binning to use a uniform sampling. However, we might have very few points and thus a lot of variance then.

In [None]:
display = CalibrationDisplay.from_estimator(
    model_calibrated, X_test, y_test, strategy="uniform", n_bins=20
)
display.ax_.legend(loc="best", bbox_to_anchor=(1, 0.5))
_ = display.ax_.set_title("Reliability of calibrated classifier")