# Metrics

**Note :** to use this notebook in Google Colab, create a new cell with
the following lines and run it:

``` shell
!git clone https://gitlab.in2p3.fr/jbarnier/ateliers_deep_learning.git
%cd ateliers_deep_learning
!pip install .
```

In [None]:
import polars as pl
import torch
from torch import nn

from adl.metrics import stratified_split

pl.Config(tbl_rows=10, float_precision=3)

The train and validation losses allow to evaluate the evolution of the
training process, but they are not necessarily good indicators to assess
the quality of the network predictions. For this we need specific
metrics aligned with the problem we are trying to solve.

For example, for a regression problem we could compute the $R^2$ score,
the mean absolute error or the mean absolute percentage error. For a
classification problem we could use many different metrics such as
accuracy, precision, recall, F1-score, ROC AUC, etc.

## Computing metrics during training

In this notebook we will use a dataset on [credit card fraud
detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
downloaded from Kaggle and converted to a parquet file.

In [None]:
d = pl.read_parquet("data/creditcard.parquet")
d

This tabular dataset contains 284 807 rows describing credit card
transactions which happened in september 2013 in Europe:

-   The `Amount` column is the transaction amount
-   The columns `V1` to `V28` are different characteristics of the
    transaction anonymized through a PCA transformation
-   The `Class` column has value 1 if the transaction is a credit card
    fraud, and 0 otherwise

The dataset is highly unbalanced, as there are only 492 fraudulent
transactions.

In [None]:
d.get_column("Class").value_counts()

We split this dataset into training and validation data using stratified
sampling to maintain the same proportion of fraudulent transactions in
both datasets. It is necessary because if we sampled randomly we could
get very few of them in the validation set due to their low prevalence.

In [None]:
X_train, X_valid, y_train, y_valid = stratified_split(d, valid_proportion=0.2)

Finally we create a small feed forward neural network and a training
step function as seen previously.

In [None]:
class FraudDetectionNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(29, 10), nn.ReLU(), nn.Linear(10, 10), nn.ReLU(), nn.Linear(10, 2)
        )

    def forward(self, x):
        return self.model(x).squeeze()


def train_step(epoch, model, loss_fn, optimizer):
    # Run training step
    model.train()
    optimizer.zero_grad()
    y_pred = model(X_train)
    loss = loss_fn(y_pred, y_train)
    loss.backward()
    optimizer.step()
    # Run validation step
    model.eval()
    y_valid_pred = model(X_valid)
    valid_loss = loss_fn(y_valid_pred, y_valid)
    print(f"Epoch: {epoch + 1:2}, loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}")


We launch a training process using a cross entropy loss, which is more
suitable for a classification problem such as this one.

In [None]:
model = FraudDetectionNetwork()
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

epochs = 20
for epoch in range(epochs):
    train_step(epoch, model, loss_fn, optimizer)

The training process seems to go well, both the training and validation
losses seem to go down regularly.

Let’s take a closer look at the predictions of our model on the
validation data. When applying `model` to `X_valid`, we can see that the
result for each observation is a set of two numbers. The first one is
associated to `Class=0`, and the second one to `Class=1`. These values
are not probabilities as they are not numbers between 0 and 1: they are
called *logits*.

In [None]:
logits = model(X_valid)
logits

Logits can be converted into probabilites by applying a *softmax*
function to them. But we can also determine the class of each validation
data point by applying `torch.argmax` along the second dimension of our
logits: this will return `0` if the logit associated to `Class=0` is
higher, and `1` otherwise.

In [None]:
classes = torch.argmax(logits, dim=1)
classes


So now we can finally look at the number of fraudulent transactions
predicted by our model on our validation dataset.

In [None]:
torch.sum(classes == 1)


And this value is 0… So our network is learning, as the cross entropy
loss between our logits and the target values is going down, but for the
moment its results are useless.

So if the loss value is useful to assess the progress of our training
process, it is not necessarily a good indicator of the quality of its
results. To evaluate this we need to use other metrics, which will
depend on the problem we want to solve.

**Exercise**

One very simple metric we just computed is the number of fraudulent
transactions identified by the model on the validation dataset. It could
be useful to add this metric to our training process output.

Modify the `train_step` function above to create a new
`train_step_nfraud` method which computes and displays, for each epoch,
the train loss, the validation loss, and the number of predicted
fraudulent transactions in the validation dataset.

Run this new training process for 10 epochs on a new
`FraudDetectionNetwork` model.

We can see that at the start of our training process the model predicts
some fraudulent transactions, but this number goes down to 0 rapidly.

There are many other metrics we can use to assess the results of a
classification problem, and several Python packages provide methods to
compute them more easily. For example, we could use the
`precision_score` and `recall_score` methods of the `scikit-learn`
package to compute precision and recall at each epoch.

In [None]:
from sklearn.metrics import precision_score, recall_score


def train_step_metrics(epoch, model, loss_fn, optimizer):
    # Run training step
    optimizer.zero_grad()
    y_pred = model(X_train)
    loss = loss_fn(y_pred, y_train)
    loss.backward()
    optimizer.step()

    # Run validation step
    y_valid_pred = model(X_valid)
    valid_loss = loss_fn(y_valid_pred, y_valid)

    # Compute metrics
    pred_classes = torch.argmax(y_valid_pred.detach(), dim=1)
    n_fraud = torch.sum(pred_classes == 1)
    precision = precision_score(y_valid, pred_classes) if n_fraud > 0 else 0
    recall = recall_score(y_valid, pred_classes) if n_fraud > 0 else 0
    print(
        f"Epoch: {epoch + 1:3}, loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}, n_fraud: {n_fraud:3}, "
        f"precision: {precision:5.3f}, recall: {recall:5.3f}"
    )


model = FraudDetectionNetwork()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

torch.manual_seed(42)
epochs = 10
for epoch in range(epochs):
    train_step_metrics(epoch, model, loss_fn, optimizer)

So, the metrics are not good, but the loss is still going down. Maybe we
can look at what happens if we run the training process for longer?

In [None]:
model = FraudDetectionNetwork()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

torch.manual_seed(42)
epochs = 85
for epoch in range(epochs):
    train_step_metrics(epoch, model, loss_fn, optimizer)


Now we can see that after about 50 epochs, our model starts to predict
fraudulent transactions again, with growing values of precision and
recall. At epoch 85 we get a precision of 0.79 and a recall of 0.80.

## Computing metrics after training

Metrics are useful during training, but they are also very important
post-training, to more accurately assess the results.

For example, we can compute the *confusion matrix* of our trained model
on our validation dataset by using scikit-learn’s `confusion_matrix`
method.

In [None]:
from sklearn.metrics import confusion_matrix

preds = model(X_valid)
preds = torch.argmax(preds, dim=1)


cm = confusion_matrix(y_valid, preds)
cm


Better yet, we can use `ConfusionMatrixDisplay` to generate a much more
readable plot of the confusion matrix.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay(cm).plot()

## Computing metrics when using mini-batches

### Mini-batches training process

Suppose that we are now using mini-batches during our training process,
as seen in the previous notebook: instead of feeding all training or
validation data at once during each training step, we’ll use smaller
subsets of data.

To do this, we first create a `FraudDataset` class and corresponding
`DataLoader` instances for training and validation data, with a batch
size of 256.

In [None]:
class FraudDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.y = y
        self.x = x

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        return (self.x[index], self.y[index])


train_dataset = FraudDataset(x=X_train, y=y_train)
valid_dataset = FraudDataset(x=X_valid, y=y_valid)

batch_size = 256
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)


We then define two functions:

-   `train_step` will run a training epoch, *ie* apply a training step
    to each mini-batches in the training data loader
-   `eval_step` will run a validation epoch, *ie* apply an evaluation
    step to each mini-batches in the validation data loader

In [None]:
def train_step(model, loss_fn, optimizer):
    # Switch model into train mode
    model.train()
    loss = 0
    for input, target in train_loader:
        # Apply training step to batch
        optimizer.zero_grad()
        pred = model(input)
        batch_loss = loss_fn(pred, target)
        loss += batch_loss
        batch_loss.backward()
        optimizer.step()
    # Compute and return the mean loss for this epoch
    loss /= len(train_loader)
    return loss


def eval_step(model, loss_fn):
    # Switch model into eval mode
    model.eval()
    loss = 0
    for input, target in valid_loader:
        # Apply evaluation step to batch
        pred = model(input)
        batch_loss = loss_fn(pred, target)
        loss += batch_loss
    # Compute and return the mean loss for this epoch
    loss /= len(valid_loader)
    return loss


We can now run our training process on a few epochs.

In [None]:
model = FraudDetectionNetwork()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

epochs = 5
torch.manual_seed(42)
for epoch in range(epochs):
    loss = train_step(model, loss_fn, optimizer)
    valid_loss = eval_step(model, loss_fn)
    print(f"Epoch: {epoch + 1:3}, loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}")


### Computing a single metric

To add metric computation during this training process with
mini-batches, we can calculate the metric value for each batch. However,
obtaining the overall epoch metric value for the entire validation data
from these mini-batch values can be challenging. Libraries like
`scikit-learn` do not provide methods out of the box to do this, and
manual implementation can be complicated and prone to errors.

A way to do it is to use the `torchmetrics` python package. This package
provides a great number of metrics which can be used directly on a whole
dataset, but can also be applied to mini-batches the following way:

1.  first, we instantiate a metric object from one of `torchmetrics`
    methods. For example, we can use the `BinaryF1Score` class to create
    a `f1_metric` object with `f1_metric = BinaryF1Score()`
2.  at the start of each epoch, we reset the metric with
    `f1_metric.reset()`
3.  for each mini-batch, we update the metric using the mini-batch
    predictions and targets with the `update()` method
4.  finally, at the end of the epoch, we can compute the overall epoch
    metric value using `f1_metric.compute()`

Here is how we can include an F1 metric computation in our evaluation
step by creating a new `eval_step_f1` function.

In [None]:
from torchmetrics.classification import BinaryF1Score

# Instantiate F1 score metric object
f1_metric = BinaryF1Score()


def eval_step_f1(model, loss_fn):
    # Switch model into eval mode
    model.eval()
    # Reset F1 score
    f1_metric.reset()
    loss = 0

    for input, target in valid_loader:
        # Apply evaluation step to batch
        pred = model(input)
        batch_loss = loss_fn(pred, target)
        loss += batch_loss
        # Update metric
        classes_pred = torch.argmax(pred, dim=1)
        f1_metric.update(classes_pred, target)

    # Compute overall loss and metric
    loss /= len(valid_loader)
    f1_score = f1_metric.compute()

    return loss, f1_score


In [None]:
model = FraudDetectionNetwork()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()


epochs = 5
torch.manual_seed(42)

for epoch in range(epochs):
    loss = train_step(model, loss_fn, optimizer)
    valid_loss, valid_f1 = eval_step_f1(model, loss_fn)
    print(f"Epoch: {epoch + 1:3}, loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}, f1: {valid_f1:5.3f}")


Creating a separate evaluation step function is interesting because it
makes it easy to apply a trained model to a dataset using mini-batches.

In [None]:
eval_step_f1(model, loss_fn)

### Computing a list of metrics

In general we want to compute not just one but multiple metrics. For
instance, we might want to compute the F1-score, precision and recall
values for our classification problem.

`torchmetrics` provides a `MetricCollection` class which allows to do
that quite easily. By passing a list of metrics to `MetricCollection`,
we can create a collection object which will have the same `reset()`,
`update()` and `compute()` methods as single metrics. When used, these
methods will be called for all the metrics in the list.

**Exercise**

Create the following metrics collection object:

``` py
from torchmetrics.classification import BinaryF1Score, BinaryPrecision, BinaryRecall
from torchmetrics import MetricCollection

metrics_list = MetricCollection(
    [
        BinaryPrecision(),
        BinaryRecall(),
        BinaryF1Score(),
    ]
)
```

Create a new `eval_step_metrics` function that add these metrics
computation to the evaluation step of our training process and display
their values at the end of each epoch. Run this training process for 10
epochs.