# Expected Calibration Error (ECE)
### [Code for Medium Article](LINK)


---

#### Overview

1. **Numpy Example**
    1. Binary classification

    2. Multi-class classification
    
2. **PyTorch Example**
    1. Binary classification

    2. Multi-class classification

------


## Numpy
### Definition of the ECE function:

In [None]:
import numpy as np


def expected_calibration_error(samples, true_labels, M=5):
    # uniform binning approach with M number of bins
    bin_boundaries = np.linspace(0, 1, M + 1)
    bin_lowers = bin_boundaries[:-1]
    bin_uppers = bin_boundaries[1:]

    # get max probability per sample i
    confidences = np.max(samples, axis=1)
    # get predictions from confidences (positional in this case)
    predicted_label = np.argmax(samples, axis=1).astype(float)

    # get a boolean list of correct/false predictions
    accuracies = predicted_label==true_labels

    ece = np.zeros(1)
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        # determine if sample is in bin m (between bin lower & upper)
        in_bin = np.logical_and(confidences > bin_lower.item(), confidences <= bin_upper.item())
        # can calculate the empirical probability of a sample falling into bin m: (|Bm|/n)
        prop_in_bin = in_bin.astype(float).mean()

        if prop_in_bin.item() > 0:
            # get the accuracy of bin m: acc(Bm)
            accuracy_in_bin = accuracies[in_bin].astype(float).mean()
            # get the average confidence of bin m: conf(Bm)
            avg_confidence_in_bin = confidences[in_bin].mean()
            # calculate |acc(Bm) - conf(Bm)| * (|Bm|/n) for bin m and add to the total ECE
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece

#### **Binary Classification:**

In [None]:
# Data
samples = np.array([[0.78, 0.22],
                    [0.36, 0.64],
                    [0.08, 0.92],
                    [0.58, 0.42],
                    [0.49, 0.51],
                    [0.85, 0.15],
                    [0.30, 0.70],
                    [0.63, 0.37],
                    [0.17, 0.83]])

true_labels = np.array([0,1,0,0,0,0,1,1,1])


expected_calibration_error(samples, true_labels)

array([0.10444444])

#### **Multi-class Classification:**

In addition to the binary example, also added the option for the multi-class classification.
We now use the example data from [James D. McCaffrey](https://jamesmccaffrey.wordpress.com/2021/01/22/how-to-calculate-expected-calibration-error-for-multi-class-classification/).

**_You can just skip to the PyTorch code below_** it if you are only interested in following the **_binary example_** from above.

In [None]:
target_classes = ["democrat", "republican", "independent", "green", "libertarian"]

In [None]:
# Data
samples_multi = np.array([[0.25,0.2,0.22,0.18,0.15],
                          [0.16,0.06,0.5,0.07,0.21],
                          [0.06,0.03,0.8,0.07,0.04],
                          [0.02,0.03,0.01,0.04,0.9],
                          [0.4,0.15,0.16,0.14,0.15],
                          [0.15,0.28,0.18,0.17,0.22],
                          [0.07,0.8,0.03,0.06,0.04],
                          [0.1,0.05,0.03,0.75,0.07],
                          [0.25,0.22,0.05,0.3,0.18],
                          [0.12,0.09,0.02,0.17,0.6]])

true_labels_multi = np.array([0,2,3,4,2,0,1,3,3,2])


expected_calibration_error(samples_multi, true_labels_multi, M=3)

array([0.192])

This outputs **_0.192_**, which differs to [McCaffrey's](https://jamesmccaffrey.wordpress.com/2021/01/22/how-to-calculate-expected-calibration-error-for-multi-class-classification/) calculation by **_0.002_** due to _differences in rounding!_

If you run this last step from McCaffrey's article: [(3 * 0.39) + (3 * 0.17) + (4 * 0.06)] / 10 through a calculator or Python you should also end up with 0.192, see below:

In [None]:
((3 * 0.39) + (3 * 0.17) + (4 * 0.06)) / 10

0.192

-----------

## PyTorch
We will now repeat the same 2 examples using PyTorch.

### Definition of the ECE function:
We now have to slightly adapt the function using _torch_ methods instead of _numpy_ ones:

In [None]:
import torch

def expected_calibration_error(samples, true_labels, M=5):
    # uniform binning approach with M number of bins
    bin_boundaries = torch.linspace(0, 1, M + 1)
    bin_lowers = bin_boundaries[:-1]
    bin_uppers = bin_boundaries[1:]

    # get max probability per sample i (confidences) and the final predictions from these confidences
    confidences, predicted_label = torch.max(samples, 1)


    # get a boolean list of correct/false predictions
    accuracies = predicted_label.eq(true_labels)

    ece = torch.zeros(1)
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        # determine if sample is in bin m (between bin lower & upper)
        in_bin = confidences.gt(bin_lower.item()) * confidences.le(bin_upper.item())
        # can calculate the empirical probability of a sample falling into bin m: (|Bm|/n)
        prop_in_bin = in_bin.float().mean()
        if prop_in_bin.item() > 0:
            # get the accuracy of bin m: acc(Bm)
            accuracy_in_bin = accuracies[in_bin].float().mean()
            # get the average confidence of bin m: conf(Bm)
            avg_confidence_in_bin = confidences[in_bin].mean()
            # calculate |acc(Bm) - conf(Bm)| * (|Bm|/n) for bin m and add to the total ECE
            ece += torch.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin

    return ece

-----


Instead of setting up the data using _np.arrays()_ we now use _torch.tensors()_.


#### **Binary Classification**

In [None]:
# Data
samples = torch.tensor([[0.78, 0.22],
                        [0.36, 0.64],
                        [0.08, 0.92],
                        [0.58, 0.42],
                        [0.49, 0.51],
                        [0.85, 0.15],
                        [0.30, 0.70],
                        [0.63, 0.37],
                        [0.17, 0.83]])

true_labels = torch.tensor([0,1,0,0,0,0,1,1,1])


expected_calibration_error(samples, true_labels)

tensor([0.1044])

#### **Multi-class Classification**

In [None]:
# Data
samples_multi = torch.tensor([[0.25,0.2,0.22,0.18,0.15],
                              [0.16,0.06,0.5,0.07,0.21],
                              [0.06,0.03,0.8,0.07,0.04],
                              [0.02,0.03,0.01,0.04,0.9],
                              [0.4,0.15,0.16,0.14,0.15],
                              [0.15,0.28,0.18,0.17,0.22],
                              [0.07,0.8,0.03,0.06,0.04],
                              [0.1,0.05,0.03,0.75,0.07],
                              [0.25,0.22,0.05,0.3,0.18],
                              [0.12,0.09,0.02,0.17,0.6]])

true_labels_multi = torch.tensor([0,2,3,4,2,0,1,3,3,2])


expected_calibration_error(samples_multi, true_labels_multi, M=3)

tensor([0.1920])