This repository includes a notebook showing the steps for assessing and fixing the calibration quality of a probabilistic classifier, and other notebooks for various analysis experiments. Before going into the code, though, we will give a very brief introduction to probabilistic classifiers and how to measure their performance, explaining how it can be decomposed into discrimination plus calibration components. We also discuss our view on the purpose of calibration analysis. References, proofs, and more detailed explanations on these issues can be found in the following paper:
L. Ferrer, "Analysis and Comparison of Classification Metrics", arXiv:2209.05355
While the theory behind calibration can be a bit hard to grasp at first, fortunately, in practice, measuring and fixing calibration is extremely simple. If you just want to dive directly into the practical aspects, go straight to the first notebook.
Note that the actual toolkit is provided as two python packages, expected_cost and psrcal, that can be installed through pip install. The notebooks here are a gentle introduction on how to use those packages for assessing the performance of posteriors.
In machine learning, a classifier is a system that is designed to predict a categorical class for each input sample. While the goal of these systems is to produce class labels, most classifiers have an intermediate step where they produce numerical scores for each class. The final label for a sample is then derived from these scores, for example, by choosing the class with the maximum score.
A probabilistic classifier is a special type of classifier that produces scores that are posterior probabilities for the classes given the input features.
Most modern machine learning classification methods fall in this category. For example, any model trained with cross-entropy as objective function produces posterior probabilities.
The advantage of (good) probabilistic classifiers is that they produce outputs that can be interpreted. Also, as explained below, posterior probabilities can be used to make decisions using Bayes decision theory. Bayes decisions are optimal in the sense that they minimize the expected value of a cost function of interest when all we know about our samples are the posteriors generated by the system.
Assume we have a certain cost function
where the expectation is taken with respect to some reference distribution
These decisions minimize the expectation of the cost with respect to the reference distribution
Note that the Bayes decisions only depend on the input features through our model for the posterior probabilities. Hence,
The usual argmax method used for multi-class classification, where the class with the maximum score is selected, is one instance of Bayes decision theory which corresponds to a cost function
In summary, making Bayes decisions is trivial once you have the posteriors. Of course, Bayes decisions will be good if the posteriors are good and bad if the posteriors are bad. So, how do we know if the posteriors generated by our classifier are good or bad? We use proper scoring rules.
A proper scoring rule (PSR) is a function
This property gives us an intuition of why PSRs are good for assessing the performance of posterior probabilities. They encourage the scores to be the same as the distribution with respect to which we are taking the expectation. So, if we could take the expectation with respect to the true
A perhaps more direct way to understand PSRs is by looking at how they can be constructed. Given a certain cost function
Simply put, PSRs measure the quality of the posteriors by assuming we will use them to make Bayes decisions and measuring the quality of those decisions. Remember that Bayes decisions are those that optimize the expected cost. Hence, PSRs simply measure how good is the best we can do with the posteriors we have.
Note that PSRs define a cost function per sample. To create a metric that reflects the performance of the posteriors over a dataset, we simply take the expectation of this function with respect to the distribution of the data. In practice, we usually just take the average PSR over our dataset. We call the resulting value expected PSR (EPSR).
where
The most common PSR is the negative logarithmic loss. The expectation of this loss is the widely used cross-entropy. Another common PSR is the Brier loss, commonly used in medical applications.
EPSRs can be normalized by dividing them by the EPSR of a system that outputs the prior distribution for every sample. This is the best possible system among those that do not have access to the input samples. Normalized EPSRs are easily interpretable: if their value is larger than 1.0 then it means the system is worse than a system that does not have access to the input samples. We would, in fact, be better off throwing away the system and replacing it by one that outputs the priors. So, if we want informative posteriors, at the very least we need normalized EPSRs lower than 1.0. If a system has a normalized cross-entropy larger than 1.0, we can immediately say that its posteriors are bad.
Note that, up to here we have not needed to talk about calibration. The quality of the posteriors is measured by expected PSRs (EPSRs) like the cross-entropy. If all we want to know is how good our posteriors are, we do not need to assess calibration quality. Yet, during development, we might wish to analyze whether our classifier is well calibrated because if it is not, we can get very cheap gains from fixing that problem.
A probabilistic classifier is said to be calibrated if the scores it generates match the frequency of the classes given those scores.
That is, if
How can we check if this equality is satisfied for our scores? Unfortunately, we cannot since, again, the "true" distribution is not accessible to us in practical problems and it never will. Hence, we have no way to compare our scores to this unknown distribution. To move forward we will need to create models for
For binary classification, a simple way to obtain
The scores for class 2 in our dataset are binned into
We can then compute some kind of distance between the score
where
The figure below show the reliability diagram for a poorly calibrated system where we can see that the scores from the system do not match the frequencies for class 2.
The ECE metric is intuitive but it is a pretty poor calibration metric for many different reasons, including the fact that it is based on histogram binning which is a rather fragile calibration approach, that it takes values that are not easily interpretable, and that it does not generalize well to the multi-class case. These issues are discussed in one of the notebooks and in the paper cited above.
A much better calibration metric, in our opinion is the so called calibration loss. This metric is defined as follows:
where EPSRraw is some EPSR evaluated on the scores
Note the similarity between this expression and that of the ECE. Both are averages of over the data of some function of the raw and calibrated scores. In calibration loss that function is defined based on PSRs which, as we will see next, allows for a very useful decomposition.
Note also, that the calibration transform used to obtain
The CalLoss metric defined above provides us with a very useful decomposition of the EPSR of the classifier:
The CalLoss term is the part of EPSRraw that can be reduced by doing calibration, i.e., just by transforming the scores generated by the classifier. The EPSRmin term reflects the part of EPSRraw that cannot be improved by calibration, usually called the discrimination, refinement or sharpness. To improve a system's discrimination performance we need to go into the classifier and change the features, the architecture, the training data, the hyperparameters, etc. This is, in most cases, a much more costly endeavour than adding a calibration stage.
The calibration analysis tells us whether it is possible to get cheap gains in the performance of the posteriors just by adding a calibration stage to the system.
How do we decide whether adding a calibration stage is worthwhile? The absolute CalLoss does not tell us much. A value of CalLoss of 0.05 means a different thing if EPSRraw is 0.1 than if it is 0.5. In the first case, 50% of the EPRSraw is due to miscalibration while in the second case, it is only around 10%. In the second case, one might decide that the extra stage is not necessary. For this reason, it is useful to compute the relative CalLoss, given by:
With this metric we can then directly judge the impact that adding a calibration stage would have on the quality of the posteriors. If the percentage is small, then we do not need to bother about adding a calibration stage. If the percentage is large, then we know that we can get a large relative improvement quite cheaply by adding a calibration stage to the system.
How do we find the post-hoc calibration transformation
Another common calibration method is to transform the scores with an affine transformation:
where
Note that the definition of calibration loss does not specify which calibration transform to use or how to train it. This can feel somewhat unsettling since it means that we would get different estimates of the calibration loss depending on how we choose to calibrate. This problem is, in fact, inherent to the concept of calibration and it stems from the fact that, as discussed above, the posterior probability that we would like to compare against,
The essential thing to note is that diagnosing and fixing calibration issues is one and the same process when using CalLoss as metric. To know whether our classifier is well-calibrated we need to go ahead and attempt to add a calibration stage to the system, following good development practices as for any other system stage. If adding a calibration stage led to improved EPSR, then we have both diagnosed and fixed the calibration problem at the very same time.
EPSRs are designed for the assessment of the quality of posterior probabilities. The widely-used cross-entropy is a general-purpose example of an EPSR. When comparing two systems, a better cross-entropy tells us that the posteriors are better for making Bayes decisions and also more useful, more informative, more interpretable. To judge whether a classifier's posteriors are good we do not need calibration metrics, all we need are EPSRs.
Calibration metrics tell us whether the posteriors can be improved by adding a calibration stage. They are useful during development, when we still have a chance to modify the system. If we find we have a calibration problem, we can add a calibration stage and get cheap improvements in the performance of the posteriors. Of course, we can also try to get better posteriors by playing with the classifier itself, but that is usually a much more costly endeavour. Hence, if we care about having good posteriors, we should always first do a calibration analysis and get the most out of the classifier we already have. If, after calibration, the cross-entropy is still not considered good enough, then we know we need to roll up our sleeves and attempt to improve the classifier.
The bottom line is: If we have two systems
but
then
The script example.py contains all the steps needed for evaluating the posteriors from a classifier. It also shows how to implement the post-hoc calibration stage if it turns out the the classifier's output is poorly calibrated. It is run as follows:
python ./example.py
and generates the following message:
Overall performance before calibration (LogLoss) = 7.33
Overall performance after calibration (LogLoss) = 0.39
Calibration loss = 6.94
Relative calibration loss = 94.7%
You can use it to run the same analysis in your own data by changing the initial part where the scores and label variables are defined.
This repository includes various notebooks that explain in detail through examples different practical and theoretical issues about assessing the quality of posterior probabilities. Below we briefly describe the contents of each notebook.
You can find the notebook here.
This notebook shows how to measure discrimination and calibration performance of classifier and, if necessary, how to fix it. It includes three different scenarios depending on which data the calibration transform is trained on.
- Training the transform on the test data (as usually done in the literature when computing calibration metrics like the ECE and many others)
- Training the transform on held-out data (a good option when you have plenty of data)
- Training the transform through cross-validation on the test data (a good option when you have a small test dataset)
You can find the notebook here.
This notebook compares calibration loss using different calibration approaches with ECE. It explores the issue of overfitting the calibration transform, and the effect of the train-on-test approach. It shows cases in which ECE fails to diagnose miscalibration. It also shows that, even when ECE works reasonably well, calibration loss gives a much more interpretable metric.
You can find the notebook here.
This notebook shows how to compute confidence intervals using the bootstraping method for any performance metric and, in particular, for calibration loss which presents some additional complexity.
You can look at the pre-run notebooks directly from this website, but if you want to play with them, change values and re-run, or run the example.py script you should follow these steps:
-
Install two python packages:
pip install expected_cost pip install psrcal
-
Test the installation by running the
examples.py
script. -
Download the notebook of your choice from the notebooks directory (or clone the whole repository if you prefer), open jupyter notebook and load the notebook. You can then play with the code and see how the results or plots change.
If you just want to use the packages within your own code, you only need to do step 1 (and 2, if you want to make sure everything works fine before importing the packages in your code).
Comments or questions?
If you found mistakes or have any comments or questions, please feel free to raise an issue in github.
Also, please leave a star if you liked this repo.
Acknowledgment
This material is partly based upon work supported by the European Union’s Horizon 2020 research and innovation program under grant No 101007666/ESPERANTO/H2020-MSCA-RISE-2020 and by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0124. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).