setup.tex

\section{Setup and background}
\label{sec:formulation}

\newcommand{\ce}[0]{\ensuremath{\textup{CE}}}
\newcommand{\lpce}[0]{\ensuremath{\ell_p\textup{-CE}}}
\newcommand{\ltwoce}[0]{\ensuremath{\ell_2\textup{-CE}}}
\newcommand{\lsquared}[0]{\ensuremath{L^2}}
\newcommand{\squaredce}[0]{\ensuremath{L^2\textup{-CE}}}
\newcommand{\topce}[0]{\ensuremath{\textup{TCE}}}
\newcommand{\margsquaredce}[0]{\ensuremath{\textup{MCE}}}
\newcommand{\mse}[0]{\ensuremath{\textup{MSE}}}

\subsection{Binary classification}

Let $\mathcal{X}$ be the input space and $\mathcal{Y}$ be the label space where $\mathcal{Y} = \{0, 1\}$ for binary classification.
Let $X \in \mathcal{X}$ and $Y \in \mathcal{Y}$ be random variables denoting the input and label, given by an unknown joint distribution \pl{use mathbb}$P$. As usual, expectations are taken over all random variables.

Suppose we have a model $f : \mathcal{X} \to [0, 1]$ where the (possibly uncalibrated) output of the model represents the model's confidence that the label is 1. The calibration error examines the difference between the model's probability and the true probability given the model's output:

\begin{definition}[Calibration error]
The calibration error of $f : \mathcal{X} \to [0, 1]$ is given by:
\begin{align}
\ce(f) = \Big(\expect\big[ \left|f(X) - \expect[Y \mid f(X)] \right|^2 \big] \Big)^{1/2}
\end{align}
\end{definition}

If the calibration error is $0$ then the model is perfectly calibrated. This notion of calibration error is most commonly used~\cite{murphy1973vector,murphy1977reliability,degroot1983forecasters, nguyen2015posterior, hendrycks2019anomaly, kuleshov2015calibrated, hendrycks2019pretraining, brocker2012empirical}. Replacing the $2$ in the above definition by $p \geq 1$ we get the $\ell_p$ calibration error---the $\ell_1$ and $\ell_{\infty}$ calibration errors are also used in the literature~\cite{guo2017calibration, naeini2015obtaining, nixon2019calibration}. 

% \pl{why do we need the general $\ell_p$ case? we only do $\ell_2$ in this paper, right?}

% \pl{IMPORTANT: we use $\ell_2$, $L^2$, and $L_2$ in this paper; take a careful pass over it to make it all the same}

Calibration alone is not sufficient: consider an image dataset containing $50\%$ dogs and $50\%$ cats.
If $f$ outputs $0.5$ on all inputs, $f$ is calibrated but not very useful.
We often also wish to minimize the mean-squared error---also known as the Brier score---as defined below.

\begin{definition}
The mean-squared error of $f : \mathcal{X} \to [0, 1]$ is given by $\textup{MSE}(f) = \mathbb{E}[(f(X) - Y)^2]$.
\end{definition}

We often want to minimize the MSE subject to a calibration budget~\cite{gneiting2005weather, gneiting2007probabilistic}. Of course, these two measures are not orthogonal\pl{independent?} because $\mbox{MSE} = 0$ implies perfect calibration; in fact the MSE is the sum of the squared calibration error and a ``sharpness'' term~\cite{murphy1973vector,degroot1983forecasters, kuleshov2015calibrated}.

\subsection{Multiclass classification}

While calibration in binary classification is \pl{well-defined?}well-studied,
it's less clear what to do for multiclass\pl{make less casual}, where multiple definitions abound, differing in their strengths. In the multiclass setting, $\mathcal{Y} = [K] = \{1, \dots, K\}$ and $f : \mathcal{X} \to [0, 1]^K$ outputs a confidence measure for each class in $[K]$\pl{probability distribution over the $K$ classes [have to sum to 1, right?]}.

\begin{definition}[Top-label calibration error]
The top-label calibration error examines the difference between the model's probability for its top prediction and the true probability of that prediction given the model's output:
\begin{align}
\topce(f) = \Big( \expect\Big[ \Big( \prob\big(Y = \argmax_{j \in [K]} f(X)_j \mid \max_{j \in [K]} f(X)_j\big) - \max_{j \in [K]} f(X)_j \Big)^2 \Big] \Big)^{1/2}
\end{align}
\end{definition}

We would often like the model to be calibrated on less likely predictions as well---imagine that a medical diagnosis system says there is a $50\%$ chance a patient has a benign tumor, a $10\%$ chance she has an aggressive form of cancer, and a $40\%$ chance she has one of a long list of other conditions. We would like the model to be calibrated on all of these predictions so we define the marginal calibration error which examines, \emph{for each class}, the difference between the model's probability and the true probability of that class given the model's output.

\begin{definition}[Marginal calibration error]
\label{dfn:marginal-ce}
  Let $w_k \in [0, 1]$ denote how important calibrating class $k$ is, where $w_k = 1/k$ if all classes are equally important. The marginal calibration error is:
\begin{align}
\margsquaredce(f) = \Big( \sum_{k = 1}^K w_k \mathbb{E}\big[ (f(X)_k - \prob(Y = k \mid f(X)_k))^2 \big] \Big)^{1/2}
\end{align}
\end{definition}

Prior works~\cite{guo2017calibration, hendrycks2019anomaly, hendrycks2019pretraining} propose methods for multiclass calibration but only measure top-label calibration---\cite{nixon2019calibration} and concurrent work to ours~\cite{kull2019temperature} define similar per-class calibration metrics where temperature scaling~\cite{guo2017calibration} is worse than vector scaling despite having better top-label calibration.
% In the Appendix we discuss other notions of multiclass calibration.

%\pl{is marginal calibration standard terminology? makes sense when compared with joint, but sounds a bit strange in relation to top-label}
%\ak{There isn't a standard terminology here that I am aware of. Nixon et al (came out a few weeks before neurips deadline) have a similar metric called static calibration error. Seems fairly concurrent to ours, and I prefer marginal.}

For notational simplicity, our theory focuses on the binary classification setting. We can transform top-label calibration into a binary calibration problem---the model outputs a probability corresponding to its top prediction, and the label represents whether the model gets it correct or not. Marginal calibration can be transformed into $K$ one-vs-all binary calibration problems where for each $k \in [K]$ the model outputs the probability associated with the $k$-th class, and the label represents whether the correct class is $k$~\cite{zadrozny2002transforming}. We consider both top-label calibration and marginal calibration in our experiments.
Other notions of multiclass calibration include joint calibration (which requires the entire probability \emph{vector} to be calibrated)~\cite{murphy1973vector, brocker2009decomposition} and event-pooled calibration~\cite{kuleshov2015calibrated}.
% Achieving joint calibration efficiently is an important direction for future research.

\subsection{Recalibration}

Since most machine learning models do not output calibrated probabilities out of the box~\cite{guo2017calibration, zadrozny2001calibrated} recalibration methods take the output of an uncalibrated model, and transform it into a calibrated probability. That is, given a trained model $f: \mathcal{X} \to [0, 1]$, let $Z = f(X)$. We are given recalibration data $T = \{ (z_i, y_i) \}_{i=1}^n$ independently sampled from \pl{mathbb} $P(Z, Y)$, and we wish to learn a recalibrator $g : [0, 1] \to [0, 1]$ such that $g \circ f$ is well-calibrated.

\emph{Scaling methods}, for example Platt scaling~\cite{platt1999probabilistic}, output a function $g = \argmin_{g \in \mathcal{G}} \sum_{(z, y) \in T} \ell(g(z), y)$, where $\mathcal{G}$ is a model family, $g \in \mathcal{G}$ is differentiable, and $\ell$ is a loss function, for example the log-loss or mean-squared error. The advantage of such methods is that they converge very quickly since they only fit a small number of parameters.

\emph{Histogram binning} first constructs a set of bins (intervals) that partitions $[0, 1]$, formalized below.

\begin{definition}[Binning schemes]
A binning scheme $\mathcal{B}$ of size $B$ is a set of $B$ intervals $I_1, \dots, I_B$ that partitions $[0, 1]$. Given $z \in [0, 1]$, let $\beta(z) = j$, where $j$ is the interval that $z$ lands in ($z \in I_j$).
%\pl{rewrite: given a function $g$, define its binned version as $\hat g(z) = \text{center of $I_j$}$ for $z \in I_j$.}
%\ak{I don't think we should define binning a function when we introduce histogram binning?}
\end{definition}

The bins are typically chosen such that either $I_1 = [0, \frac{1}{B}], I_2 = (\frac{1}{B}, \frac{2}{B}], \dots, I_B = (\frac{B-1}{B}, 1]$ (equal width binning)~\cite{guo2017calibration} or so that each bin contains an equal number of $z_i$ values in the recalibration data (uniform mass binning)~\cite{zadrozny2001calibrated}. Histogram binning then outputs the average $y_i$ value in each bin.