main.tex

\documentclass{article}
\pdfoutput=1

% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2019

% ready for submission
\usepackage[final, nonatbib]{neurips_2019}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xcolor}
\usepackage[utf8]{inputenc}
\usepackage{thmtools, thm-restate}
\usepackage[normalem]{ulem}
% \usepackage{xpatch}
% \usepackage{apptools}

% \makeatletter
% \xpatchcmd{\thmt@restatable}% Edit \thmt@restatable
% {\csname #2\@xa\endcsname\ifx\@nx#1\@nx\else[{#1}]\fi}% Replace this code
% {\IfAppendix{\csname #2\@xa\endcsname}{\csname #2\@xa\endcsname\ifx\@nx{Restatement}\@nx\else[{Restatement}]\fi}}% with this code
% {}{} % execute code for success/failure instances
% \makeatother

\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{example}[theorem]{Example}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}

\def\shownotes{1}  \ifnum\shownotes=0
\newcommand{\authnote}[2]{$\ll$\textsf{\footnotesize #1 notes: #2}$\gg$}
\else
\newcommand{\authnote}[2]{}
\fi
\newcommand{\tnote}[1]{{\color{blue}\authnote{Tengyu}{#1}}}


\newcommand{\pl}[1]{}
\newcommand{\tm}[1]{}
\newcommand{\ak}[1]{}
\newcommand{\pldel}[1]{}

% \newcommand{\pl}[1]{\textcolor{red}{[PL: #1]}}
% \newcommand{\pldel}[1]{\sout{\textcolor{red}{#1}}}
% \newcommand{\tm}[1]{\textcolor{blue}{[TM: #1]}}
% \newcommand{\ak}[1]{\textcolor{brown}{[AK: #1]}}

\newcommand{\expect}[0]{\ensuremath{\mathbb{E}}}
\newcommand{\prob}[0]{\ensuremath{\mathbb{P}}}
\newcommand{\ourcal}[0]{the scaling-binning calibrator}
\newcommand{\Ourcal}[0]{The scaling-binning calibrator}
\newcommand{\ourcalnothe}[0]{scaling-binning calibrator}
\newcommand{\Ourcalnothe}[0]{Scaling-binning calibrator}

\graphicspath{ {images/} }

% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2019}

% to compile a camera-ready version, add the [final] option, e.g.:
    %  \usepackage[final]{neurips_2019}

% to avoid loading the natbib package, add option nonatbib:
%     \usepackage[nonatbib]{neurips_2019}

\usepackage[numbers]{natbib}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argsup}{arg\,sup}
\newcommand{\bins}{\mathcal{B}}

\title{Verified Uncertainty Calibration}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.

% \setcounter{footnote}{0}
% \renewcommand*{\thefootnote}{\fnsymbol{footnote}}

\author{%
  Ananya Kumar, Percy Liang, Tengyu Ma \\
  Department of Computer Science\\
  Stanford University\\
  \texttt{\{ananya, pliang, tengyuma\}@cs.stanford.edu} \\
  % examples of more authors
  % \And
  % Percy Liang \\
  % Department of Computer Science\\
  % Stanford University\\
  % \texttt{pliang@cs.stanford.edu} \\
  % \And
  % Tengyu Ma \\
  % Department of Computer Science\\
  % Stanford University\\
  % \texttt{tengyuma@stanford.edu} \\
}

\begin{document}

\maketitle

\begin{abstract}
  Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient---it requires $O(B/\epsilon^2)$ samples, compared to $O(1/\epsilon^2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce \emph{\ourcal{}}, which first fits a parametric function that acts like a baseline for variance reduction and then bins the function values to actually ensure calibration. This requires only $O(1/\epsilon^2 + B)$ samples. We then show that methods used to estimate calibration error are suboptimal---we prove that an alternative estimator introduced in the meteorological community requires fewer samples ($O(\sqrt{B})$ instead of $O(B)$). We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35\% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration.

% Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction.
% Scaling recalibration methods that fit a parametric model are great, because they require $O(1/\epsilon^2)$ samples to achieve calibration error $\epsilon$, but we find in this work that they are (i) less calibrated than reported and (ii) current techniques cannot estimate how miscalibrated they are.
% An alternative method, histogram binning, is sample inefficient -- it requires $O(B/\epsilon^2)$ samples where $B$ is the number of distinct probabilities the model can output.
% To solve these problems, we introduce the variance-reduced calibrator, which first fits a parametric function that acts like a baseline for variance reduction and then bins the function values to actually ensure calibration.
% This requires only $O(1/\epsilon^2 + B)$ samples.\tm{changed the order of two summands}
% We also show that the methods used in the machine learning literature to estimate calibration error are suboptimal -- we prove that an alternative estimator introduced in the meteorological community requires fewer samples -- samples proportional to $\sqrt{B}$ instead of $B$.
% We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we get a 2x lower calibration error than histogram binning and guarantees on true calibration unlike scaling methods.
% In this paper, we first show that previous calibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported and (ii) we do not know how miscalibrated they are, because prior work only gives lower bounds for their calibration error \pl{reframe this: previous work (possibly some of our reviewers) did not think they were giving lower bounds; should say instead that methods used by previous work to assess calibration underestimate the true calibration error}.
%   Other methods \pl{what other methods?} like histogram binning are sample inefficient -- they require $O(\frac{B}{\epsilon^2})$ samples to reach calibration error $\epsilon$, where $B$ is the number of model outputs \pl{be clear: which model? this looks like you're talking about number of classes; which is coincidentally what you want, I guess!}. To solve these problems, we introduce the variance-reduced calibrator.
%   The variance-reduced calibrator first fits a parametric function -- which acts like a baseline for variance reduction -- and then bins the function values to actually ensure calibration. This requires only $O(B + \frac{1}{\epsilon^2})$ samples. \pl{we can join some sentences to make it flow better}
%   \pl{compress: We also introduce a new estimator for verifying...}
%   We then turn to the question of how to verify calibration. We introduce a new estimator for the calibration error that requires fewer samples -- samples proportional to $\sqrt{B}$ instead of $B$. We prove finite sample guarantees for all our results, and validate our theory with multiclass calibration experiments on CIFAR-10 and ImageNet, where we get a 2x lower calibration error than histogram binning. \pl{and guarantees on true calibration unlike scaling methods}
\end{abstract}

%\pl{Writing tip: find ways to group statements into single sentences so that you get some hierarchical structure: each sentence should express a set of related ideas, and important ideas are at the beginning of sentences}

\input{intro}

\input{setup}

\input{platt_not_calibrated}

\input{calibrating_models}

\input{verifying_calibration}

\input{related_work}

\bibliographystyle{unsrtnat}
\bibliography{local,refdb/all}

\appendix

\newpage 
\section{Model and code details}

The VGG16 model we used for ImageNet experiments is from the Keras~\cite{chollet2015} module in the TensorFlow~\cite{tensorflow2015-whitepaper} library and we used pre-trained weights supplied by the library. The VGG16 model for CIFAR-10 was obtained from an open-source implementation on GitHub~\cite{geifman2017}, and we used the pre-trained weights there. We independently verified the accuracies of these models.

\pldel{We include our code for the experiments in the supplementary material.}

\input{appendix_platt_not_calibrated_experiments}

\input{appendix_platt_not_calibrated}

\input{appendix_calibrating_models}

\input{appendix_calibrating_models_experiments}

\input{appendix_verifying_calibration_proofs}

\input{appendix_verifying_calibration_experiments}

\end{document}