-
Notifications
You must be signed in to change notification settings - Fork 1
/
appendix_calibrating_models.tex
290 lines (229 loc) · 27.9 KB
/
appendix_calibrating_models.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
\newpage
\section{Proofs for section~\ref{sec:calibrating_models}}
\label{sec:calibrating_models_appendix}
\newcommand{\G}[0]{\ensuremath{\mathcal{G}}}
Our analysis of the sample complexity of \ourcal{} requires some assumptions on the function family $\G{}$:
\begin{enumerate}
\item (Finite\pl{-dimensional} parameters). Let $\G{} = \{ g_{\theta} : [0, 1] \to [0, 1] \; | \; \theta \in A \}$ where $A \subseteq \mathbb{R}^{d}$ and $A$ is open.
\item (Injective). For all $g_{\theta} \in \G{}$ we assume $g_{\theta}$ is injective.
\item (Consistency). Intuitively, consistency means that given infinite data, the estimated parameters should converge to the unique optimal parameters in $A$.
More formally, suppose $\theta^* = \argmin_{\theta \in A} \mse(g_{\theta})$.
Then the parameters $\hat{\theta}_n$ estimated by minimizing the empirical MSE with $n$ samples in step 1 of the algorithm, converges in distribution to $\theta^*$, that is, $\hat{\theta}_n \to_D \theta^*$ as $n \to \infty$. Note that consistency inherently assumes identifiability, that there is a unique minimizer $\theta^*$ in the open set $A$.
\item (Regularity). We assume that regularity conditions in Theorem 5.23 of~\cite{vaart98asymptotic} hold, which require the loss to be twice differentiable with symmetric, non-singular Hessian, and that $g_{\theta}(x)$ is Lipschitz in $\theta$ for all $x$. We will also require the second derivative to be continuous.
\end{enumerate}
We assume that $\G{}$ satisfies these assumptions in the rest of this section. Note that aside from injectivity, the remaining conditions are only required for the (fairly standard) analysis of step 1 of the algorithm, which says that a parametric scaling method with a small number of parameters will quickly converge to its optimal error.
% We first need to introduce the Bayes optimal recalibrator $\omega^*$, which we compare all other recalibrators to. $\omega^* \circ f$ has calibration error $0$, and also has the minimum mean-squared error among all recalibrators.
% \begin{definition}
% $\omega^* : [0, 1] \to [0, 1]$ is given by $\omega^*(x) = \expect[Y | Z = z]$.
% \end{definition}
\subsection{Calibration bound (Proof of Theorem~\ref{thm:final-calib})}
The goal is to prove the following theorem from Section~\ref{sec:calibrating_models}, which we restate:
\begin{finalCalib}
\finalCalibText{}
\end{finalCalib}
We will analyze each step of our algorithm and then combine the pieces to get Theorem~\ref{thm:final-calib}.
As we mention in the main text, step 3 is the main step, so Lemma~\ref{thm:empirical-binning} is one of the core parts of our proof.
Step 2 is where we construct a binning scheme so that each bin has an equal number of points---we show that this property holds approximately in the population (Lemma~\ref{lem:well-balanced}).
This is important as well, particularly to ensure we can estimate the calibration error.
Step 1 is basically Platt scaling, and the asymptotic analysis is fairly standard.
\textbf{Step 3}: Our proofs will require showing convergence in $\ell_2$ and $\ell_1$ norm in function space, which we define below:
\begin{definition}[Distances between functions]
Given $f, g : [0, 1] \to [0, 1]$, for the $\ell_2$ norm we define $||f - g||_2^2 = \expect[(f(Z) - g(Z))^2]$ and $||f- g||_2 = \sqrt{||f - g||_2^2}$. For the $\ell_1$ norm we define $||f - g||_1 = \expect[\lvert f(Z) - g(Z)\rvert]$
\end{definition}
Recall that we showed that in the limit of infinite data the binned version of $g$, $g_{\bins{}}$, has lower calibration error than $g$ (Proposition~\ref{prop:bin_low_bound}). However our method uses $n$ data points to empirically bin $g$, giving us $\hat{g_{\bins{}}}$. We now show the key lemma that allows us to bound the calibration error and later the mean-squared error. That is, we show that the empirically binned function $\hat{g_{\bins{}}}$ quickly converges to $g_{\bins{}}$ in both $\ell_2$ and $\ell_1$ norms.
\begin{lemma}[Empirical binning]
\label{thm:empirical-binning}
There exist constants $c_B, c_1, c_2$ such that the following is true. Given $g : [0, 1] \to [0, 1]$, binning set $T_3 = \{(z_i, y_i)\}_{i=1}^n$ and a 2-well-balanced binning scheme $\bins{}$ of size $B$. Given $0 < \delta < 0.5$, suppose that $n \geq c_B B\log{\frac{B}{\delta}}$. Then with probability at least $1 - \delta$, $||\hat{g_{\bins{}}} - g_{\bins{}}||_2 \leq \frac{c_2}{\sqrt{n}}\sqrt{\log{\frac{B}{\delta}}}$ and $||\hat{g_{\bins{}}} - g_{\bins{}}||_1 \leq \frac{c_1}{\sqrt{nB}}\sqrt{\log{\frac{B}{\delta}}}$
\end{lemma}
\begin{proof}
% Note, for the L2 bound we only need 1 side of 2-well-balanced, P(f(X) \in I_j) \geq 1/2B for all j.
% For the L1 bound, if we have only 1 side of 2-well-balanced, but b_j - b_{j-1} are equal for all j, then we get the same bound.
% One question is whether we can relax this theorem, for example if b_j - b_{j-1} are equal for all j. Maybe we don't need well-balancedness in that case.
% We first discuss high level intuition for the proof. Recall that in the naive binning approach, where we bin the label (that is, $Y$) values, the $\ell_2$ calibration error is $O(\sqrt{\frac{B}{n}})$ ignoring $\log$ factors. That is, the number of samples we need for calibration increases linearly with the number of bins. The intuition is that when we have more bins, we have fewer samples per bin. That is, we have $\frac{n}{B}$ samples of $Y$ in each bin, where $Y \in \{0, 1\}$. Then Hoeffding's bound gives us the above result, which is tight up to constants. However, this theorem shows when we bin $f$ instead of $Y$, our rates have much better dependencies on $B$. The high level idea is that for each bin $j$ we are taking the average of values bounded in $I_j = [b_{j-1}, b_j]$. This is a narrower range than in naive binning where we took the average of values in $\{0, 1\}$. Since the values are bounded in a narrower range, the variance is lower. Of course, the variance of any particular bin could be big. For example we could have $I_7 = [0.1, 0.9]$. But the sum of all the bin sizes is 1, so `most' of the bins have a lower variance. We now give the formal proof.
Recall that the intuition is in Figure~\ref{fig:variance_reduced_illustration} of the main text---the $g(z_i)$ values in each bin (gray circles in Figure~\ref{fig:var_red_binning}) are in a narrower range than the $y_i$ values (black crosses in Figure~\ref{fig:hist_binning}) so when we take the average we incur less of an estimation error. Now, there may be a small number of bins where the $g(z_i)$ values are not in a narrow range, but we will use the assumption that $\bins{}$ is 2-well-balanced to show that these effects average out and the overall estimation error is small.
Define $R_j$ to be the set of $g(z_i)$ that fall into the $j$-th bin, given by $R_j = \{g(z_i) \mid g(z_i) \in I_j \wedge (z_i, y_i) \in T_3\}$ (recall that $T_3$ is the data we use in step 3).
Let $p_j$ be the probability of landing in bin $j$, given by $p_j = \prob(g(Z) \in I_j)$.
Since $\bins{}$ is 2-well-balanced, $p_j \geq \frac{1}{2B}$.
Since $n \geq c_B B\log{\frac{B}{\delta}}$, by the multiplicative Chernoff bound, for some large enough $c_B$, with probability at least $1 - \frac{\delta}{2}$, $|R_j| \geq \frac{p_j}{2}$.
Consider each bin $j$. Let $\mu_j$ be the expected output of $g$ in bin $j$, given by $\mu_j = \expect[g(Z) \; | \; g(Z) \in I_j]$. $\mu(R_j)$, the mean of the values in $R_j$, is the empirical average of $|R_j|$ such values, each bounded between $b_{j-1}$ and $b_j$ where $I_j = [b_{j-1}, b_j]$. So $\hat{\mu}(R_j)$ is sub-Gaussian with parameter:
\[ \sigma^2 = \frac{(b_j - b_{j-1})^2}{4|R_j|} \leq \frac{(b_j - b_{j-1})^2}{2p_jn} \]
Then by the sub-Gaussian tail bound, for any $1 \leq j \leq B$, with probability at least $1 - \frac{\delta}{2B}$, we have:
\begin{align} (\mu_j - \hat{\mu}(R_j))^2 \leq \frac{(b_j - b_{j-1})^2}{p_jn} \log{\frac{4B}{\delta}}\label{eqn:1} \end{align}
So by union bound with probability at least $1 - \frac{\delta}{2}$ the above holds for all $1 \leq j \leq B$ simultaneously.
We then bound the $\ell_2$-error.
\begin{align*}
||\hat{g_{\mathcal{B}}} - g_{\mathcal{B}}||_2 &= \sqrt{\sum_{j =1}^B p_j (\mu_j - \hat{\mu}(R_j))^2} \\
&\leq \sqrt{\sum_{j =1}^B p_j \frac{(b_j - b_{j-1})^2}{p_jn} \log{\frac{4B}{\delta}}} \tag{by equation~\eqref{eqn:1}}\\
&= \sqrt{\frac{1}{n} \log{\frac{4B}{\delta}} \sum_{j =1}^B (b_j - b_{j-1})^2 } \\
&\leq \sqrt{\frac{1}{n} \log{\frac{4B}{\delta}} \sum_{j =1}^B (b_j - b_{j-1}) } \tag{because $0\le b_j - b_{j-1}\le 1$}\\
&\leq \sqrt{\frac{1}{n} \log{\frac{4B}{\delta}} } \\
&\leq c_2 \frac{1}{\sqrt{n}} \sqrt{\log{\frac{B}{\delta}}}
\end{align*}
Similarly, we can also bound the $\ell_1$-error. Here we also use the fact that $p_j \leq \frac{2}{B}$ since $\bins{}$ is 2-well-balanced.
\begin{align*}
||\hat{g_{\mathcal{B}}} - g_{\mathcal{B}}||_1 &= \sum_{j =1}^B p_j |\mu_j - \hat{\mu}(R_j)| \\
&\leq \sum_{j =1}^B p_j \sqrt{\frac{(b_j - b_{j-1})^2}{p_jn} \log{\frac{4B}{\delta}}} \\
&=\sum_{j =1}^B \sqrt{\frac{p_j(b_j - b_{j-1})^2}{n} \log{\frac{4B}{\delta}}} \\
&\leq \sum_{j =1}^B \sqrt{\frac{2(b_j - b_{j-1})^2}{Bn} \log{\frac{4B}{\delta}}} \\
&\leq \sqrt{\frac{2}{Bn} \log{\frac{4B}{\delta}}} \sum_{j =1}^B (b_j - b_{j-1}) \\
&\leq c_1 \frac{1}{\sqrt{Bn}} \sqrt{\log{\frac{B}{\delta}}}
\end{align*}
By union bound, these hold with probability at least $1 - \delta$, which completes the proof.
\end{proof}
\textbf{Step 2}: Recall that we chose our bins so that each bin had an equal proportion of points in the recalibration set. In our proofs we required that this property (approximately) holds in the population as well. The following lemma shows this.
\begin{wellBalanced}
\wellBalancedText{}
\end{wellBalanced}
\begin{proof}
Suppose we are given a bin construction set of size $n$, $T_n = \{(z_1, y_1), \dots, (z_n, y_n)\}$.
% We want to show that if an interval $I_j$ contains $\frac{n}{B}$ points $g(z_i)$ then $\frac{1}{2B} \leq \prob(g(Z) \in I_j) \leq \frac{2}{B}$.
For any interval $I$, let $\hat{P}(I)$ be the empirical estimate of $P(I) = \prob(g(Z) \in I)$ given by:
\[ \hat{P}(I) = \frac{|\{(z_i, y_i) \in T_n \mid g(z_i) \in I\}|}{n} \]
We constructed the bins so that each interval $I_j$ contains $\frac{n}{B}$ points, or in other words, $\hat{P}(I_j) = \frac{1}{B}$. We want to show that $\frac{1}{2B} \leq \prob(g(Z) \in I_j) \leq \frac{2}{B}$. Since the intervals are chosen from data, we want a uniform concentration result that holds for all such intervals $I_j$.
We will use a discretization argument. The idea is that we will cover $[0, 1]$ with $10B$ disjoint small intervals such that for each of these intervals $I_j'$, $P(g(Z) \in I_j') = \frac{1}{10B}$. We will then use Bernstein and union bound to get that with probability at least $1 - \delta$, for all $I_j'$, $|P(I_j') - \hat{P_j}(I_j')| \leq \frac{1}{100B}$ . Given an arbitrary interval $I$, we can write it as an approximate union of these small intervals, which will allow us to concentrate $|P(I) - \hat{P}(I)|$.
\textbf{Concentrating the small intervals:} Fix some interval $I_j'$. Let $w_i = \mathbb{I}(g(z_i) \in I_j')$ for $i = 1,\dots,n$. Then $w_i \sim \mbox{Bernoulli}(\frac{1}{10B})$. $\hat{P}(I_j')$ is simply the empirical average of $n$ such values and as such with probability at least $1 - \frac{\delta}{10B}$:
\[ |P(I_j') - \hat{P_j}(I_j')| \leq \sqrt{\frac{2}{10Bn} \log{\frac{10B}{\delta}}} + \frac{2}{3n} \log{\frac{10B}{\delta}} \]
If $n = cB \log{\frac{B}{\delta}}$ for a large enough constant $c$, we get:
\[ |P(I_j') - \hat{P_j}(I_j')| \leq \frac{1}{100B} \]
And this was with probability at least $1 - \frac{\delta}{10B}$. So by union bound we get that with probability at least $1 - \delta$ this holds for all $I_j'$.
\textbf{Concentrating arbitrary intervals:} Now consider arbitrary $I \subseteq [0, 1]$. We can approximately write $I$ as a union of the small $I_j'$ intervals. More concretely, we can form a lower bound for $\hat{P}(I)$ by considering all $I_j'$ contained in $I$:
\[ S_L = \{I_j' \mid I_j' \subseteq I \} \]
Similarly we can form an upper bound for $\hat{P}(I)$ by considering all $I_j'$ that have non-empty intersection with $I$:
\[ S_U = \{I_j' \mid I_j' \cap I \neq \emptyset \} \]
We can then show:
\[ \frac{9}{10} P(I) - \frac{1}{5B} \leq \hat{P}(I) \leq \frac{11}{10} P(I) + \frac{1}{5B} \]
Since in our case for all $j$, $\hat{P}(I_j) = \frac{1}{B}$, this gives us:
\[ \frac{1}{2B} \leq P(I_j) \leq \frac{2}{B} \]
\end{proof}
\textbf{Step 1}: Recall that step 1 essentially applies a scaling method---we fit a small number of parameters to the recalibration data.
We show that if $\G{}$ contains $g^* \in \G{}$ with low calibration error, then the empirical risk minimizer $g \in \G{}$ of the mean-squared loss will also quickly converge to low calibration error.
Intuitively, methods like Platt scaling fit a single parameter to the data so standard results in asymptotic statistics tell us they will converge quickly to their optimal error, at least in mean-squared error.
We can combine this with a decomposition of the mean-squared error into calibration and refinement, and the injectivity of $g \in \mathcal{G}$, to show they also converge quickly in calibration error.
\begin{lemma}[Convergence of scaling]
\label{lem:platt_scaling_bound}
Given $\delta$, there exists a constant $c$, such that for all $n$, $\squaredce(g) \leq \min_{g' \in \G{}}\squaredce(g') + \frac{c}{n}$, with probability at least $1 - \delta$.
% independent of $d, L, n$ such that with high probability over the recalibration samples $\squaredce(g) \leq \min_{g' \in \G{}}\squaredce(g') + \frac{cLd \log{R}}{\sqrt{n}}$, where recall that $g \in \G{}$ was selected as the empirical risk minimizer of the mean-squared error loss.
\end{lemma}
\begin{proof}
\textbf{From calibration error to mean-squared error}: We use the classic decomposition of the mean-squared error into calibration error (also known as reliability) and refinement\footnote{Note that the refinement term can be further decomposed into resolution (also known as sharpness) and irreducible uncertainty.}. For any $g \in \G{}$ we have:
\[ \mse(g) = \underbrace{\squaredce(g)}_{\mbox{calibration}} + \underbrace{\expect[(\expect[Y \mid g(Z)] - Y)^2]}_{\mbox{refinement}} \]
Note that the refinement term is constant for all injective $g \in \G{}$, since for injective $g$:
\[ \expect[(\expect[Y \mid g(Z)] - Y)^2] = \expect[(\expect[Y \mid Z] - Y)^2] \]
This means that the difference in calibration error between any $g$ and $g'$ is precisely the difference in the mean-squared error. So it suffices to upper bound the generalization gap $\mse(g) - \mse(g^*)$ for the mean-squared error. Our analysis is fairly standard: we will show asymptotic convergence in the parameter space, and then use a Taylor expansion to show convergence in the MSE loss.
\textbf{Parameter convergence}: Recall that $\hat{\theta}$ denotes the parameters estimated by optimizing the empirical mean-squared error objective on $n$ samples in step 1 of our algorithm, and $\theta^*$ denotes the optimal parameters that minimize the mean-squared error objective on the population. From Theorem 5.23 of~\cite{vaart98asymptotic}, on the asymptotic parameter convergence of M-estimators, we have as $n \to \infty$:
\[ \sqrt{n}(\hat{\theta} - \theta^*) \to_D N(0, \Sigma) \]
Then for each $1 \leq i \leq d$, we have:
\[ \sqrt{n}(\hat{\theta}_i - \theta^*_i) \to_D N(0, \sigma_i^2) \]
We will show that there exists $c_i$ such that for each $i$ and for all $n$, with probability at least $1 - \frac{\delta}{d}$:
\[ \lvert \hat{\theta}_i - \theta^*_i \rvert \leq \frac{c_i}{n} \]
To see this, we begin with the definition of convergence in distribution, which says that the CDFs converge pointwise at every point where the CDF is continuous, which for a Gaussian is every point. That is, letting $z_i$ be a sample from $N(0, \sigma_i^2)$, we have for all $c$:
\[ \lim_{n \to \infty} \prob( \sqrt{n}(\hat{\theta}_i - \theta^*_i) \geq c) = \prob(z_i \geq c) \]
By considering the CDF at each point and its negative, we can show the same result for the absolute value:
\[ \lim_{n \to \infty} \prob( \sqrt{n} \lvert \hat{\theta}_i - \theta^*_i \rvert \geq c) = \prob(\lvert z_i \rvert \geq c) \]
The tails of a normal distribution are bounded, so we can choose $c_i'$ such that:
\[ \prob(\lvert z_i \rvert \geq c_i') \leq \frac{\delta}{2d} \]
By definition of limit, this means that we can choose $N_i$ such that for all $n \geq N_i$, we have:
\[ \prob( \sqrt{n} \lvert \hat{\theta}_i - \theta^*_i \rvert \geq c_i') \leq \frac{\delta}{d} \]
In other words, for all $n \geq N_i$, with probability at least $1 - \frac{\delta}{d}$:
\[ \lvert \hat{\theta}_i - \theta^*_i \rvert \leq \frac{c_i'}{\sqrt{n}} \]
Since this only does not hold for finitely many values $1, \cdots, N_i - 1$, we can `absorb' these cases into the constant. That is, for each $n \in \{1, \cdots, N_i - 1 \}$, there exists $r_n$ such that if we use $n$ samples, then except with probability $\frac{\delta}{d}$, $\lvert \hat{\theta}_i - \theta^*_i \rvert \leq r_n$. So then we can choose $c_i$ such that for all $n$:
\[ \lvert \hat{\theta}_i - \theta^*_i \rvert \leq \frac{c_i' + \max_{1 \leq m < N_i}{r_m \sqrt{m}}}{\sqrt{n}} \leq \frac{c_i}{\sqrt{n}} \]
We apply union bound over the indices $i$, and can then bound the $\ell_2$-norm of the difference between the estimated and optimal parameters, so that we can choose $k$ such that for all $n$, with probability at least $1 - \delta$:
\[ ||\hat{\theta} - \theta^*||_2^2 \leq \frac{k}{n} \]
\textbf{Loss convergence}: We denote the loss by $L$, defined as:
\[ L(\theta) = \mse(g_{\theta}) = \expect[ (Y - g_{\theta}(X))^2 ] \]
We approximate the loss $L$ by the first few terms of its Taylor expansion, which we denote by $\widetilde{L}$:
\[ \widetilde{L}(\theta) = L(\theta^*) + \nabla L(\theta^*)^T (\hat{\theta} - \theta^*) + (\hat{\theta} - \theta^*)^T \nabla^2 L(\theta^*) (\hat{\theta} - \theta^*) \]
We assumed that $L$ was twice differentiable with continuous second derivative, and that $\theta^*$ minimized the loss in an open set, so $\nabla L(\theta^*) = 0$, and we also have (see e.g. Theorem 3.3.18 in~\cite{hubbard1998vector}):
\[ \lim_{||\hat{\theta} - \theta^*||_2 \to 0} \frac{L(\hat{\theta}) - \widetilde{L}(\hat{\theta})}{||\hat{\theta} - \theta^*||_2^2} = 0 \]
By the definition of a limit if we fix $\epsilon > 0$, there exists $R > 0$ such that if $||\hat{\theta} - \theta^*||_2 \leq R$ then $L(\hat{\theta}) - \widetilde{L}(\hat{\theta}) \leq \epsilon ||\hat{\theta} - \theta^*||_2^2$. For some large enough $N_0$, if $n \geq N_0$, then with probability at least $1 - \delta$, $||\hat{\theta} - \theta^*||_2 \leq R$. As before, since this only does not hold for finitely many $N$, we can fold these cases into the constant so that there exists $\epsilon'$ such that for all $n$, $L(\hat{\theta}) - \widetilde{L}(\hat{\theta}) \leq \epsilon' ||\hat{\theta} - \theta^*||_2^2$ with probability at least $1 - \delta$. Plugging in $\widetilde{L}(\hat{\theta})$, we have:
\[ L(\hat{\theta}) - L(\theta^*) \leq (\hat{\theta} - \theta^*)^T \nabla^2 L(\theta^*) (\hat{\theta} - \theta^*) + \epsilon' ||\hat{\theta} - \theta^*||_2^2 \]
We can bound this by the operator norm of the Hessian, and then use the parameter convergence result:
\[ L(\hat{\theta}) - L(\theta^*) \leq (|| \nabla^2 L(\theta^*) ||_{op} + \epsilon') ||\hat{\theta} - \theta^*||_2^2 \leq \frac{c}{n} \]
which holds with probability at least $1 - \delta$, as desired.
% From a standard $\epsilon$-cover in the parameter space, or PAC-Bayes argument, we can get a high probability generalization bound on the mean-squared error:
% \[ \mse(g) \leq \mse(g^*) + \frac{cLd \log{R}}{\sqrt{n}} \]
% This gives us the desired result.
\end{proof}
Finally, we have the tools to prove the main theorem:
\begin{proof}[Proof of Theorem~\ref{thm:final-calib}]
The proof pieces together Lemmas~\ref{lem:platt_scaling_bound},~\ref{thm:empirical-binning},~\ref{lem:well-balanced} and Proposition~\ref{prop:bin_low_bound}.
For any fixed $c_1 > 0$, there exists $c_1'$ such that if $n \geq c_1'\big(\frac{1}{\epsilon^2}\big)$, from Lemma~\ref{lem:platt_scaling_bound}, step 1 of our algorithm gives us $g$ with $\squaredce(g) \leq \min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2$, with probability at least $1 - \frac{\delta}{3}$.
Next, for universal constant $c_2$, if $n \geq c_2(B \log{\frac{B}{\delta}})$, from Lemma~\ref{lem:well-balanced}, step 2 chooses a 2-well-balanced binning scheme $\bins{}$ with probability at least $1 - \frac{\delta}{3}$.
From Proposition~\ref{prop:bin_low_bound}, $\squaredce(g_{\bins{}}) \leq \squaredce(g) \leq \min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2$. Then from Lemma~\ref{thm:empirical-binning}, for any $c_3 > 0$, there exists $c_3'$ such that if $n \geq c_3'\big(\frac{1}{\epsilon^2} \log{\frac{B}{\delta}}\big)$, step 3 gives us $\hat{g_{\bins{}}}$ with $||\hat{g_{\bins{}}} - g_{\bins{}}||_2 \leq c_3 \epsilon$ with probability at least $1 - \frac{\delta}{3}$. We want to say that since $\hat{g_{\bins{}}}$ is close to $g_{\bins{}}$ and $g_{\bins{}}$ has low calibration error, this must mean that $\hat{g_{\bins{}}}$ has low calibration error.
To do this we represent the $\ell_2$ calibration error of any $g$ as the distance between $g$ and a perfectly recalibrated version of $g$. That is, we define the perfectly recalibrated version of $g$ as:
\[ \omega(g)(z) = \expect[ Y \mid g(Z) = z ] \]
Then for any $g$, we can write $\ltwoce(g) = ||g - \omega(g)||_2$. By triangle inequality on the $\ell_2$ norm on functions, we have:
\[ ||\hat{g_{\bins{}}} - \omega(g_{\bins{}})||_2 \leq ||\hat{g_{\bins{}}} - g_{\bins{}}||_2 + ||g_{\bins{}} - \omega(g_{\bins{}})||_2 \leq c_3 \epsilon + \sqrt{\min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2} \]
Now the LHS is not quite the $\ell_2$ calibration error of $\hat{g_{\bins{}}}$, which is $||\hat{g_{\bins{}}} - \omega(\hat{g_{\bins{}}})||_2$~\footnote{This is a very technical point, so at a first pass the reader may skip the following discussion.}.
However, since $g$ is injective, $g_{\bins{}}$ takes on a different value for each interval $I_j \in \bins{}$.
If $\hat{g_{\bins{}}}$ also takes on a different value for each interval $I_j \in \bins{}$, then we can see that $\omega(g_{\bins{}}) = \omega(\hat{g_{\bins{}}})$.
If not, $\omega(\hat{g_{\bins{}}})$ can only merge some of the intervals of $\omega(g_{\bins{}})$, and by Jensen's we can show:
\[ ||\hat{g_{\bins{}}} - \omega(\hat{g_{\bins{}}})||_2 \leq ||\hat{g_{\bins{}}} - \omega(g_{\bins{}})||_2 \leq c_3 \epsilon + \sqrt{\min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2} \]
An alternative way to see this is to add infinitesimal noise to $\hat{g_{\bins{}}}$ for each interval $I_j$, in which case we get $\omega(g_{\bins{}}) = \omega(\hat{g_{\bins{}}})$.
Finally we convert back from the $\ltwoce$ to $\squaredce$:
\[ \squaredce(\hat{g_{\bins{}}}) = ||\hat{g_{\bins{}}} - \omega(\hat{g_{\bins{}}})||_2^2 \leq \min_{g' \in \G{}}\squaredce(g') + (c_3^2 + c_1) \epsilon^2 + 2\sqrt{(c_3^2 \epsilon^2) \big(\min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2\big)} \]
By the AM-GM inequality, we have:
\[ 2\sqrt{(c_3^2 \epsilon^2) \big(\min_{g' \in \G{}}\squaredce(g') + c_1 \epsilon^2\big)} \leq (c_3^2 + c_1) \epsilon^2 + \min_{g' \in \G{}}\squaredce(g') \]
Combining these, we get:
\[ \squaredce(\hat{g_{\bins{}}}) \leq 2 \min_{g' \in \G{}}\squaredce(g') + 2(c_3^2 + c_1) \epsilon^2 \]
By e.g. choosing $c_1 = 0.1$ and $c_3 = 0.1$, we have $2(c_3^2 + c_1) \leq 1$, which gives us the desired result. By union bound over each step, we have this with probability at least $1 - \delta$.
% From proposition Y, $\twoce(g_{\bins{}}) \leq \ltwoce(g)$
\end{proof}
\subsection{Bounding the mean-squared error}
We also show that if we use lots of bins, discretization has little impact on model quality as measured by the mean-squared error.
Note that recalibration itself typically \emph{reduces/improves} the mean-squared error.
However, in our method after fitting a recalibration function like Platt scaling does, we discretize the function outputs.
This reduces the calibration error and allows us to measure the calibration error, but it does increase the mean-squared error by a small amount.
Here we upper bound the increase in mean-squared error.
In other words, our method allows for the calibration error of the final model to be measured, and has little impact on the mean-squared error.
\begin{restatable}[MSE Bound]{proposition}{mseFiniteBinning}
\label{prop:mse-finite-binning}
If $\mathcal{B}$ is a 2-well-balanced binning scheme of size $B$ and $B = \widetilde{\Omega}(n)$, where $\widetilde{\Omega}$ hides $\log$ factors, then $\mse(\hat{g}_{\mathcal{B}}) \leq \mse(g) + O(\frac{1}{B})$.
\end{restatable}
To show this we begin with a lemma showing that if $f$ and $g$ are close in $\ell_1$ norm, then their mean-squared errors are close:
\begin{lemma}
\label{lem:mse-l1}
For $f, g : [0, 1] \to [0, 1]$, $\mse(f) \leq \mse(g) + 2||f - g||_1$.
\end{lemma}
\begin{proof}
\begin{align*}
\expect[(f(Z) - Y)^2 - (g(Z) - Y)^2] &= \expect[(f(Z) - g(Z))(f(Z) + g(Z) - 2Y)] \\
&\leq \expect[|f(Z) - g(Z)||f(Z) + g(Z) - 2Y|] \\
&\leq \expect[2|f(Z) - g(Z)|] \\
& =2||f-g||_1
\end{align*}
Where the third line followed because $-2 \leq f(Z) + g(Z) - 2Y \leq 2$.
\end{proof}
Next, we show that in the limit of infinite data, if we bin with a well-balanced binning scheme then the MSE cannot increase by much.
\begin{lemma}
\label{thm:bin-sharpness}
Let $\mathcal{B}$ be an $\alpha$-well-balanced binning scheme of size $B$. Then $\mse(g_{\mathcal{B}}) \leq \mse(g) + \frac{2\alpha}{B}$.
\end{lemma}
\begin{proof}
We bound $||g_{\mathcal{B}} - g||_1$ and then use Lemma~\ref{lem:mse-l1}. We use the law of total expectation, conditioning on $\beta(g(Z))$, the bin that $g(Z)$ falls into.
\begin{align*}
||g_{\mathcal{B}} - g||_1 &= \mathbb{E}[|g_{\mathcal{B}}(Z) - g(Z)|] \\
&\leq \mathop{\mathbb{E}}_{\beta(g(Z))} \Big[ \mathop{\mathbb{E}}_{Z | \beta(g(Z))} [ |g_{\mathcal{B}}(Z) - g(Z)| ]\Big]\\
&\leq \mathop{\mathbb{E}}_{\beta(g(Z))} \Big[ b_{\beta(g(Z))} - b_{\beta(g(Z))-1}\Big]
\end{align*}
We now use the fact that $\mathcal{B}$ is $\alpha$-well-balanced.
\begin{align*}
\mathop{\mathbb{E}}_{\beta(g(Z))} \Big[ (b_{\beta(g(Z))} - b_{\beta(g(Z))-1})\Big] &= \sum_{i=1}^B \prob\big(g(Z) \in [b_{\beta(g(Z))-1}, b_{\beta(g(Z))}]\big) (b_{\beta(g(Z))} - b_{\beta(g(Z))-1}) \\
&\leq \sum_{i=1}^B \frac{\alpha}{B} (b_{\beta(g(Z))} - b_{\beta(g(Z))-1}) \\
&\leq \frac{\alpha}{B}
\end{align*}
Finally, from Lemma~\ref{lem:mse-l1}, we get that $\mse(g_{\mathcal{B}}) \leq \mse(g) + \frac{2\alpha}{B}$.
\end{proof}
The above lemma bounds the increase in MSE due to binning in the infinite sample case -- next we deal with the finite sample case and prove proposition~\ref{prop:mse-finite-binning}:
\begin{proof}[Proof of Proposition~\ref{prop:mse-finite-binning}:]
Ignoring all $\log$ factors, from Theorem~\ref{thm:empirical-binning} if $n = \widetilde{\Omega}(B)$, we have $||\hat{g}_{\mathcal{B}} - g_{\mathcal{B}}||_1 = O(\frac{1}{\sqrt{nB}})$. Then from Lemma~\ref{lem:mse-l1}, $\mse(\hat{g}_{\mathcal{B}}) \leq \mse(g_{\mathcal{B}}) + O(\frac{1}{\sqrt{Bn}}) \leq \mse(g_{\mathcal{B}}) + O(\frac{1}{B})$. From Theorem~\ref{thm:bin-sharpness}, since $\mathcal{B}$ is 2-well-balanced, we have $\mse(g_{\mathcal{B}}) \leq \mse(g) + O(\frac{1}{B})$. This gives us $\mse(\hat{g}_{\mathcal{B}}) \leq \mse(g) + O(\frac{1}{B})$.
\end{proof}
\subsection{Alternative binning schemes}
\label{sec:alt_binning_schemes}
We note that there are alternative binning schemes in the literature.
For example, the $B$ bins can be chosen as $I_1 = [0, \frac{1}{B}], I_2 = (\frac{1}{B}, \frac{2}{B}], \dots, I_B = (\frac{B-1}{B}, 1]$.
The main problem with this binning scheme is that we may not be able to measure the calibration error efficiently, which is critical.
However, if we choose the bins like this, and are lucky that the binning scheme happens to be 2-well-balanced, we can improve the bounds on the MSE that we proved above.
This motivates alternative hybrid binning schemes, where we try to keep the width of the bins as close to $1/B$ as possible, while ensuring that each bin contains lots of points as well.
We think analyzing what binning schemes lead to the best bounds, and seeing if this can improve the calibration method, is a good direction for future research.