# Conformal Prediction

References:

+ Angelopoulos, A.N., S. Bates, A. Fisch, L. Lei, T. Schuster, 2022. Conformal Risk Control, https://arxiv.org/abs/2208.02814
+ Barber, R.F., E.J. Candes, A. Ramdas, and R.J. Tibshirani, 2022. Conformal prediction beyond exchangeability, https://arxiv.org/abs/2202.13415
+ Papadopoulos, H., K. Proedrou, V. Vovk, and A. Gammerman, 2002. Inductive Confidence Machines for Regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Machine Learning: ECML 2002. ECML 2002. Lecture Notes in Computer Science, vol 2430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36755-1_29
+ Shafer and Vovk, 2008. A Tutorial on Conformal Prediction, _Journal of Machine Learning Research, 9_, 371-421.

Basic problem set up: we observe items sequentially. Each item has a label, $y$, which could be numerical or categorical.
There is some black-box model that makes a prediction $\hat{y}$ of the value of $y$ for the next item, 
generally using side information $x$ (e.g., covariates).
The predictor can "learn" from previous cases (also called _examples_): it does not have to stay the same over time.
After each prediction $\hat{y}$ is made, the true label $y$ is revealed.
Thus,  $(x_1, y_1), \ldots, (x_{n−1}, y_{n−1})$ and $x_n$ are available to predict $y_n$.

Given an acceptable error probability $\epsilon$, the goal is to construct a "prediction region" $\Gamma^\epsilon$ of labels that contains the label of the next item with probability at least $1-\epsilon$.
A $1-\epsilon$ prediction region is _valid_ if it contains the truth at least $1-\epsilon$ of the time.

Conformal prediction works for any prediction method: regression, Bayesian modeling, support-vector machines, neural networks, etc. 

Recall [the definition of a _multiset_ or _bag_](./math-foundations.ipynb#Multisets): like a set, a bag is a collection of things, but unlike a set, it can contain multiple copies of the same thing. It amounts to an unordered list.
We will use $\llcorner \cdot \lrcorner$ to denote a bag.

The basic assumption of conformal prediction is that the cases are _exchangeable_, which is true if they are IID, but not quite as restrictive as IID. 
(There has been work on weakening that assumption: see, e.g., Foygel et al., 2022.)

One definition of exchangability of a finite set of random variables $\{Z_1, \ldots, Z_n\}$ is that their joint probability distribution is invariant under permutations of the labels. That is, the distribution of $(Z_1, \ldots, Z_n)$ is the same as the distribution $(Z_{\pi_1}, \ldots, Z_{\pi_n})$ for every permutation $\pi$ of $\{1, \ldots, n\}$.
An infinite series of random variables $Z_1, Z_2, \ldots$ is exchangable if the finite collection $\{Z_1, \ldots, Z_n\}$
is exchangeable for every $n$.

Shafer & Vovk also define exchangeability using a betting protocol, as follows.

### Backward-Looking Betting Protocol (Shaver & Vovk, 2008)

Two players, Alice & Bob.
+ $\mathcal{K}_N := 1$.
+ Alice announces a bag $\mathcal{B}_N$ of size $N$.
+ for $n = N, N−1, \ldots, , 2, 1$:
    + Bob bets on $z_n$ at odds set by $\mathbb{P} \{z_n = a || \mathcal{B}_n = \llcorner a_1, \ldots, a_n \lrcorner \} = k/n$, where $k$ is the number of times $a$ occurs in $\mathcal{B}_n$.
    + Alice announces $z_n \in \mathcal{B}_n$
    + $\mathcal{K}_{n−1} :=  \mathcal{K}_n$ plus Bob's winnings on $z_n$
    + $\mathcal{B}_{n−1} := \mathcal{B}_n \setminus \llcorner z_n \lrcorner$
Bob's moves are constrained to guarantee that his capital $\mathcal{K}_n$ will be nonnegative for all $n$, no matter how Alice
moves.

Shafer & Vovk define exchangeability as saying that Bob will not multiply his initial capital $\mathcal{K}_N$ by a
large factor in this game.

Shafer & Vovk consider two cases, one of which is a special case of the other:

1. Predict using old examples alone. Just before observing $z_n$,  predict it 
using the previous examples $(z_1, \ldots , z_{n−1})$.

2. Predict using features (covariates) of the new example. Each example $z_i = (x_i, y_i)$, where $x_i$ is side information
and $y_i$ is a label. The data are $x_1, y_1, \ldots , x_N, y_N$. Just before observing $y_n$, we predict it from $x_n$ and $(z_i)_{i=1}^{n-1}$. 

An essential ingredient in conformal prediction is a measure of _nonconformity_ between an _example_ $z$ and a bag $\mathcal{B}$ of other examples, a real-valued function $A(\mathcal{B}, z)$.
If there is a distance function defined on examples, $d(z, z')$, the nonconformity measure could be 
defined using the distance between a point estimator $\hat{z}$ and the example:
\begin{equation}
A(\mathcal{B}, z) := d(\hat{z}(\mathcal{B}), z).
\end{equation}
In general, the precision of prediction regions ("size") depends on the nonconformity measure, but the _validity_ of the prediction regions (the chance the regions contain the next observation) will not.
Moreover, the prediction regions generated by conformal prediction are invariant under monotone increasing
transformations of $A$.
Hence, in this approach to constructing a nonconformity measure, the particular choice of distance measures $d(\cdot, \cdot)$ 
is less important than the choice of point predictors $\hat{z}$.

If the labels are real numbers, one
could define $A(\mathcal{B}, z)$ in a number of simple ways, such as
\begin{equation}
 A(\mathcal{B}, z) := |\bar{z}_{\mathcal{B}} - z|,
\end{equation}
where $\bar{z}_{\mathcal{B}} := (\sum_{a \in {\mathcal{B}}} a)/|{\mathcal{B}}|$.

## Conformal prediction algorithm without covariates (Shafer & Vovk)

Input: Nonconformity measure $A$, significance level $\epsilon$, examples $z_1, \ldots, z_{n-1}$,
example $z$.

Task: Decide whether to include $z$ in $\gamma^\epsilon(z_1, \ldots, z_{n−1})$.

Algorithm:
1. Provisionally set $z_n := z$.
2. For $i = 1, \ldots,n$, set $\alpha_i := A( \llcorner z_1, \ldots ,z_n \lrcorner \setminus \llcorner z_i \lrcorner, ,z_i)$.
3. Set $p_z := \frac{1}{n}\left | \left \{i \in \{1, \ldots, n\}: \alpha_i \ge \alpha_n \right \} \right |$.
4. Include $z$ in $\gamma^\epsilon(z_1, \ldots ,z_{n−1})$ iff $p_z > \epsilon$.

## Conformal prediction with covariates (Shafer & Vovk)

Input: Nonconformity measure $A$, significance level $\epsilon$, examples $z_1, \ldots, z_{n-1}$,
object $x_n$, example $z$.

Task: Decide whether to include $z$ in $\Gamma^\epsilon(z_1, \ldots, z_{n−1})$.

Algorithm:
1. Provisionally set $z_n := z$.
2. For $i = 1, \ldots,n$, set $\alpha_i := A( \llcorner z_1, \ldots ,z_n \lrcorner \setminus \llcorner z_i \lrcorner, ,z_i)$.
3. Set $p_z := \frac{1}{n}\left | \left \{i \in \{1, \ldots, n\}: \alpha_i \ge \alpha_n \right \} \right |$.
4. Include $z$ in $\Gamma^\epsilon(z_1, \ldots ,z_{n−1})$ iff $p_z > \epsilon$.

The proof that conformal prediction sets have their claimed coverage probability
is at its core the same argument we've used to show that (hits$+1$)/(reps$+1$) is a valid $P$-value in permutation tests: if the data are exchangeable, every permutation of the nonconformance values (including the value for the final observation, which uses $z$ as a trial value) is equally likely.
If that value of $z$ would make the last point's nonconformance value be too far in the tails to be plausible--that is, if the $P$-value of the hypothesis of exchangability based on the nonconformance values using that value of $z$ is too small--don't include that value of $z$ in the prediction set.

## Optimality of conformal prediction (Shafer & Vovk, 2008)

Conformal prediction regions have a number of properties:

1. The predictions are invariant with respect to the ordering of previous examples: they depend on the nonconformity measure, the confidence level, and the _bag_ of previous examples.

2. The probability that the prediction includes the truth is at least the nominal confidence level. 
For every integer $n>0$ and every exchangeable probability distribution for $z_1, \ldots, z_n$, 
\begin{equation}
\mathbb{P} \{z_n \in \gamma^\epsilon(\llcorner z_1, \ldots, z_{n-1} \lrcorner) \} \ge 1-\epsilon.
\end{equation}
3. The prediction regions are nested: If $\epsilon_1 \le \epsilon_2$, $\gamma^{\epsilon_1}(B) \subset \gamma^{\epsilon_2}(B)$. 

**Lemma.** (Shafer & Vovk, Lemma 1)  
Suppose $\gamma$ is a procedure for creating prediction regions that satisfies the three conditions above, that $\llcorner a_1, \ldots , a_n \lrcorner$ is a bag of examples, and 
$0 < \epsilon \le 1$. 
Then $n\epsilon$ or fewer of the $n$ elements of the bag satisfy
\begin{equation}
    a_i \notin \gamma^\epsilon(\llcorner a_1, \ldots , a_n \lrcorner \setminus \llcorner a_i \lrcorner). 
    \tag{1} \label{lemma:1}
\end{equation}

**Proof.**  
Consider the unique exchangeable probability distribution for $(z_1, \ldots, z_n)$ that gives probability
1 to the bag $\llcorner z_1, \ldots, z_n \lrcorner = \llcorner a_1, \ldots, a_n \lrcorner$. 
For that distribution, each element of $\llcorner a_1, \ldots, a_n\lrcorner$ is equally likely to be $z_n$,
and (\ref{lemma:1}) is an error if it occurs.
Since the second condition says that the chance of an error is at most $\epsilon$, the
fraction of elements of $\llcorner a_1, \ldots , a_n \lrcorner \setminus \llcorner a_i \lrcorner$ for which (\ref{lemma:1}) occurs cannot exceed $\epsilon$. $\Box$

### Tightening conformal prediction regions

The lemma provides a way to show how to alter a nonconformity measure to get tighter prediction regions.
Suppose we have a prediction region $\gamma^\epsilon$.
If $z \notin \gamma^\delta(B)$, it means that the predictor has confidence at least $1-\delta$ that $z$ is different enough from $B$ that it won't be the next value.
We can measure nonconformity using that "confidence."
The largest $1−\delta$ for which $z \notin \gamma^\delta(B)$ is a natural nonconformity measure:
\begin{equation}
A(B,z) := \sup \{1−\delta | z \notin \gamma^\delta(B) \}.
\end{equation}
Consider the prediction set $\gamma_A^\epsilon$ based on that nonconformity measure.
Then $\gamma_A^\epsilon(B) \subset \gamma^\epsilon(B)$.
To see that, suppose $z \in \gamma_A^\epsilon(\llcorner z_1, \ldots, z_{n-1} \lrcorner)$.
That means that when the algorithm sets $z_n := z$, strictly more than
$n\epsilon$ of the scores 
\begin{equation}
    \alpha_i := \sup\{1−\delta | z_i \notin \gamma^\delta( \llcorner z_1, \ldots, z_n \lrcorner ) \}, \;\; 
    i=1, \ldots, n
\end{equation}
are greater than or equal to $\alpha_n$.
Since the prediction regions $\gamma$ produces are nested by confidence level, if $z_n \notin \gamma^\epsilon(\llcorner z_1, \ldots, z_{n-1} \lrcorner)$, then $z_i \notin \gamma^\epsilon( \llcorner z_1, \ldots, z_n \lrcorner \setminus \llcorner z_i \lrcorner )$ for strictly more than $n\epsilon$ of the elements of $\llcorner z_i \lrcorner_{i=1}^n$, contradicting the lemma.

## Extensions of conformal prediction

+ Weakening assumptions: assuming only exchangeability within labels (Shafer & Vovk, 2008)

+ Split conformal prediction (Papadopoulos et al., 2002)
    - It's computationally expensive to re-fit the model after each example.
    - Instead, split the sample into a training set (used to fit the model) and a calibration set (used to approximate the distribution of the nonconformity measure).
    
+ Balancing errors on subgroups 
    - Ensure, for instance, that the chance that the prediction set does not contain the true label is controlled for each possible value of the true label, or that the chance that the prediction set does not contain the true label is controlled for different subsets of the covariates.

+ Conformal _risk_ control. (Angelopoulos et al., 2022) 
    - Instead of guaranteeing $\mathbb{P} \{ z_n \in \gamma(\llcorner z_1, \ldots, z_{n-1} \lrcorner) \} \ge 1-\epsilon$, guarantee that $\mathbb{E} \ell(\gamma(\llcorner z_1, \ldots, z_{n-1} \lrcorner), z_n) \le \epsilon$ for any bounded _loss function_ $\ell$ that decreases as $\gamma$ grows. 

+ Conformal prediction without exchangeability (Barber et al., 2022). Uses weights on examples. The coverage is approximate, not conservative, but they bound the error in the confidence level in terms of the total variation distance between the distribution of the bag of previous examples and the distributuion of the current example.
    - distributional drifts: give more weight to more recent examples
