# Conformal Prediction

Core reference: Shafer and Vovk, 2008. A Tutorial on Conformal Prediction, _Journal of Machine Learning Research, 9_, 371-421.

Basic problem set up: we observe items sequentially. Each item has a label, $y$, which could be numerical or categorical.
There is some black-box model that makes a prediction $\hat{y}$ of the value of $y$ for the next item, 
generally using side information $x$ (e.g., covariates).
The predictor can "learn" from previous cases (also called _examples_): it does not have to stay the same over time.
After each prediction $\hat{y}$ is made, the true label $y$ is revealed.
Thus,  $(x_1, y_1), \ldots, (x_{n−1}, y_{n−1})$ and $x_n$ are available to predict $y_n$.

The problem statement is, given an acceptable error probability $\epsilon$, construct a "prediction region" $\Gamma^\epsilon$ of labels that contains the label of the next item with probability at least $1-\epsilon$.
A $1-\epsilon$ prediction region is _valid_ if it contains the truth at least $1-\epsilon$ of the time.

Conformal prediction works for any prediction method: regression, Bayesian modeling, support-vector machines, neural networks, etc. 

Recall [the definition of a _multiset_ or _bag_](./math-foundations.ipynb#Multisets): like a set, a bag is a collection of things, but unlike a set, it can contain multiple copies of the same thing. It amounts to an unordered list.
We will use $\Lbag \cdot \Rbag$ to denote a bag.

The basic assumption of conformal prediction is that the cases are _exchangeable_, which is true if they are IID, but not quite as restrictive as IID. 
(There has been work on weakening that assumption: see, e.g., Foygel, R.B., E.J. Candes, A. Ramdas, and R.J. Tibshirani, 2022.  Conformal prediction beyond exchangeability, https://arxiv.org/abs/2202.13415.)

One definition of exchangability of a finite set of random variables $\{Z_1, \ldots, Z_n\}$ is that their joint probability distribution is invariant under permutations of the labels. That is, the distribution of $(Z_1, \ldots, Z_n)$ is the same as the distribution $(Z_{\pi_1}, \ldots, Z_{\pi_n})$ for every permutation $\pi$ of $\{1, \ldots, n\}$.
An infinite series of random variables $Z_1, Z_2, \ldots$ is exchangable if the finite collection $\{Z_1, \ldots, Z_n\}$
is exchangeable for every $n$.

Shafer & Vovk also define exchangeability using a betting protocol, as follows.

### Backward-Looking Betting Protocol (Shaver & Vovk, 2008)

Two players, Alice & Bob.
+ $\mathcal{K}_N := 1.
+ Alice announces a bag $\mathcal{B}_N$ of size $N$.
+ for $n = N, N−1, \ldots, , 2, 1$:
    + Bob bets on $z_n$ at odds set by $\mathbb{P} \{z_n = a || \mathcal{B}_n = \Lbag a_1, \ldots, a_n \Rbag \} = k/n$, where $k$ is the number of times $a$ occurs in $\mathcal{B}_n$.
    + Alice announces $z_n \in \mathcal{B}_n$
    + $\mathcal{K}_{n−1} :=  \mathcal{K}_n$ plus Bob's winnings on $z_n$
    + $\mathcal{B}_{n−1} := \mathcal{B}_n \setminus \Lbag z_n \Rbag$
Bob's moves are constrained to guarantee that his capital $\mathcal{K}_n$ will be nonnegative for all $n$, no matter how Alice
moves.

Shafer & Vovk define exchangeability as saying that Bob will not multiply his initial capital $\mathcal{K}_N$ by a
large factor in this game.

Shafer & Vovk consider two cases, one of which is a special case of the other:

1. Predict using old examples alone. Just before observing $z_n$,  predict it 
using the previous examples $(z_1, \ldots , z_{n−1})$.

2. Predict using features (covariates) of the new example. Each example $z_i = (x_i, y_i)$, where $x_i$ is side information
and $y_i$ is a label. The data are $x_1, y_1, \ldots , x_N, y_N$. Just before observing $y_n$, we predict it from $x_n$ and $(z_i)_{i=1}^{n-1}$. 

An essential ingredient in conformal prediction is a measure of _nonconformity_ between an _example_ $z$ and a bag $\mathcal{B}$ of other examples, a real-valued function $A(\mathcal{B}, z)$.
If there is a distance function defined on examples, $d(z, z')$, the nonconformity measure could be 
defined using the distance between a point estimator $\hat{z}$ and the example:
\begin{equation}
A(\mathcal{B}, z) := d(\hat{z}(\mathcal{B}), z).
\end{equation}
In general, the precision of the prediction regions will depend on the nonconformity measure, but the validity of the prediction regions will not.
The prediction regions that conformal prediction produces are invariant under monotone increasing
transformations of $A$.
Hence, in this approach to constructing a nonconformity measure, the particular choice of distance measures $d(\cdot, \cdot)$ 
is less important than the choice of point predictors $\hat{z}$.

If the labels are real numbers, one could define $A(\mathcal{B}, z)$ in a number of simple ways, such as
\begin{equation}
 A(\mathcal{B}, z) := |\bar{z}_{\mathcal{B}} - z|,
\end{equation}
where $\bar{z}_{\mathcal{B}} := (\sum_{a \in {\mathcal{B}}} a)/|{\mathcal{B}}|.

## Conformal prediction algorithm without covariates (Shafer & Vovk)

Input: Nonconformity measure $A$, significance level $\epsilon$, examples $z_1, \ldots, z_{n-1}$,
example $z$.

Task: Decide whether to include $z$ in $\Gamma^\epsilon(z_1, \ldots, z_{n−1})$.

Algorithm:
1. Provisionally set $z_n := z$.
2. For $i = 1, \ldots,n$, set $\alpha_i := A( \Lbag z_1, \ldots ,z_n \Rbag \setminus \Lbag z_i \Rbag, ,z_i)$.
3. Set $p_z := \frac{1}{n}\left | \left \{i \in \{1, \ldots, n\}: \alpha_i \ge \alpha_n \right \} \right |$.
4. Include $z$ in $\Gamma^\epsilon(z_1, \ldots ,z_{n−1})$ iff $p_z > \epsilon$.