# Prediction-Powered Inference (PPI)

## Introduction and Problem Setup

**Prediction-Powered Inference (PPI)** is a statistical framework that enables valid inference when we have:
- A large dataset of **unlabeled covariates** $\{X_i\}_{i=1}^N$ (size $N$, typically very large)
- A small dataset of **labeled observations** $\{(X_i, Y_i)\}_{i=1}^n$ (size $n \ll N$)
- A predictive model $f: \mathcal{X} \to \mathcal{Y}$ trained on labeled data

### The Core Challenge

Given these resources, we want to estimate some population parameter $\theta^* = \mathbb{E}[g(Y, X)]$ where $g$ is a known function. Examples include:
- **Mean estimation**: $\theta^* = \mathbb{E}[Y]$ where $g(y,x) = y$
- **Regression coefficients**: $\theta^* = \mathbb{E}[XY]$ for linear regression
- **Risk measures**: $\theta^* = \mathbb{E}[\mathbf{1}\{Y > t\}]$ for threshold exceedance

### Why Not Just Use ML Predictions?

The naive approach would be to:
1. Train $f$ on labeled data
2. Predict $\hat{Y}_i = f(X_i)$ for all unlabeled $X_i$
3. Estimate $\tilde{\theta}_f = \frac{1}{N}\sum_{i=1}^N g(\hat{Y}_i, X_i)$

**This fails because:**
$$\mathbb{E}[\tilde{\theta}_f] = \mathbb{E}[g(f(X), X)] \neq \mathbb{E}[g(Y, X)] = \theta^*$$

The estimator $\tilde{\theta}_f$ is **biased** unless $f(X) = Y$ almost surely (perfect prediction).

## Mathematical Framework

### The Rectifier Approach

PPI introduces a **rectifier** $\Delta$ that captures the systematic bias:
$$\Delta = \mathbb{E}[g(Y, X) - g(f(X), X)]$$

The key insight is that $\theta^*$ can be decomposed as:
$$\theta^* = \mathbb{E}[g(f(X), X)] + \Delta = \tilde{\theta}_f^{pop} + \Delta$$

where $\tilde{\theta}_f^{pop} = \mathbb{E}[g(f(X), X)]$ is the population version of our biased estimator.

### PPI Estimator Construction

**Step 1: Estimate the bias using labeled data**
$$\hat{\Delta} = \frac{1}{n}\sum_{i=1}^n [g(Y_i, X_i) - g(f(X_i), X_i)]$$

**Step 2: Estimate population predictions using unlabeled data**
$$\tilde{\theta}_f = \frac{1}{N}\sum_{i=1}^N g(f(X_i), X_i)$$

**Step 3: Construct the PPI estimator**
$$\hat{\theta}_{PPI} = \tilde{\theta}_f + \hat{\Delta}$$

### Theoretical Properties

**Unbiasedness**: 
$$\mathbb{E}[\hat{\theta}_{PPI}] = \mathbb{E}[\tilde{\theta}_f] + \mathbb{E}[\hat{\Delta}] = \tilde{\theta}_f^{pop} + \Delta = \theta^*$$

**Variance**: Under independence assumptions,
$$\text{Var}(\hat{\theta}_{PPI}) = \text{Var}(\tilde{\theta}_f) + \text{Var}(\hat{\Delta}) = \frac{\sigma_f^2}{N} + \frac{\sigma_{\Delta}^2}{n}$$

where:
- $\sigma_f^2 = \text{Var}(g(f(X), X))$
- $\sigma_{\Delta}^2 = \text{Var}(g(Y, X) - g(f(X), X))$

## Confidence Intervals and Inference

### Asymptotic Distribution

Under regularity conditions, as $n, N \to \infty$ with $n/N \to \lambda \in [0,1]$:

$$\sqrt{n}(\hat{\theta}_{PPI} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \sigma_{PPI}^2)$$

where the asymptotic variance is:
$$\sigma_{PPI}^2 = \frac{n}{N}\sigma_f^2 + \sigma_{\Delta}^2$$

### Confidence Interval Construction

The **rectifier set approach** constructs confidence intervals by accounting for uncertainty in $\hat{\Delta}$:

**Step 1**: Build a confidence set for $\Delta$
$$R_{\alpha} = \left[\hat{\Delta} - z_{1-\alpha/2}\sqrt{\frac{\hat{\sigma}_{\Delta}^2}{n}}, \hat{\Delta} + z_{1-\alpha/2}\sqrt{\frac{\hat{\sigma}_{\Delta}^2}{n}}\right]$$

where $\hat{\sigma}_{\Delta}^2 = \frac{1}{n-1}\sum_{i=1}^n (g(Y_i, X_i) - g(f(X_i), X_i) - \hat{\Delta})^2$

**Step 2**: Construct the prediction-powered confidence set
$$C_{PPI} = \{\tilde{\theta}_f + \delta : \delta \in R_{\alpha}\}$$

This gives the $(1-\alpha)$-level confidence interval:
$$\left[\tilde{\theta}_f + \hat{\Delta} - z_{1-\alpha/2}\sqrt{\frac{\hat{\sigma}_{\Delta}^2}{n}}, \tilde{\theta}_f + \hat{\Delta} + z_{1-\alpha/2}\sqrt{\frac{\hat{\sigma}_{\Delta}^2}{n}}\right]$$

### Efficiency Gains

The **efficiency gain** of PPI over classical inference depends on the prediction quality. Define the **correlation coefficient**:
$$\rho = \frac{\text{Cov}(g(Y,X), g(f(X),X))}{\sqrt{\text{Var}(g(Y,X))\text{Var}(g(f(X),X))}}$$

The **relative efficiency** of PPI vs. classical estimator is approximately:
$$\text{RE} \approx \frac{1 + \lambda(1-\rho^2)}{1-\rho^2}$$

where $\lambda = n/N$. As $\rho \to 1$ (perfect predictions), $\text{RE} \to \infty$, meaning enormous efficiency gains.

### Key Mathematical Insights

1. **Bias-Variance Tradeoff**: PPI trades off between:
   - **Variance reduction** from using large $N$ unlabeled samples
   - **Bias correction** using small $n$ labeled samples

2. **Optimality**: When $f$ has good predictive power ($\rho$ close to 1), PPI can achieve the efficiency of having $N$ labeled samples while only requiring $n$ labels.

3. **Robustness**: PPI validity requires no assumptions on $f$ beyond measurability - it works with any black-box predictor.

## PPI Workflow Visualization

```mermaid
flowchart TD
    %% Input Data
    subgraph "Input Data"
        A["`**Large Unlabeled Data**
        X₁, X₂, ..., X_N
        Size: N (very large)`"]
        B["`**Small Labeled Data** 
        (X₁,Y₁), (X₂,Y₂), ..., (X_n,Y_n)
        Size: n << N`"]
    end
    
    %% Model Training
    subgraph "Model Training"
        C["`**Train Predictor**
        f: X → Y
        using labeled data`"]
    end
    
    %% Two Parallel Computations
    subgraph "Parallel Computations"
        D["`**Unlabeled Path**
        Compute predictions f(Xᵢ)
        for all N unlabeled points`"]
        E["`**Labeled Path**
        Compute predictions f(Xᵢ) 
        for all n labeled points`"]
    end
    
    %% Core PPI Components
    subgraph "PPI Components"
        F["`**Biased Estimator**
        θ̃_f = (1/N) Σ g(f(Xᵢ), Xᵢ)
        Uses all N predictions`"]
        G["`**Rectifier (Bias Correction)**
        Δ̂ = (1/n) Σ [g(Yᵢ,Xᵢ) - g(f(Xᵢ),Xᵢ)]
        Estimates systematic bias`"]
        H["`**Variance Estimation**
        σ̂²_Δ = Var[g(Y,X) - g(f(X),X)]
        For confidence intervals`"]
    end
    
    %% Final Results
    subgraph "Final Inference"
        I["`**PPI Estimator**
        θ̂_PPI = θ̃_f + Δ̂
        Unbiased estimate`"]
        J["`**Confidence Interval**
        θ̂_PPI ± z₁₋α/₂ √(σ̂²_Δ/n)
        Valid statistical inference`"]
    end
    
    %% Flow connections
    A --> C
    B --> C
    C --> D
    C --> E
    
    D --> F
    E --> G
    B --> H
    
    F --> I
    G --> I
    H --> J
    I --> J
    
    %% Styling
    classDef inputStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef modelStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef computeStyle fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef ppiStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef resultStyle fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    
    class A,B inputStyle
    class C modelStyle
    class D,E computeStyle
    class F,G,H ppiStyle
    class I,J resultStyle
```

**What estimands can PPI be used for?**
The principe of prediction powered inference can be used for constructing valid CI for any estimand that can be expressed as a minimizer of a convex objective function. 

---

**Prediction-Powered Inference For LLM Eval**

We have two signals for assessing model performance: 
- human labels which are typically accurate, but expensive to collect. Usually only a small sample size is available. An estimate based on these samples alone will have high variance.

- **autorater predictions.** Are easy to collect for large sample sizes, but may also be systematically biased.

This suggest that one must make a choice between either (1) a high-variance, but unbiased, estimate from a small human sample or (2) a lower-variance but biased, autorater-based estimate.

**PPI gives us a statistically valid way of combining the auto-rater and human data for hybrid evaluations.**


Standard PPI does not take heterogeneity into account.

>  Heterogeneity refers to dissimilarities or variations across different subgroups or conditions within your data. This implies that a single, simple model or a single set of parameters might not adequately describe the entire population or dataset. The underlying processes, relationships, or effects differ for distinct parts of the data.




**Core Idea:**
- You have a large set of unlabeled covariates X, but only a small sample of observed outcomes Y
- Train a predictive model $f(X)$ on labeled data, then use it to estimate a parameter (e.g., mean, proportion) over the entire population using unlabeled data.
- Valid inference is achieved by _correcting_ the bias of the predictor via a labeled test set.

Prediction Powered inference allows us to "impute" outcomes via ML to boost effective sample size, then use those labeled data to "rectify" those imputations so that all the resulting 

**Why can’t we simply plug in ML-imputed labels into a standard confidence‐interval procedure and be done?**
1. **Bias & Invalid Coverage**: When you replace the true $Y_i$ with $f(X_i)$, your estimator $\tilde\theta_f$ will typically be **biased** (unless f is perfect), and as a result your CI will **under- or over-cover** the true parameter $\theta^*$
2. Prediction-Powered Inference (PPI) fixes this by introducing a **rectifier** $\Delta$ that captures the systematic error from using $f(X)$ in place of Y.  Concretely, for the simple case of estimating a mean, one defines$$
    \Delta ;=; \mathbb{E}\bigl[Y - f(X)\bigr],$$and then uses the labeled data to build a **confidence set** $R$ for $\Delta$.  Finally, one **“rectifies”** the naive estimator $\tilde\theta_f$ by each candidate in $R$ to obtain a valid CI for $\theta^*$
 Only **after** you ensure **validity** via the rectifier, without it, you’d simply be averaging biased predictions under the false pretence they’re “real” data.

---

In prediction-powered inference (PPI), the “rectifier set” $R$ is **not** the single-point estimate $\hat\Delta$ but rather a **confidence set** containing all plausible values of the rectifier $\Delta$ given your labeled data.  Because $\Delta$ is itself estimated with sampling noise, we construct $R$ (typically an interval) to capture that uncertainty.  In the mean‐estimation warm‐up, $\Delta = \mathbb{E}[\,f(X)-Y\,]$ is estimated by the empirical average $\hat\Delta = \frac1n\sum_i\bigl(f(X_i)-Y_i\bigr)$; the rectifier set is then
$$
R = \bigl[\hat\Delta - z_{1-\delta/2}\,\mathrm{SE}(\hat\Delta),\;\hat\Delta + z_{1-\delta/2}\,\mathrm{SE}(\hat\Delta)\bigr],
$$
an **infinite** continuum of values rather than a single number
# Conformal Predition
**Core Idea:**
- CP wraps around _any_ predictive model (black-box or otherwise)
- Given a trained model and a calibration set, it outputs _set-valued predictions_ that contain the true label (or value) with probability $≥ 1 − α$.
- It gives **finite-sample, distribution-free guarantees**.
**Ideas:**
**Quality Assurance in Imputation:** Use CP to generate _interval estimates_ when imputing missing values (e.g., in income, education level).
**Outlier Detection**: Conformal methods (like Conformal Anomaly Detection) could flag units whose predicted behavior doesn’t match the calibration distribution.

- CP works by computing nonconformity scores on previous unlabled data, and using these to create prediction sets on a new (unlabled) test data
- requires a user specified significance level for which the algorithm should produce its predictions (predicton sets)
    - this significance level restricts the frequency of errors that the algorithm is allowed to make;
    - for example, significance of 0.1 means that the algorithm can make at more 10% erroneous predictons 
- To meet this requirement, the output is a predictions set, instead of a point prediction produced by standard supervised models
    - for classification, this means that predictions are not a single class, for example `cat`, but instead, a set `{‘cat’, ‘dog'}`
- depending on how good this underlying model is (how well it can discen between classes) and the specified significance level, these sets can be smaller or larger. 
- for regression tasks, the output is prediction intervals
 --- 
 C is a _set-valued_ function that maps each test point $X_{\text{test}}$ to a subset of the K possible labels.  Concretely, after calibration you compute
$$C(X_{\text{test}}) \;=\; \{\,y\in\{1,\dots,K\}: \hat f(X_{\text{test}})_y \;\ge\;1 - \hat q\},$$
Let:
- $\hat f(x)_y$ be your model’s softmax “probability” for class $y$.
- $\alpha\in(0,1)$ be your target miscoverage level (e.g. $\alpha=0.1$ for 90 % coverage)
- A calibration set $\{(X_i,Y_i)\}_{i=1}^n$
To construct $C$ from $\hat f$ and the calibration data:
1. Nonconformity Scores
$$s_i \;=\; 1 - \hat f(X_i)_{Y_i} \quad\text{for }i=1,\dots,n$$
A large $s_i$ means the model gave low probability to the true label.  
2. **Threshold via empirical quantile.**
Order the $s_i’s$ from smallest to largest and let
$\hat q_{1-\alpha} \;=\; \text{the }\lceil (n+1)(1-\alpha)\rceil\text{-th smallest }s_i.$
Equivalently, $\hat q_{1-\alpha}$ is the smallest number such that at least $(1-\alpha)\times100\%$ of the calibration scores satisfy $s_i \le \hat q_{1-\alpha}$.
3. **Prediction sets.**
For a
4. we set the conformal scores si = 1-f(Xi) to be (one minus the softmax output of the true class). The score is high when the model is baldy wrong.
5. Define $\hat q$ to be the empirical quantile 
6. For a new test data set create a prediction set such that:
$$C(X_{\text{test}}) \;=\; \{\,y\in\{1,\dots,K\}: \hat f(X_{\text{test}})_y \;\ge\;1 - \hat q\},$$
	that includes all classes with a high enough softmax output. 

--- 
## **Cross-Prediction-Powered Inference (CrossPPI)**
### **Core Idea**
CrossPPI addresses the overfitting and bias that arise when the **same** data are used to train $f$ and to estimate the rectifier.  It partitions the labeled dataset into K folds: for each fold, the model is trained on the other K-1 folds and used to predict on the held-out fold; this “cross-prediction” mimics an independent test set and avoids overly optimistic bias estimates  .
### **Debiasing Mechanism**

1. **Fold-wise rectifiers** $\hat\Delta^{(k)}$ are computed on each held-out fold k.
2. These are pooled to form an overall confidence set R for the bias $\Delta$.
3. The final **prediction-powered set**
    
$$C_{PP} \;=\;\bigl\{\;\tilde\theta_f + d : d\in R\bigr\}$$
    
    retains exact coverage while leveraging all labeled data without reuse bias  .
CrossPPI requires only that predictions be measurable functions of the training folds—**no consistency** or **i.i.d. guarantees** on f are needed for validity.  Empirically, it often tightens intervals compared to classical PPI when the model generalizes well across folds
