# (Athey et al., 2024) The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely

[Paper Link](https://arxiv.org/pdf/1603.09326)

# 1. Introduction

A fundamental challenge for evaluating interventions is that the primary outcome of interest are often hard to measure. Researchers are often interested in the effect of the policy on some long-term outcome but do not observe that in their study. Instead, they observe a number of short-term outcomes that are all related to this primary outcome of interest.

The researcher is faced with making recommendations regarding the future implementation of the intervention on the basis of measurements of its effect on a variety of sometimes disparate and possibly conflicting outcome measures. A key question is how to balance these different outcomes when making an overall assessment.

This paper lays out a framework for analyzing these issues.

Assumes you have two datasets. The first one, likely experimental. The second is an observational dataset where the researcher observes the surrogates and the primary outcome but does not observe treatment.

The paper makes 4 main contributions:

**First Contribution**:

Articulates the 3 key assumptions under which the ATE on the primary outcome is identified from the combination of experimental and observational samples:
- i. A standard assumption that the assignment in the experimental sample is *Unconfounded*
- ii. a *Surrogacy* assumption which requires that the causal path from the treatment to the primary outcome goes through the surrogates
- iii. a *Comparability* or external validity assumption, which requires that the observational and experimental samples are comparable in the sense that the outcome distributions conditional on surrogates and pre-treatment variables are identical.

Under these 3 assumptions, the ATE on the primary outcome can be estimated as the ATE on an aggregate of surrogates, which we label the surrogate index. This index combines the individual surrogates through their predicted value of the primary outcome.

For example, when studying the impact of class size (the treatment), the primary outcome might be high school graduation, a lagging metric, and the two surrogates might be mathematics and reading scores.

**Second Contribution**:

Derive the efficiency bound and propose various efficient estimators under various scenarios, including scenarios with a single sample or two samples, as well as with and without Surrogacy. This allows us to quantify the information content of the Surrogacy assumption.

**Third Contribution**:

Provide bounds on the biases that arise in scenarios where either or both of Surrogacy or Comparability are violated. Show that even if these assumptions fail to hold (but unconfoundedness does hold), the proposed estimators still estimate a well-defined causal effect, by providing a principled way of combining short-term outcomes in a single measure through their predicted effect on the long-term outcome.

**Fourth Contribution**:

Evaluate these methods in the context of a labor market program.

This study is related to 3 main bodies of literature:
- surrogacy
- mediation
- missing data







# 2. Setup and Notation

We define two samples, an Experimental (E) sample and an Observational (O) sample, with $N_E$ and $N_O$ units or individuals, respectively. It is convenient to view the data as consistening of a single sample of size $N=N_E+N_O$ with $P_i\in\{O,E\}$, a binary indicator denoting the sample to which unit $i$ belongs.

For each unit, there is a:
- binary treatment of interest $W_i\in\{0,1\}$
- scalar primary outcome $Y_i$. This outcome is not observed for individuals in the experimental sample.
- intermediate or secondary outcomes (surrogates), denoted by $S_i$ for each unit. Typically vector-valued to make the properties we define plausible.
- pre-treatment covariates $X_i$ for each unit, known not to be affected by the treatment.

Individuals in this group have 2 pairs of potential outcomes:
- $(Y_i(0),Y_i(1))$
- $(S_i(0),S_i(1))$

With the 1 and 0's representing $W_i$, the treatment of interest.

Overall, the units are characterized by the values of the septuple

$$(Y_i(0), Y_i(1), S_i(0), S_i(1), X_i, W_i, P_i)$$

Of-course, we do not observe the full septuple for any units.

Instead, we observe:

For units in experimental sample: $(X_i, W_i, S_i)$ with support $(\mathbb{X},\mathbb{W},\mathbb{S})$, with $\mathbb{W}=\{0,1\}$

For units in the observational sample, we do not observe to which treatment each of the $N_O$ individuals were assigned. We observe $(X_i,S_i,Y_i)$ with support $\mathbb{X,S,Y}$ respectively.

To simplify the exposition, we analyze the data as if we have a random sample from a population of units for which we observe the quintuple

$$(P_i,X_i, S_i, 1_{P_i=E}W_i, 1_{P_i=O}Y_i)$$

where we treat $P_i$ as a random variable taking on the values $\{O,E\}$.

**Assumption 1:** We have a single random sample of size $N$ drawn from the joint distribution of $(P_i, X_i, S_i, W_i, Y_i)$, where we observe for each unit in the sample $(P_i,X_i, S_i, 1_{P_i=E}W_i, 1_{P_i=O}Y_i)$.

We are interested in the ATE on the primary outcome in the population from which the experimental sample is drawn:

$$\tau\equiv\mathbb{E}[Y_i(1)-Y_i(0)|P_i=E]$$

Notice how we do not directly observe $Y_i$ in the experimental sample.

An implicit assumption in our setup is that the two variables that are common to both samples ($S_i$ and $X_i$ measure the same underlying variables in both samples).








# 3. The Critical Assumptions: Unconfoundedness, Surrogacy, and Comparability

Discuss the 3 key assumptions that together allow us to combine the observational and experimental samples and estimate the causal effect of the treatment on the primary outcome, exploiting the presence of the surrogates.

**Unconfoundedness** (Rosenbaum and Rubin, 1983):

which ensures that adjusting for pre-treatment variables leads to valid causal effects in the experimental sample.

**Surrogacy condition** (Prentice, 1989):

allows us to use the surrogate variables to proxy for the primary outcome

**Comparability**:

Formalizes the connection between the two samples.

## 3.1 Unconfoundedness

For the individuals in the experimental group, the propensity score is the conditional probability of receiving the treatment:

$$\rho (x)\equiv\Pr(W_i=1|X_i=x, P_i=E)$$

We assume that for individuals in the experimental group, treatment assignment is unconfounded.

**Assumption 2** (Unconfounded Treatment Assignment / Strong Ignorability)

## 3.2 Surrogacy

We introduce two concepts, the surrogacy score (similar to propensity score) and the surrogacy index, to combine multiple surrogates.

### 3.2.1 The Prentice Criterion

Prentice 1989 defines a surrogate as a post-treatment variable where conditioning on it makes the outcome and the treatment independent:

Imagine the causal path:
$T\rightarrow S\rightarrow Y$
Controlling on $S$ will block the relationship between $T$ and $Y$.

**Assumption 3**: (Surrogacy, Prentice Criterion)

(i) $W_i\perp\!\!\!\perp Y_i |S_i,X_i, P_i=E$

(ii) $0 < \rho (s,x) < 1$ for all $s\in\mathbb{S},x\in\mathbb{X}$ and $0< \Pr(P_i=E) < 1$

Note how if the quadruple $(Y_i, S_i, W_i, X_i)$ were observed for all units, surrogacy would be a testable condition. With $(S_i, W_i, X_i)$ observed for units in the experimental sample, and $(Y_i,S_i, X_i)$ observed for units in the observational sample, this assumption has no testable implications.

Surrogacy is often debated in empirical applications.

### 3.2.2 The Surrogacy Index and the Surrogacy Score

There are two scalar functions of the surrogates that play an important role in the analyses: the surrogate index and surrogate score.

**Definition 1**: (The Surrogate Index)

The surrogate index is the conditional expectation of the primary outcome given the surrogate outcomes and the pre-treatment variables, conditional on the sample:

$$\mu (s,x,p)\equiv\mathbb{E}[Y_i|S_i=s, X_i=x, P_i=p]$$

Note that the surrogate index in the observational sample $\mu (s,x,O)$ is identified because we observe the triple $(Y_i,S_i,X_i)$ in the observational sample.

**Definition 2**: (The Surrogate Score)

The surrogate score is the conditional probability of having received the treatment given the value for the surrogate outcomes and the covariates in the experimental sample:

$$\rho (s,x)\equiv\Pr(W_i=1|S_i=s,X_i=x,P_i=E)$$

The role that the surrogacy score plays is similar to the role propensity score plays in analyses under unconfoundedness. Here, if the surrogacy condition holds conditional on $(S_i,X_i)$, it also holds conditional on the surrogacy score.

**Proposition 1: (Surrogat Score) Suppose Surrogacy (Assumption 3) holds. Then:

$$W_i\perp\!\!\!\perp Y_i|\rho(S_i,X_i),P_i=E$$

### 3.2.3 The Benefits of Multiple Surrogates

Having multiple short-term variables can make a surrogacy approach more plausible.

## 3.3 Comparability

Surrogacy and Unconfoundedness by themselves are not sufficient for consistent estimation of $\tau$ because they do not place restrictions on how the relationship between $Y_i$ and $S_i$ in the observational sample compares to that in the experimental sample.

### 3.3.1 The Comparability Assumption

Let $\varphi\equiv\Pr (P_i=E)$ be the probability of a unit being part of the experimental sample.

We introduce the *Sampling Score*, the propensity to be in the experimental sample:

**Definition 3**: (Sampling Score)

The sample score is $\varphi (s,x)\equiv\Pr (P_i=E|S_i=s, X_i=x)$

The third key assumption we make is that the conditional distribution of $Y_i$ given $(S_i, X_i)$ in the observational sample is the same as the conditional distribution of $Y_i$ given $(S_i,X_i)$ in the experimental sample, and that the support of $(S_i,X_i)$ in the experimental sample is a subset of that in the observational sample. Formally,

**Assumption 4**: (Comparability of Samples)

(i) $P_i\perp\!\!\!\perp Y_i|S_i,X_i$

(ii) $\varphi (s,x)< 1$ for all $s\in\mathbb{S}$ and $x\in\mathbb{X}$

This is a strong assumption, and section 5 discusses the biases arising from violations and improves the intuition for when this assumption may be of concern.

If the observational and experimental samples are substantially different in terms of the distribution of pre-treatment variables and surrogates, it would likely be more controversial to assume that conditional on those varaibles the outcome distributions are identical.

### 3.3.2 The Surrogate Index and the Sampling Score

We let $\mu (s,w,x,p)$ denote the conditional expectation of the primary outcome given pre-treatment variables, surrogates, treatment, and sample:

$$\mu (s,w,x,p)\equiv\mathbb{E}[Y_i|S_i=s, X_i=x, W_i=w, P_i=p]$$

Comparability and Surrogacy together allow us to impute the missing primary outcomes in the experimental sample, as show by the following proposition.

**Proposition 2:** (Surrogate Index)

(i) Suppose Assumption 3 (Surrogacy) holds. Then,

$\mu (s,w,x,E)=\mu(s,x,E)$ for all $s\in\mathbb{S},x\in\mathbb{X}$, and $w\in\mathbb{W}$.

> What is this saying? It's saying that in the experimental sample, if we condition on the surrogates (and pre-treatment variables), then the average the outcome is the same across all treatment variants. This makes sense given what the Surrogacy assumption does.

(ii) Suppose Assumption 4 (Comparability) holds. Then:

$\mu (s,x,E)=\mu (s,x,O)$ for all $s\in\mathbb{S}$ and $x\in\mathbb{X}$.

> What does this mean? Conditional on surrogates (and pre-treat covariates), if we are willing to assume that the probability of being in experimental or observational is independent, then of-course the average of the primary outcome will be the same.

(iii) Suppose Assumptions 3 (Surrogacy) and 4 (Comparability) hold. Then:

$\mu (s,w,x,E)=\mu (s,x,O)$ for all $s\in\mathbb{S},x\in\mathbb{X}$, and $w\in\mathbb{W}$.

Because we can estimate $\mu (s,x,O)=\mathbb{E}[Y_i|S_i=s,X_i=x,P_i=O]$, we can impute the missing $Y_i$ in the experimental sample as $\mu (S_i,X_i,O)$.

## 3.4 Surrogacy, Mediation, Instrumental Variables, Directed Acyclic Graphs, and Missing Data

### 3.4.1 Directed Acyclical Graph Representations

### 3.4.2 A Missing Data Representation








# 4. Identification and Semiparametric Efficiency Bounds

## 4.1 Three Identification Results

Presents the central identification result using 3 strategies:
1. Requires estimation of surrogate index, but not the surrogate score
2. Estimation of surrogate score, but not the surrogate index
3. Estimation requires estimation of both, but have attractive double robustness properties

We define the following 4 objects, all functionals of distributions that are directly estimable from the data.

First define the statistical estimand, the average diff. in the surrogate index between treat and control, adjusted for pre-treat variables in the experimental sample:

$$\tau^*\equiv \mathbb{E}[\left\{\mathbb{E}[\mathbb{E}[Y_i|S_i,X_i,P_i=O|W_i=1,X_i,P_i=E]-\mathbb{E}[\mathbb{E}[Y_i|S_i,X_i,P_i=O]|W_i=0,X_i,P_i=E]\right
\}|P_i=E]$$

Okay, that's confusing af. let's break it down.
First, we have this:

$\tau^*\equiv\mathbb{E}[\{?_1\}|P_i=E]$
> this indicates that we'll be getting the average of $?_1$ among all users in the experimental sample

$?_1=\mathbb{E}[?_2|W_i=1,X_i,P_i=E]-\mathbb{E}[?_2|W_i=0,X_i,P_i=E]$

> Here, we are taking the difference of $?_{2}$ among users who have treatment ($W_i=1$) and control ($W_i=0$), again among users in the experimental sample.

$?_2=\mathbb{E}[Y_i|S_i,X_i,P_i=O]$

> this is the average outcome among users in the observational sample, given the surrogates (and pre-treatment covariates). Note how this corresponds to the definition of the Surrogate index for the observational sample from above ($\mu (s,x,O)$))

Now, let's move on to the "surrogate index representation" of $\tau$:

$$\tau^E\equiv\mathbb{E}\left[\mu(S_i,X_i,O)\cdot\frac{W_i}{\rho (X_i)}-\mu (S_i,X_i,O)\cdot\frac{1-W_i}{1-\rho (X_i)}|P_i=E\right]$$

> Does this make sense? Remember, $\mu (S_i, X_i, O)=\mathbb{E}[Y_i|S_i=s,X_i=x,P_i=O]$

There's also a surrogate score representation and a third representation based on the influence function which we don't write down here.

What's important is that $\tau\equiv\mathbb{E}[Y_i(1)-Y_i(0)|P_i=E]=\tau^*=\tau^E=\tau^O=\tau^{O,E}$ under the above stated assumptions.

## 4.2 Semiparametric Efficiency Bounds








# 5. Violations of the Surrogacy and Comparability Assumptions: Biases and Bounds

The 3 critical assumptions of Unconfoundedness, Surrogacy, and Comparability are strong.

There is a large literature studying the sensitivity to unconfoundedness conditions.

Multiple studies have also raised concerns that in practice Surrogacy may not be satisfied, although we are not aware of formal sensitivity or bounds analyses.

Violations of Comparability have not been explored because this assumption has not been previously formalized.

In this section we examine the biases that arise from violations of Surrogacy and Comparability.

## 5.1 Biases



# 6. Estimation

Presents 4 estimators for the ATE. First, the surrogate index estimator, and then 3 new alternative estimators.

## 6.1 Surrogate Index

## 6.2 Surrogate Score Estimator

## 6.3 Influence Function Estimator

## 6.4 Double Matching Estimator





# 7. Application: Impacts of Job Training on Employment

In this section we apply our method to estimate the causal effect of the Greater Avenues to Independence (GAIN) job training program on long-term labor market outcomes.

Also, shows how one can validate the surrogacy assumption using intermediate outcomes and bound the degree of bias arising from potential violations of surrogacy.

## 7.1 The GAIN Program

## 7.2 Three Esimators

### 7.2.1 Surrogate Index Estimator

To construct the surrogacy index, we estimate a linear regression model using least squares for the individuals in the *observational* sample

$$Y_i=\beta_0+\beta_S^\top S_i^t+\beta_X^\top X_i+\epsilon_i$$

The predicted value from this regression, which we denote by $\hat Y_i$, is our surrogate index for mean employment based on surrogates up to quarter $t$. We then compute this surrogate index for each of the individuals in the experimental sample and estimate the treatment effect based on the surrogate index as

$$\hat\tau^O=\frac{1}{N_{E,T}}\sum_{i=1}^{N_E}\hat Y_iW_i-\frac{1}{N_{E,C}}\sum_{i=1}^{N_E}\hat Y_i(1-W_i)$$

If we use all 36 quarters of employment indicators as surrogates, then the regression of $Y_i$ on the set of surrogates will fit perfectly, $\hat Y_i$ will be equal to $Y_i$, and the estimated effect will be identical to the original experimental estimate.

The question is whether using a much more limited set of surrogates will get us close to the experimental benchmark.

### 7.2.2 Surrogate Score Estimator

### 7.2.3 Influence Function Estimator

## 7.3 Results

Here we discuss two sets of results. First the estimates for the ATE of intervention on the two primary outcomes under various assumptions about the surrogates. Second, we test the Surrogacy and Comparability assumptions directly.

### 7.3.1 Estimation Results








