# Topic 11 : Interactive Regression Adjustment (IRA) to Improve Precision

This is an exploration of various regression adjustment methodologies highlighted in [Chernozhukov et al.](https://causalml-book.org/assets/chapters/CausalML_book.pdf)'s book on Causal Inference, specifically, Chapter 2: Causal Inference via Randomized Experiments.

Let's first start with a quick history lesson, derived from [Chiang et al., 2023](https://arxiv.org/abs/2302.00469) and a few other sources. Then, let's discuss the table of contents for this notebook.

In the Randomized Controlled Trials (RCT) literature, it has become common to collect covariates that are predetermined characteristics of the experiment subjects and conduct regression adjustments to estimate treatment effects of interest since regression adjustments can potentially reduce variability of the estimates. In the A/B Testing literature, this is often called CUPED, although my impression is that its use differs slightly between the industry and in academia (for example, CUPED is automatically implemented in most AB tests under the experimentation platform, and the variable being adjusted for includes the baseline level of the outcome variable of interest during the pre-treatment period).

Historically, there has been a bit of pushback, including from an influential work by Freedman, 2008 that discouraged regression adjustment for RCT's due to various reasons, which since then, have been relieved due to advancements in the literature.

In this notebook, we compare the point estimate and the precision of said estimates of 3 estimators:
- Classical 2-Sample Approach (CL)
- Classical Linear Regression Adjustment (CRA)
- Interactive Regression Adjustment (IRA)

CL is the familiar `Y ~ T` approach. CRA has the form of `Y ~ T + X`, and can improve precision, but only under certain conditions. IRA, addressed by [Lin, 2013](https://arxiv.org/pdf/1208.2301), will always improve precision.

We'll try these 3 estimators and compare them against each other for various kinds of data setups.



# Data Generating Process: RCT without Heterogeneity

Let $T$ be our Binary Treatment Variable of Interest, and let $Y$ be our outcome of interest. Here, we assume the linear model:

$$E[Y|T,X] = \beta_1 T + \beta_2 X$$

Using the Potential Outcomes Notation, under the Consistency and Exchangeability Assumptions, we have that

$$E[Y(1)|X]=\beta_1 +\beta_2X$$
$$E[Y(0)|X]=\beta_2 X$$

assuming our covariates $X$ are centered,

$$E[Y(1)]=E[E[Y(1)|X]]=\beta_1$$

Thus, $E[Y(1)-Y(0)]=\beta_1$.


In [2]:
import numpy as np
import pandas as pd

# Generate Data
n = 10000
beta_1 = 2
Y_1 = 2 + np.random.normal(0, 1, n)
Y_0 = 0 + np.random.normal(0, 1, n)
T = np.random.binomial(1, 0.5, n)
X = np.random.normal(0, 1, n)
Y = T * Y_1 + (1 - T) * Y_0
df = pd.DataFrame({'T': T, 'X': X, 'Y': Y})

In [13]:
def collect_inference(models, model_names):
  estimates = []
  standard_errors = []
  confidence_intervals = []
  for model in models:
    estimates.append(model.params[1])
    standard_errors.append(model.bse[1])
    confidence_intervals.append(np.round(model.conf_int()[1],5))

  df = pd.DataFrame({'Model': model_names, 'Estimate': estimates, 'Standard Error': standard_errors, '95% CI': confidence_intervals})
  return df

In [14]:
import statsmodels.formula.api as smf
# Model Formulas
model_base_cl = 'Y ~ T'
model_base_cra = 'Y ~ T + X'
model_base_ira = 'Y ~ T + X + I(T*X)'

# Fit models
model_cl = smf.ols(model_base_cl, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_cra = smf.ols(model_base_cra, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_ira = smf.ols(model_base_ira, data=df).fit().get_robustcov_results(cov_type = 'HC1')

# Output inference
models = [model_cl, model_cra, model_ira]
model_names = ['Classic','Adjustment','IRA']
inference = collect_inference(models, model_names)
inference

Unnamed: 0,Model,Estimate,Standard Error,95% CI
0,Classic,1.985579,0.019953,"[1.94647, 2.02469]"
1,Adjustment,1.98542,0.019951,"[1.94631, 2.02453]"
2,IRA,1.985442,0.019954,"[1.94633, 2.02456]"


Looks pretty much the same.

# Data Generating Process: RCT with Heterogeneity

Let's assume the following linear model:

$$E[Y|T,X]=\beta_1T+\beta_2X+\beta_3TX+\beta_4X^2+\beta_5X^3$$


In [33]:
import numpy as np
import pandas as pd
from scipy.special import expit

# Generate Data
n = 10000
beta_1 = 2
X = np.random.normal(0, 1, n)
T = np.random.binomial(1, 0.5, n)
Y = beta_1 * T + 2*X + T * X + X**2 + X**3 + np.random.normal(0, 1, n)
df = pd.DataFrame({'T': T, 'X': X, 'Y': Y})

In [34]:
import statsmodels.formula.api as smf
# Model Formulas
model_base_cl = 'Y ~ T'
model_base_cra = 'Y ~ T + X'
model_base_ira = 'Y ~ T + X + I(T*X)'

# Fit models
model_cl = smf.ols(model_base_cl, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_cra = smf.ols(model_base_cra, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_ira = smf.ols(model_base_ira, data=df).fit().get_robustcov_results(cov_type = 'HC1')

# Output inference
models = [model_cl, model_cra, model_ira]
model_names = ['Classic','Adjustment','IRA']
inference = collect_inference(models, model_names)
inference

Unnamed: 0,Model,Estimate,Standard Error,95% CI
0,Classic,2.05638,0.123259,"[1.81477, 2.29799]"
1,Adjustment,2.060681,0.059074,"[1.94488, 2.17648]"
2,IRA,2.061516,0.058124,"[1.94758, 2.17545]"


Interestingly, regression adjustment in this case helps immensely, even for regular adjustment. Perhaps the text meant more severe deviations from the Linear case.

# Data Generation Process: Treatment Variable $T$ lacks Exchangeability

The above adjustments are also crucially, only guaranteed when $T$ is exchangeable, such as in an RCT. We can see below that without Exchangeability, the coefficient $\hat\beta_1$ is not a valid estimate for the Treatment effect.



In [35]:
import numpy as np
import pandas as pd
from scipy.special import expit

# Generate Data
n = 10000
beta_1 = 2
X = np.random.normal(0, 1, n)
T = np.random.binomial(1, expit(X), n)
Y = beta_1 * T + 2*X + T * X + X**2 + X**3 + np.random.normal(0, 1, n)
df = pd.DataFrame({'T': T, 'X': X, 'Y': Y})

In [36]:
import statsmodels.formula.api as smf
# Model Formulas
model_base_cl = 'Y ~ T'
model_base_cra = 'Y ~ T + X'
model_base_ira = 'Y ~ T + X + I(T*X)'

# Fit models
model_cl = smf.ols(model_base_cl, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_cra = smf.ols(model_base_cra, data=df).fit().get_robustcov_results(cov_type = 'HC1')
model_ira = smf.ols(model_base_ira, data=df).fit().get_robustcov_results(cov_type = 'HC1')

# Output inference
models = [model_cl, model_cra, model_ira]
model_names = ['Classic','Adjustment','IRA']
inference = collect_inference(models, model_names)
inference

Unnamed: 0,Model,Estimate,Standard Error,95% CI
0,Classic,6.314115,0.120198,"[6.0785, 6.54973]"
1,Adjustment,1.727444,0.052183,"[1.62516, 1.82973]"
2,IRA,1.718913,0.05724,"[1.60671, 1.83111]"
