# Week 9: Topics in Causal Inference and Models of Probability Distribution

# Reading List on Causal Inference

- [Stephen Wager's lecture notes on causal inference at Stanford](https://web.stanford.edu/~swager/stats361.pdf): Theoretical statistics framework on causal inference estimators.
- [Qingyuan Zhao's lecture notes at Cambridge](http://www.statslab.cam.ac.uk/~qz280/teaching/causal-2020/notes.pdf): Contains some additional topics such as causal mediation analysis (CMA).
- [An introductory paper to A/B Testing](https://www.researchgate.net/publication/316116834_Online_Controlled_Experiments_and_AB_Testing): Written by senior research scientists at Airbnb and Microsoft, touches on some industry interviews and common implementation comments (e.g. choice of performance metric).
- [An example of using CMA to account for cannibalization](https://codeascraft.com/2020/02/24/the-causal-analysis-of-cannibalization-in-online-products/): Written by research scientists at Esty and based on their paper submitted to KDD 2019. A good introductory materials on tackling cannibalization.
- [Causal inference methods in practice](https://towardsdatascience.com/causal-inference-thats-not-a-b-testing-theory-practical-guide-f3c824ac9ed2): A medium post with a useful list of common causal inference methods in practice.
- [Selection bias in online experimentation](https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb): Airbnb team member writes on the "selection bias" topic in sequential experiments which in each stage the regime that has a statistically significant effect was chosen, creating an upward bias (kind of like the argument we saw on how stepwise regression is bad).

# Some Notes on Causal Inference

Before moving to to dicuss A/B Testing, we explore the theoretical framework for causal inference here. Most of the content of this section will be based on, and contains excerpts from the [notes](https://web.stanford.edu/~swager/stats361.pdf) by [Stefan Wager](https://web.stanford.edu/~swager/) for his ECON 361 course at Stanford. 

# Randomized Controlled Trials

Denote observations by $\{W_i, Y_i\}$ where $W_i \in \{0,1\}$ indicates control or treatment, and $Y_i$ is the outcome. The treatment effect for case $i$ is given by $\Delta_i = Y_i(1)-Y_i(0)$. Note that only only one of $\{Y_i(0), Y_i(1)\}$ can be observed so we will have to resort to groupwise metrics such as the average treatment effect. We will cover various methods across data conditions. 

One crucial assumption for causal inference is the <b>Stable Unit Treatment Value Assumption</b> (SUTVA). Formally, it states that, (1). there is no interference or [spillover](https://en.wikipedia.org/wiki/Spillover_(experiment)) (units do not interfere with each other): treatment applied to one unit does not effect the outcome for another unit, and (2). potential outcomes are well-defined (or, there is only a single version of each treatment level). So we can write $Y_i | \{W_j\}_j = Y_i(W_i)$. If either component of SUTVA is not satisfied, then the potential outcomes are not uniquely defined. Even worse, causal effects are hard to even define and estimates have limited credibility. 

We first consider the ideal case: randomized controlled trials (RCTs). Formally, it is assumed that:

\begin{align*}
Y_i &= Y_i(W_i) & & \text{SUTVA}\\
\{Y_i(0), Y_i(1)\} &\text{ } {\perp\!\!\!\perp} W_i & & \text{random treatment assignment} 
\end{align*}

Our goal is to estimate the average treatment effect (ATE) $\tau \triangleq \mathbb{E}[Y_i(1)-Y_i(0)]$. In RCT, $\tau$ is identified entirely via randomization (or, by design of the experiment). We'll see later that regression adjustments may be used to decrease variance, but regression modeling plays no role in defining the average treatment effect in RCT. 

## Difference-In-Mean Estimator

A method-of-moments estimator for $\tau$ is the difference-in-mean estimator $\hat{\tau}_{DM}$:

\begin{align*}
\hat{\tau}_{DM} = \frac{1}{n_1}\sum_{i:W_i=1} Y_i - \frac{1}{n_0}\sum_{i:W_i=0} Y_i
\end{align*}

We can see that its asymptotic mean is $\tau$:

\begin{align*}
\mathbb{E}[\hat{\tau}_{DM}] &= \mathbb{E}\bigg[\frac{1}{n_1}\sum_{i:W_i=1} Y_i\bigg] - \mathbb{E}\bigg[\frac{1}{n_0}\sum_{i:W_i=0} Y_i\bigg]\ & & \\
&= \mathbb{E}[Y_i | W_i=1] - \mathbb{E}[Y_i | W_i=0] & &\text{(IID)} \\
&= \mathbb{E}[Y_i(1) | W_i=1] - \mathbb{E}[Y_i(0) | W_i=0] & &\text{(SUTVA)}\\ 
&= \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] & &\text{(random treatment assignment)}\\
&= \tau
\end{align*}

By CLT $\sqrt{n}(\hat{\tau}_{DM}-\tau) \rightarrow N(0, V_{DM})$. Derive the variance $V_{DM}$ similarly:

\begin{align*}
\hat{V}_{DM} &= Var\bigg(\frac{1}{n_1}\sum_{i:W_i=1} Y_i - \frac{1}{n_0}\sum_{i:W_i=0} Y_i\bigg) \\
&= \frac{1}{n_1} Var\big(Y_i(1)\big) + \frac{1}{n_0} Var\big( Y_i(0)\big)
\end{align*}

Or equivalently:

\begin{align*}
\sqrt{n}\hat{V}_{DM} &= \frac{n}{n_1} Var\big(Y_i(1)\big) + \frac{n}{n_0} Var\big( Y_i(0)\big) \\
&\rightarrow \frac{Var\big(Y_i(1)\big)}{\mathbb{P}(W_i=1)}  + \frac{Var\big(Y_i(0)\big)}{\mathbb{P}(W_i=0)} \\
&\triangleq V_{DM}
\end{align*}


## OLS Adjustment

Suppose now that the treatment effect is given by the following linear model, 

\begin{align*}
Y_i(w) &= c_w + X_i \beta_w + \varepsilon_{iw} \\
\mathbb{E}[\varepsilon_{iw} | X_i] &= 0 \\
Var(\varepsilon_{iw} | X_i) &= \sigma^2
\end{align*}

As the RCT assumptions are still valid, $\hat{\tau}_{DM}$ is still consistent. To derive the variance, suppose $\mathbb{E}[X] = 0$ and $Var(X) = A$. Suppose $\rho = \frac{1}{2}$ of the cases are selected as treatment,

\begin{align*}
Var(\hat{\tau}_{DM}) &= \frac{Var(X_i \beta_0 + \sigma^2)}{1-\rho} + \frac{Var(X_i \beta_1 + \sigma^2)}{\rho} \\
&= \bigg(\frac{1}{1-\rho} + \frac{1}{\rho}\bigg) \sigma^2 + \frac{\Vert \beta_0 \Vert_A^2}{1-\rho} + \frac{\Vert \beta_1 \Vert_A^2}{\rho} \\
&= 4 \sigma^2 + 2 \Vert \beta_0 \Vert_A^2 + 2 \Vert \beta_1 \Vert_A^2 \\
&= 4 \sigma^2 + \Vert \beta_0 + \beta_1 \Vert_A^2 + \Vert \beta_0 - \beta_1 \Vert_A^2
\end{align*}

Here we write $\Vert v' \Vert_A^2 \triangleq v'Av$. We note that an OLS adjustment will provide efficiency improvments. Denote the OLS estimator for $\tau$ be:

\begin{align*}
\hat{\tau}_{OLS} &= \hat{c}_1 - \hat{c}_0 + \bar{X} \big( \hat{\beta}_1 - \hat{\beta}_0 \big) \\
\end{align*}

Where $\hat{c}_w$ and $\hat{\beta}_w$ are OLS estimates with asymptotic distribution for $w = 0,1$:

\begin{align*}
\sqrt{n_w} \begin{pmatrix}
\begin{bmatrix}
\hat{c}_w\\
\hat{\beta}_w
\end{bmatrix} - 
\begin{bmatrix}
c_w\\
\beta_w
\end{bmatrix}
\end{pmatrix} \sim 
N\Bigg(0,
\sigma^2 \begin{pmatrix}
1 & 0\\
0 & A^{-1}
\end{pmatrix}\Bigg)
\end{align*}

So one can show that:

\begin{align*}
V_{OLS} = 4 \sigma^2 + \Vert \beta_0 - \beta_1 \Vert_A^2 < V_{DM}
\end{align*}

## OLS Adjustment without Linearity

While the results above is not very surprising, it is possible to prove a much stronger result for OLS in randomized trials: OLS is never worse that the difference-in-means methods in terms of its asymptotic variance, and usually improves on it - even in misspecified models. Suppose the outcome $Y$ follows the model:

\begin{align*}
Y_i(w) &= \mu_w(X_i) + \varepsilon_{iw} \\
\mathbb{E}[\varepsilon_{iw}|X_i] &= 0 \\
Var(\varepsilon_{iw}|X_i) &= \sigma^2
\end{align*}

Note that, with Huber-White OLS analysis, the OLS estimates $\hat{c}$ and $\hat{\beta}$ converges to the following least squares minimizing parameters:

\begin{align*}
(c^*_w, \beta^*_w) &= \arg \min_{c, \beta} \mathbb{E}[(Y_i(w) - c - \beta X_i)^2] 
\end{align*}

Assuming $\mathbb{E}[X] = 0$ we have, according to [Buja (2019)](https://arxiv.org/pdf/1404.1578.pdf):

\begin{align*}
\sqrt{n_w} \begin{pmatrix}
\begin{bmatrix}
\hat{c}_w\\
\hat{\beta}_w
\end{bmatrix} - 
\begin{bmatrix}
c^*_w\\
\beta^*_w
\end{bmatrix}
\end{pmatrix} &\sim 
N\Bigg(0,
\begin{pmatrix}
MSE^*_w & 0\\
0 & \cdots
\end{pmatrix}\Bigg)\\
c^*_w &= \mathbb{E}[Y_i(w)]\\
MSE^*_w &= \mathbb{E}[(Y_i(w) - \hat{c}^*_w - \beta^*_w X_i)^2] 
\end{align*}

One can show that in this case we still have $\sqrt{n}$-consistency $\sqrt{n}(\hat{\tau}_{OLS}-\tau) \rightarrow N(0, V_{OLS})$ with $V_{OLS} = V_{DM} - \Vert \beta^*_0 + \beta^*_1 \Vert^2$.

# Unconfoundedness and the Propensity Score

We extend our analysis to experiments that may not be completely randomized, are under unconfoundedness (upon controlling for a set of covariates). Qualitatively, unconfoundedness is relevant when we want to estimate the effect of a treatment that is not randomized, but is as good as random once we control for $X_i$. The assumption of unconfoundedness is also referred to as the assumption of no unobserved confounding variables. Formally, we have, 

\begin{align*}
Y_i &= Y_i(W_i) & & \text{SUTVA}\\
\{Y_i(0), Y_i(1)\} &\text{ } {\perp\!\!\!\perp} W_i|X_i=x \text{ for all } x  & & \text{unconfoundedness} 
\end{align*}

## Aggregated Difference-in-means Estimator: A Motivating Example

The simplest way to move beyond one RCT is to consider two RCTs. As a concrete example, supposed that we are interested in giving teenagers cash incentives to discourage them from smoking. A random subset of ∼ 5% of teenagers in Palo Alto, CA, and a random subset of ∼ 20% of teenagers in Geneva, Switzerland are eligible for the study. Now suppose we want to find $\tau$ in the pooled dataset. Consider the naive difference in mean:

\begin{align*}
\tau_{DM} &= \frac{1}{n_1}\sum_{i: W_i=1} Y_i - \frac{1}{n_0}\sum_{i: W_i=0} Y_i \\
&= \frac{\sum_{i: W_i=1} Y^{PA}_i + \sum_{i: W_i=1} Y^{GVA}_i}{n^{PA}_1 + n^{GVA}_1} - \frac{\sum_{i: W_i=0} Y^{PA}_i + \sum_{i: W_i=0} Y^{GVA}_i}{n^{PA}_0 + n^{GVA}_0}
\end{align*}

This is the [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox). The estimated $\tau$ will be biased against $\hat{\tau}^{GVA}$ as it absorbs the confounding effect of treatment that Genevians are more likely to be treated. In essence, the random treatment assignment assumption is violated (not random when location is not controlled for, e.g. variations in sample location drive variations in both $W$ and $Y$, therefore $Y \text{ } {\perp\!\!\!\perp} W$ no longer holds). Note that the Palo Alto and Geneva sets are RCTs. Indexing the group by $x$, an estimator that properly addresses this is the aggregated difference-in-mean estimtor:

\begin{align*}
\hat{\tau}_{AGG} &= \sum_x \frac{n_x}{n} \hat{\tau}(x) \triangleq \sum_x \hat{\pi}(x) \hat{\tau}(x) \\
\hat{\tau}(x) &= \frac{1}{n_{x1}}\sum_{i: X_i=x, W_i=1} Y_i - \frac{1}{n_{x0}}\sum_{i: X_i=x, W_i=0} Y_i\\
\end{align*}

Note that $\hat{\pi}(x) \rightarrow \pi(x)$ and $\hat{\tau}(x) \rightarrow \tau(x) = \mathbb{E}[Y_i(1)-Y_i(0)|x]$ so we see that $\hat{\tau}_{AGG} \rightarrow \mathbb{E}_x\big[ \mathbb{E}[\tau|x] \big] = \tau$.

### Remark: Comparison with the Linear Model

It may be tempting to interpret unconfoundedness with the following linear model:

\begin{align*}
Y_i \sim \beta X_i + W_i \tau
\end{align*}

The approach following is potentially acceptable if one knows the linear model to be well specified, or is willing to settle for a more heuristic analysis. However, one should note that the standard of rigor underlying such linear modeling vs. the methods discussed today is quite different. Note that IPW is consistent under the substantively meaningful assumption of unconfoundedness, whereby treatment assigned emulates random treatment assignment once we control for $X_i$. On the contrary, the linear modeling approach is entirely dependent on well-specification of the equation above; and in case of model misspecification, there's no reason to expect that its $\tau$ estimate will converge to anything that can be interpreted as a causal effect.


## The Propensity Score

We define the propensity score as $e(x) = \mathbb{P}(W_i=1|X_i=x)$. Recall that the assumption of unconfoundedness requires that the treatment assignment $W_i$ be independent of potential outcome $Y_i(w)$ at each level of control $x$. If $x$ is continuous, then controlling for $x$ up to fine granularity will be intractable. It turns out that under unconfoundedness, it would be sufficient to just control for $e(X)$ rather than $X$ to remove biases associated with a non-random treatment assignment. Check [this blogpost](http://blog.data-miners.com/2012/02/using-matched-pairs-to-test-for.html) by a Tripadvisor data scientist in which customers with confounding variables in $e$ are grouped together according to the Euclidean distance of $X$ between each pair to account for cannibalization. 

\begin{align*}
& \mathbb{P}\big(W_i=w|\{Y_i(0), Y_i(1)\}, e(X_i)\big)\\
=& \int_\mathcal{X} \mathbb{P}\big(W_i=w|\{Y_i(0), Y_i(1)\}, X_i=x\big) \mathbb{P}(X_i = x|e(X_i))dx\\
=& \int_\mathcal{X} \mathbb{P}\big(W_i=w|X_i=x\big) \mathbb{P}(X_i = x|e(X_i))dx\\
=& \begin{cases}
\mathbb{E}[1-e(X_i)|e(X_i)] & & w= 0\\
\mathbb{E}[e(X_i)|e(X_i)] & & w= 1\\
\end{cases}\\
=& e(X_i)I\{w=1\} + (1-e(X_i))I\{w=0\}
\end{align*}

Or simply in other words,

\begin{align*}
\{Y_i(0), Y_i(1)\} \text{ } {\perp\!\!\!\perp} W_i|X_i \Rightarrow \{Y_i(0), Y_i(1)\} \text{ } {\perp\!\!\!\perp} W_i|e(X_i)
\end{align*}

The implication is that if we can partition our observations into groups with (almost) constant values of the propensity score $e(x)$, then we can consistently estimate the average treatment effect via variants of $\hat{\tau}_{AGG}$. Note that the aggregated difference-in-mean estimator, $\hat{\tau}_{AGG}$ belongs to a class of "inverse propensity weighted" (IPW) estimators in which the propensity score is empirically estimated as the proportion of $x$ in the treatment group:

\begin{align*}
\hat{\tau}_{AGG} &= \sum_n \frac{1}{n} \bigg( \frac{W_iY_i}{\hat{e}(X_i)} - \frac{(1-W_i)Y_i}{1-\hat{e}(X_i)} \bigg)\\
\hat{e}(x) &= \frac{n_{x1}}{n_1}
\end{align*}

It is interesting to note that the "feasible" $\hat{\tau}_{AGG}$ (using empirical $\hat{e}$) is more efficient than the "oracle" IPW estimator (using true $e$). At a high level, the reason this phenomenon occurs is that the estimated propensity score corrects for local variability in the sampling distribution of the $W_i$ (i.e., it accounts for the number of units that were actually treated in each group). 

In practice, when estimating feasible versions of $\hat{\tau}_{IPW}$, we might want to avoid regularization and overfitting bias by using cross-fitting. Note that aside from consistency of $\hat{e}$, we also require that the true propensity score to exhibit, to a certain degree, some __overlapping__ property: there exists some $\eta < 1$ such that $\eta < e(x) < 1-\eta$ for all $x$. Intuitively, overlap means that randomization in fact occurred (e.g., one can’t learn anything from a randomized trial where everyone is assigned to control). Mathmatically, having $e(x)$ not equal to strictly 0 or 1 prevents the estimator to be undefined. 

# Heterogeneous Treatment Effect and Double Machine Learning

We further extend our framework to the case in which the treatment effect depends on individual characteristics $X_i$. In this case, we consider how treatment effects vary with the observed covariates, i.e. the conditional average treatment effect (CATE): $\tau(x) = \mathbb{E}[Y_i(1)-Y_i(0)|X_i=x]$. Note that despite its dependence on $X_i$, CATE is not the individual effect (e.g. $\Delta_i = Y_i(1)-Y_i(0)$) but an _average_ over a more targeted group of samples as characterized by their covariates $X_i$. Suppose we have unconfounded treatment assignment $W_i \text{ } {\perp\!\!\!\perp} \{Y_i(0), Y_i(1)\}|X_i$ we express the data generating process as:

\begin{align*}
\tau(x) &= \mu_1(x) - \mu_0(x)\\
\mu_w(x) &= \mathbb{E}[Y_i | X_i=x, W_i=w]\\
\end{align*}

## Caveats for Using Non-parametric Estimators

A natural way would be to derive the feasible estimator $\hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)$ for some non-parametric estimates $\hat{\mu}_1$, $\hat{\mu}_0$ for $\mu_1(x)$ and $\mu_0(x)$. However we should note two caveats for using semi-parametric methods on finite samples. The first is __regularization bias__. In essence, if there are many more control than treated units (or vice-versa) and we use generic non-parametric methods, then the two regression surfaces $\hat{\mu}_1(\cdot)$ and $\hat{\mu}_0(\cdot)$ may be differently regularized, thus creating artifacts in the learned CATE estimate $\hat{\tau}$. The following figure, reproduced from [Kunzel, Sekhon, Bickel, and Yu (2019)](https://www.pnas.org/content/116/10/4156) illustrates this point. Both $\mu_1(x)$ and $\mu_0(x)$ vary with $x$ but the CATE function is constant. There are many controls so $\hat{\mu}_0(\cdot)$ is well estimated, but there are very few treated treated units and so $\hat{\mu}_1(\cdot)$ is heavily regularized and approximated as a linear function. Both estimates $\hat{\mu}_1(\cdot)$ and $\hat{\mu}_0(\cdot)$ are reasonable on their own; however, once we take their difference, we find strong heterogeneity is $\tau(x)$ where there is none.

<img src="https://github.com/kpjwong/ISYE-6501/blob/main/images/Reg_Bias.PNG?raw=true" width="70%">

Another caveat which also stems from imbalanced samples arises from __variation in the propensity score__. If $e(x)$ varies considerably, then our estimates of $\hat{\mu}_0(\cdot)$ will be driven by data in areas with many control units (i.e., with
$e(x)$ closer to 0), and those of $\hat{\mu}_1(\cdot)$ by regions with more treated units (i.e., with $e(x)$ closer to 1). If there is a covariate shift (changes in distribution of $X$, $F(X)$) between the data used to learn $\hat{\mu}_0(\cdot)$ and $\hat{\mu}_1(\cdot)$, this may create biases for their difference, $\hat{\tau}(x)$.

## Double Machine Learning for Semi-parametric Modeling

Now suppose the treatment effect has a semi-parametric form: $\tau = \psi(x) \cdot \beta$ with $\psi(x), \beta \in \mathbb{R}^k$ so $\tau$ is a linear combination of $k$ basis functions (which do not need to be parametric), but the sum is parameterized by $\beta$. Further, the outcome $Y_i$ is unconfounded upon controlling the set of covariates $X_i$ with the semi-parametric function:

\begin{align*}
Y_i(w) &= f(X_i) + w \tau(X_i) + \varepsilon_{iw} & & \psi(X_i), \beta \in \mathbb{R}^k \\
Y_i &= f(X_i) + W_i \tau(X_i) + \varepsilon_{i} & & \text{observation} 
\end{align*}

It would be difficult to estimate the nuisance function, $f(X_i)$ - the effect on outcome without considering $X$. Instead, we consider the conditional outcome, marginalized over $W_i$ by accounting for the propensity score $e$:

\begin{align*}
m(x) &= \mathbb{E}[Y_i | X_i = x] = f(X_i) + e(X_i)\tau(X_i) \\
e(x) &= \mathbb{P}(W_i = 1 | X_i = x)
\end{align*}

Subtracting the observational equation by $m(x)$ we have the following regression model introduced in [Robinson (1988)](https://www.jstor.org/stable/1912705?seq=1):

\begin{align*}
Y_i - m(X_i) &= \big( W_i - e(X_i) \big) \tau(X_i) + \varepsilon_i \\
&= \big( W_i - e(X_i) \big) \psi(X_i) \cdot \beta + \varepsilon_i
\end{align*}

Which implies the OLS regression estimator for $\beta$ obtained by regressing $Y_i - m(X_i)$ over $( W_i - e(X_i) \big) \psi(X_i)$ which is $\sqrt{n}$-consistent. Since true $m$ and $e$ are not known, this algorithm is by nature "oracle". A more general form of the Robinson estimator (without linearity of basis functions $\psi$) would be to express in loss function form:

\begin{align*}
\arg \min_{\tau'} \sum_i \bigg\{ \big(Y_i-m(X_i)\big) - \big( W_i-e(X_i) \big) \tau'\bigg\}^2
\end{align*}

This can be derived from $\mathbb{E}[\varepsilon_i(W_i)|X_i, W_i]=0$ under unconfoundedness of the semi-parametric model. [Chernozhukov et al (2014)](https://arxiv.org/pdf/1608.00060.pdf) decomposed the asymptotic distribution of the "feasible" estimator $\hat{\tau}$ using non-parametric $\hat{m}$ and $\hat{e}$. 

\begin{align*}
\sqrt{n}(\hat{\tau}-\tau) &\rightarrow a^* + b^* + c^* & &\\
a^* &\rightarrow \mathcal{N}(0,\Sigma) & &\\
b^* &\leq \sqrt{n} n^{-(\varphi_{\hat{e}} + \varphi_{\hat{m}})} & & \varphi_{\hat{e}}, \varphi_{\hat{m}} \text{ are rate of convergence of }\hat{e}, \hat{m} \\
c^* &\rightarrow o_{\mathcal{P}}(1) & &\text{with cross-fitting} 
\end{align*}

So we need "cross-fitting" for consistency, on top of consistency for $\hat{e}, \hat{m}$. Here cross fitting refers to the following process:

1. Partition the data into $K$ parts, denoted by $\{I_k\}_{k=1}^K$.
2. For each $k$, denote the complement of $I_k$ as $I_k^c$. Estimate $\hat{e}_k, \hat{m}_k$ with data in $I_k^c$.
3. Estimate $\hat{\tau}_k$ with data in $I_k$.
4. Get cross-fitting estimate $\hat{\tau} = \frac{1}{K}\sum_{i=1}^K \hat{\tau}_k$.

Refer to [Chernozhukov et al (2014)](https://arxiv.org/pdf/1608.00060.pdf) for the details on the role of cross-fitting on de-biasing the empirical estimates. Check [this blogpost](http://aeturrell.com/2018/02/10/econometrics-in-python-partI-ML/) for an implementation example. At a high level, cross-fitting uses cross-fold estimation to avoid bias due to overfitting, the same reason we resort to cross-validation for model selection. To validate heterogeneity of treatment effects, one can compare estimators with heterogeneity against subgroup ATE estimators.

# Regression Discontinuity Design (RDD)

In applied work, there are several other __quasi-experimental__ designs that have repeatedly proven themselves in practice. One simple yet versatile approach of this type is the regression discontinuity design, which relies on __discontinuous treatment__ assignment mechanisms to identify causal effects. In this setup, we assume that there is a running variable $Z$ and a cutoff $c$ such that $W_i = I\{Z_i \geq c\}$. Example of such running variables include time horizon/location indicator indicating when/where a regime change took place, a passing threshold for an education test score. In these cases, note that treatment assignment is by definition unconfounded: $W_i \text{ } {\perp\!\!\!\perp} \{Y_i(0), Y_i(1)\}|Z_i$ as $W_i$ is directly a deterministic function of $Z_i$. However, note that the consistency requirement of overlapping ($\exists \eta$ such that $e(z)\in (\eta, 1-\eta)$) we cannot apply estimators under unconfoundedness in previous sections. 

Instead we make use of regression discontinuity designs. As we will see, idnetification of the treatment effect under these estimators can either stem from (bounded) smoothness of the latent effect $\mu_w(z)$, or randomness of a latent variable that induce noises in $Z_i$. 

## Identification via Continuity

Note that the target parameter is:

\begin{align*}
\tau_c = \lim_{z \searrow c} \mathbb{E}[Y_i(w)|Z_i=z] - \lim_{z \nearrow c} \mathbb{E}[Y_i(w)|Z_i=z]
\end{align*}

Denote $\mu_w(z) = \mathbb{E}[Y_i(w)|W_i=w, Z_i=z]$. We assume that $\mu_w(z)$ are smooth in the sense that:

\begin{align*}
\bigg| \frac{\partial^2 \mu_w(z)}{\partial z^2} \bigg| < B
\end{align*}

The second degree Taylor expansion of $\mu_w$ around a neighborhood of $c$ is:

\begin{align*}
\mu_w(z) &= a_w + \beta_w (z-c) + \rho_w(z-c) \\
|\rho_w(x)| &\leq Bx^2
\end{align*}

This yields the __local linear regression estimator__ for $\mu_w$ with bandwidth $h_n$ (note that this is a function of sample size $n$), ideally converges to $h_n \rightarrow 0$ for large samples, and weighting (Kernel) function $K$:

\begin{align*}
\hat{a}_w, \hat{\beta}_w = \arg \min_{a, \beta} \bigg\{ \sum_{i:Z_i \geq c} K\Big(\frac{|Z_i-c|}{h_n}\Big) \big( Y_i - a - \beta (Z_i-c) \big)^2 \bigg\}
\end{align*}

Common choices for the kernel $K$ includes the window function $K(x) = I\{|x| \leq 1\}$ and triangular kernel $K(x) = (|x|-1)_+$. We consider the former case (window function) in the consistency analysis. WLOG, consider $w=1$. The closed form for $\hat{a}_w$ can be expressed as:

\begin{align*}
\hat{a}_1 = \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i Y_i
\end{align*}

Where the weights $\{\gamma_i\}$ can be shown to satisfy: (1). $\sum_{i: c\leq Z_i \leq c+h_n} \gamma_i = 1$ and (2). $\sum_{i: c\leq Z_i \leq c+h_n} \gamma_i (Z_i-c) = 0$. So we can write:

\begin{align*}
\hat{a}_1 &= \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i Y_i \\
&= \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i \big(\mu_1(Z_i) + Y_i - \mu_1(Z_i)\big)\\
&= \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i \big( a_1 + \beta_1 (Z_i-c) + \rho_1(Z_i-c) + Y_i - \mu_1(Z_i) \big)\\
&= \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i a_1 + \beta_1 \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i (Z_i-c) + \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i\rho_1(Z_i-c) +  \sum_{i: c\leq Z_i \leq c+h_n} \gamma_i \big(Y_i - \mu_1(Z_i)\big)\\
&= a_1 + \underbrace{\sum_{i: c\leq Z_i \leq c+h_n}  \gamma_i\rho_1(Z_i-c)}_\text{curvature bias}  +  \underbrace{\sum_{i: c\leq Z_i \leq c+h_n} \gamma_i \big(Y_i - \mu_1(Z_i)\big)}_\text{sampling noise} 
\end{align*}

Ultimately, we would like to estimate $\hat{\tau} = \hat{a}_1-\hat{a}_0$. The curvature bias can be considered as the source of errors from using a linear function to approximate a arbitrary smooth (with the assumption of bounded second order derivative, see above). The variance-bias tradeoff basically involves tuning $h_n$. A smaller bandwidth $h_n$ the interval used for linear approximation narrows and decreases the curvature bias, while at the same time the sampling noise increases as less observations are included. To see what the optimal bandwidth is, note that the bias is bounded: $|\rho(Z_i-c)| \leq B(Z_i-c)^2 \leq Bh_n^2$. So the squared bias scales as $h_n^4$. On the other hand, the sampling variance can be shown to scale as $\frac{1}{nh_n}$. Thus we'd like to tune the bandwidth with $h_n \sim n^{-\frac{1}{5}}$ to minimize the bias-variance trade-off. At the end, 

\begin{align*}
\hat{a}_1-\hat{a}_0 = \hat{\tau}_c = \tau_c + \mathcal{O}_P(n^{-\frac{2}{5}})
\end{align*}

Note that the rate of convergence is slower than $\sqrt{n}$. In general, if $\mu_w$ is smooth in the sense that the $k$-th order derivative is bounded, then the $(k-1)$ degree local polynomial regression estimator can be shown to achieve a convergence rate of $n^{-\frac{k}{2k+1}}$ using a bandwidth $h_n \sim n^{-\frac{1}{2k+1}}$.

## Identification via Noisy Running Variable

Despite its simplicity and interpretability, the continuity-based approach to regression discontinuity inference above does not satisfy the criteria for rigorous design-based causal inference as outlined by [Rubin (2008)](https://arxiv.org/pdf/0811.1640.pdf). According to the design-based paradigm, even in observational studies, a treatment effect estimator should be justifiable based on randomness in the treatment assignment mechanism alone. In contrast, the
continuity-based regression discontinuity analysis is based on the smoothness assumtion of $\mu_w(z)$.

An alternative justification for identification in regression discontinuity designs starts with a form of implicit randomization in the running variable $Z_i$: There are many factors outside of the control of decision-makers that determine the running variable $Z_i$ such that if some unit barely clears the eligibility cutoff for the intervention then the same unit could also plausibly have failed to clear the cutoff with a different realization of these chance factors ([Lee and
Lemieux (2010)](https://www.princeton.edu/~davidlee/wp/RDDEconomics.pdf)). For example, in an educational setting where a test is used to determine eligibility to an honors program, there may be a group of marginal students who might barely pass or fail pass a test due to unpredictable variation in their test score, thus resulting in an effectively exogenous treatmentassignment rule. 

And, if the running variable is in fact noisy, we can build an identification argument on top of it. Here we suppose that 1. there is a latent variable $U_i$ with distribution $G$ such that $Z_i|U_i \sim \mathcal{N}(U_i, \nu^2)$ for some $\nu > 0$, and 2. the noise in $Z_i$ are unconfounded: $\{Y_i(0), Y_i(1)\} \text{ } {\perp\!\!\!\perp} \text{  }  Z_i | U_i$. We have the following propoensity score:

\begin{align*}
e(u) &= \mathbb{P}(W=1|U=u)\\
&= \mathbb{P}(Z \geq c|U=u)\\
&= 1-\Phi\Big(\frac{c-u}{\nu}\Big)
\end{align*}

So the treatment assignment is unconfounded: $\{Y_i(0), Y_i(1)\} \text{ } {\perp\!\!\!\perp} \text{  }  W_i | U_i$. With a variable $U$ observable by an "oracle" we note that we can apply the IPW class (with cross-fitting where necessary) to estimate $\tau$. The "feasible" version will involve a deconvolution process specified in [Eckles, Ignatiadis, Wager and Wu (2020)](https://arxiv.org/pdf/2004.09458.pdf) as follows. Define a pair of weighting functions $\gamma_+(Z_i)$ and $\gamma_-(Z_i)$ to be applied to the right and left side of the cutoff $c$ satistifying $\mathbb{E}[I\{Z_i \geq c\} \gamma_+(Z_i)] = \mathbb{E}[I\{Z_i < c\} \gamma_-(Z_i)] = 1$. Consider the estimator:

\begin{align*}
\hat{\gamma}_\gamma = \frac{1}{n}\sum_{i:Z_i\geq c} \gamma_+(Z_i) Y_i - \frac{1}{n}\sum_{i:Z_i < c} \gamma_-(Z_i)Y_i
\end{align*}

WLOG, consider $W_i=1$. We first derive the expectation of the following terms:

\begin{align*}
&\mathbb{E}[\gamma_+(Z_i) Y_i I\{Z_i \geq c\}|U_i] & &\\
=&\mathbb{E}[\gamma_+(Z_i) Y_i(1) I\{Z_i \geq c\}|U_i] & &\\
=&\mathbb{E}[Y_i(1)|U_i] \mathbb{E}[\gamma_+(Z_i) I\{Z_i \geq c\}|U_i] & & \because \text{unconfoundedness} \\
=&\mu_1(U_i) \int_c^\infty \gamma_+(z)\phi(z|U_i)dz & &\\
\triangleq&\mu_1(U_i) h_+(U_i)
\end{align*}

Therefore taking expectation over the support of $U_i$:

\begin{align*}
&\mathbb{E}_U\big[\mathbb{E}_Z[\gamma_+(Z_i) Y_i I\{Z_i \geq c\}|U_i]\big]\\
=&\mathbb{E}_U\big[ \mu_1(U_i) h_+(U_i) \big]\\
=&\int \mu_1(u)h_+(u) dG(u)
\end{align*}

Ultimately, the expectation of the estimator $\hat{\tau}_\gamma$ is given by:

\begin{align*}
&\int \mu_1(u)h_+(u)dG(u) - \int \mu_0(u)h_-(u)dG(u)\\
=&\int \big(\tau(u)+\mu_0(u)\big)h_+(u)dG(u) - \int \mu_0(u)h_-(u)dG(u)\\
=&\underbrace{\int h_+(u)\tau(u)dG(u)}_\text{weighted treatment effect} + \underbrace{\int \big(h_+(u)-h_-(u)\big) \mu_0(u)dG(u)}_\text{confounding bias}   
\end{align*}

To understand the context of the second term as confounding bias, note that the $h_\pm(u) = \mathbb{E}[\gamma_\pm(z)|u]$ which is the expected weights (think: numbers of) treatment and control groups. The second term can be considered as the covariance between expected assignment and outcome $Y$, which under unconfoundedness should be zero.

# Statistical Tests under the A/B Testing Framework

It would be useful to review the hypothesis test framework involved in A/B Testing. [This medium post](https://towardsdatascience.com/the-art-of-a-b-testing-5a10c9bb70a4) written by a data scientist at Sephora provides a detailed derivation on some tests with businsess examples on UX experiments. For extension, check [here](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing) for a list of common hypothesis testing tools in two-sample setups.

## Example 1: t- and Z-tests on Average Time Spent

Let the $\hat{\mu}_A$ and $\hat{\mu}_B$ be the average time spent on the webpage for "A" and "B" randomized control group in the sample. Suppose the groups are sufficiently large, then the central limit theorem (CLT) guarantees asymptotic normality, i.e.,

\begin{align*}
\hat{\mu}_A &\rightarrow \mathcal{N}\Big(\mu_A, \frac{\sigma_A^2}{n_A}\Big)\\
\hat{\mu}_B &\rightarrow \mathcal{N}\Big(\mu_B, \frac{\sigma_B^2}{n_B}\Big)\\
\end{align*}

The hypothesis to be tested in this example is: 
<br>&nbsp;$H_0$: "the average time spent is the same for the two versions"
<br>&nbsp;$H_1$: "the average time spent is higher for version B"

In other words:
<br>&nbsp;$H_0$: $\mu_A = \mu_B$
<br>&nbsp;$H_1$: $\mu_A < \mu_B$

Suppose the null is true, we have, 

\begin{align*}
\hat{\mu}_A - \hat{\mu}_B &\rightarrow \mathcal{N}\Big(\mu_A-\mu_B, \frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}\Big)\\
&\sim \mathcal{N}\Big(0, \frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}\Big)
\end{align*}

Naturally, in "feasible" setting, the test statistic and p-value are:
\begin{align*}
Z &= \frac{\hat{\mu}_A-\hat{\mu}_B}{\frac{\hat{\sigma}_A^2}{n_A} + \frac{\hat{\sigma}_B^2}{n_B}}\\
p &= \Phi(Z)
\end{align*}

And one only needs to do a one-sided test against the alternative. Make sure to use the unbiased estimator in $\hat{\sigma}$ (usually also known as $s$). If the sample is small (e.g. less than 30 per version), then we need to invoke the two-sample $t$ test. Another justification to use a $t$ test in place of a $Z$ test is that when the variance of the samples are not known, $t$ tests can better adjust for the uncertainties in the variance. Rigorously, the derivation in the test assumes that $\sigma_A$, $\sigma_B$ are known, but in reality the test should respect the fact that the population variances are not known and account for variability in the test statistic due to sampling error in the estimates $\hat{\sigma}_A$ and $\hat{\sigma}_B$.

We will use the pooled (in contrast to paired) two-sample $t$-test. Note that the form of the test statistic and d.f. depend on how similar the standard errors are. In the unextreme case in which the ratio of the standard errors are within 1/2 and 2, we can construct the test statistics as follow:

\begin{align*}
t &= \frac{\hat{\mu}_A-\hat{\mu}_B}{s\sqrt{\frac{1}{n_A} + \frac{1}{n_B}}}\\
s &= \sqrt{\frac{(n_A-1)s_A^2 + (n_B-1)s_B^2}{n_A+n_B-2}}\\
\end{align*}

Note that $t \sim t_{n_A+n_B-2}$ so the d.f. is $n_A+n_B-2$. 

## Example 2: Chi-square test on Conversion Rates

Suppose in an A/B test we observe $(n_A^0, n_A^1)$ and $(n_B^0, n_B^1)$, the number of converted (1) and non-converted (0) customers under each version. We would like to test the following null:
<br>&nbsp;$H_0$: "the conversion rate is the same for the two versions"
<br>&nbsp;$H_1$: "the conversion rate is higher for version B"

By defining an individual binary random variable for conversion, one can formulate the hypothesis test as a two sample test for equal means and proceed with a $Z$ or $t$ test as described above. Alternatively, we can make use of a chi-squared test. Before we proceed, we introduce the [Pearson's Theorem](https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2003/lecture-notes/lec23.pdf) on multinomial distributions.

#### Pearson's Theorem

Let $\{B_j\}_{j=1}^r$ be a partition of the outcome space for each of the $n$ i.i.d. random variable $\{X_i\}_{i=1}^n$ with probability $p_j = \mathbb{P}(X_i \in B_j)$. Define the random variable $v_j = \sum_{i=1}^n I\{X_i \in B_j\}$. Then,

\begin{align*}
\sum_{j=1}^r \frac{(v_j-np_j)^2}{np_j} \rightarrow \chi^2_{r-1}
\end{align*}

Check the [lecture notes](https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2003/lecture-notes/lec23.pdf) for the MIT courseware for proof. The application of the Pearson's Theorem will involve the following lemma under a two sample setup.

#### Lemma (Two sample)

See [here](http://personal.psu.edu/drh20/asymp/fall2006/lectures/ANGELchpt07.pdf). Let there be another sample with size $m$ and define $w_j$ analogously to $v_j$. Define $\hat{p}_j$ the empirical probability of $B_j$ in the pooled sample, i.e. $\hat{p}_j = \frac{v_j+w_j}{n+m}$. If both samples are generated from the same multinomial distribution with probability vector $p$, we have:

\begin{align*}
\sum_{j=1}^r \bigg\{\frac{(v_j-n\hat{p}_j)^2}{n\hat{p}_j} + \frac{(w_j-m\hat{p}_j)^2}{m\hat{p}_j}\bigg\} \rightarrow \chi^2_{r-1}
\end{align*}

By partitioning the outcome space to (converted, non-converted), the chi-square test on conversion rates under A/B testing can be implemented by directly evaluating the $p$ value of the test statistic under a $\chi_1^2$ distribution.

# Causal Mediation Analysis (CMA)

Based on [Imai, Kelle and Yamamoto (2010)](https://arxiv.org/pdf/1011.1079.pdf) and [Imai, Kelle and Tingley (2010)](https://imai.fas.harvard.edu/research/files/BaronKenny.pdf). It is common for controlled experiments to exhibit indirect treatment effects. At the end, the ATE measured will encompass both the direct effect on $Y_i$ treatment effects that channel through a list of mediators $M_i$. A mediator (also known as mediating variable, intermediary variable, or intervening variable) is a variable that lies in the causal path between the treatment and the outcome. So it must be a post-treatment variable that occurs before the outcome is realized (image credits to Run4psych). In essence, existence of nuisance mediator breaks the assumption of unconfoundedness even if they are measured, becuase of its post-treatment nature.

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f8/Simple_Mediation_Model.png" width="40%">

The robustness of measured ATE depends on which mediators are relevant in the research question. An example given in Imai et al (2010) is the effect of H1N1 vaccine on reducing the risk of developing flu in children. While a virologist may consider the development of antibodies as the relevant mediator, a social scientist may instead the act for parents to sign a form acknowledging risks of the vaccine (thus reducing the probability of getting the second dose) relevant. When analyzing ATE in (quasi-)experimental designs we must identify the mediators that are to be included in the target parameter, and which mediators are in fact nuisance in the research question and need to be excluded.

In terms of the framework we've been using, denote $Y_i$ and $W_i$ the observed outcome and treatment. Let $M_i(w)$ be the potential value of a (vector of) mediator under treatment assignment $w \in \{0,1\}$. The outcome of the experiment can be expressed as $Y_i = Y_i(W_i, M_i(W_i))$: if the support of $M$ is of cardinality $J$ (possibly through heterogeneity across $i$), then there are $2J$ potential values for the outcome, but only one is observed. We define the causal mediation effect (also known as pure/total indirect effect) for unit $i$ under treatment $w$ as:

\begin{align*}
\delta_i(w) = Y_i(w, M_i(1)) - Y_i(w, M_i(0))
\end{align*}

An implicit assumption of this representation is about how the mediation effect takes place, i.e. the fact that $M_i = M_i(W_i)$ only. Whether we are interested in the natural direct effect or the mediation effect on average, it will suffice to identify the average causal mediation effect (ACME) $\mathbb{E}[\delta_i]$. It will be violated if $M_i$ depends on factors such as how treatment is assigned (e.g. random vs natural response). We write the total treatment effect as:

\begin{align*}
\tau_i &= Y_i(1, M_i(1)) - Y_i(0, M_i(0))\\
&= \delta_i(1)+\zeta_i(0) = \delta_i(0)+\zeta_i(1)\\
\zeta_i(w) &\triangleq Y_i(1, M_i(w)) - Y_i(0, M_i(w))
\end{align*}

Here $\zeta_i(m)$ is the controlled/natural direct effect of the treatment when the mediator is held constant. The authors showed that under the assumption of sequential ignorability specified below, ACME is non-parametrically identified (as the double integral of non-parametric $\mathbb{E}[Y]$ over $m$ and $x$:

\begin{align*}
\{Y_i(w',m), M_i(w)\} &\text{ } {\perp\!\!\!\perp} W_i | X_i=x & & \\
Y_i(w',m) &\text{ } {\perp\!\!\!\perp} M_i(w) | W_i=w, X_i=x & & \forall m,w,w' \in \{0,1\}, x \in \mathcal{X}\\
\end{align*}

Where we also have the regularity conditions $\mathbb{P}(W_i=w|X_i=x)\in (0,1)$ and $\mathbb{P}(M_i(w)=m|W_i=w, X_i=x)\in (0,1)$. As such, the potential values of the outcome and potential values of the mediator variable are first assumed to be ignorable given the pre-treatment covariates when assigning treatments, and then the potential outcome is assumed to be ignorable in the realization of the mediator given the observed value of the treatment as well as the pretreatment covariates. The authors provide an example under the structural linear equation model setting. 

# Applications on A/B Testing

Based on [Yin and Hong (2019)](https://arxiv.org/pdf/1906.09757.pdf). The authors noted that it is common for squential ignorance (SI) to break under A/B Testing and other online experimentation. Such violations are caused by (1). mediators that are known to exist, but unmeasured in the available data or (2). variables (can be either pre- or post-treatment) that are known to confound the mediator-outcome relationship, but are unmeasured. In the motivating example of Etsy.com regarding a recommendation module, the change in the recommendation can induce users to change their behaviors on many other webpages and modules. Therefore, if the number of organic search clicks is the mediator of interest, numerous mediators might confound its relationship with the sitewide conversion. However, it could be too costly to measure user behaviors on every single webpage or module, making this mediator unmeasured. On the other hand, because of the "guest checkout" function (i.e., users can make purchases without registering at Etsy) and increasingly stringent privacy protection, it is difficult to obtain users' pre-treatment covariates, such as age, gender, and education level to properly control for the mediator-outcome relationship.

The authors identify conditions which which even when multiple unmeasured causally-dependent mediators exist, generalized direct effect and indirect effect specified in two linear regression equations can still be identified and estimated. Throughout this example (and in fact in the paper), we can assume that the $W_i$ is exposure to recommendation and $M_i$ is organic search. In the framework presented in the paper, there are upstream, intermediate, and downstream mediators $M_0, M_1, M_2$. The upstream and downstream mediators are allowed to the multivariate, while $M_1$ is assumed to be a scalar. The causal relationship (note $T$ is equivalent to $W$ in our notation) can be seen in the following directed acyclic graph (DAG).

<img src="https://dl.acm.org/cms/asset/171c5686-e541-4584-9c45-c7b45b3bda12/3292500.3330769.key.jpg" width="25%">

Define the Generalized Averaged Direct Effect (GADE) and Generalized Average Causal Mediator Effect (GACME) as:

\begin{align*}
\text{GADE}(w) =& \mathbb{E}\bigg[ Y_i\Big(1, M_{i0}(1), M_{i1}\big(w, M_{i0}(w)\big), M_{i2}\Big( 1, M_{i0}(1), M_{i1}\big(w, M_{i0}(w)\big) \Big)\bigg]\\
& - \mathbb{E}\bigg[ Y_i\Big(0, M_{i0}(0), M_{i1}\big(w, M_{i0}(w)\big), M_{i2}\Big( 0, M_{i0}(0), M_{i1}\big(w, M_{i0}(w)\big) \Big)\bigg]\\
\text{GACME}(w) =& \mathbb{E}\bigg[ Y_i\Big(w, M_{i0}(w), M_{i1}\big(1, M_{i0}(1)\big), M_{i2}\Big( w, M_{i0}(w), M_{i1}\big(1, M_{i0}(1)\big) \Big)\bigg]\\
& - \mathbb{E}\bigg[ Y_i\Big(w, M_{i0}(w), M_{i1}\big(0, M_{i0}(0)\big), M_{i2}\Big( w, M_{i0}(w), M_{i1}\big(0, M_{i0}(0)\big) \Big)\bigg]\\
\text{ATE} = &\mathbb{E}\bigg[ Y_i\Big(1, M_{i0}(1), M_{i1}\big(1, M_{i0}(1)\big), M_{i2}\Big( 1, M_{i0}(1), M_{i1}\big(1, M_{i0}(1)\big) \Big)\bigg]\\
& - \mathbb{E}\bigg[ Y_i\Big(0, M_{i0}(0), M_{i1}\big(0, M_{i0}(0)\big), M_{i2}\Big( 0, M_{i0}(0), M_{i1}\big(0, M_{i0}(0)\big) \Big)\bigg]\\
=& \text{GADE}(w) + \text{GACME}(1-w)  
\end{align*}

In other words, GADE captures the causal effect of the treatment $W_i$ that goes through all the channels that do not have $M_{i1}$, i.e. $T \rightarrow Y$, $T \rightarrow M_0 \rightarrow Y$, $T \rightarrow M_0 \rightarrow M_2 \rightarrow Y$, $T \rightarrow M_2 \rightarrow Y$, while GACME captures the causal effect of the treatment $W_i$ that goes through all channels that have $M_{i1}$: $T \rightarrow M_1 \rightarrow Y$, $T \rightarrow M_0 \rightarrow M_1 \rightarrow Y$, $T \rightarrow M_1 \rightarrow M_2 \rightarrow Y$, and $T \rightarrow M_0 \rightarrow M_1 \rightarrow M_2 \rightarrow Y$. In the context of the Etsy.com example, GADE essentially represents the portion of ATE that is not transmitted by the induced change in users' organic search clicks, so the middlestream mediation is organic search clicks. In most A/B Testing applications, the upstream and downstream mediators are unknown.

The causal relationships are formulated according to the following linear structural equation model (LSEM). Note that as $M_0$ and $M_2$ are multivariate, (1) and (3) can be regarded as the reduced forms of (linear) structural equations that parametrize causal relationships among multiple upstream mediators and among multiple downstream mediators. E.g. any linear relationships between $M_{i0}^k$ and $M_{i0}^j$ are reduced to form equation (1).

\begin{align}
M_{i0} &= \alpha_0 + \beta_0 W_i + e_{i0}\\
M_{i1} &= \alpha_1 + \beta_1 W_i + \psi_1^T M_{i0} + \xi_1^T M_{i0} W_i + e_{i1}\\
M_{i2} &= \alpha_2 + \beta_2 W_i + \Psi_2 M_{i0} + \psi_3 M_{i1} + \Xi_2 M_{i0}W_i + \xi_3 M_{i1}W_i + e_{i2}\\
Y_i &= \alpha_3 + \beta_3 W_i + \gamma_0^T M_{i0} + \gamma_1 M_{i1} + \gamma_2 M_{i2} + \kappa_0^T M_{i0} W_i + \kappa_1 M_{i1} W_i + \kappa_2^T M_{i2} W_i + e_{i3}\\
\end{align}

We will adjust the definition of SI in this setup from the top level, $\{Y, M_0, M_1, M_2\} \text{ } {\perp\!\!\!\perp} W_i$, then $\{Y, M_1, M_2\} \text{ } {\perp\!\!\!\perp} M_0|W_i=w$, and so on. Sequential ignorability in the presence of stagewise mediator pictures sequential treatment assignments (ideal interventions) in randomized experiments. First of all, the treatment is independent of all potential outcomes and potential mediators (i.e., ignorable) and its probability is strictly between 0 and 1, which is  guaranteed by random assignment. Then, for each mediator, conditional on the treatment and its (upstream) mediators, each of its potential mediators behave like the treatment and are ignorable to the potential outcomes and the potential mediators of its downstream mediators. The ability to control for upstream mediators provides a relaxation to the original SI as per Imai et al (2010) in a sense that downstream mediators only need to be conditionally independent. SI will imply that the error terms $\{e_{i0}, e_{i1}, e_{i2}, e_{i3}\}$ are strictly exogenous.

Consider the following linear regression:

\begin{align}
M_{i1} &= \theta_{M_10} + \theta_{M_11} W_i + \mu_{M_1}\\
Y_i &= \theta_{Y0} + \theta_{Y1}W_i + \theta_{Y2}M_{i1} + \theta_{Y3}M_{i1}W_i+\mu_Y
\end{align}


We can show that, by comparing (4) and (6) after substituting out $M_{i0}$ and $M_{i2}$ and using strict exogeneity of error terms:

\begin{align*}
\text{GADE}(w) &= \theta_{Y1} + \theta_{Y3}(\theta_{M_10}+\theta_{M_11} w)\\
\text{GACME}(w) &= \theta_{M_11}(\theta_{Y2}+\theta_{Y3}w) 
\end{align*}

For inference, it can be shown that the test-statistic $\frac{\text{GADE}}{var(\text{GADE})} \rightarrow \mathcal{N}(0,1)$ and $\frac{\text{GACME}}{var(\text{GACME})} \rightarrow \mathcal{N}(0,1)$. Here we will find the asymptotic variance by the Delta method. Since the $\mu_{M_1}$ and $\mu_Y$ are correlated (one can see this when expressing each of them in terms of $\{e_{i0}, e_{i1}, e_{i2}, e_{i3}\}$), an iterative GMM with with a heteroskedasticity and autocorrelation consistent (HAC) covariance matrix adjustment that estimates the system simultaneously is recommended, even though it is convenient to regress each of the equations and form the point estimates for GADE and GACME.

# Models of Probabilistic Distributions

## Bernoulli

Perhaps it is most straightforward to start with a Bernoulli distribution for a binary random variable $Y$ with support 0 and 1, i.e. $\text{Bernoulli}(p)$. 

\begin{align*}
\text{supp}(Y) &= \{0,1\}\\
\mathbb{P}(Y=1) &= p\\
\mathbb{E}[Y] &= p\\
var(Y) &= p(1-p)
\end{align*}

## Binomial

If we have $n$ i.i.d. $\text{Bernoulli}(p)$ random variable $\{Y_i\}_{i=1}^n$ then the number of 1's in the sample, $Z$ follows a $\text{Binomial}(n,p)$ random variable.

\begin{align*}
Z &= \sum_{i=1}^n I\{Y_i=1\}\\
\text{supp}(Z) &= \mathbb{N}\\
\mathbb{P}(Z=k) &= \binom{n}{k} p^k (1-p)^{n-k}\\
\mathbb{E}[Z] &= np\\
var(Z) &= np(1-p)
\end{align*}

Note that as $n \rightarrow \infty$ the $\text{Binomial}(n,p)$ converges to a Gaussian distribution $\mathcal{N}(np, np(1-p))$. One can see this from the proof of CLT.

## Geometric

Consider the binomial setup. Let $V$ denote the number periods elapsed until the first success (1) is observed. $V$ follows a $\text{Geometric}(p)$ distribution. Note that $p$ must be strictly positive for $V$ to be well-defined.

\begin{align*}
\text{supp}(V) &= \mathbb{N}^+\\
\mathbb{P}(V=k) &= (1-p)^{k-1}p\\
\mathbb{E}[V] &= \frac{1}{p}\\
var(V) &= \frac{1-p}{p^2}
\end{align*}

Note that the Geometric distribution is often used to model the duration for a fault to occur. 

## Poisson

A Poisson distribution can be considered as the continuous approximation for a binomial distribution. Let $\lambda > 0$. Suppose that: (1). The number of events occurring in non-overlapping intervals are independent, (2). The probability of exactly one event in a short interval of length $h=\frac{1}{n}$ is (approximately) $\frac{\lambda}{n}$ and (3). The probability of exactly two or more events in a short interval is essentially zero. Then the number of events over $n$ intervals, $X$ follows a $\text{Poisson}(\lambda)$ distribution as $n \rightarrow \infty$.

\begin{align*}
\mathbb{P}(X=k) &= \binom{n}{k}\Big(\frac{\lambda}{n}\Big)^k \Big(1-\frac{\lambda}{n}\Big)^{n-k}\\
&= \frac{\lambda^k}{k!} \bigg[\frac{n(n-1)\cdots(n-k+1)}{n^k}\bigg] \Big(1-\frac{\lambda}{n}\Big)^{n-k}\\
&= \frac{\lambda^k}{k!} \bigg[1 \cdot \Big(1-\frac{1}{n}\Big) \cdots \Big( 1-\frac{k-1}{n} \Big)\bigg] \Big(1-\frac{\lambda}{n}\Big)^{n} \Big(1-\frac{\lambda}{n}\Big)^{-k}\\
&\rightarrow \frac{\lambda^k}{k!} (1) (e^{-\lambda}) (1) = \frac{\lambda^k}{k!}e^{-\lambda} 
\end{align*}

Poisson distributions are common for modeling arrival process, such as in queueing theory:

\begin{align*}
\text{supp}(X) &= \mathbb{N}\\
\mathbb{P}(X=k) &= \frac{\lambda^k}{k!}e^{-\lambda} \\
\mathbb{E}[X] &= \lambda\\
var(X) &= \lambda
\end{align*}

## Exponential

Consider the Poisson setup. Note that the number of arrivals in an interval of length $t$ follows a Poisson($\lambda t$) distribution. Let $T$ be the interarrival time between two arrivals. 

\begin{align*}
&\mathbb{P}(T \leq t)\\
=&1-\mathbb{P}(T > t)\\
=&1-\mathbb{P}(X(t) = 0)\\
=&1-e^{-\lambda t}
\end{align*}

Essentially, the interarrival time $T$ is memoryless and does not depend on the number of arrivals that had taken place so far. One application is to test whether there was fatigue for service assistants. For example, if they had fatigue then $T$ should follow an exponential distribution with a constant parameter. If $T$ follows $\text{Exponential}(\lambda)$: 

\begin{align*}
\text{supp}(T) &= [0, \infty)\\
F(t) &= e^{-\lambda t}\\
\mathbb{E}[T] &= \frac{1}{\lambda}\\
var(T) &= \frac{1}{\lambda^2}
\end{align*}

## Negative Binomial

Consider the binomial setup. Let $K$ be the number of failures before the $r$-th success. Suppose there were $k$ failures before the $r$-th success, it means that for the first $k+r-1$ trials, there are $r-1$ success and $k$ failures, and the $k+r$-th trial must be a success. Therefore:

\begin{align*}
\mathbb{P}(K=k) &= \binom{k+r-1}{r-1}p^{r-1}(1-p)^k \times p\\
&= \binom{k+r-1}{r-1}p^{r}(1-p)^k
\end{align*}

We say that $K$ follows a negative binomial distribution with parameters $(r,p)$:

\begin{align*}
\text{supp}(K) &= \mathbb{N}\\
\mathbb{P}(K=k) &= \binom{k+r-1}{r-1}p^{r}(1-p)^k\\
\mathbb{E}[K] &= \frac{pr}{1-p}\\
var(K) &= \frac{pr}{(1-p)^2}
\end{align*}

It gets the namesake negative binomial because:

\begin{align*}
\binom{k+r-1}{k} &= (-1)^k \binom{-r}{k}\\
&= (-1)^k \frac{(-r)(-r-1)\cdots(-r-k+1)}{k!}
\end{align*}

One advantage of using negative binomial distributions to model arrivals is that there will not be overdispersion (recall that a Poisson($\lambda$) distribution has both mean and variance at $\lambda$).

## Weibull

Derive a Weibull distribution as a nonlinear transformation of a standard exponential, see [here](https://math.stackexchange.com/questions/2556351/deriving-the-weibull-distribution-using-the-exponential/2556377). Suppose that $X$ follows $\text{Exponential}(1)$, and $W = \lambda X^{-\frac{1}{k}}$ follows a $\text{Weibull}(\lambda, k)$ distribution.

\begin{align*}
\mathbb{P}(W \leq w) &= \mathbb{P}(\lambda X^{-\frac{1}{k}} \leq w) \\
&= \mathbb{P}\bigg(X \leq \Big(\frac{w}{\lambda}\Big)^{-k} \bigg)\\
&= 1-\exp\bigg( -\Big(\frac{w}{\lambda}\Big)^{-k} \bigg)
\end{align*}

$\lambda$ and $k$ can be interpreted as the scale and shape parameters. Large $k$ signifies higher mean, but thinner tails. Weibull distribution is known as the Extreme Value Type III distribution (Type I and II being Gumbel and Frechet respectively). It is known to display fat tails, making it ideal for modeling extreme events in finance and applications of survival analysis. 

Despite its flexibility, the Lognormal also has some advantages at modeling skewed distributions with lower value means values, large variances (i.e, data with a large standard deviation), and all-positive values. Additionally, if we were to take the natural log of each random variable and its result is a normal distribution, then the Lognormal is the best fit.

## Beta

Before getting to derive a Beta distribution, we note that a __Conjugate Prior__ (with respect to data generating distribution) is a distribution such that if it is chosen as the prior, the posterior ends up being in the same distribution as the prior. Beta distribution is the conjugate prior in terms of the $p$ parameter of a binomial distribution. Specifically, under the binomial setup, a $\text{Beta}(\alpha, \beta)$ can be thought of as the prior of $p$ when there were $\alpha-1$ successes and $\beta-1$ failures. If the current trial is a success, then the updated prior $p$ follows a $\text{Beta}(\alpha+1, \beta)$ distribution. Note that $\text{supp}(B) = (0,1)$. 

Beta distributions are very flexible. PDFs can be bell-shaped, straight line, or even U shaped (see [this medium post](https://towardsdatascience.com/beta-distribution-intuition-examples-and-derivation-cf00f4db57af)). Note that the Beta distribution is the conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions (seems like those are the distributions that involve success & failure) in Bayesian inference.

## Gamma

Note that the Gamma function is defined as:

\begin{align*}
\Gamma(z+1) &= \int_0^\infty x^z e^{-x}dx
\end{align*}

Some nice properties of Gamma function include, $\Gamma(z+1) = z\Gamma(z)$ (proved using integration by part), and $\Gamma(n) = (n-1)!$ if $n \in \mathbb{N}^+$. Now consider the Poisson setup. Let $\tau$ be the elapsed time for the $k$-th arrival. Then $\tau$ follows a $\Gamma(k, \frac{1}{\lambda})$ distribution). It would perhaps be more intuitive to consider $\frac{1}{\lambda}$ the expected waiting time for an arrival.

\begin{align*}
F(t) &= \mathbb{P}(\tau \leq t)\\
&= 1-\mathbb{P}(\tau > t)\\
&= 1-\sum_{x=1}^{k-1} \frac{(\lambda t)^x e^{-\lambda t}}{x!}\\
f(t) &= \lambda e^{-\lambda t} - \lambda e^{-\lambda t} \sum_{x=1}^{k-1} \bigg( \frac{(\lambda t)^x}{x!} - \frac{\lambda^{x-1}}{(x-1)!} \bigg)\\
&= \lambda e^{-\lambda t} + \lambda e^{-\lambda t}\bigg( \frac{(\lambda t)^{k-1}}{(k-1)!} - 1 \bigg)\\
&= \frac{\lambda e^{-\lambda t} (\lambda t)^{k-1}}{(k-1)!}\\
&= \frac{t^{k-1} e^{-\lambda t}}{\Gamma(k)\big(\frac{1}{\lambda}\big)^k}
\end{align*}

Alternatively, a Gamma distribution can also be considered as the sum of $k$ i.i.d. exponential distributions with parameter $\lambda$. Naturally if $k=1$ then a Gamma distribution is exponential. The chi squared distribution is also a special of Gamma distribution.