# Analyzing A/A Tests and Backwards Causality for Observational Causal Inference Diagnostics

## Introduction and Context

Scouring the causal inference text, it is clear that causal inference is difficult due to many reasons but most notably due to the inescapable reality that all observational causal inference relies on *unverifiable* assumptions.

The most common way this is demonstrated is that in many methods, the causal inference model requires *conditional exchangeability* (this has a lot of different names), a key assumption in causal identifiability. The intuition behind this assumption is that it requires that you understand the full causal structure of the treatment effect you are analyzing. It assumes that all confounders have been identified and included in the list of covariates to control for. And sadly, there is no way to tell if this was done correctly.

In 2019, LinkedIn's Data Science team published a paper ([Bojinov et al., 2019](https://arxiv.org/pdf/1903.07755.pdf)) regarding an application of observational causal inference. They were interested in the causal effect of active engagement on LinkedIn Feed (likes, comments, shares) on subsequent engagement. They utilized a popular panel data method called Fixed Effects as well as more conventional cross-sectional methods such as matching (or weighting) with doubly robust practices. Finally, they compared those results with naive correlation analysis to show how relying on mere correlation can overestimate confidence in the product.

However, on section $2.1.3\text{ Step }3:\text{Diagnostic tools and sensitivity analysis}$, the paper claims several practical diagnostics to verify whether or not the above assumptions were satisfied.
1. *Backwards Causality*: The paper claims that achieving a good balance (e.g., via matching) ensures that the potential outcomes do not contain any information about the treatment assignment. Hence, running a regression for $\Pr(W_i=1|Y_i,X_i)$ on the balanced set should demonstrate a coefficient for $Y_i$ be close to 0. ($W_i$ is assumed to be the binary treatment)
2. *A/A testing*: Typically, the outcome of interest in observational studies is a member level metric which is measured over time. Therefore, there is usually a measure of this before the intervention occured. Using the historical metrics as the primary outcome and applying the chosen analysis method to this outcome should yield a causal estimate of $0$.

The goal of this notebook is to explore these methods and see if they really work the way it is claimed in the paper.

## 2. Backwards Causality

After reading the above section in ([Bojinov et al., 2019](https://arxiv.org/pdf/1903.07755.pdf)), I was a bit perplexed because I had never heard of this diagnostic method before. Googling held no reliable results so I resorted to asking on [CrossValidated](https://stats.stackexchange.com/questions/631005/is-running-pt-1y-x-a-recommended-way-to-help-diagnose-the-unconfoundedness-as). Admittedly, I was a bit conservative in the framing of my question because I acknowledged that while the diagnostic tool didn't sound right to me, I knew that the authors of the paper had a significantly higher level of mastery of causal inference than I did. Of-course, this does not mean that they will always be more correct than someone less familiar, but it means that they will be right *more often*. Due to this framing, it looked based on the first reply (a comment) that I must have not articulated my question well because the reply was completely irrelevant to my question. I added a few more words of context, and a little while later I received an Answer by the user Noah (whose work I have used quite often, including the R packages `MatchIt()` and `WeightIt()`). Noah explained that *Backwards Causality* as described in the above paper is a "completely invalid method".

He explained that estimating $\Pr(T|Y,X)$ (Here we are using $T$ as the binary treatment instead of $W$ in the paper) is simply the association between $Y$ and $T$ given $X$, which is exactly the effect we are trying to estimate in the first place. This lines up with my current level of understanding as well - Conditional Expectation is simply an associational measure, and is only endowed with a causal interpretation when it has been identified. We can test this below.

First, we'll create our variables
- $X_i\sim\text{Normal}(0,1)$
- $T_i\sim\text{Bernoulli}(\text{logit}(X_i))$
- $\epsilon_i\sim\text{Normal}(0,1)$
- $Y_i=2T_i+X_i+\epsilon_i$

Notice that $X_i$ is a common cause of both treatment $T_i$ and outcome $Y_i$, making it a confounder. $T_i$ is our binary treatment that is based on $X_i$ and the outcome $Y_i$ shows that the treatment effect is $2$.

In [1]:
def sim_data(n=2000):
  """
  Simulate Cross-Sectional Data to Match

  Parameters:
  - n: Number of Samples

  Output:
  - Pandas Dataframe
  """
  import pandas as pd
  import numpy as np
  from scipy.special import expit

  #create dataframe with confounder X
  p = pd.DataFrame(
      {
          'x': np.random.normal(0,1,size=n),
      }
  )

  # add treatment and outcome
  p['t'] = np.random.binomial(1,expit(p['x']),size=n)
  p['y'] = 2*p['t'] + p['x'] + np.random.normal(0,1,size=n)

  return p

In [2]:
# Load R
# Guide here: https://colab.research.google.com/drive/1ISG891i076enSPB-4bni_DECWlFlnasU?usp=sharing
%reload_ext rpy2.ipython
%config IPCompleter.greedy=True
%config InlineBackend.figure_format = 'retina'

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
%%R
#install.packages("MatchIt") # main matching package
#install.packages("marginaleffects") # to compute treatment effect

NULL


In [6]:
%%R
library("MatchIt")
library("marginaleffects")

In [27]:
# simulate data
df = sim_data()

In [28]:
%%R -i df
avg_comparisons(lm(y~t,df), variables = "t")


 Term Contrast Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
    t    1 - 0     2.89     0.0612 47.2   <0.001 Inf  2.77   3.01

Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 



Notice how when we don't include our confounder, the treatment effect is biased. We can use matching to balance the treatment and control. We'll use `MatchIt()`'s implementation of Generalized Full Matching. Of-course, there are a lot of variations of specifications for matching (I recommend [Greifer and Stuart, 2021](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9005055/)), but due to the simplicity of our simulation, the exact specification should matter little. I simply just chose one that has the ability to take $\text{ATE}$ as an estimand. For example, the popular Nearest Neighbor Matching w/ Propensity Score (this is what people usually refer to when they say "Propensity Score Matching") can only compute the $\text{ATT}$ or $\text{ATC}$.

In [45]:
%%R
# 1:1 NN PS matching w/o replacement
m_out <- matchit(t ~ x, data = df, method = "quick", distance = "glm", estimand='ATE')
m_out

A matchit object
 - method: Generalized full matching
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 2000 (original), 2000 (matched)
 - target estimand: ATE
 - covariates: x


In [46]:
%%R -o md
md <- match.data(m_out)

In [47]:
md.head()

Unnamed: 0,x,t,y,distance,weights,subclass
0,0.222959,1,0.845909,0.564879,1.5165,183
1,0.917167,1,0.920097,0.728085,0.75825,240
2,0.644831,1,3.987496,0.668393,0.6066,192
3,0.114695,1,2.356726,0.536954,0.631875,1
4,0.576414,0,1.277647,0.652394,0.989,2


Now we can compute the treatment effect

In [53]:
%%R
#Linear model with covariates
fit <- lm(y ~ t, data = md, weights = weights)

avg_comparisons(fit, variables = "t", vcov = ~subclass, wts = "weights")


 Term Contrast Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
    t    1 - 0     2.02     0.0587 34.4   <0.001 859.9  1.91   2.14

Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 



We see that the treatment effect was sucessfully extracted in the matched dataset. Conventionally we would include our covariate $X_i$ in the post-matching outcome regression to be doubly robust, but we omitted here because it is not strictly necessary and more importantly helps show that matching works instead of attributing the bias adjustment to just a natural outcome of regression.

Now, we can try computing $\Pr(T|Y,X)$ on the matched dataset. According to the paper, the coefficient estimate of $Y$ should be near $0$.



In [55]:
%%R
fit2 <- lm(t ~ y + x, data = md, weights = weights)
fit2


Call:
lm(formula = t ~ y + x, data = md, weights = weights)

Coefficients:
(Intercept)            y            x  
     0.2415       0.2502      -0.2584  



As matching was performed perfectly and all confounders were controlled for, our environment is going to be the setting in which the proposed method should return a coefficient of near $0$ for. However, we see here is that this is not the case. In fact, the coefficient estimate increases from calculating this in our original (unmatched) dataset, as we see below.

In [57]:
%%R
lm(t~y+x, data=df)


Call:
lm(formula = t ~ y + x, data = df)

Coefficients:
(Intercept)            y            x  
     0.2711       0.2281      -0.1228  



In conclusion, we have shown that as Noah mentioned on CrossValidated, the proposed diagnostic of *Backwards Causality* does not appear to work.

## 3. A/A tests

Let's break down the explanation from the paper, sentence by sentence:
> *A/A testing*: Typically, the outcome of interest in observational studies is a member level metric which is measured over time. Therefore, there is usually a measure of this before the intervention occured. Using the historical metrics as the primary outcome and applying the chosen analysis method to this outcome should yield a causal estimate of $0$.

The first sentence mentions that the outcome in a causal analysis in a product-setting is typically a member-level metric that is measured over time. For example, metrics such as "return rate" or "subsequent engagement" are all metrics that can be measured over time as it is not a "happens once" event like the metric of *mortality* in a more clinical setting.

Then, the author remarks how this means that we can capture this metric before the intervention. So for example, if the chosen "treatment" is "engaged with LinkedIn Feed" and outcome is "Next Week Engagement Rate", then you can still measure "Next Week Engagement Rate" without the actual treatment.

Finally, they claim that we can apply the chosen analysis method on this outcome before the intervention on the same control vs treatment and the model should yield a causal estimate of $0$.

Conceptually this makes sense because when we compare Treatment vs Control post-"intervention", we're making the assumption that the only difference between the two groups is the intervention. This means that when you compare the outcome before the intervention, which should be computable, the difference between control and treatment should be near 0.

Practically speaking, I have a few questions. As we are technically measuring our outcome as "outcome (engagement) from the past", does this imply that we should no longer include historical engagement as one of our covariates to control for? In section $4.1$, indeed, Linkedin's data science team utilizes "*historical engagement*" as one of the covariates to control for in their study. If they are matching on historical engagement, would this not induce any problems with comparing historical engagement between control and treatment as our diagnostic tool?

To make my query more clear, suppose that we only control for 1 covariate, "historical engagement". This means that upon performing matching, both control and treatment have similar historical engagement. The proposed diagnostic says that we should compare historical engagement between control and treatment. If we did this, naturally the difference will be near zero, because we had controlled for it!

Of-course, if there is any error in my reasoning, please let me know as I am also learning and I am nowhere near an expert in observational causal inference.

