$$
\newcommand{P}[1]{\mathrm{P}\left( #1 \right)}
\newcommand{Pc}[2]{\mathrm{P}\left( #1 \mid #2 \right)}
\newcommand{Ec}[2]{\mathrm{E}\left( #1 \mid #2 \right)}
\newcommand{In}[2]{ #1 \perp\!\!\!\perp #2}
\newcommand{Cin}[3]{ #1 \perp\!\!\!\perp #2 \, | \, #3}
\newcommand{do}[1]{\mathrm{do}\left( #1 \right)}
\newcommand{\ind}{\perp\!\!\!\!\perp}
$$

# Causal Inference

Day 3 Module 1

## Associatiioon and Causality

* $X$ causes $Y$ $\Rightarrow$ $X$ and $Y$ are associated

* $X$ causes $Y$ $\nLeftarrow$ $X$ and $Y$ are associated

![no_causation](./plots/causal1.jpg)

* Observational Data $\implies$ outcome comparison between treatment and control $\implies$ Association

* RCT $\implies$ outcome comparison between treatment and control $\implies$ Causation

Would Eating More Chocolate Be The Solution? 
![spurious_corr_demo](./plots/chocolate1.png)

##### [More on spurious correlation](https://www.tylervigen.com/spurious-correlations)

## Example: Simpson's Paradox

[Simpson' Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) is a example of one of the counter intuitive properties of probability distributions.

As an example of the paradox, imagine we are trying to find out whether certain treatment is effective for a disease. We decide to compare people who received the treatment versus who did not. 

Let $A$ be a binary treatment variable (trt) with $A=1$ if the subject received treatment and $A=0$ if the subject was assigned to the control group.  
$Y$ be a binary outcome variable, $Y=1$ if the subject is cured and $Y=0$ if not. We further denote $X$ be a third confounding variable (e.g. sex), $X=1$ for female and $X=0$ for male.

Unfortunately, this drug was not administered as part of a [random controlled trial](https://en.wikipedia.org/wiki/Randomized_controlled_trial). Instead, we observed the following data

In [1]:
import numpy as np
import pandas as pd
from IPython.core.display import HTML

df = pd.read_csv("eg.simpsons.csv")
HTML(print(df))

   count  trt  y  sex
0    131    0  1    0
1     56    0  0    0
2     19    0  1    1
3     44    0  0    1
4     50    1  1    0
5     12    1  0    0
6     75    1  1    1
7    113    1  0    1


<IPython.core.display.HTML object>

To decide whether it is effective, we compare the percentage of people who were cured among the treatment group versus the percentage of cured among the control group. i.e. the treatment effect can be defined as $P(Y=1|A=1) - P(Y=1|A=0)$

In [25]:
for sex in sorted(df.sex.unique()):
    d = df[df.sex == sex].set_index(["trt", "y"])["count"].to_dict()
    n0 = int(d[(0,0)] + d[(0,1)])
    p0 = d[(0,1)] / n0 if n0 != 0 else 0
    n1 = int(d[(1,0)] + d[(1,1)])
    p1 = d[(1,1)] / n1 if n1 != 0 else 0
    dp = "{:.3f}".format(p1-p0)
    print("treatment effect for subgroup sex=", sex, " is ", dp)

d = df.groupby(["trt", "y"])["count"].sum().to_dict()
n0 = int(d[(0,0)] + d[(0,1)])
p0 = d[(0,1)] / n0 if n0 != 0 else 0
n1 = int(d[(1,0)] + d[(1,1)])
p1 = d[(1,1)] / n1 if n1 != 0 else 0
dp = "{:.3f}".format(p1-p0)
print("treatment effect for the entire cohort is ", dp)

treatment effect for subgroup sex= 0  is  0.106
treatment effect for subgroup sex= 1  is  0.097
treatment effect for the entire cohort is  -0.100


When we just look at those male patients ($X=0$), it looks like there is a clear advantage of the treatment, with about 10% more patients being cured if treated. Similarly, the treatment seems effective if we look at the subgroup of female patients ($X=1$). However, if we look at the cure rate for the entire cohort, it seems like the treatment is harmful as the treatment group yielded a lower cure rate.

This might be difficult to explain with an association type of statement, yet, it might look differently from the perspective of causal inference.

## Example: WWII Bombers

![survivor bias](./plots/fighter.jpg)

It hass soon been pointed out by [Abraham Wald](https://en.wikipedia.org/wiki/Abraham_Wald): the original study only included the aircraft that had survived their missions

# Definitions of Causality 

["Causality"](https://plato.stanford.edu/entries/causation-metaphysics/) is a vague, philosophical sounding word. 

#### What Does Causality Mean?

Change the value of $X$ $\implies$ Change the distribution of $Y$ (Change the parameter or the distribution structure) 

## Causal Models

All aim to relate two type of situation:

* <font color='blue'>An observational world</font>
    * a 'natural' process assigns 'treatment'
    * Example: each patient chooses their own treatment
    
* <font color='red'>An experimental world</font>
    * 'Treatment' assigned via an 'external' process
    * Example: each patient is given the same treatment
    
* In observational world
    * <font color='red'>patient A --- takes drug X  $\rightarrow$ cured </font>
    * <font color='green'>patient B --- takes placebo  $\rightarrow$ not cured </font> <br/>
  __drug X is effective?__
    
    
* In experimental world
    * <font color='red'>patient A --- takes drug X  $\rightarrow$ cured </font>
    * <font color='green'>patient A --- takes placebo  $\rightarrow$ not cured </font> <br/>
  __drug X is effective! (at lease for patient A)__
  
    
Notationwise, let us denote $X$ and $Y$ be two random variables. The "causal effect" of $X$ on $Y$ could be how the distribution of $Y$ will change when we force $X$ to change from one value (control?) to another (treatment?). This act of forcing a variable to take a certain value is called an "Intervention".

In the observational world, when we make no intervention on the system, we have an observational distribution of $Y$, conditioned on the fact we observe $X$: $\color{blue}{\Pc Y X}$

In the experimental world, we could make an intervention. The distribution of $Y$ is then given by the "interventional" distribution: $\color{red}{\Pc Y {\hbox{do}(X)}}$

In general these two are not the same.


### Usage of Causal Models

* Given observational data make predictions about what would be observed in an experimental setting

* Given experimental data predict what happens in an observational context

* Combine experimental and observational data to predict the result of some experiment that was not performed

## Different Frameworks can relate observational and experimental:


* ### Structural Equation Models

* ### Graphical Causal Models

* ### Counterfactual Outcomes

* ### ...

### Structural Equation Approach

* Econometrics: [Haavelmo (1943)](http://static.stevereads.com/papers_to_read/the_statistical_implications_of_a_system_of_simultaneous_equations.pdf), [Strotz & Wold (1960)](https://www.jstor.org/stable/pdf/1907731.pdf?casa_token=8D9vMMtDZH4AAAAA:zXsTdnxco8YKzg-pNLj8X5gf7NdR_mBTS8Wdv5wUxMpxSYjCtULPH5PyV-35w5WklN7ifHyAGfweyXiBbKylsExgI4aj9C1CeJze6C3oyFKzRZJqQLs) 

* System of equations describing the observational world

* Specifies a data-generating process – with autonomous ‘mechanisms’ – for the observational distribution

Simple example with covariate $L$, treatment $A$ and outcome $Y$:

$L = f_L(\epsilon_L)$

$A = f_A(L, \epsilon_A)$

$Y = f_Y(L, A, \epsilon_Y)$

Individual outcomes under intervention derived by removing equations. e.g. By fixing $A$ to 0, we can obtain

$L = f_L(\epsilon_L)$

$A = 0$

$Y = f_Y(L, 0, \epsilon_Y)$

### Graphical Causal Models

* A graphical model is a probabilistic model whose conditional (in)dependence structure between random variables is encoded by a graph. 

* A graph that is both directed (i.e. all edges are directed) and acyclic (i.e. contains no directed cycles) is called a Directed Acyclic Graph (DAG).

We will mainly focus on the DAGs as these allow particularly simple causal interpretations. Such models are also known as Bayesian networks.

<img src="./plots/ebmed2019June243109F1.jpg" alt="Drawing" style="width: 600px;"/>

Example: Causal DAG depicting a hypothetical flow of causes and effects between variables.[source:BMJ Evid Based Med. 2019 Jun;24(3):109-112](https://ebm.bmj.com/content/24/3/109)

* Postulates a joint distribution over outcomes in experimental and observational settings

* Typically, experimental outcomes are ‘primary’, of which observational outcomes are deterministic functions

* Allows precise characterization of identification assumptions as conditional independence

### Potential Outcome Framework

* Introduce new random variables to the system: $Y^{a}$, known as the Potential Outcomes (the outcome if patient takes treatment $a$). 

* These variables can never be directly observed, but can be treated as any other random variable.

* Connect to the observed variable $Y$: 

 - $Y = Y^{a}$ when $A=a$
 
<img src="./plots/twoWORLD.png" alt="Drawing" style="width: 400px;"/>
 
* Rich language allowing many quantities of interest to be formulated, e.g. [Effect of Treatment among the treated (ETT)](https://www.nber.org/papers/t0107), [NDE](https://www.ncbi.nlm.nih.gov/pubmed/1576220)

* "Reduces" Causation to [Missing Data](https://en.wikipedia.org/wiki/Missing_data); all outcomes ‘observable’ a priori

* Allows precise characterization of identification assumptions as conditional independence

#### History of Counterfactual Outcome Framework

* Rubin Causal Model (RCM; Rubin 1978) is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes (one of the most popular approaches in causal inference)
* J. Neyman first used the term ‘potential outcome’ in his Master‘s thesis (1923，Polish)
  - Neyman, Jerzy. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Master's Thesis (1923)
* In 1990, D. M. Dabrowska, and T. P. Speed translated it and reprinted on Statistical Science   (Neyman-Rubin Model)
  - Neyman (1990 [from 1923]) On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science 5:465-480.
* D. Rubin (independently) brought this concept into a general framework for thinking about causation in both observational and experimental studies
* D Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educational Psychology 66:688-701
* Accoridng to J. Heckman, potential outcomes model was first proposed in economics (Roy Model)
  - Roy, A. (1951): Some Thoughts on the Distribution of Earnings. Oxford Economic Papers 3(2), pp. 135-146.
* D. Rubin had a short comment on Heckman’s citation in his Fisher Lecture “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions” (Rubin 2005 JASA)
  - “In the economics literature, the use of the potential outcomes notation to define causal effects has recently (e.g., Heckman 1996) been attributed to Roy (1951) or Quandt (1958), which is puzzling because neither of these articles addresses causal inference, and the former has no mathematical notation at all. For seeds of potential outcomes in economics, the earlier references cited at the start of this paragraph are much more relevant; see the rejoinder by Angrist, Imbens, and Rubin (1996) for more on this topic.”

#### Define Counterfactuals 

* Study sample consists of patients (or subjects) indexed by $i = 1, 2, \cdots, n$.

* Define the potential outcome $Y_i^{(a_1,k_1, a_2,k_2, \cdots, a_n, k_n)}$  to be the outcome that would be observed for the $i$th subject if each subject’s treatment $A_i$ and method of administration $k_i$ were set to $a_i$ and $k_i$, respectively, for $i = 1, 2, \cdots, n$.

* How could this be practically useful -- possible simplification:
  
  Stable Unit Treatment Value Assumption (__SUTVA__): 
  
  $Y_i^{(a_1,k_1,a_2,k_2,\cdots,a_n,k_n)}=Y_i^{a_i}$.

    * No interference (units do not interfere with each other): treatment applied to one unit does not effect the outcome for another unit
    
    * There is only a single version of each treatment level (potential outcomes must be well defined)

#### What does counterfactual model look like

Binary treatment $A=1$ if treated / exposed, $0$ if not treated / unexposed

Denote $Y^1$ the outcome if patient takes the treatment $A=1$, $Y^0$ the outcome if patient takes no treatment $A=0$

While $Y$ is the true outcome

For simple comparisons of outcomes with vs. without a single treatment, we let $A_i = 1$ if the treatment is given for patinet $i$ ($i=1,\cdots,n$) and $A_i$ = 0 otherwise, and say the causal effect for the $i$th patient is $Y_i^1−Y_i^0$. 

* The average causal effect in the study population (ATE) is: $\text{ATE} = E[Y_i^1 − Y_i^0]$

* The average causal affect in the treated (ATT): $\text{ATT} = E[Y_i^1 − Y_i^0\,\vert\, A_i=1]$

* The average causal affect in the untreated (ATU, or ATC as 'in controls'): $\text{ATU} = E[Y_i^1 − Y_i^0\,\vert\, A_i=0]$

| index $i$ | $A_i$ | $Y_i$ | $Y_i^0$ | $Y_i^1$ |
| --- | --- | --- | --- | --- | 
| 1 | 0 | 4 | 4 | ? |
| 2 | 0 | 7 | 7 | ? |
| 3 | 0 | 2 | 2 | ? |
| 4 | 0 | 8 | 8 | ? |
| 5 | 1 | 3 | ? | 3 |
| 6 | 1 | 5 | ? | 5 |
| 7 | 1 | 8 | ? | 8 |
| 8 | 1 | 9 | ? | 9 |

$\def\ci{\perp\!\!\!\perp}$
__Assumptions__ are needed to estimate the missing data in the counterfacturals:

*  Consistency: $Y_i^a = Y_i$ if $A_i=a$.

*  Positivity: $0 < \Pc {A_i=1} {X_{1,i},X_{2,i},\cdots,X_{p,i}} < 1$, for all values of $X_{1,i},X_{2,i}, \cdots, X_{p,i}$ that occur in the study population.

*  Conditional Exchangeability (No Unmeasured Confounding Assumption, NUCA): $\{Y_i^0, Y_i^1\} \ci A_i \vert \{X_{1,i},X_{2,i}, \cdots, X_{p,i}\}$

<img src="./plots/CIassumptions.png" alt="Drawing" style="width: 400px;"/>