## Week 1: Introduction to Causal Effects

__Spurious correlation:__ Causally unrelated variables might happen to be highly correlated with each other over some period of time.

__Anecdotes:__ People have beliefs about causal effects in their own lives  
__Reverse Causality:__ Even if there is a causal relationship, sometimes the direction is unclear.

Causality framework:
- Formal definitions
- Assumptions necessary to identify causal effects from data
- Rules about what variables need to be controlled for
- Sensitivity analysis to determine the impact of violations of assumptions on conclusions

Causal inference from observational studies and natural experiments
- In observational studies, the treatment or exposure is just as it is in the real world without any direct manipulation/assignment. 
- Causal inference requires making some untestable assumptions


### Terminology: Potential Outcomes vs Counterfactual

Potential outcomes  
- Potential outcomes are the outcomes we would see under each possible treatment option
- $Y^a$ is the outcome that would be observed if treatment was set to A=a. Each person has potential outcomes $Y^0, Y^1$

Counterfactual  
- Counterfactual outcomes are ones that would have been observed, had the treatment been different.

Key difference:   
- Before the treatment decision is made, any outcome is a potential outcome: $Y^0$ and $Y^1$
- After the study, there is an observed outcome, $Y = Y^A$, and counterfactual outcomes $Y^{(1-A)}$, with assumption that A is a binary variable.

Counterfactual outcomes $Y^0$, $Y^1$ are typically assumed to be the same as potential outcomes $Y^0$, $Y^1$.

### Terminology: Interventions

Interventions or actions
- Causal effects of variables that can be manipulated
- Common assumption that there are no hidden versions of treatment. The treatment itself is always assumed to be consistent.
- Some variables are not directly mutable: Age, Race
- Manipulable vs non manipulable:
 * Race vs Name on resume
 * Obesity vs Bariatric surgery
 - Socioeconomic status vs Gift of money
- Focus on causal effects of hypothetical interventions since:
 - Their meaning is well defined
 - Potentially actionable


### Fundamental Problem of Causal Inference

The fundamental problem of causal inference is that we can only observe one potential outcome for each person. However, with certain assumptions, we can estimate population level (average) causal effects. 
- How can we use observed data to link observed outcomes to potential outcomes?
- What assumptions are necessary to estimate causal effects from observed data?


### Hypothetical Worlds - Average Causal Effects

<img src="./img/hypo_worlds_ACE.png" >

Average Causal Effect = E($Y^1 - Y^0$)

This is the average value of Y if everyone in the population was treated with A = 1 minus the average value of Y if everyone was treated with A = 0. 
- If Y is binary, this is a risk difference.

### “Conditioning on” versus “setting, treatment”

In general, $E(Y^1 - Y^0) \ne E(Y|A=1) - E(Y | A= 0)$
- LHS is a causal relationship defined by potential outcomes
- RHS is a statistical associational relationship defined by observed data on subpopulations
- There needs to be further causal assumptions made such that the causal relationship on the LHS can be reduced to a statistical relationship on the RHS.

E(Y|A=1) reads as “expected value of Y given A=1”. This restricts to the __subpopulation__ of people who actually had A = 1. This __subpopulation might differ from the whole population__ in important ways; they might not be representative of the overall population. 
- For example, people at higher risk for flu might be more likely to choose to get a flu shot. This is an example of how there is a confounding variable which affects treatment assignment and outcome.

This is illustrated by the following diagram

<img src="./img/hypo_worlds_conditioning.png" >

To summarise:
- E(Y|A-1): mean of Y among people with A = 1
- E($Y^1$): mean of Y if the whole population was treated with A = 1

__$E(Y|A=1) - E(Y|A=0)$ is generally not a causal effect__ because it is comparing two different (sub)populations of people.

__$E(Y^1 - Y^0)$ is a causal effect__ because it is comparing what would happen if the same people were treated with A = 1 versus if the same people were treated with A = 0.


### Other Causal Effects:

1. $E(Y^1/Y^0)$: __Causal relative risk__

2. $E(Y1- Y0| A=1)$: __Causal effect of treatment on the treated (subpopulation)__
 - Might be interested in how well treatment works among treated people.
 - __There might be some subpopulation in the general population that are not interested in the treatment (e.g. patients who are not interested in surgery)__. Thus we want to find out among people who do want the surgery, how well does the treatment work?


<img src="./img/hypo_worlds_treated.png" >

3. $E(Y^1- Y^0|V=v)$: __Average causal effect in the subpopulation with covariate V=v.__
 - Also known as heterogeneity treatment effects where there might be some subpopulation defined by v. 
 - We are isolating a treatment effect, but maybe in certain sub-populations


### Identifiability

Identifiability of causal effects requires making some __untestable assumptions called causal assumptions__: 
1. Stable Unit Treatment Value Assumption (SUTVA)
2. Consistency
3. Ignorability
4. Positivity

These assumptions will be about the observed data: Y, A, and a set of pre-treatment covariates X.


#### __SUTVA__

SUTVA involves two assumptions:
- No interference: Units do not interfere with each other. 
 - Treatment assignment of one unit does not affect the OUTCOME of another unit. 
 - Spillover or contagion are also terms for interference
- One version of treatment that is consistent

SUTVA allows us to write the potential outcome for the _ith_ person in terms of only that person’s treatment.


#### __Consistency__
Consistency assumption: The potential outcome under treatment A=a, $Y^a$, is equal to the observed outcome if the actual treatment received is A=a. 

This assumption is simply linking/relating potential outcomes with observed outcomes.

#### __Ignorability__

This is also known as the no unmeasured confounders assumption, and is probably the most critical assumption of all.

Given pre-treatment covariates X, the treatment assignment is independent from the potential outcomes. This can also be phrased as __the conditional independence of treatment assignment from potential outcomes.__

$Y^0, Y^1 \perp A | X$

Implications: Among people with the same values of X, we can think of treatment as being randomly  assigned. Treatment itself becomes ignorable (a non-factor) if we have the right covariates. 



Toy example:
- X is a single variable (age) that can take values “younger” or “older”
- Older people are more likely to get treatment A = 1
- Older people are also more likely to have the outcome (hip fracture) regardless of treatment
- Thus age is related to the risk of outcome, as well as related to treatment assignment.

$Y^0$ and $Y^1$ are not independent from A (marginally).  
However, within levels of X, treatment might be __randomly assigned.__


#### __Positivity__

For every set of values for X covariates, treatment assignment was not deterministic. 

$P(A=a|X=x)$ > 0 for all a and x

If this was violated, it will mean that for a given value of X, everybody is treated. This means that there is no way for us to learn the causal treatment effect. 

If for some values of X, treatment was deterministic:
- P(A = 1|X) = 1 
- P(A=0|X) = 0,  

then we would have no observed values of Y for one of the treatment groups for those values of X. 

Positivity assumption states that as long as within every level of X, there are people who are treated and not treated. Variability in treatment assignment is important for identification.



__Observed Data and Potential Outcomes__

The following steps allow us to use the assumptions to identify a causal relationship using observed data:  

$E(Y | A=a,X=x)$ involves only observed data  
= $E(Y^a | A=a, X=x)$ by consistency (linking potential outcomes with observed outcomes for the given treatment)  
= $E(Y^a | X=x)$ by ignorability (drop the conditioning on treatment since it is supposed to be independent of potential outcomes)

Thus, with those causal assumptions,  
$E(Y|A=a, X=x) = E(Y^a | X=x)$


__Conditioning and Marginalizing = Standardisation__

Marginal causal effect: $E(Y^a) = E_x(Y^a | X=x)$ by averaging over X
- For discrete forms of X, we perform a summation $\sum_x E(Y|A=a, X=x) P(X=x)$
- For Continuous forms of X, we perform integration $\int P (Y|A=a, X=x) P(X=x) dx$

Expected variable of the potential outcome is just an expected value of the observed outcome in these subpopulations averaged over the distribution of the covariant. 

This is known as __standardisation__, which involves conditioning/stratifying first before marginalizing/averaging over. This gives us the __standardized mean__, which happens to be the same as the __average potential outcome__ (either $E(Y^1)$ or $E(Y^0)$.  


__Standardisation__ involves stratfying and then averaging.  
- Obtain a treatment effect within each stratum and then pool across stratum, weighting by the probability (size) of each stratum
- From data, you could estimate a treatment effect by computing means under each treatment within each stratum, and then pooling across stratum.

Following example can be illustrated as shown: 
- Treatment is “Saxa”
- Outcome is “MACE” as a risk
- Confounding variable to be stratified is “Prior OAD use”


Without stratification, we have the naive approach that does not account for confounding:

<img src="./img/standardisation_1.png" >

Without stratification, Saxa treatment group was observed to have higher risk of MACE (due to confounding of treatment assignment that patients who were worse off tend to receive treatment)

With __standardisation__, we obtain the following tables using __stratification__ based on the confounding variable first:

<img src="./img/standardisation_2.png" >
<img src="./img/standardisation_3.png" >

Subsequently, we perform marginalization over the confounding variable:

<img src="./img/standardisation_4.png" >
<img src="./img/standardisation_5.png" >

Comparing the treatment effects of either SAXA or SITA (non-SAXA), we see that there is no difference in terms of the outcome risk MACE variable.

Problems with Standardization:
- Typically there will be many X variables needed to achieve ignorability
- Stratification would lead to many empty cells (no data for combinations of stratification)
- Need alternatives to standardization 
