## Observational Studies

We need to understand the difference between observational studies and randomised trials, and how to bridge the difference with matching.

Consider the following DAG:

<img src="./img/dags_confounding.png" >

In this case, X is sufficient to control for confounding.
- Ignorability assumption holds:

$Y^0, Y^1 \perp A |X$

In a randomized trial, treatment assignment A would be determined by a coin toss.
- This effectively erases the arrow from X to A.

<img src="./img/dags_rct_vs_observational.png" >

In a randomized trial, the distribution of X will be the same in both treatment groups.

<img src="./img/rct_population.png" >

In summary:
- Distribution of preteatment variabels X that affect Y are the same in both treatment groups.
 - __Covariate balance is ensured__
- Thus, if the outcome distribution ends up differing, it will not be because of differences in X.
- X is dealth with at the design phase

<u> Issues with Randomization:</u>
- Randomized trials are expensive
- Sometimes randomizing treatment/exposure is unethical
- Some (many) people will refuse to partcipate in trials
- Randomized trials take time (you have to wait for outcome data).
 - In some cases, by the time you have outcome data, the question might no longer be relevant.

<u>Observational Studies</u>
Planned, prospective, observational studies with active data collection:
- __Like trials:__ data collected on a commn set of variables at planned times; outcomes are carefully measured; study protocols.
- __Unlike trials:__ regulations much weaker, since not intervening; broader population eligible for the study.

Databases, retrospective, passive data collection:
- large sample sizes; inexpensive; potential for rapid analysis
- Data quality typically lower; no uniform standard of collection

In observational studies, the distribution of X will differ between treatment groups (since there is no control of the covariate to ensure balance).
- For example, if older people are more likely to get A = 1, we might see distributions like this:

<img src="./img/obs_vs_rct_distribution.png" >

### Matching

Matching is a method that attempts to make an observational study more like a randomized trial.

Main idea:
- Match individuals in the treated group (A = 1) to individuals in the control group (A=0) on the covariates X.

In the example where older people are more likely to get A = 1:
- At younger ages, there are more people with A = 0
- At older ages, there are more people with A = 1

In a RCT, for any particular age, there should be about the same number of treated and untreated people.

__By matching treated people to control people of the same age, there will be about the same number of treated and controls at any age.__

<u> Advantages of matching </u>

Controlling for confounders is acheived at the design phase (without looking at the outcome)
- the difficult statistical work can be done completely blinded to the outcomes

Matching will __reveal lack of overlap__ in covariate distribution
- Positivity assumption will hold in the population that can be matched

Once data are matched, essentially treated as if the data is produced from a randomized trial with __ensured covariate balance__.

### Single Covariate Matching


Consider the following covariate distribution of a single covariate between the treatment groups.

<img src="./img/matching_single_covariate_1.png" >

We can match each treated subject to a control subject

<img src="./img/matching_single_covariate_2.png" >

and then we eliminate the excess "blue" subjects in the Control group.

<img src="./img/matching_single_covariate_3.png" >

This ensures a balance in the covariate X.

### Many covariates

We will not be able to exactly match on the full set of covariates.

In a randomized trial, treated and control subjects are not perfect matches either.
- The distribution of covariates is balanced between groups (stochastic balance)

With observational data, matching closely on covariates can achieve stochastic balance.

Example with two covariates (sex, age)

<img src="./img/matching_double_covariates_1.png" >

It is easy to match on discrete type covariates (sex), but not so easy to match on continuous type covariates (age).

<img src="./img/matching_double_covariates_2.png" >

Note that we are making the __distribution of covariates in the control population look like that in the treated population__:
- Doing so means we are find the causal treatment on the treated.

This is represented by the following population breakdown

<img src="./img/hypo_worlds_treated.png" >

There are matching methods that can be used to target a different population, but this requires more advanced techniques.

### Fine Balance

Sometimes it is difficult to find great matches. We might be willing to accept some non-ideal matches if treated and control groups have same distribution of covariates.
- This is known as __"fine balance"__.

For example:
- Match 1: 
 - Treated: Male, Age 40
 - Control: Female, Age 45
- Match 2:
 - Treated: Female, Age 45
 - Control: Male, Age 40
 
Average age and percent female are the same in both groups, __even though neither match is great__.
- Percentage of Male is 50% which is the same in both treatment groups
- Average age in both treatment groups is 42.5

__We achieve fine balance even though the matches are not great by tolerating non-ideal matches.__

<u> Number of matches </u>
- __One to one (pair matching)__
 - Match exactly one control to every treated subject
 - Discard those without matches so you might lose some efficiency
- __Many to one__
 - Match some fixed number K controls to every treated subject (e.g., 5 to 1 matching)
- __Variable__
 - Sometimes match 1, sometimes more than 1, control to treated subjects
  - If multiple good matches available, use them. 
  - If not, do not.

### How to match?

Because we typically cannot match exactly, we first need to choose some metric of closeness.

We will consider two options (for now):
- Mahalanobis distance
- Robust Mahalanobis distance

### Mahalanobis distance

Denote by $X_j$ (a vector of covariates for subject j).

The Mahalanobis distance between covariates for subject i and subject j is:

$D(X_i, X_j) = \sqrt{(X_i - X_j)^TS^{-1}(X_i-X_j)}$

This metric is the square root of the sum of squared distances between each covariate scaled by the covariance matrix
- We need to scale because some dimensions may be on a much larger quantum, so "big" should be a relative notion.

<img src="./img/mahalanobis_dist_1.png" >

### Robust Mahalanobis distance

Motivation is to deal with outlier data.
- Outliers (in a specific dimension/covariate) can create large distances between subjects, even if the covariates are otherwise similar
- __Ranks__ might be more relevant
 - e.g. highest and second highest ranked valeus of covariates perhaps should be treated as similar, even if the values are far apart.
 

Robust Mahalanobis distance:
- Replace each covariate value with its rank
- Constant diagonal on covariance matrix (since ranks should be on the same scale)
- Calculate the usual Mahalanobis distance on the ranks

### Other distance measures
- If you want an exact match on a few important covariates, you can essentially make the distance infinity if they are not equal. 
 - In other words, strong penalty/weightage for specific covariates dimensions
- Distance on propensity score 

Once you have a distance score, how should you select matches?
- __Greedy (nearest neighbor) matching__
 - Not as good but coputationally fast
- __Optimal matching__
 - Better but computationally demanding.