## Observational Studies

We need to understand the difference between observational studies and randomised trials, and how to bridge the difference with matching.

Consider the following DAG:

<img src="./img/dags_confounding.png" >

In this case, X is sufficient to control for confounding.
- Ignorability assumption holds:

$Y^0, Y^1 \perp A |X$

In a randomized trial, treatment assignment A would be determined by a coin toss.
- This effectively erases the arrow from X to A.

<img src="./img/dags_rct_vs_observational.png" >

In a randomized trial, the distribution of X will be the same in both treatment groups.

<img src="./img/rct_population.png" >

In summary:
- Distribution of preteatment variabels X that affect Y are the same in both treatment groups.
 - __Covariate balance is ensured__
- Thus, if the outcome distribution ends up differing, it will not be because of differences in X.
- X is dealth with at the design phase

<u> Issues with Randomization:</u>
- Randomized trials are expensive
- Sometimes randomizing treatment/exposure is unethical
- Some (many) people will refuse to partcipate in trials
- Randomized trials take time (you have to wait for outcome data).
 - In some cases, by the time you have outcome data, the question might no longer be relevant.

<u>Observational Studies</u>
Planned, prospective, observational studies with active data collection:
- __Like trials:__ data collected on a commn set of variables at planned times; outcomes are carefully measured; study protocols.
- __Unlike trials:__ regulations much weaker, since not intervening; broader population eligible for the study.

Databases, retrospective, passive data collection:
- large sample sizes; inexpensive; potential for rapid analysis
- Data quality typically lower; no uniform standard of collection

In observational studies, the distribution of X will differ between treatment groups (since there is no control of the covariate to ensure balance).
- For example, if older people are more likely to get A = 1, we might see distributions like this:

<img src="./img/obs_vs_rct_distribution.png" >

### Matching

Matching is a method that attempts to make an observational study more like a randomized trial.

Main idea:
- Match individuals in the treated group (A = 1) to individuals in the control group (A=0) on the covariates X.

In the example where older people are more likely to get A = 1:
- At younger ages, there are more people with A = 0
- At older ages, there are more people with A = 1

In a RCT, for any particular age, there should be about the same number of treated and untreated people.

__By matching treated people to control people of the same age, there will be about the same number of treated and controls at any age.__

<u> Advantages of matching </u>

Controlling for confounders is acheived at the design phase (without looking at the outcome)
- the difficult statistical work can be done completely blinded to the outcomes

Matching will __reveal lack of overlap__ in covariate distribution
- Positivity assumption will hold in the population that can be matched

Once data are matched, essentially treated as if the data is produced from a randomized trial with __ensured covariate balance__.

### Single Covariate Matching


Consider the following covariate distribution of a single covariate between the treatment groups.

<img src="./img/matching_single_covariate_1.png" >

We can match each treated subject to a control subject

<img src="./img/matching_single_covariate_2.png" >

and then we eliminate the excess "blue" subjects in the Control group.

<img src="./img/matching_single_covariate_3.png" >

This ensures a balance in the covariate X.

### Many covariates

We will not be able to exactly match on the full set of covariates.

In a randomized trial, treated and control subjects are not perfect matches either.
- The distribution of covariates is balanced between groups (stochastic balance)

With observational data, matching closely on covariates can achieve stochastic balance.

Example with two covariates (sex, age)

<img src="./img/matching_double_covariates_1.png" >

It is easy to match on discrete type covariates (sex), but not so easy to match on continuous type covariates (age).

<img src="./img/matching_double_covariates_2.png" >

Note that we are making the __distribution of covariates in the control population look like that in the treated population__:
- Doing so means we are find the causal treatment on the treated.

This is represented by the following population breakdown

<img src="./img/hypo_worlds_treated.png" >

There are matching methods that can be used to target a different population, but this requires more advanced techniques.

### Fine Balance

Sometimes it is difficult to find great matches. We might be willing to accept some non-ideal matches if treated and control groups have same distribution of covariates.
- This is known as __"fine balance"__.

For example:
- Match 1: 
 - Treated: Male, Age 40
 - Control: Female, Age 45
- Match 2:
 - Treated: Female, Age 45
 - Control: Male, Age 40
 
Average age and percent female are the same in both groups, __even though neither match is great__.
- Percentage of Male is 50% which is the same in both treatment groups
- Average age in both treatment groups is 42.5

__We achieve fine balance even though the matches are not great by tolerating non-ideal matches.__

<u> Number of matches </u>
- __One to one (pair matching)__
 - Match exactly one control to every treated subject
 - Discard those without matches so you might lose some efficiency
- __Many to one__
 - Match some fixed number K controls to every treated subject (e.g., 5 to 1 matching)
- __Variable__
 - Sometimes match 1, sometimes more than 1, control to treated subjects
  - If multiple good matches available, use them. 
  - If not, do not.

### How to match?

Because we typically cannot match exactly, we first need to choose some metric of closeness.

We will consider two options (for now):
- Mahalanobis distance
- Robust Mahalanobis distance

### Mahalanobis distance

Denote by $X_j$ (a vector of covariates for subject j).

The Mahalanobis distance between covariates for subject i and subject j is:

$D(X_i, X_j) = \sqrt{(X_i - X_j)^TS^{-1}(X_i-X_j)}$

This metric is the square root of the sum of squared distances between each covariate scaled by the covariance matrix
- We need to scale because some dimensions may be on a much larger quantum, so "big" should be a relative notion.

<img src="./img/mahalanobis_dist_1.png" >

### Robust Mahalanobis distance

Motivation is to deal with outlier data.
- Outliers (in a specific dimension/covariate) can create large distances between subjects, even if the covariates are otherwise similar
- __Ranks__ might be more relevant
 - e.g. highest and second highest ranked valeus of covariates perhaps should be treated as similar, even if the values are far apart.
 

Robust Mahalanobis distance:
- Replace each covariate value with its rank
- Constant diagonal on covariance matrix (since ranks should be on the same scale)
- Calculate the usual Mahalanobis distance on the ranks

### Other distance measures
- If you want an exact match on a few important covariates, you can essentially make the distance infinity if they are not equal. 
 - In other words, strong penalty/weightage for specific covariates dimensions
- Distance on propensity score 

Once you have a distance score, how should you select matches?
- __Greedy (nearest neighbor) matching__
 - Not as good but coputationally fast
- __Optimal matching__
 - Better but computationally demanding.

### Greedy (nearest neighbor) matching

Experiment Setup:
- Selected a set of pre-treatment covariates X that (hopefully) satisfy the ignoraibility assumption
- You have calculated a distance $d_{ij}$ between each treated subject with every control subject
- You have many more controls subjects than treated subjects
 - This is often the case in observational studies
- Focus is on pair (one-to-one) matching 

Steps:
1. Randomly order list of treated subejcts and control subjects
2. Start with the first trated subject. Match to the control with the smallest distance (this is greedy).
3. Remove the matched control from the list of available matches.
4. Move on to the next treated subject. Match tot he control with the smallest distance.
5. Repeat steps 3 and 4 until you have matched all treated subjects.

<u> Greedy Matching</u>
- Intuitive 
- Computationally fast
    - Involves a series of simple algorithms (identifying min distance)
    - Fast even for large data sets
    - R package: MatchIt
- Not invariant to intial order of list
- Not optimal
    - Always taking the smallest distance match does not minimize total distance
    - Can lead to some bad matches

<u> Many-to-one Matching</u>
- For k:1 matching:
    - After everyone has 1 match, go through the list again and find 2nd matches from the remaining pool

<u> Tradeoffs</u>
- Pair matching
    - Closer matches
    - Faster computing time
- Many-to-one
    - Larger sample size
- Largely a bias-variance tradeoff issue
    - Pair matching has less bias because the matching is closer, but it should be less efficient because you are discarding data.
    - Many to one matching has more bias, but smaller variance.
- Note that the efficiency gain for using "many-to-one" is not as much as if you were adding an additional treated subject that you can find matches for in the control subjects.

<u> Caliper </u>
- We might prefer to exclude treated subjects for whom there does not exist a good match.
- A bad match can be defined using a caliper (max acceptable distance)
    - Only match a treated subject if the best control match has distance less than the caliper
    - Otherwise, get rid of that treated subject
    - Recall positivity assumption (prob of each treatment given X should be non-zero): 
        - If no matches within caliper, it is a sign that positivity assumption would be violated.
        - Excluding these subjects makes assumption more realistic
        - Drawback: population might be hard to define

### Optimal Matching
- Greedy matching is not typically optimal
- Optimal matching
    - Minimizes global distance measure
    - Computationally demanding
    - R packages:
        - `optmatch`
        - `rcbalance`

Feasibility:
- Where or not it is feasible to perform optimal matching depends on the size of the problem.
- Constraints can be imposed to make optimal matching computationally feasible for larger data sets.
    - For example:
        - Match within hospitals in a multi-site clinical study
        - Match within primary disease categoy
        - These are "blocks"
    - This is known as sparse matching
        - Mismatches can be tolerated if fine balance can still be achieved.

### Assessing Balance

<u> Did matching work? </u>
- After you ahve matched, you should assess whether matching worked.
    - __Covariate balance__
        - Standardized differences
            - Similar means?
    - __This can/should be done without looking at the outcome__
- Commonly, a "Table 1" is created, where pre-matching and post-matching balance is compared.

<u> Hypothesis Tests and p-values </u>
- Balance can be assessed with hypothesis tests
    - i.e., test for a difference in means between treated and controls for each covariate
        - Two sample t-tests (for continuous covariates) or chi-square test (for discrete covariates) and report p-value for each test
    - Drawback:
        - p-values are dependent on sample size
        - Small differences in means will have a small p-value if the sample size is large (which is often in most cases).
            - We probably do not care much if mean differences are small.


<u> Standardised differences </u>

A standardised difference is the difference in means between groups, divided by the (pooled) standard deviation.

<img src="./img/standardised_diff_formula.png" >

Standardised differences:
- Does not depend on sample size
- Often, absolue value of smd is reported (ignore polarity)
- Calculate for each variable that you match on

Rules of thumb:
- Values < 0.1 indicate adequate balance
- Values 0.1 to 0.2 are not too alarming
- Values > 0.2 indicate serious imbalance

### Table 1

<img src="./img/table1_1.png" >

### SMD Plot with Threshold = 0.1
<img src="./img/smd_plot.png" >
