# (Austin, 2011) An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

As always - some notes I took from the paper that I found to be most important or notable. I always try to keep these notes short, otherwise it's easier to just reference the paper directly. Of course, *how* short is difficult to say as papers can vary drastically in length.

In addition, I like to add my own notes here and there when I deem it as necessary to help with understanding the text.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/

## Abstract

## Randomized Controlled Trials Versus Observational Studies

## The Propensity Score and Propensity Score Methods

Defined by Rosenbaum and Rubin (1983a) as the probability of treatment assignment conditional on observed baseline covariates:
$$e_i=Pr(Z_i=1|X_i)$$

This is a balancing score: conditional on the propensity score, the distribution of measured baseline covariates is similar between treated and untreated subjects.

In a randomized experiment, the true propensity score is known and defined by the study design.

In observational studies, the true score is not known (generally), and is estimated using the study data, often using logistic regression, although the following methods have been examined too:
- bagging/boosting (2010; 2004)
- recursive partitioning or tree-based methods (2010; 2008)
- random forests (2010)
- neural networks (2008)

There are 4 different methods that use propensity score:
- Propensity score matching
- Stratification (on the propensity score)
- Inverse probability of treatment weighting (IPTW)
- covariate adjustment w/ propensity score

It was shown by Rosenbaum and Rubin (1983a) define treatment assignment to be strongly "ignorable" if the following 2 conditions hold:
- $(Y(1),Y(0))\perp Z|X$
- $0< P(Z=1|X)< 1$

>My note: You'll notice that they combine the conditions conditional exchangeability and positivity here to create the definition "ignorability"

In other words, they showed that you can achieve causal identification with the propensity score that is built on the covariates instead of just the covariates directly. Of course, it's important to note that they mean the true propensity score here, not the estimated.

### Propensity Score Matching

>My note: they describe 1:1 PSM here - I won't go into detail since you can get a better read from ([King and Nielsen, 2019](https://gking.harvard.edu/sites/scholar.harvard.edu/files/gking/files/pan1900011_rev.pdf)). I will note though - the methods described in this section isn't unique to propensity scores so they are concepts that can be applied to other distance measures such as mahalanobis distance.

### Stratification on the Propensity Score

Involves stratifying subjects into mutually exclusive subsets based on their estimated propensity score. Increasing the # of strata used should result in improved bias reduction, although the marginal reduction in bias decreases as the number of strata increases.

### IPTW

Uses weights based on the propensity score to create a synthetic sample in which the distribution of measured baseline covariates is independent of treatment assignment.

Weights can be defined as:
$$w_i=\frac{Z_i}{e_i}+\frac{(1-Z_i)}{1-e_i}$$

where $Z_i$ be an indicator variable denoting whether or not the $i$th subject was treated.

So, if $Z_i=1$, then $w_i=\frac{1}{e_i}$.

One estimate of the ATE is:
$$\frac{1}{n}\sum_{i=1}^{n}\frac{Z_iY_i}{e_i}-\frac{1}{n}\sum_{i=1}^{n}\frac{(1-Z_i)Y_i}{1-e_i}$$

Thus, one can think of weighting as changing the importance of each user's value based on the inverse of the propensity score to even out the propensity scores across the treatment and control.

Interestingly, regression models can be weighted by the inverse probability of treatment to estimate causal effects. When used in this context, IPTW is part of a larger family of causal methods known as marginal structural model.

Also, since the weights may be inaccurate or unstable for subjects with low probability of receiving the treatment, the use of stabilizing weights have been proposed (2000).

### Covariate Adjustment Using the Propensity Score

Outcome regression: `lm(Y~Z+e)`

This method assumes that the relationship between the propensity score and the outcome has been correctly modeled (e.g., linear)

### Comparison of the Different Propensity Score Methods

While performances may differ based on the analysis, IPTW and PSM seem the most promising out of the 4.






## Balance Diagnostics

Comparing the similarity of treated and untreated subjects in the matched sample should begin with a comparison of the means or medians of continuous covariates and the distributions of their categorical counterparts between treated and untreated subjects.

The *standardized* difference can be used to compare the mean of continuous and binary variables between treatment groups (multilevel categorical variables can be represented using a set of binary indicator variables)

For a continuous covariate, the standardized difference is defined as:

$$d=\frac{\bar{x}_{\text{treat}}-\bar{x}_{\text{control}}}{\sqrt{\frac{s_{\text{treat}}^2-s_{\text{control}}^2}{2}}}$$

For dichotomous variables, the standardized difference is:

$$d=\frac{\hat{p}_{\text{treat}}-\hat{p}_{\text{control}}}{\sqrt{\frac{\hat{p}_{\text{treat}}(1-\hat{p}_{\text{treat}})-\hat{p}_{\text{control}}(1-\hat{p}_{\text{control}})}{2}}}$$

While there is no universally accepted threshold for what constitutes good balance, a standardized difference less than 0.1 has been taken to indicate a negligible difference (2001).

Note that using statistical significance to compare the balance has been heavily criticized because significance levels are confounded with sample size.

## Variable Selection for the Propensity Score Model

## Propensity Score Methods Versus Regression Adjustment

Note that a measure of treatment effect is said to be *collapsible* if the conditional and marginal effects coincide.

>My note: Overall, this paper suggests that generally speaking regression adjustment is inferior, due to the issue of model specification in regression as well as higher risk of human-induced "cheating" in regression adjustment. However, as this paper is from 2011, it does not discuss a more recent suggestion that we should use both at the same time to be doubly robust.



## Discussion