# Introduction

We are already familiar with the definition and calculation of propensity score
$$ps = pr(D=1|x)$$
It is applied to solve the problem of high-dimension data problem during matching. 

## Using Propensity Score: PSM
For each individual $i$ in the treatment group, we find the samples in the control group with the same or close $ps$ with $i$. By this procedure, we actually create a pseudo control group, whose size is the same as in treatment group. We can use this newly generated data to do regression or non-parametric estimation. We only need to condition on $ps(x)$, instead of the high-dimensional $x$.

### Demerit
Creating a control group by matching has the distressing side-effect of throwing away large amounts of the data, because the control group is shrunk down to the same size as the treatment group. This happens especially when the characteristics of groups are too different

## Using Propensity Score: IPW 
For each $x$ value($x$ may be high dimensional), we can get $ps(x)$. For a given $x$, we have some inviduals in the treatment group, and some in the control group. Consider for example $ps(x) = 0.1$, a low probability of entering the treatment. Naturally, for this $x$, the number of individuals in treatment group is much smaller than that of the individuals in control group. This is as if each individuals in the treatment is more 'important' than those in the control group, so they should be given a 'higher weight', which is exactly the inverse of $ps(x)$. Theindividuals in control group is given a 'lower weight', which is exactly the inverse of $(1-ps(x))$.

Once finishing this data construction, we can again use regression or non-parametric estimation. The merit of IPW is that it keeps all the data information. For example, for non-parametric methods, we can calculate 
$$ATE = E_x\left(\frac{1(D=1)y}{ps(x)}\right)- E_x\left(\frac{1(D=0)y}{1-ps(x)}\right)$$


### Demerit
One of the criticisms of this inverse probability of treatment weighting approach is that individual observations can get very high weights and become unduly influential.Consider a lone treated observation that happens to have a very low probability of being treated. The value of the inverse of the propensity score will be extremely high, asymptotically infinity. The effect size obtained will be dominated by this single value, and any fluctuations in it will produce wildly varied results, which is an undesirable property.

## Common issues with PS:
 The predictive quality of the propensity score does not translate into its balancing properties.Maximising the prediction power of the propensity score can even hurt the causal inference goal. Propensity score doesn’t need to predict the treatment very well. It just needs to <b>include all the confounding variables</b>. 
 
 If we include variables that are very good in predicting the treatment but have no bearing on the outcome this will actually increase the variance of the propensity score estimator. This is similar to the problem linear regression faces when we include variables correlated with the treatment but not with the outcome.

## Doubly Robust 

We already know how to estimate $ATE(x)$ using nonparmetric method or linear regression. Which one should we use? When in doubt, just use both! Doubly Robust Estimation is a way of combining propensity score and linear regression in a way you don’t have to rely on either of them. See [Here](https://matheusfacure.github.io/python-causality-handbook/12-Doubly-Robust-Estimation.html) for a detailed expression for doubly robust estimation.

## Reference:
- [Cross Validate](https://stats.stackexchange.com/questions/293960/how-does-inverse-weighted-propensity-score-regression-differ-from-propensity-sco/421200)
- [A Careful Guide 1](http://freerangestats.info/blog/2017/04/09/propensity-v-regression)
- [A Careful Guide 2](https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html)
- [A Youtube Guide](https://www.youtube.com/watch?v=VJhLaOdpUv0)