# Section 3: RCTs

Today we are going to go over RCTs somewhat interactively in R, with the hope of building familiarity with the type of analysis you might do on your final projects. As in class, we are going to go over the _Cash or Condition_ paper by Baird et al (2011). Recall that the paper studies a randomized control trial in which teenage girls in Malawi were offered a cash transfer that was either conditional on their school attendance or unconditional or were offered no cash transfer. The outcome we will mostly be looking at today is how each type of transfer `T2a` and `T2b` affected teen pregnancy rates `ever_pregnant`. 

In [2]:
# load packages we will need
pacman::p_load(tidyverse, haven, lfe, texreg, modelsummary)
# load data
df <- read_stata('CorC_Public_Data_FINAL_PubH6445.dta') 
# get just panel data from round 3
df3 <- df %>% subset(panel == T & round == 3)

## What is an RCT?
The main challenge of identifying the causal impact of a policy or program is constructing a valid counterfactual for what would have happened to treated individuals had they not been treated. Of course, we never observe the same individual being both treated and untreated at a given time. So the next best thing we can do is compare groups of people that were statistically identical before the treatment after some of them get the treatment and others don't. We can do this by randomizing who gets treated and who doesn't. If treatment is random, we can be almost positive that the _only_ difference between the treatment and control group is the effect of the treatment, as long as our sample is large enough. 

When feasible, randomization is the most rigorous approach to construct a treatment and a control group from among an eligible population (see Duflo, 2006; Banerjee and Duflo, 2009).

How do we analyze an RCT If we want to know the **average treatment effects ATE** of a program, then all we need to do is compare the average value of our outcome in the treatment group to its average value in the control group.

$$ ATE = \bar Y_T - \bar Y_C$$

We could also accomplish this (and get standard errors) by doing this in regression form,

$$ Y_i = \alpha + \beta Treat_i + u_i $$
where $\beta$ is the ATE. 

Of course, in our Malawi case, we have two mutually-exclusive treatments, but this doesn't complicate things much. We can still estimate the ATEs for each of the conditional (CCT) and unconditional (UCT) cash transfers

$$ EverPregnant_i = \alpha + \beta_1 CCT_i + \beta_2 UCT_i + u_i $$

Here $\beta_1$ is the average treatment effect of receiving a conditional cash transfer and $\beta_2$ s the average treatment effect of receiving a conditional cash transfer on teen pregnancy.

1. Compute $\beta_1$ and $\beta_2$ without a regression by comparing means across groups.
2. Estimate $\beta_1$ and $\beta_2$ from a regression. 



In [3]:
# First check the names of what we're working with
names(df3)

## A Few Practical Concerns

Most household surveys come with sampling weights. These weights assign different importance to different observations (inversely) based on the probability that they were included in the sample. This is because it's not usually feasible to survey a group of people that's exactly representative of the population of interest. But fortunately, surveys come with weights that we can apply in order to make the sample representative of the broader population. We can implement this in R by specifying the column containing our sampling weights `wgt` in the `weights` argument in `felm`. 

We also want to make sure we're using the right standard errors to conduct statistical inference. The default standard errors we get when we run `felm` or some other regression package are almost never the ones we actually want, because they make unrealistic assumptions about the distribution of the error term. When it comes to RCTs, best practice is to cluster standard errors at the level at which treatment was assigned. (This essentially allows the error term to be arbitrarily correlated within clusters. For example, we might expect unobserved factors affecting pregnancy to more highly correlated among girls from the same village. This generally produces more conservative estimates than unadjusted standard errors). In this case, treatment was assigned at the community level so we can use the variable `eaid` to cluster. We can simply specify this with the `cluster` argument in `felm` (for now at least). 

3. Re-run your regression applying the appropriate sampling weights. Then also cluster your standard errors at the village level. How does this affect the standard errors? How does this affect the coefficients?

## Tests and Adjustments

Now that we've got the mechanics down, should we believe these estimates we've just computed? The key assumption is that the randomization did in fact result in two statistically identicial groups. Unfortunately, we can't test this assumption directly because there are many variables we can't observe. But as a smell test, we can at least test whether observable attributes (that are either measured before the treatment or are fixed) are statistically different between the two groups. 

4. Statistically test whether the baseline variables `age_R1`, `highest_grade_bl`, `asset_index_bl` and `never_had_sex_bl` vary across groups. First compare CCT to control and then UCT to control. 

Okay, so we see some statistically significant differences across treatment and control. That's potentially a concern (in the case of this paper, it seems like attrition -- i.e. people dropping out of the study --is correlated with treatment and causing these compositional difference. Attrition comes with its own set of issues that we won't discuss here). Do we necessarily have to worry about these differences though? Not really. As long we believe the randomization was done properly we can always control for them in order to hold these attributes that happen to differ across groups fixed. 

Another advantage of adding (pre-determined) controls is that they increase the precision of our treatment effect estimates by soaking up a lot of the additional variation that was not explained by the model without controls. 

5. Add the covariates you used above to the regression, but use the dummy variables `_Iage_R1_14-20` instead of `age_R1` and also control for `stratum1` and `stratum2`.
   
   a. Are these additional variables statistically significant? Is this surprising or not?
   
   b. How do the treatment effects you estimate compare to the ones you estimated without controls. Does this suggest that the sample imbalances you found were affecting our results or not?
   
   c. What happens to the precision of our estimates when we add controls?
   

Note that we've only been using the third round of data so far. The nice thing about an RCT is that we don't even need pre-period data because the only different between groups post-treatment is the effect of the treatment. But we still can use pre-period data, again both as a robustness check and to increase precision, by running a diff-in-diff. Randomization ensures that trends in pregnancy (like everything else) are the same across groups, so we get parallel trends for free. 

6. Estimate treatment effects using the full dataset `df`. Assume the "post" period is round 3. How have the coefficients changed? Is this what you expected?