--- 
Project for the course in Microeconometrics | Summer 2020, M.Sc. Economics, Bonn University | Julia Wilhelm

# Replication of F. Barrera-Osorio, M. Bertrand, L. L. Linden, F. Perez-Calle  (2011) <a class="tocSkip">   
---

This notebook contains my replication of the results from the following paper:

> Barrera-Osorio, Felipe, Marianne Bertrand, Leigh L. Linden, and Francisco Perez-Calle. 2011. "Improving the Design of Conditional Transfer Programs: Evidence from a Randomized Education Experiment in Colombia." American Economic Journal: Applied Economics, 3 (2): 167-95. 

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import statsmodels as sm
import scipy.stats as ss
import pandas.io.formats.style
import statsmodels.formula.api as smf
import statsmodels.api as sm_api

In [4]:
from auxiliary.auxiliary_tables import *

---
# 1. Introduction 
---

Barrera-Osorio et al. (2011) compare three education-based conditional cash transfer designs aimed at incentivizing academic participation. Using data from a pilot study in Bogota, Colombia they examine the effects of a bi-monthly transfer (basic Treatment), a bi-monthly transfer combined with a lump-sum payment at the time students are supposed to re-enroll in school (savings treatment) and a bi-monthly transfer combined with a large payment upon graduation (tertiary treatment). The payments are conditional on school attendence of the child and designed to prevent dropout from secondary schools and to encourage matriculation at tertiary institutions.

To estimate and compare the causal impact of the three treatments, Barrera-Osorio et al. (2011) apply a difference model to data from Bogota. The Secretary of Education of the City implemented a pilot study running for one year, where they randomly allocated treatments to children in two localities. They randomize at child-level, generating variation within schools and families. This allows the authors to assess comparability of different groups. Barrera-Osorio et al. (2011) find that all designs significantly increase attendance and that the savings and tertiary treatments increase enrollment rates more strongly than the basic treatment. They conclude taht the structure of the intervention can help targeting resources.

In this notebook, I replicate the results presented in the paper by Barrera-Osorio et al. (2011).

This notebook is structured as follows. 

---
# 2. Identification
--- 


Barrera-Osorio et al. (2011) aim to answer the question on how three different education-based conditional cash transfer designs perform in preventing dropout from secondary schools and encouraging matriculation at tertiary institutions. 
The different treatments were implemented in two localities in Bogota, San Cristobal and Suba. Since it is impossible to observe treatment effects at the individual level, researchers thus estimate average effects using treatment and control groups. Eligible children in San Cristobal were randomly assigned between a control group, the basic treatment (bi-monthly transfer) and the savings treatment (bi-monthly transfer combined with a lump-sum payment at the time students are supposed to re-enroll in school). In Suba eligible children were randomly assigned to the tertiary treatment (bi-monthly transfer combined with a large payment upon graduation) and a control group. For each individual $i$ we can image a potential outcome where they are treated $Y_i(1)$ and where they are not $Y_i(0)$ but we can never simultaneously observe both outcomes for each individual. The random treatment assignment allows the authors to experimentally estimate the causal effects of the three treatments. While they directly can compare the effect of the basic and savings treatment, comparing those with the tertiary treatment they cannot rely on purely random variation. This is because the tertiary treatment was implemented in another locality and therefore, the comparison to the tertiary treatment occurs across experiments.

Since treatments were assigned randomly within the two localities, potential outcomes are independent of $D$ and the selection bias is eliminated. The naive estimate which simply compares the observed average outcome of the treatment and control groups, then equals the true average treatment effect: 

\begin{align*}
  E[Y\mid D = 1] - E[Y\mid D = 0] & = E[Y^1\mid D = 1] - E[Y^0\mid D = 0] \\
                                  & =E[Y^1\mid D = 1]  - E[Y^0\mid D = 1] + E[Y^0\mid D = 1] - E[Y^0\mid D = 0]  \\
                                  & = \underbrace{E[Y^1 - Y^0\mid D = 1]}_{ATT} + \underbrace{E[Y^0\mid D= 1]- E[Y^0 \mid D = 0]}_{\text{Selection bias}} \\
                                  & =E[Y^1 - Y^0\mid D = 1] \\
                                  & =E[Y^1 - Y^0\mid D = 0] \\
                                  & =E[Y^1 - Y^0]
\end{align*}

The authors here rely on the following two assumptions:
\begin{align*}
E[Y^1\mid D = 1] = E[Y^1\mid D = 0] \\
E[Y^0\mid D = 1] = E[Y^0\mid D = 0] \\
\end{align*}

The causal graphs below illustrate the relationship between the treatments $D_B$, $D_S$, $D_T$ and outcome $Y$ in the two localities. Additionally there may be observables $W$ and unobservables $U$ also affecting $Y$. Due to random treatment assignment within the two localities, treatment is independent of $W$ and $U$ and there is no back-door path which has to be eliminated.

**San Cristobal:**
![ERROR:Here should be causal graph 1](files/CausalGraph_1.png)
$D_B$: Basic treatment  
$D_S$: Savings treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

**Suba:**
![ERROR:Here should be causal graph 2](files/CausalGraph_2.png)
$D_T$: Tertiary treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

The identification assumption to eliminate causal effects here is that randomization within the localities is successful. Barrera-Osorio et al. (2011) account for this checking whether treatment assignment created balanced treatment and control groups using household- and individual-level characteristics. These information were collected prior to the randomization, which suggests that students in each group should, on average, have similar characteristics. The authors make 60 comparisons and find 7 differences that are statistically significant at the 10 percent level, 5 at the 5 percent level and 2 at the 1 percent level. They conclude that randomization of the treatment assignment is successful.

However, considering the experiment in the two localities together in order to compare the effects of all three treatments, the causal graph has three back-door paths which have to be eliminated. Treatment assignment then is not completely random, since it is not random whether a person lives in Suba or San Cristobal. Observable or unobservable factors may affect treatment assignment and the outcome at the same time. The causal graph the authors use then looks as follows:

![ERROR:Here should be causal graph 3](files/CausalGraph_3.png)
$D_B$: Basic treatment  
$D_S$: Savings treatment  
$D_T$: Tertiary treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

In order to eliminate the back-door paths Barrera-Osorio et al. (2011) control for a large set of observable demographic characteristics. Nevertheless, differences between the tertiary treatment and the other treatments could be due to unobserved heterogeneity in treatment effects. I will come to this problem later again.

---
# 3. Empirical Strategy
---
Barrera-Osorio et al. (2011) examine the impact of the basic, the savings and the tertiary treatment on student outcome. They use a simple difference model that makes comparisons between different subsets of the sample without controlling for covariates. 

For the basic-savings experiment in San Cristobal the specification takes the following form:

\begin{equation}
y_{ij} = \beta_0 + \beta_B Basic_i + \beta_S Savings_i + \epsilon_{ij} 
\end{equation}

For the tertiary experiment in Suba the specification takes the following form:

\begin{equation}
y_{ij} = \beta_0 + \beta_T Tertiary_i + \epsilon_{ij} 
\end{equation}
* $y_{ij}$ denotes a particular outcome for child $i$ in school $j$,
* $Basic_i$, $Savings_i$ and $Tertiary_i$ are indicator variables for whether or not the child is in the respective treatment group,
* $\epsilon_{ij}$ is the error term, which is allowed to vary up to the school level.

The authors additionally use a difference estimator that controls for socio-demographic and school characteristics.
For the basic-savings experiment the model is specified as follows:

\begin{equation}
y_{ij} = \beta_0 + \beta_B Basic_i + \beta_S Savings_i + \delta X_{ijk} + \phi_{j} + \epsilon_{ij} 
\end{equation}

For the tertiary treatment the model is specified as follows:

\begin{equation}
y_{ij} = \beta_0 + \beta_T Tertiary_i + \delta X_{ijk} + \phi_{j} + \epsilon_{ij} 
\end{equation}

The variables are defined as before. Additionally,
* $X_{ijk}$ is a vector of socio-demographic controls for child $i$ in school $j$ and family $k$,
* $\phi_{j}$ are school fixed effects.

---
# 4. Replication of Barrera-Osorio et al. (2011)
---

## 4.1. Data & Descriptive Statistics
Barrera-Osorio et al. (2011) restricted their sample of students spread across 251 schools to the 68 school with the largest number of registered children. In addition to that they filter the data by those students who comleted a baseline survey they conducted. They end up with a sample of 7158 children.

In [9]:
data = pd.read_stata('data/Public_Data_AEJApp_2010-0132.dta')
data.index.name = "individual"

#### Table 1- Distribution of Subjects by Research Groups

In [7]:
create_table1(data)

Experiment,Basic-Savings,Basic-Savings,Basic-Savings,Tertiary,Tertiary,Total
Group,Basic,Control,Savings,Control,Tertiary,Unnamed: 6_level_1
Grade 11,188,179,177,160,148,852
Grades 6-8,1215,1189,1166,0,0,3570
Grades 9-10,633,643,586,449,425,2736
Female,1022,1047,1000,361,336,3766
Male,1014,964,929,248,237,3392
Total,2036,2011,1929,609,573,7158


First, Barrera-Osorio et al. (2011) check that the randomization created balanced research groups. Therefore, they compare characteristics of students between research groups.

#### Table 2- Comparison of Students between Research Groups