--- 
Project for the course in Microeconometrics | Summer 2020, M.Sc. Economics, Bonn University | Julia Wilhelm

# Replication of F. Barrera-Osorio, M. Bertrand, L. L. Linden, F. Perez-Calle  (2011) <a class="tocSkip">   
---

This notebook contains my replication of the results from the following paper:

> Barrera-Osorio, Felipe, Marianne Bertrand, Leigh L. Linden, and Francisco Perez-Calle. 2011. "Improving the Design of Conditional Transfer Programs: Evidence from a Randomized Education Experiment in Colombia." American Economic Journal: Applied Economics, 3 (2): 167-95. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import statsmodels as sm
import scipy as sc
import scipy.stats as ss
import pandas.io.formats.style
import statsmodels.formula.api as smf
import statsmodels.api as sm_api

In [2]:
from auxiliary.auxiliary_tables import *

---
# 1. Introduction 
---

Barrera-Osorio et al. (2011) compare three education-based conditional cash transfer designs aimed at incentivizing academic participation. Using data from a pilot study in Bogota, Colombia they examine the effects of a bi-monthly transfer (basic Treatment), a bi-monthly transfer combined with a lump-sum payment at the time students are supposed to re-enroll in school (savings treatment) and a bi-monthly transfer combined with a large payment upon graduation (tertiary treatment). The payments are conditional on school attendence of the child and designed to prevent dropout from secondary schools and to encourage matriculation at tertiary institutions.

To estimate and compare the causal impact of the three treatments, Barrera-Osorio et al. (2011) apply a difference model to data from Bogota. The Secretary of Education of the City implemented a pilot study running for one year, where they randomly allocated treatments to children in two localities. They randomize at child-level, generating variation within schools and families. This allows the authors to assess comparability of different groups. Barrera-Osorio et al. (2011) find that all designs significantly increase attendance and that the savings and tertiary treatments increase enrollment rates more strongly than the basic treatment. They conclude that the structure of the intervention can help targeting resources.

In this notebook, I replicate the results presented in the paper by Barrera-Osorio et al. (2011).

This notebook is structured as follows. 

---
# 2. Identification
--- 


Barrera-Osorio et al. (2011) aim to answer the question on how three different education-based conditional cash transfer designs perform in preventing dropout from secondary schools and encouraging matriculation at tertiary institutions. 
The different treatments were implemented in two localities in Bogota, San Cristobal and Suba. Since it is impossible to observe treatment effects at the individual level, researchers thus estimate average effects using treatment and control groups. Eligible children in San Cristobal were randomly assigned between a control group, the basic treatment (bi-monthly transfer) and the savings treatment (bi-monthly transfer combined with a lump-sum payment at the time students are supposed to re-enroll in school). In Suba eligible children were randomly assigned to the tertiary treatment (bi-monthly transfer combined with a large payment upon graduation) and a control group. For each individual $i$ we can image a potential outcome where they are treated $Y_i(1)$ and where they are not $Y_i(0)$ but we can never simultaneously observe both outcomes for each individual. The random treatment assignment allows the authors to experimentally estimate the causal effects of the three treatments. While they directly can compare the effect of the basic and savings treatment, comparing those with the tertiary treatment they cannot rely on purely random variation. This is because the tertiary treatment was implemented in another locality and therefore, the comparison to the tertiary treatment occurs across experiments.

Since treatments were assigned randomly within the two localities, potential outcomes are independent of $D$ and the selection bias is eliminated. The naive estimate which simply compares the observed average outcome of the treatment and control groups, then equals the true average treatment effect: 

\begin{align*}
  E[Y\mid D = 1] - E[Y\mid D = 0] & = E[Y^1\mid D = 1] - E[Y^0\mid D = 0] \\
                                  & =E[Y^1\mid D = 1]  - E[Y^0\mid D = 1] + E[Y^0\mid D = 1] - E[Y^0\mid D = 0]  \\
                                  & = \underbrace{E[Y^1 - Y^0\mid D = 1]}_{ATT} + \underbrace{E[Y^0\mid D= 1]- E[Y^0 \mid D = 0]}_{\text{Selection bias}} \\
                                  & =E[Y^1 - Y^0\mid D = 1] \\
                                  & =E[Y^1 - Y^0\mid D = 0] \\
                                  & =E[Y^1 - Y^0]
\end{align*}

The authors here rely on the following two assumptions:
\begin{align*}
E[Y^1\mid D = 1] = E[Y^1\mid D = 0] \\
E[Y^0\mid D = 1] = E[Y^0\mid D = 0] \\
\end{align*}

The causal graphs below illustrate the relationship between the treatments $D_B$, $D_S$, $D_T$ and outcome $Y$ in the two localities. Additionally there may be observables $W$ and unobservables $U$ also affecting $Y$. Due to random treatment assignment within the two localities, treatment is independent of $W$ and $U$ and there is no back-door path which has to be eliminated.

**San Cristobal:**
![ERROR:Here should be causal graph 1](files/CausalGraph_1.png)
$D_B$: Basic treatment  
$D_S$: Savings treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

**Suba:**
![ERROR:Here should be causal graph 2](files/CausalGraph_2.png)
$D_T$: Tertiary treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

The identification assumption to eliminate causal effects here is that randomization within the localities is successful. Barrera-Osorio et al. (2011) account for this checking whether treatment assignment created balanced treatment and control groups using household- and individual-level characteristics. These information were collected prior to the randomization, which suggests that students in each group should, on average, have similar characteristics. The authors make 60 comparisons and find 7 differences that are statistically significant at the 10 percent level, 5 at the 5 percent level and 2 at the 1 percent level. They conclude that randomization of the treatment assignment is successful.

However, considering the experiment in the two localities together in order to compare the effects of all three treatments, the causal graph has three back-door paths which have to be eliminated. Treatment assignment then is not completely random, since it is not random whether a person lives in Suba or San Cristobal. Observable or unobservable factors may affect treatment assignment and the outcome at the same time. The causal graph the authors use then looks as follows:

![ERROR:Here should be causal graph 3](files/CausalGraph_3.png)
$D_B$: Basic treatment  
$D_S$: Savings treatment  
$D_T$: Tertiary treatment  
$Y$: Students outcome  
$U$: Unobservables  
$W$: Observables

In order to eliminate the back-door paths Barrera-Osorio et al. (2011) control for a large set of observable demographic characteristics. Nevertheless, differences between the tertiary treatment and the other treatments could be due to unobserved heterogeneity in treatment effects. I will come to this problem later again.

---
# 3. Empirical Strategy
---
Barrera-Osorio et al. (2011) examine the impact of the basic, the savings and the tertiary treatment on student outcome. They use a simple difference model that makes comparisons between different subsets of the sample without controlling for covariates. 

For the basic-savings experiment in San Cristobal the specification takes the following form:

\begin{equation}
y_{ij} = \beta_0 + \beta_B Basic_i + \beta_S Savings_i + \epsilon_{ij} 
\end{equation}

For the tertiary experiment in Suba the specification takes the following form:

\begin{equation}
y_{ij} = \beta_0 + \beta_T Tertiary_i + \epsilon_{ij} 
\end{equation}
* $y_{ij}$ denotes a particular outcome for child $i$ in school $j$,
* $Basic_i$, $Savings_i$ and $Tertiary_i$ are indicator variables for whether or not the child is in the respective treatment group,
* $\epsilon_{ij}$ is the error term, which is allowed to vary up to the school level.

The authors additionally use a difference estimator that controls for socio-demographic and school characteristics.
For the basic-savings experiment the model is specified as follows:

\begin{equation}
y_{ij} = \beta_0 + \beta_B Basic_i + \beta_S Savings_i + \delta X_{ijk} + \phi_{j} + \epsilon_{ij} 
\end{equation}

For the tertiary treatment the model is specified as follows:

\begin{equation}
y_{ij} = \beta_0 + \beta_T Tertiary_i + \delta X_{ijk} + \phi_{j} + \epsilon_{ij} 
\end{equation}

The variables are defined as before. Additionally,
* $X_{ijk}$ is a vector of socio-demographic controls for child $i$ in school $j$ and family $k$,
* $\phi_{j}$ are school fixed effects.

---
# 4. Replication of Barrera-Osorio et al. (2011)
---

## 4.1. Data & Descriptive Statistics
Barrera-Osorio et al. (2011) restricted their sample of students spread across 251 schools to the 68 school with the largest number of registered children. In addition to that they filter the data by those students who comleted a baseline survey they conducted. For the tertiary experiment they drop students in grade 6-8 since those were not eligible for the program.

In [3]:
data = pd.read_stata('data/Public_Data_AEJApp_2010-0132.dta')
data.index.name = "individual"
data['grade_group'] = 'Grades 6-8'
data.loc[data['grade'] > 8, 'grade_group'] = 'Grades 9-10'
data.loc[data['grade'] > 10, 'grade_group'] = 'Grade 11'
data['group'] = 'Control'
data.loc[data['T1_treat'] == 1, 'group'] = 'Basic'
data.loc[data['T2_treat'] == 1, 'group'] = 'Savings'
data.loc[data['T3_treat'] == 1, 'group'] = 'Tertiary'
sample = data.drop(data[(data.suba == 1) & (data.grade < 9)].index)
sample['s_teneviv_int'] = sample['s_teneviv'].cat.codes + 1
sample['s_sexo_int'] = sample['s_sexo'].cat.codes
sample['s_estcivil_int'] = sample['s_estcivil'].cat.codes + 1
sample_baselinesurvey =  sample.drop(sample[sample.bl_observed == 0].index)

Table 1 summarizes the distribution of children by grade, gender and experimental group. They end up with a sample of 7158 children.

#### Table 1- Distribution of Subjects by Research Groups

In [4]:
create_table1(sample_baselinesurvey)

Experiment,Basic-Savings,Basic-Savings,Basic-Savings,Tertiary,Tertiary,Total
Group,Basic,Control,Savings,Control,Tertiary,Unnamed: 6_level_1
Grade 11,188,179,177,160,148,852
Grades 6-8,1215,1189,1166,0,0,3570
Grades 9-10,633,643,586,449,425,2736
Female,1022,1047,1000,361,336,3766
Male,1014,964,929,248,237,3392
Total,2036,2011,1929,609,573,7158


## 4.2. Baseline Comparison
Barrera-Osorio et al. (2011) check that the randomization created balanced research groups. Therefore, they compare characteristics of students between research groups. Table 2 shows the control group averages of 15 different variables in the Basic-Savings experiment (B-S) and the tertiary experiment (T). The 4 other columns show 60 comparisons. The standard errors are in the row below each difference, labeled with "SE". 7 differences are statistically significant at the 10 percent level, 5 at the 5 percent level, and 2 at the 1 percent level. One can conclude that treatment is assigned randomly.

#### Table 2- Comparison of Students between Research Groups

In [5]:
sancristobal = sample.drop(sample[sample.suba == 1].index)
suba = sample.drop(sample[sample.suba == 0].index)
create_table2(sancristobal, suba)

Unnamed: 0,Control average B-S,Basic-Control,Savings-Control,Basic-Savings,Control average T,Tertiary-Control
Possessions,1.9,0.07,0.04,0.03,1.94,-0.05
Possessions SE,1.1,0.02,0.02,0.02,1.02,0.04
Utilities,4.65,-0.02,0.06,-0.08,4.85,0.04
Utilities SE,1.42,0.03,0.03,0.03,1.32,0.04
Durable Goods,1.37,-0.02,0.01,-0.03,1.63,0.02
Durable Goods SE,0.89,0.02,0.02,0.02,0.86,0.03
Physical Infrastructure,11.65,-0.05,0.04,-0.09,12.14,-0.05
Physical Infrastructure SE,1.75,0.03,0.03,0.04,1.49,0.06
Age,14.38,0.09,-0.06,0.16,15.67,-0.06
Age SE,5.3,0.1,0.14,0.17,4.23,0.19


## 4.2. Results

### 4.2.1 Attendence
First, the authors analyse the effect of the conditional cash transfers on the school attendence rate. They here include only individuals who are enrolled in one of the 68 schools selected for surveying. In addition to that, they exclude students who are in grade 11 for the enrollment effect estimations, since they should graduate rather than re-enroll. To be consistent they also restrict the sample to students in grades 6-10 from the estimates of the effect on the attendence rate. In order to replicate their results I run simple regressions of the school attendence rate on the treatment variable without control variables, with demographic controls and with demographic controls and school fixed effects. The first three columns of table 3 show the results for the Basic-Savings experiment in Sancristobal, while columns 4 to 6 show the results for the tertiary experiment in Suba. The last column shows results of a regression containing all three treatments, demographics and school fixed effects. The estimated treatment effects and their standard errors ("SE") are provided in rows 1-6 and the test statistics from comparisons of the relative treatment effects and their p-values are in rows 7-10.

#### Table 3 - Effects on monitored school attendence rates

In [5]:
sancristobal = sample.drop(sample[sample.suba == 1].index)
suba = sample.drop(sample[sample.suba == 0].index)
sancristobal = sancristobal.drop(sancristobal[(sancristobal.survey_selected == 0) | (sancristobal.grade == 11)].index)
suba = suba.drop(suba[(suba.survey_selected == 0) | (suba.grade == 11) | (suba.grade < 9)].index)
sample_survey = sample.drop(sample[(sample.survey_selected == 0) | (sample.grade == 11) | (sample.grade < 9)].index)
create_table34(sancristobal, suba, sample_survey, 'at_msamean')

Unnamed: 0,Basic-Savings,Basic-Savings with demographics,Basic-Savings with demographics and school fixed effects,Tertiary,Tertiary with demographics,Tertiary with demographics and school fixed effects,Both
Basic treatment,0.033,0.032,0.032,,,,0.024
Basic treatment SE,0.007,0.008,0.007,,,,0.01
Savings treatment,0.029,0.028,0.029,,,,0.03
Savings treatment SE,0.008,0.008,0.008,,,,0.011
Tertiary treatment,,,,0.052,0.054,0.054,0.054
Tertiary treatment SE,,,,0.018,0.017,0.017,0.017
H0: Basic-Savings F-Stat,0.312,0.333,0.291,,,,0.153
p-value,0.581,0.569,0.594,,,,0.698
H0: Tertiary-Basic F-Stat,,,,,,,2.274
p-value,,,,,,,0.139


The table shows:
- Basic treatment increases attendence by 3.3 percentage points (significant at the one percent level)
- Savings treatent increases attendence by 2.9 percentage points (significant at the one percent level)
- Tertiary treatment increases attendence by 5.2 percentage points (significant at the one percent level)
- no evidence that the treatments have different effects

My results from the regressions are very similar to those Barrera-Osorio et al. estimate in their paper. Concerning the difference tests, the results differ, while in any case the results from the paper and my results here are not statistically significant.

### 4.2.2 Re-enrollment
Second, the authors analyse the effect of the conditional cash transfers on re-enrollment. Table 4 is designed as table 3, running regressions on the observed re-enrollment rate.

In [4]:
sancristobal = sample.drop(sample[(sample.suba == 1) | (sample.grade == 11)].index)
suba = sample.drop(sample[(sample.suba == 0) | (sample.grade == 11) | (sample.grade < 9)].index)
sancristobal = sancristobal[sancristobal['m_enrolled'].notna()]
suba = suba[suba['m_enrolled'].notna()]
sample_grade = sample.drop(sample[(sample.grade == 11) | (sample.grade < 9)].index)
sample_grade = sample_grade[sample['m_enrolled'].notna()]
create_table34(sancristobal, suba, sample_grade, 'm_enrolled')

  


Unnamed: 0,Basic-Savings,Basic-Savings with demographics,Basic-Savings with demographics and school fixed effects,Tertiary,Tertiary with demographics,Tertiary with demographics and school fixed effects,Both
Basic treatment,0.017,0.017,0.014,,,,0.001
Basic treatment SE,0.009,0.009,0.009,,,,0.016
Savings treatment,0.045,0.046,0.042,,,,0.021
Savings treatment SE,0.016,0.015,0.012,,,,0.018
Tertiary treatment,,,,0.042,0.04,0.036,0.039
Tertiary treatment SE,,,,0.022,0.021,0.019,0.019
H0: Basic-Savings F-Stat,3.99,3.895,4.091,,,,1.047
p-value,0.048,0.05,0.045,,,,0.307
H0: Tertiary-Basic F-Stat,,,,,,,1.836
p-value,,,,,,,0.177


The table shows:
- Basic treatment increases re-enrollment by 1.7 percentage points (significant at the 10 percent level)
- Savings treatent increases re-enrollment by 4.5 percentage points (significant at the one percent level)
- Tertiary treatment increases re-enrollment by 3.6 percentage points (significant at the 10 percent level)
- difference in magnitude of the basic and savings treatent effects is statistically significant at the 5 percent level
- no evidence that the tertiary and the basic treatment effects are different

Again, my results from the regressions are very similar to those Barrera-Osorio et al. estimate in their paper. Concerning the difference tests, the results differ in the case for the basic-savings comparison in column 3. The F-statistic calculated in the paper is 5.52 with a p-value of 0.02, while my value for the F-statistic is 4.091 with a p-value of 0.045. The interpretation of the test remains the same. For the test statistics in the last column my results differ more strongly from those by Barrera-Osorio et al., while the results from the paper and my results here are not statistically significant.

### 4.2.3 Heterogeneity

In [67]:
#h = 0.075
#xmin = 0.2
#xmax = 0.95
#st1 = (xmax-xmin)/(gsize-1)
#sample['xgrid1'] = xmin + ((sample.index-1)*st1)
#sample.loc[sample.index > gsize, 'xgrid1'] = 0
#ic = 1
#while ic <= gsize:
#sample['z'] = abs((sample['en_baseline']-sample['xgrid1'])/h)
#sample = sample.drop(sample[sample.z > 1].index)
#sample['kz'] = (3/4)*(1-sample['z']**2)/h
#sample['x_mod'] = (sample['en_baseline'] - sample['xgrid1']) * np.sqrt(sample['kz'])
#sample['const_mod'] = sample['kz']**0.5
#sample['y_mod'] = sample['m_enrolled']*(sample['kz']**0.5)
#sample.drop(sample[sample.kz == 0].index)
#sample.drop(sample[sample.const_mod == 0].index)
#reg = sm_api.OLS(sample['y_mod'], sm_api.add_constant(sample[['const_mod','x_mod']])).fit()
#sample['den_control'] = reg.params[2]
#sample['en_control'] = reg.params[1]
#reg.params

### 4.2.4 Survey-Based Outcomes - Graduation and Tertiary Enrollment
Barrera-Osorio et al. (2011) use data from a follow-up survey which was conducted after the treatments were implemented to analyze the effects of each treatment on self-reported graduation and tertiary enrollment for students who were in grade 11.

In [4]:
create_table5(sample)

Unnamed: 0,Graduation Basic-Savings,Graduation Tertiary,Graduation Both,Tertiary enrollment Basic-Savings,Tertiary enrollment Tertiary,Tertiary enrollment Both
Basic treatment,0.039,,0.035,0.042,,0.051
Basic treatment SE,0.041,,0.041,0.031,,0.029
Savings treatment,0.044,,0.042,0.084,,0.088
Savings treatment SE,0.031,,0.029,0.03,,0.031
Tertiary treatment,,0.023,0.025,,0.491,0.49
Tertiary treatment SE,,0.038,0.031,,0.043,0.037
H0: Basic-Savings F-Stat,0.014,,0.036,1.372,,1.214
p-value,0.906,,0.851,0.256,,0.278
H0: Tertiary-Basic F-Stat,,,0.036,,,105.652
p-value,,,0.85,,,0.0
