# <span style="color:darkblue"> Lecture 12a: Analyzing Experiments </span>

<font size = "5">



# <span style="color:darkblue"> I. Import Libraries </span>


In [1]:
# The "pandas" library is used for processing datasets
# The "numpy" is for numeric observations and random numbers
# The "matplotlib.pyplot" library is for creating graphs

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# We will "alias" two sublibraries in "statsmodels"
# "statsmodels.formula.api" contains functions to estimate models
# "statsmodels.api" contains general-use statistical options

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col

<font size = "5">

Import data

In [2]:
dataset = pd.read_stata("data_raw/malawiexperiment.dta")

# <span style="color:darkblue"> II. Context </span>


<font size = "5">

Today we will review a paper by Rebecca Dizon-Ross published <br>
in the American Economic Review (2019).

- In this study, researchers partnered with local schools in Malawi <br>
- This study evaluated the impacts of information about children’s <br>
 academic performance on parents’ subsequent investments in their <br>
  children’s education.


https://www.povertyactionlab.org/evaluation/effects-student-performance-information-parental-decision-making-malawi?lang=fr

https://www.nber.org/papers/w24610

<font size = "5">

Intervention

- Parents in Malawi with low literacy levels had trouble interpreting <br>
school report cards.Many parents were unaware that their children were <br>
struggling with school.
- The intervention altered the way that the schools engaged with <br>
 parents. It had an impact on reducing the information gaps.


<font size = "5">

Experimental Design

- Students were assigned to randomly to treatment and control <br>
with 50% probability
- The random assignment was done at the household level

<font size = "5">

Findings


<img src="figures/treatmenteffects_dizonross.png" alt="drawing" width="650"/>



<font size = "5">

At baseline (before the intervention)

- The graph on the left shows parental beliefs at baseline
- Parents of low performing students thought that they were doing <br>
better than they were. Ideally it should  be along the 45-degree line
- Similar results for treated and control groups at baseline because <br>
of randomization

At endline (after the intervention)

- Treated parents had more accurate perceptions of their children's <br>
performance
- The treatment effects varied depending on the baseline test scores


# <span style="color:darkblue"> II. Basic Descriptive Analysis </span>


<font size = "5">

Total number of children

In [3]:
len(dataset)

5268

<font size = "5">

Total number of households

- 2 children per households

In [4]:
unique_ids = pd.unique(dataset["hhid"])
len(unique_ids)

2634

<font size = "5">

Calculate number of treated and control

In [5]:
table = pd.crosstab(index = dataset['treat'],columns = "count")
table

col_0,count
treat,Unnamed: 1_level_1
Control,2654
Treatment,2614


# <span style="color:darkblue"> III. Testing Covariate Balance </span>


<font size = 5>


Diagnose dataset columns


In [6]:
column_types_data = pd.DataFrame(dataset.columns)

column_types_data.iloc[2,:]

data_analysis = dataset[["treat","ave","u_ave"]]

<font size = "5">

Subset treated and control observations

In [7]:
dataset_treated = dataset.query('treat == "Treatment"')
dataset_control = dataset.query('treat == "Control"')

<font size = "5">

Socio-economic information can be collected at baseline <br>
(before the experiment) 

In [9]:
variables_scores      = ["ave"]
variables_respondent  = ["lit","primary_resp_fem","age_par1","farmer"]
variables_household   = ["tot_kids","one_par"]
variables_student     = ["std","age","female","attendance_sv"]

<font size = "5" >

Check that characteristics are similar between treated and control <br>
at baseline

In [10]:
# Compute mean and standard deviation for the treated group
display(dataset_treated[variables_respondent].describe().loc[['mean', 'std']])

# Compute mean and standard deviation for the control group
display(dataset_control[variables_respondent].describe().loc[['mean', 'std']])

Unnamed: 0,lit,primary_resp_fem,age_par1,farmer
mean,0.675613,0.75899,40.97408,0.460587
std,0.468235,0.427778,11.290328,0.498541


Unnamed: 0,lit,primary_resp_fem,age_par1,farmer
mean,0.667426,0.773926,40.645455,0.465544
std,0.471225,0.418366,10.638597,0.498907


<font size = "5">



<font size = "5" >

Conduct a formal test of whether the coefficients are similar

- We should expect the coefficient on the treatment variable <br>
to be non-significant

In [11]:
reg_model = smf.ols("lit ~ treat ", dataset)
results = reg_model.fit(cov_type= "HC1")

print(summary_col(results,
                  stars = True))


                      lit   
----------------------------
Intercept          0.6674***
                   (0.0092) 
treat[T.Treatment] 0.0082   
                   (0.0130) 
R-squared          0.0001   
R-squared Adj.     -0.0001  
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01


<font size = "5">

Try it yourself!

<font size = "3">

- Obtain summary statistics for the mean and standard deviation for <br>
the other sets of baseline variables


In [17]:
# Write your own code

# Filter the dataset for numerical columns only
numerical_columns = dataset.select_dtypes(include=['float64', 'int64']).columns

# Calculate summary statistics (mean and standard deviation) for the numerical columns
summary_statistics_numerical = dataset[numerical_columns].agg(['mean', 'std']).transpose()

# Display the summary statistics for the first few variables as an example
summary_statistics_numerical.head()




Unnamed: 0,mean,std
part3_33_c,-6.0,4.64758
part5_5_13_a,2.651376,4.301031
part5_5_15_a,2.453333,4.00098
part5_5_17_a,2.713615,3.316646
part5_5_13_b,2.764977,6.41628


<font size = "5">

Try it yourself!

<font size = "3">

- Write a loop that runs different regressions of baseline covariates <br>
on the treatment variable. This can help you automate the process of <br>
testing for covariate balance

In [24]:
# Write your own code
# Initialize a dictionary to store regression results
regression_results = {}

# Run regressions of baseline covariates on the treatment variable
for covariate in numerical_columns:
    # Add a constant to the model for the intercept
    X = sm.add_constant(dataset['treat'])
    Y = dataset[covariate]
    
    model = sm.OLS(Y, X).fit()
    regression_results[covariate] = {
        'Coefficient': model.params['treat'],
        'P-value': model.pvalues['treat']
    }

# Display the regression results
regression_results



ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

# <span style="color:darkblue"> IV. Calculating Average Treatment Effect </span>


- Make sure to use robust standard errors

<font size = "5">

Effect of treatment on endline beliefs

In [28]:
reg_model = smf.ols("u_ave ~ treat ", dataset)
results = reg_model.fit(cov_type= "HC1")

print(summary_col(results,
                  stars = True))


                     u_ave   
-----------------------------
Intercept          63.5628***
                   (0.3435)  
treat[T.Treatment] -7.4218***
                   (0.4988)  
R-squared          0.0406    
R-squared Adj.     0.0404    
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01


<font size = "5">

Add baseline covariates

In [29]:
reg_model = smf.ols("u_ave ~ treat + ave ", dataset)
results = reg_model.fit(cov_type= "HC1")

print(summary_col(results,
                  stars = True))


                     u_ave   
-----------------------------
Intercept          39.6549***
                   (0.6801)  
treat[T.Treatment] -7.0564***
                   (0.4325)  
ave                0.5079*** 
                   (0.0129)  
R-squared          0.2725    
R-squared Adj.     0.2722    
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01


<font size = "5">

Test for heterogeneous effects


In [30]:
reg_model = smf.ols("u_ave ~ treat + ave + treat*ave", dataset)
results = reg_model.fit(cov_type= "HC1")

print(summary_col(results,
                  stars = True))


                          u_ave   
----------------------------------
Intercept              49.1885*** 
                       (0.8840)   
treat[T.Treatment]     -25.9979***
                       (1.2381)   
ave                    0.3054***  
                       (0.0177)   
treat[T.Treatment]:ave 0.4055***  
                       (0.0241)   
R-squared              0.3095     
R-squared Adj.         0.3091     
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01


<font size = "5">

Try it yourself!

Test for heterogeneous effects using other baseline covariates!

In [None]:
# Write your own code




