# The effect of a Universal Child Benefit - González (2013) revisited

## A Project by Antonia Entorf & Marc Lipfert

### Overview

This notebook replicates the main results presented in the following article:   

González, L. (2013): [The Effect of a Universal Child Benefit on Conceptions, Abortions, and Early Maternal Labor Supply](https://www.aeaweb.org/articles?id=10.1257/pol.5.3.160). American Economic Journal: Economic Policy 5(3): 160–188.

In that paper, the author investigates the effect of a universal child benefit on fertility, household expenditure patterns and maternal labor supply. In particular, she is able to exploit the unanticipated introduction of a child benefit that took place in Spain in 2007 by utilising a Regression Discontinuity Design.

Apart from replicating major findings presented by González, our aim is to enrich her analysis with respect to three different aspects. First, we will examine whether accounting for autocorrelation is necessary in the given context. Secondly, since the benefit was suspended in the aftermath of the financial crisis, we will exploit this fact and apply the identical research design to the abolishment of the policy in order to study the effect on conceptions. Thirdly, we will investigate threats to validity using simulations as well as a placebo test.




### Introduction

On July 3 in 2007 the spanish government announced that all mothers giving birth from July 1 in 2007 on were eligible to receive a child benefit. The child benefit was a one-time cash payment of 2,500€. Libertad González, henceforth LG, analyzes the effect of this child benefit on the number of conceptions and abortions as well as on household expenditures, maternal labor supply and day care use. She uses a sharp regression discontinuity design (RDD) for her analysis, where the running variable is time and the treatment variable is equal to 1 after July 1 2007 and zero before that date.

A problem which could occur when using a RDD is self selection into treatment. In this particular setting one could imagine that women try to postpone birth in order to give birth after the cutoff and not before. This scenario seems especially plausible for low income families who need the cash payment. Therefore, the mothers giving birth right before and rigth after the cutoff could differ in personal characteristics, such as income and education. This could lead to a bias in the estimated average treatment effect if low income families react differently to receiving treatment compared to high income families and, additionally, we are not able to control for these personal characteristics.

However, self slection into treatment was not possible in our setting, because the introduction of the policy was unexpected and the cutoff date was even two days before the announcement. These two points rule out the possiblity to postpone birth in order to get the benefit.

In the following, we will explain the research question in more detail using causal graphs.

#### Effect on Fertility

<img src="causal-graphs/causal-graph-fertility.png" height="700" width="700" />

The causal graph above represents two regression analyses. First, LG is interested in the effect of the introduction of the child benefit on the number of conceptions and, second, on the number of abortions. Thus, the treatment variable is the introduction of the child benefit and the dependent variables are number of conceptions and number of abortions. In this graph the treatment should be interpreted as having the possibility to get the benefit and not as already having the benefit.

One could expect that the number of conceptions increases after the introduction because the promise of receiving a cash payment increases families' incentives to get a child. The same reasoning can be applied to the number of abortions. Treated women have more incentives to reject an abortion than untreated women. Therefore, the introduction of the benefit should reduce the number of abortions.

Obviously, 2,500€ is rather few compared to the costs associated with a child. Thus, a pontential impact of the treatment on the variables of interest should be small.

*Here we could think about whether rather low educated families aim for the cash payment not only due to probably lower income but also because they underestimate the costs associated with a child!*

### Replication
Before we can start with the data analysis we need to prepare the data sets. Therefore we translate the author's Stata code into python code. There are three different do-files using different datasets.

In the following I will translate the first do-file which you can find in "Additional Materials" on the AEJ-website.

In [3]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
#import statsmodels.api as sm
#from patsy import dmatrices

In [5]:
df_births = pd.read_stata('data/data_births_20110196.dta')
df_births.head()

Unnamed: 0,mesp,year,prem,semanas
0,3.0,2000.0,1,0.0
1,1.0,2000.0,2,36.0
2,3.0,2000.0,1,37.0
3,3.0,2000.0,1,39.0
4,12.0,2000.0,1,0.0


#### Variables
mesp: Month of birth <br>
year: Year of birth <br>
prem: Prematurity indicator (1 if baby is not premature, 2 if it is) <br>
semanas: Number of weeks of gestation at birth

In [33]:
df_births.describe().round(2)

Unnamed: 0,mesp,year,prem,semanas,m,mc1,mc2,mc3,mc,n
count,4984066.0,4984066.0,4984066.0,4706361.0,4984066.0,4984066.0,4984066.0,4984066.0,4984066.0,4984066.0
mean,6.57,2041.5,1.07,34.93,-21.53,-30.4,-30.26,-30.14,-30.14,1.0
std,3.44,35.83,0.26,11.71,37.52,37.58,37.51,37.59,37.59,0.0
min,1.0,2000.0,1.0,0.0,-90.0,-99.0,-99.0,-100.0,-100.0,1.0
25%,4.0,2003.0,1.0,38.0,-53.0,-62.0,-62.0,-62.0,-62.0,1.0
50%,7.0,2005.0,1.0,39.0,-20.0,-29.0,-29.0,-29.0,-29.0,1.0
75%,10.0,2008.0,1.0,40.0,11.0,2.0,2.0,2.0,2.0,1.0
max,12.0,2010.0,2.0,46.0,41.0,32.0,33.0,33.0,33.0,1.0


Create month of birth variable based on month of policy intervention in July 2007. Month of intervention is set equal to zero, the next month = 1, the previos month = -1, and so on.

In [34]:
df_births['m'] = df_births['mesp'] + 29
df_births.loc[df_births['year'] == 2009, 'm'] = df_births['mesp'] + 17
df_births.loc[df_births['year'] == 2008, 'm'] = df_births['mesp'] + 5
df_births.loc[df_births['year'] == 2007, 'm'] = df_births['mesp'] - 7
df_births.loc[df_births['year'] == 2006, 'm'] = df_births['mesp'] - 19
df_births.loc[df_births['year'] == 2005, 'm'] = df_births['mesp'] - 31
df_births.loc[df_births['year'] == 2004, 'm'] = df_births['mesp'] - 43
df_births.loc[df_births['year'] == 2003, 'm'] = df_births['mesp'] - 55
df_births.loc[df_births['year'] == 2002, 'm'] = df_births['mesp'] - 67
df_births.loc[df_births['year'] == 2001, 'm'] = df_births['mesp'] - 79
df_births.loc[df_births['year'] == 2000, 'm'] = df_births['mesp'] - 91

df_births['m'].describe().round(2)

count    4984066.00
mean         -21.53
std           37.52
min          -90.00
25%          -53.00
50%          -20.00
75%           11.00
max           41.00
Name: m, dtype: float64

#### Create month of conception variable
We are not interested in the month of birth but in the month of conception. Therefore we subtract e.g. 9 months from the month of birth to get month of conception.

1. the naive definition: 9 months before birth

In [35]:
df_births['mc1'] = df_births['m'] - 9

2. naive plus prematures

In [13]:
# share of premature babys
sum(df_births['prem'] == 2)/len(df_births['prem'])

0.07037145976798863

In [14]:
max(df_births.loc[df_births['prem'] == 2, 'semanas'])

36.0

In [15]:
# generating mc2
df_births['mc2'] = np.where((df_births['prem'] == 2) |
        # if premature baby subtract only 8 months to get month of conception
        ((0 < df_births['semanas']) & (df_births['semanas'] < 38)), df_births['m'] - 8,
        # otherwise subtract 9
        df_births['m'] - 9)
#why 38 and not 39 as in the sophisticated version???

3. sophisticated

In [16]:
sum(df_births['semanas'] == 0)/len(df_births['semanas'])
# why is semanas zero for some obs?

0.09286253432438495

In [17]:
# generating mc3
df_births['mc3'] = np.where((df_births['prem'] == 2) |
        # if premature baby subtract only 8 months to get month of conception
        ((0 < df_births['semanas']) & (df_births['semanas'] < 39)), df_births['m'] - 8,
        # otherwise if baby was born only after 43 months --> -10
        np.where(df_births['semanas'] > 43, df_births['m'] - 10,
        # otherwise  - 9
        df_births['m'] - 9))

In [20]:
df_births[['semanas','mc1', 'mc2', 'mc3']].head()

Unnamed: 0,semanas,mc1,mc2,mc3
0,0.0,-97.0,-97.0,-97.0
1,36.0,-99.0,-98.0,-98.0
2,37.0,-97.0,-96.0,-96.0
3,39.0,-97.0,-97.0,-97.0
4,0.0,-88.0,-88.0,-88.0


What about observations with semanas = 0? <br>
I would drop all observations with semanas = 0. Furthermore, I would calculate the month of conception with even higher precision by applying the same reasoning as used for semanas < 39 for semanas < 35

#### Group Data
Only the sophisticated version of month of conception (mc3) is used for the analysis. <br>
Now, I will group the data by mc3 and count the number of observations per month.

In [25]:
# rename
df_births['mc'] = df_births['mc3']
# this variable will indicate the number of conceiption per month:
df_births['n'] = 1
dfb = df_births.groupby('mc', as_index = False)['n'].count()

dfb.head()

Unnamed: 0,mc,n
0,-100.0,6
1,-99.0,24690
2,-98.0,30595
3,-97.0,32547
4,-96.0,32352


In [26]:
dfb.tail()

Unnamed: 0,mc,n
129,29.0,41709
130,30.0,42480
131,31.0,41202
132,32.0,41703
133,33.0,11322


generate calendar month of conception

The author made some mistakes here!

In [27]:
dfb['month'] = 0

#note that range starts at 0 but does not include the last number
for i in range(3):
    dfb.loc[dfb['mc'] == 0 + 12*i, 'month'] = 7
    dfb.loc[dfb['mc'] == 1 + 12*i, 'month'] = 8
    dfb.loc[dfb['mc'] == 2 + 12*i, 'month'] = 9
    dfb.loc[dfb['mc'] == 3 + 12*i, 'month'] = 10
    dfb.loc[dfb['mc'] == 4 + 12*i, 'month'] = 11
    dfb.loc[dfb['mc'] == 5 + 12*i, 'month'] = 12
    dfb.loc[dfb['mc'] == 6 + 12*i, 'month'] = 1
    dfb.loc[dfb['mc'] == 7 + 12*i, 'month'] = 2
    dfb.loc[dfb['mc'] == 8 + 12*i, 'month'] = 3
    dfb.loc[dfb['mc'] == 9 + 12*i, 'month'] = 4
    dfb.loc[dfb['mc'] == 10 + 12*i, 'month'] = 5
    dfb.loc[dfb['mc'] == 11 + 12*i, 'month'] = 6
       
for i in range(9):
    dfb.loc[dfb['mc'] == -1 - 12*i, 'month'] = 6
    dfb.loc[dfb['mc'] == -2 - 12*i, 'month'] = 5
    dfb.loc[dfb['mc'] == -3 - 12*i, 'month'] = 4
    dfb.loc[dfb['mc'] == -4 - 12*i, 'month'] = 3
    dfb.loc[dfb['mc'] == -5 - 12*i, 'month'] = 2
    dfb.loc[dfb['mc'] == -6 - 12*i, 'month'] = 1
    dfb.loc[dfb['mc'] == -7 - 12*i, 'month'] = 12
    dfb.loc[dfb['mc'] == -8 - 12*i, 'month'] = 11
    dfb.loc[dfb['mc'] == -9 - 12*i, 'month'] = 10
    dfb.loc[dfb['mc'] == -10 - 12*i, 'month'] = 9
    dfb.loc[dfb['mc'] == -11 - 12*i, 'month'] = 8
    dfb.loc[dfb['mc'] == -12 - 12*i, 'month'] = 7

# check that no zero is left
sum(dfb['month'] == 0)

0

In [28]:
# generate July indicator
dfb['july'] = np.where(dfb['month'] == 7, 1, 0)

generate number of days in a month <br>
Note that from 2000 - 2010 there are the following leap years: 2008, 2004, 2000. This is relevant for mc = 7, 7-12 times 4, 7-12 times 8 <br>
For some reason the author only adjusted feburary in 2008.

In [29]:
dfb['days'] = np.where((dfb['mc'] == 7) | (dfb['mc'] == -41) |
        (dfb['mc'] == -89), 29,
        # for all other feburarys
        np.where(dfb['month'] == 2, 28,
        # for April, June, September, November
        np.where((dfb['month'] == 4) | (dfb['month'] == 6) |
                (dfb['month'] == 9) | (dfb['month'] == 11), 30, 31)))



# indicator for treatment group (post-policy conception), i.e. after June 2007
dfb['post'] = np.where(dfb['mc'] >= 0, 1, 0)


# quadratic and cubic mc
dfb['mc2'] = dfb['mc']*dfb['mc']
dfb['mc3'] = dfb['mc']*dfb['mc']*dfb['mc']

# natural log of number of obs n
dfb['ln'] = np.log(dfb['n'])

# get month dummies
dummies = pd.get_dummies(dfb['month'])
dummies.columns = ['jan','feb','mar','apr','mai','jun','jul','aug','sep','oct','nov','dec']
# bind data frames
dfb = pd.concat([dfb, dummies], axis=1)

The following is part of the paper's Descriptive Statitics in Table 1

In [36]:
dfb.loc[(dfb['mc']>-91) & (dfb['mc']<30), ['n','ln','post','mc','month','days']].describe().round(2)

Unnamed: 0,n,ln,post,mc,month,days
count,120.0,120.0,120.0,120.0,120.0,120.0
mean,38020.64,10.54,0.25,-30.5,6.5,30.44
std,3167.55,0.08,0.43,34.79,3.47,0.81
min,30138.0,10.31,0.0,-90.0,1.0,28.0
25%,35775.25,10.49,0.0,-60.25,3.75,30.0
50%,38505.0,10.56,0.0,-30.5,6.5,31.0
75%,40305.75,10.6,0.25,-0.75,9.25,31.0
max,44375.0,10.7,1.0,29.0,12.0,31.0


### Regressions
The author uses different subsets of the data and slightly different specifications.

In [31]:
# create necessary subsets of dfb
dfb_list = list()

dfb_list.append(dfb.loc[(dfb['mc']>-91) & (dfb['mc']<30)])
dfb_list.append(dfb.loc[(dfb['mc']>-31) & (dfb['mc']<30)]) # 5 years
dfb_list.append(dfb.loc[(dfb['mc']>-13) & (dfb['mc']<12)]) # 12 months
dfb_list.append(dfb.loc[(dfb['mc']>-10) & (dfb['mc']<9)]) # 9 months
dfb_list.append(dfb.loc[(dfb['mc']>-4) & (dfb['mc']<3)]) # 3 months
dfb_list.append(dfb.loc[(dfb['mc']>-67) & (dfb['mc']<30)]) # 8 year