## 6. Angrist-Krueger (1991) Replication

You'll find data from a famous paper by Angrist and Krueger (1991) in the ARE212_Materials repository, along with a pdf of the paper. The paper uses information on quarter of birth as an instrument for (endogenous) education to measure returns to education. The first specification in the paper is given in their equations (1) and (2).

#### 1. What is the (implicit) identifying assumption? Comment on its plausibility.

The authors are trying to estimate a causal effect of years of education on returns to education (wages). Since this relation is endogenous, they use quarters of birth (season of birth) as instruments for years of education, a link formed by the existence of compulsory schooling laws, that mandate the age until which an individual has to remain in school. 

The implicit identifying assumption here is that the fraction of students who want to drop out prior to the legal dropout age is independent of the season of birth and so the observed seasonal pattern in education is only due to the fact that compulsory schooling constrains some students born later in the year to stay in school longer.

The authors try to establish the plausibility of this assumption in two ways. They do not find a similar seasonal pattern in education for college graduates who do not face equivalent laws. They also analzye the enrolment rates for two age cohorts across states with varying age restrictions in the law and find evidence of lower enrolment of the older cohort in states with lower age restrictions. They also explore the relationship between earnings and season of birth for a smaller sample of college graduates to establish no direct link between the two. 

However, there is a possibility, albeit small, that there are underlying unobservable factors that may affect both the season in which one is born and one's attitude towards staying in school regardless of compulsory schooling laws. If we assume parents' education or socio-economic status or demographic characteristics can influence a child's decision to stay in school or dropout, and these factors also influence family planning to the extent of which season a child is born in, then the identifying assumption of exogeneity between the instrument and the structural model errot term is violated.

#### 2. Using their data, estimate (2), replicating the figures in their Table 5, using the conventional two-stage least squares IV estimator (what they call TSLS).

In [30]:
import pandas as pd
%matplotlib inline
import numpy as np
from scipy.stats import multivariate_normal
from scipy.linalg import inv, sqrtm
import numpy as np
import statsmodels.api as sm

# read in dta file
file_path = '~/ARE212_Materials/angrist-krueger91.dta'
df = pd.read_stata(file_path)

# quick look at data
print(df.head)

#print(df.dtypes)

<bound method NDFrame.head of          ageq  edu   logwage  married  state  qob  black  smsa   yob  region
0       47.00   12  6.245846        1      1    1      1     1  1933     0.0
1       46.25   12  5.847161        1     48    4      1     1  1933     0.0
2       50.00   12  6.645516        1      2    1      1     1  1930     0.0
3       47.00   16  6.706133        1     22    1      1     1  1933     0.0
4       42.25   14  6.357876        1     42    4      1     1  1937     0.0
...       ...  ...       ...      ...    ...  ...    ...   ...   ...     ...
329504  42.50   10  4.583833        1     26    3      1     1  1937     4.0
329505  42.00   12  5.784210        1     22    1      1     1  1938     4.0
329506  41.00   12  5.707302        1     48    1      1     1  1939     4.0
329507  47.25   12  5.952494        1     42    4      1     1  1932     4.0
329508  48.50   13  6.047781        1     20    3      1     1  1931     4.0

[329509 rows x 10 columns]>


In [31]:
#create instrumental variable
df['qob_yob_int'] = df['qob'] * df['yob']

#create age_squared
df['age_sq'] = df['ageq']**2

# set up dummy variables
dummy_var_yob = pd.get_dummies(df['yob'], prefix='yob', drop_first=True)
dummy_var_region = pd.get_dummies(df['region'], prefix='region', drop_first=True)

dummy_var_yob = dummy_var_yob.astype(int)
dummy_var_region = dummy_var_region.astype(int)

# concatenate the original DataFrame with the dummy variables
df = pd.concat([df, dummy_var_yob, dummy_var_region], axis=1)

#variable_list = df.columns.tolist()
#print(variable_list)


In [32]:
# Creating variables for the OLS regressions in Table 5

y = df.logwage

X1= pd.concat([df.filter(regex='^yob_'), df[['edu']]], axis=1)
X3= pd.concat([df.filter(regex='^yob_'), df[['edu', 'ageq', 'age_sq']]], axis=1)
X5= pd.concat([df.filter(regex='^yob_'),df.filter(regex='^region_'), df[['edu','black','smsa','married']]], axis=1)
X7= pd.concat([df.filter(regex='^yob_'),df.filter(regex='^region_'), df[['edu','black','smsa','married', 'ageq', 'age_sq']]], axis=1)

X1 = sm.add_constant(X1)
X3 = sm.add_constant(X3)
X5 = sm.add_constant(X5)
X7 = sm.add_constant(X7)


In [38]:
# Table 5: OLS Results

# Column 1
model = sm.OLS(y, X1)
results1 = model.fit()
print(results1.summary())

# Column 3
model = sm.OLS(y, X3)
results3 = model.fit()
print(results3.summary())

# Column 5
model = sm.OLS(y, X5)
results5 = model.fit()
print(results5.summary())

# Column 7
model = sm.OLS(y, X7)
results7 = model.fit()
print(results7.summary())

                            OLS Regression Results                            
Dep. Variable:                logwage   R-squared:                       0.118
Model:                            OLS   Adj. R-squared:                  0.118
Method:                 Least Squares   F-statistic:                     4397.
Date:                Wed, 10 Apr 2024   Prob (F-statistic):               0.00
Time:                        19:09:27   Log-Likelihood:            -3.1926e+05
No. Observations:              329509   AIC:                         6.385e+05
Df Residuals:                  329498   BIC:                         6.387e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0173      0.005    917.144      0.0