## 6. Angrist-Krueger (1991) Replication

You'll find data from a famous paper by Angrist and Krueger (1991) in the ARE212_Materials repository, along with a pdf of the paper. The paper uses information on quarter of birth as an instrument for (endogenous) education to measure returns to education. The first specification in the paper is given in their equations (1) and (2).

#### 1. What is the (implicit) identifying assumption? Comment on its plausibility.

#### 2. Using their data, estimate (2), replicating the figures in their Table 5, using the conventional two-stage least squares IV estimator (what they call TSLS).

In [1]:
import pandas as pd
%matplotlib inline
import numpy as np
from scipy.stats import multivariate_normal
from scipy.linalg import inv, sqrtm
import numpy as np
import statsmodels.api as sm

import warnings
# Ignore FutureWarning messages
warnings.filterwarnings("ignore", category=FutureWarning)

# read in dta file
file_path = '~/ARE212_Materials/angrist-krueger91.dta'
df = pd.read_stata(file_path)

# quick look at data
print(df.head)

#print(df.dtypes)

<bound method NDFrame.head of          ageq  edu   logwage  married  state  qob  black  smsa   yob  region
0       47.00   12  6.245846        1      1    1      1     1  1933     0.0
1       46.25   12  5.847161        1     48    4      1     1  1933     0.0
2       50.00   12  6.645516        1      2    1      1     1  1930     0.0
3       47.00   16  6.706133        1     22    1      1     1  1933     0.0
4       42.25   14  6.357876        1     42    4      1     1  1937     0.0
...       ...  ...       ...      ...    ...  ...    ...   ...   ...     ...
329504  42.50   10  4.583833        1     26    3      1     1  1937     4.0
329505  42.00   12  5.784210        1     22    1      1     1  1938     4.0
329506  41.00   12  5.707302        1     48    1      1     1  1939     4.0
329507  47.25   12  5.952494        1     42    4      1     1  1932     4.0
329508  48.50   13  6.047781        1     20    3      1     1  1931     4.0

[329509 rows x 10 columns]>


In [2]:
# Creating all required variables ( instruments and exogenous dummy variables)

# Dropping one value in each variable to create the interaction
df['yearminus_1'] = df['yob'].shift(1)
df['qobminus_1'] = df['qob'].shift(1)
df['stateminus_1'] = df['state'].shift(1)

df['year_quar'] = df['yearminus_1'].astype(str) + df['qobminus_1'].astype(str)
df['quar_state'] = df['qobminus_1'].astype(str) + df['stateminus_1'].astype(str)

# Creating age_squared variable
df['age_sq'] = df['ageq'] ** 2

# Creating dummy variables for year of birth, region, state 
dummy_var_yob = pd.get_dummies(df['yob'], prefix='yob', drop_first=True, sparse=True)
dummy_var_region = pd.get_dummies(df['region'], prefix='region', drop_first=True, sparse=True)
dummy_var_state = pd.get_dummies(df['state'], prefix='state', drop_first=True, sparse=True)

# Creating dummy variables for interaction terms used as instruments
# Since we have already dropped one category in each of the interaction terms, we don't need to drop them now
dummy_var_year_quar = pd.get_dummies(df['year_quar'], prefix='year_quar', sparse=True)
dummy_var_quar_state = pd.get_dummies(df['quar_state'], prefix='quar_state', sparse=True)

# Concatenate the original DataFrame with the dummy variables
df = pd.concat([df, dummy_var_yob, dummy_var_region, dummy_var_state, dummy_var_year_quar, dummy_var_quar_state], axis=1)

#variable_list = df.columns.tolist()
#print(variable_list)


In [3]:
# Creating variables for the OLS regressions in Table 5

y = df.logwage

X1= pd.concat([df.filter(regex='^yob_'), df[['edu']]], axis=1)
X3= pd.concat([df.filter(regex='^yob_'), df[['edu', 'ageq', 'age_sq']]], axis=1)
X5= pd.concat([df.filter(regex='^yob_'),df.filter(regex='^region_'), df[['edu','black','smsa','married']]], axis=1)
X7= pd.concat([df.filter(regex='^yob_'),df.filter(regex='^region_'), df[['edu','black','smsa','married', 'ageq', 'age_sq']]], axis=1)

X1 = sm.add_constant(X1)
X3 = sm.add_constant(X3)
X5 = sm.add_constant(X5)
X7 = sm.add_constant(X7)


In [4]:
# Column 1
model = sm.OLS(y, X1)
results1 = model.fit()
coefficient_edu1 = results1.params[10]  
std_error_edu1 = results1.bse[10]       
# Print the results
print("\033[1mColumn 1\033[0m")
print(f"Coefficient for education: {coefficient_edu1}")
print(f"Standard Error for education: {std_error_edu1}")

# Column 3
model = sm.OLS(y, X3)
results3 = model.fit()
coefficient_edu3 = results3.params[10]  
std_error_edu3 = results3.bse[10]       
# Print the results
print("\033[1mColumn 3\033[0m")
print(f"Coefficient for education: {coefficient_edu3}")
print(f"Standard Error for education: {std_error_edu3}")

# Column 5
model = sm.OLS(y, X5)
results5 = model.fit()
coefficient_edu5 = results5.params[18]  
std_error_edu5 = results5.bse[18]       
# Print the results
print("\033[1mColumn 5\033[0m")
print(f"Coefficient for education: {coefficient_edu5}")
print(f"Standard Error for education: {std_error_edu5}")


# Column 7
model = sm.OLS(y, X7)
results7 = model.fit()
coefficient_edu7 = results7.params[18]  
std_error_edu7 = results7.bse[18] 
print("\033[1mColumn 7\033[0m")
print(f"Coefficient for education: {coefficient_edu7}")
print(f"Standard Error for education: {std_error_edu7}")



[1mColumn 1[0m
Coefficient for education: 0.07108104579762815
Standard Error for education: 0.00033900670348043735
[1mColumn 3[0m
Coefficient for education: 0.07107366320153127
Standard Error for education: 0.0003390582736823968
[1mColumn 5[0m
Coefficient for education: 0.06324573304217775
Standard Error for education: 0.0003392620665417615
[1mColumn 7[0m
Coefficient for education: 0.06323780159470427
Standard Error for education: 0.00033931099501583134


In [5]:
# Creating variables for TSLS regression: Table 5

# Get the columns that match the regex
yob_columns = df.filter(regex='^yob_').columns.tolist()
qob_yob_columns = df.filter(regex='^year_quar_').columns.tolist()
region_columns = df.filter(regex='^region_').columns.tolist()

exog_vars1 = sm.add_constant(df[['edu'] + yob_columns])
exog_vars2= sm.add_constant(df[['edu','ageq', 'age_sq'] + yob_columns])
exog_vars3= sm.add_constant(df[['edu','black','smsa','married'] + yob_columns + region_columns])
exog_vars4= sm.add_constant(df[['edu','black','smsa','married', 'ageq', 'age_sq'] + yob_columns + region_columns])

instrument1 = sm.add_constant(df[qob_yob_columns + yob_columns])
instrument2 = sm.add_constant(df[['ageq', 'age_sq'] + qob_yob_columns + yob_columns])
instrument3 = sm.add_constant(df[['black','smsa','married'] + qob_yob_columns + yob_columns + region_columns])
instrument4 = sm.add_constant(df[['black','smsa','married', 'ageq', 'age_sq'] + qob_yob_columns + yob_columns + region_columns])

# Table 5: TSLS Regressions 

from statsmodels.sandbox.regression.gmm import IV2SLS
import statsmodels.api as sm

# column 2
resultsIV = IV2SLS(y,
                  exog_vars1,
                  instrument1).fit()

coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 2\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultIV.summary())

# column 4
resultsIV = IV2SLS(y,
                  exog_vars2,
                  instrument2).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 4\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultsIV.summary())

# column 6
resultsIV = IV2SLS(y,
                  exog_vars3,
                  instrument3).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 6\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultIV.summary())

# column 8
resultsIV = IV2SLS(y,
                  exog_vars4,
                  instrument4).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 8\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultIV.summary())

[1mColumn 2[0m
Coefficient for education: 0.042077503583881785
Standard Error for education: 0.02748914062415495
[1mColumn 4[0m
Coefficient for education: 0.04240286964194579
Standard Error for education: 0.027343208922230057
[1mColumn 6[0m
Coefficient for education: 0.01027030979225742
Standard Error for education: 0.03078993280706803
[1mColumn 8[0m
Coefficient for education: 0.01039149580584346
Standard Error for education: 0.03065336674834997


#### 3. Repeat (2), but for the specification reported in their Table 7 (which has many more instruments). Summarize what the above exercises tell us about returns to education.

In [6]:
# OLS Regressions: Table 7
y = df.logwage
state_columns = df.filter(regex='^state_').columns.tolist()
quar_state_columns = df.filter(regex='^quar_state_').columns.tolist()

X1 = sm.add_constant(df[['edu'] + yob_columns + state_columns])
X2 = sm.add_constant(df[['edu', 'ageq', 'age_sq'] + yob_columns + state_columns])
X3 = sm.add_constant(df[['edu', 'black', 'smsa', 'married'] + yob_columns + region_columns + state_columns])
X4 = sm.add_constant(df[['edu', 'black', 'smsa', 'married', 'ageq', 'age_sq'] + yob_columns + region_columns + state_columns])

instrument1 = sm.add_constant(df[qob_yob_columns + quar_state_columns + yob_columns + state_columns])
instrument2 = sm.add_constant(df[['ageq', 'age_sq'] + qob_yob_columns + quar_state_columns + yob_columns + state_columns])
instrument3 = sm.add_constant(df[['black', 'smsa', 'married'] + qob_yob_columns + quar_state_columns + yob_columns + region_columns + state_columns])
instrument4 = sm.add_constant(df[['black', 'smsa', 'married', 'ageq', 'age_sq'] + qob_yob_columns + quar_state_columns + yob_columns + region_columns + state_columns])

In [None]:
# Table 7: OLS and TSLS Regressions 

from statsmodels.sandbox.regression.gmm import IV2SLS
import statsmodels.api as sm

# Column 1: OLS
model = sm.OLS(y, X1)
results = model.fit()
coff_edu = results.params[10]  
sd_edu = results.bse[10] 
# Print the results
print("\033[1mColumn 1\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")

# Column 2: TSLS
resultsIV = IV2SLS(y,
                  X1,
                  instrument1).fit()

coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 2\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")

# Column 3: OLS
model = sm.OLS(y, X2)
results = model.fit()
coff_edu = results.params[10]  
sd_edu = results.bse[10]       
# Print the results
print("\033[1mColumn 3\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")

# Column 4: TSLS
resultsIV = IV2SLS(y,
                  X2,
                  instrument2).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 4\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultsIV.summary())

# Column 5: OLS
model = sm.OLS(y, X3)
results = model.fit()
coff_edu = results.params[10]  
sd_edu = results.bse[10]       
# Print the results
print("\033[1mColumn 5\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")

# Column 6: TSLS
resultsIV = IV2SLS(y,
                  X3,
                  instrument3).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 6\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultIV.summary())

# Column 7:OLS
model = sm.OLS(y, X4)
results = model.fit()
coff_edu = results.params[10]  
sd_edu = results.bse[10]       
# Print the results
print("\033[1mColumn 7\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")

# Column 8: TSLS
resultsIV = IV2SLS(y,
                  X4,
                  instrument4).fit()
coff_edu = resultsIV.params[1]  
sd_edu = resultsIV.bse[1] 
print("\033[1mColumn 8\033[0m")
print(f"Coefficient for education: {coff_edu}")
print(f"Standard Error for education: {sd_edu}")
#print(resultIV.summary())

[1mColumn 1[0m
Coefficient for education: -0.04388802659209764
Standard Error for education: 0.004819353635332475
                            OLS Regression Results                            
Dep. Variable:                logwage   R-squared:                       0.129
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     811.7
Date:                Fri, 12 Apr 2024   Prob (F-statistic):               0.00
Time:                        17:48:24   Log-Likelihood:            -3.1719e+05
No. Observations:              329509   AIC:                         6.345e+05
Df Residuals:                  329448   BIC:                         6.352e+05
Df Model:                          60                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
---------------

#### (4) Adapt your implementation of the Chernozhukov and C. Hansen estimator to estimate the key parameter ρ,first for the Table 5 specification, then the Table 7 specification. How does your point estimate compare?

#### (5) Same as (4), but construct 95% confidence intervals using both 2SLS and your new estimator. How do these compare? Which estimator do you prefer, and why?