# Exercise 13

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wooldridge as woo
import statsmodels.formula.api as smf
import scipy.stats as stats
import linearmodels as plm

Exercise: Use the dataset 'mroz' from 'wooldridge' and estimate the following model, where education ('educ') is considered endogenous.  


\begin{equation*}
log(wage)=\beta_0+\beta_1educ+\beta_2exper+u
\end{equation*}  

  
  
- Do you think we get a causal effect for the return on education on wage? If not, why not?
- How is the problem called and which OLS assumption is violated?
- What are potential solutions to adress this problem? What are the requirements to use them in order to be able to estimate caual effect?
- There are two potential sources of endogeneity that we've covered in the class. How do we call them and how do they cause endogeneity?
- If this would be panel data, how could we correct for endogeneity? Which type of the endogeneity could we adress by taking advantage of the panel data structure?
- What are the two requirements of an instrument to work?
- Which one can be tested and which one can't?
- Think about potential instruments which could be used to adress the endogeneity in this case.
- A candidate is the father's education. But does it fulfill the exogeneity requirement if we estimate the model as it is at the moment?
- Estimate OLS, IV by hand and IV using an implemented estimator and report the results.
- Is the instrument relevant (strong enough)? Please test the instrument relevance.
- What can you say about the inference when estimating IV by hand? 
- Can you use IV also in non-linear models? If not, what would be an alternative?
- Estimate the model with the control function approach


In [2]:
mroz = woo.dataWoo('mroz')
mroz.describe()

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,...,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
count,753.0,753.0,753.0,753.0,753.0,753.0,428.0,753.0,753.0,753.0,...,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,428.0,753.0
mean,0.568393,740.576361,0.237716,1.353254,42.537849,12.286853,4.177682,1.849734,2267.270916,45.12085,...,23080.594954,0.678863,9.250996,8.808765,8.623506,0.642762,10.63081,20.128964,1.190173,178.038513
std,0.49563,871.314216,0.523959,1.319874,8.072574,2.280246,3.310282,2.419887,595.566649,8.058793,...,12190.202026,0.083496,3.367468,3.57229,3.114934,0.479504,8.06913,11.634797,0.723198,249.630849
min,0.0,0.0,0.0,0.0,30.0,5.0,0.1282,0.0,175.0,30.0,...,1500.0,0.4415,0.0,0.0,3.0,0.0,0.0,-0.029057,-2.054164,0.0
25%,0.0,0.0,0.0,0.0,36.0,12.0,2.2626,0.0,1928.0,38.0,...,15428.0,0.6215,7.0,7.0,7.5,0.0,4.0,13.02504,0.816509,16.0
50%,1.0,288.0,0.0,1.0,43.0,12.0,3.4819,0.0,2164.0,46.0,...,20880.0,0.6915,10.0,7.0,7.5,1.0,9.0,17.700001,1.247574,81.0
75%,1.0,1516.0,0.0,2.0,49.0,13.0,4.97075,3.58,2553.0,52.0,...,28200.0,0.7215,12.0,12.0,11.0,1.0,15.0,24.466,1.603571,225.0
max,1.0,4950.0,3.0,8.0,60.0,17.0,25.0,9.98,5010.0,60.0,...,96000.0,0.9415,17.0,17.0,14.0,1.0,45.0,96.0,3.218876,2025.0


In [3]:
mroz.columns

Index(['inlf', 'hours', 'kidslt6', 'kidsge6', 'age', 'educ', 'wage', 'repwage',
       'hushrs', 'husage', 'huseduc', 'huswage', 'faminc', 'mtr', 'motheduc',
       'fatheduc', 'unem', 'city', 'exper', 'nwifeinc', 'lwage', 'expersq'],
      dtype='object')

In [5]:
reg_lin = smf.ols(formula='lwage ~ educ + exper ', data=mroz)
results_lin = reg_lin.fit()
results_lin.summary()

0,1,2,3
Dep. Variable:,lwage,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.144
Method:,Least Squares,F-statistic:,37.02
Date:,"Mon, 19 Sep 2022",Prob (F-statistic):,1.51e-15
Time:,14:28:37,Log-Likelihood:,-433.74
No. Observations:,428,AIC:,873.5
Df Residuals:,425,BIC:,885.6
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.4002,0.190,-2.102,0.036,-0.774,-0.026
educ,0.1095,0.014,7.728,0.000,0.082,0.137
exper,0.0157,0.004,3.900,0.000,0.008,0.024

0,1,2,3
Omnibus:,81.122,Durbin-Watson:,1.981
Prob(Omnibus):,0.0,Jarque-Bera (JB):,296.773
Skew:,-0.807,Prob(JB):,3.6000000000000002e-65
Kurtosis:,6.746,Cond. No.,113.0


**1.** Do you think we get a causal effect for the return on education on wage? If not, why not?

No, we only measure correlation between education and wage, because eucation and experience are both influenced by omitted variables such as ability, intelligence or effort.

**2.** How is the problem called and which OLS assumption is violated?

Omitted variable bias, $E(u/x)=0$

**3.** What are potential solutions to adess this problem? Whate are the requirements to use them in order to be able to estimate causal effet?

We can include the omitted variables in our model, but some of them could be non measurable. Or we can use the Instrumental Variable Regression.

**4.** The are two potential sources of endogeneity that we've covered in the class. ho do we call them and how do the cause endogeneity?

**5.** If this would be panel data, how could we correct for endogeneity? which type of the endogeneity could we adress by taking advantage of the panel data structure?

**6.** What are the two requirements of an instrument to work?

The instrument should respect to requirements:
- **relevance**: it must be sufficiently correlated with the endogenous regressor (not *weak*)
- **exogeneity**: it must be uncorrelated with the error term

**7.** Which one can be tested and which one can't?

The relevance can be tested (in the first stage, the t-statistic sould be superior to 10), whereas the exogeneity can't.

**8.** Think about potential instruments which could be used to adress the endogeneity in this case.

We could think about the parents' education or the family's income.

**9.** A candidate is the father's education. but does it fulfill the exogeneity requirement if we estimate the model as it is as the moment?

Not really, it could be for example correlated with the mother's education.

**10.** Estimate OLS, IV by hand and IV using an implemented estimator and report the results.

In [23]:
#restrict to non-missing wage observations:
mroz = mroz.dropna(subset=['lwage'])

# OLS
reg_lin = smf.ols(formula='lwage ~ educ + exper', data=mroz)
results_lin = reg_lin.fit()

# print regression table:
table_ols = pd.DataFrame({'b': round(results_lin.params, 4),
                          'se': round(results_lin.bse, 4),
                          't': round(results_lin.tvalues, 4),
                          'pval': round(results_lin.pvalues, 4)})
print(f'table_ols: \n{table_ols}\n')


# IV by hand
## first stage
first_stage = smf.ols(formula='educ ~ fatheduc', data=mroz)
fstage_results = first_stage.fit()
fstage_pred = fstage_results.predict()
## second stage
second_stage = smf.ols(formula='lwage ~ fstage_pred', data=mroz)
sstage_results = second_stage.fit()

# print regression table:
table_iv_hand = pd.DataFrame({'b': round(sstage_results.params, 4),
                          'se': round(sstage_results.bse, 4),
                          't': round(sstage_results.tvalues, 4),
                          'pval': round(sstage_results.pvalues, 4)})
print(f'table_iv_hand: \n{table_iv_hand}\n')


# IV built-in
import linearmodels.iv as iv
reg_iv = iv.IV2SLS.from_formula(formula='lwage ~ 1 + [educ ~ fatheduc]', data=mroz)
results_iv = reg_iv.fit()
iv_pred = results_iv.predict()


# print regression table:
table_iv_builin = pd.DataFrame({'b': round(results_iv.params, 4),
                          'se': round(results_iv.std_errors, 4),
                          't': round(results_iv.tstats, 4),
                          'pval': round(results_iv.pvalues, 4)})
print(f'table_iv_builtin: \n{table_iv_builin}\n')


table_ols: 
                b      se       t    pval
Intercept -0.4002  0.1904 -2.1021  0.0361
educ       0.1095  0.0142  7.7283  0.0000
exper      0.0157  0.0040  3.8998  0.0001

table_iv_hand: 
                  b      se       t    pval
Intercept    0.4411  0.4671  0.9443  0.3455
fstage_pred  0.0592  0.0368  1.6081  0.1086

table_iv_builtin: 
                b      se       t    pval
Intercept  0.4411  0.4643  0.9501  0.3421
educ       0.0592  0.0369  1.6017  0.1092



**11.** Is the instruent relevant (strong enough)? Please test the instrument relevance.

In [26]:
print(f"Intrument strong enough : {sstage_results.fvalue>10} (fvalue = {round(sstage_results.fvalue, 2)})")

Intrument strong enough : False (fvalue = 2.59)


**12.** What can you say about the inference when estimating IV by hand?

**13.** Can you use IV also in non-linear models? If not, what would be an alternative?

**14.** Estimate the model with the control funciton approach.

In [31]:
# first stage
fstage = smf.ols(formula='lwage ~ educ + exper', data=mroz)
fstage_results = fstage.fit()
fstage_resid = fstage_results.resid

# second stage
reg_control = smf.ols(formula='lwage ~ educ + fstage_resid + exper', data=mroz)
results_control = reg_control.fit()
results_control.summary()

0,1,2,3
Dep. Variable:,lwage,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,3.039e+31
Date:,"Mon, 19 Sep 2022",Prob (F-statistic):,0.0
Time:,15:12:46,Log-Likelihood:,13986.0
No. Observations:,428,AIC:,-27960.0
Df Residuals:,424,BIC:,-27950.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.4002,4.45e-16,-8.98e+14,0.000,-0.400,-0.400
educ,0.1095,3.31e-17,3.3e+15,0.000,0.109,0.109
fstage_resid,1.0000,1.13e-16,8.81e+15,0.000,1.000,1.000
exper,0.0157,9.4e-18,1.67e+15,0.000,0.016,0.016

0,1,2,3
Omnibus:,10.923,Durbin-Watson:,0.214
Prob(Omnibus):,0.004,Jarque-Bera (JB):,11.409
Skew:,0.398,Prob(JB):,0.00333
Kurtosis:,2.912,Cond. No.,113.0
