The EAEF (Educational Attainment & Earnings Functions) data set was obtained through [this link](https://economistsview.typepad.com/economics421/files/EAEF.xls).

- Other data sets reference in the [book](https://global.oup.com/ukhe/product/introduction-to-econometrics-9780199676828?cc=br&lang=en&) can be found [here](https://global.oup.com/uk/orc/busecon/economics/dougherty5e/student/datasets/).

In [21]:
import numpy as np
import pandas as pd
from statsmodels.api import OLS, add_constant
from statsmodels.sandbox.regression.gmm import IV2SLS

In [2]:
data = pd.read_excel('./EAEF.xls')

In [3]:
data.head()

Unnamed: 0,S,ASVABC,SM,SF,SIBLINGS,LIBRARY,EXPER,EARNINGS
0,12,60.89985,8,8,11,0,22.38461,53.41
1,12,33.6379,5,5,3,0,8.903846,8.0
2,15,38.81767,11,12,3,1,13.25,24.0
3,13,57.08318,12,16,2,1,18.25,29.5
4,18,65.53439,16,20,1,1,13.76923,32.05


**Columns**:

- `S`: subject's schooling (in years).
- `ASVABC`: scaled standardized test score.
- `SM`: schooling of the mother.
- `SF`: schooling of the father.
- `LIBRARY`: parents' ownership of a library card (boolean).
- `SIBLINGS`: number of siblings.
- `EXPER`: work experience (in years).
- `EARNINGS`: subject's earnings.

**Regression**: $\log \textrm{ earnings } \sim \textrm{ schooling }, \textrm{ experience }$, that is, $\log$ `EARNINGS` $\sim$ `S`, `EXPER`.

> "We are going to worry that $\textrm{ schooling }$ is correlated with the error and we will get an instrument for it. And we will use a couple of different kinds of instruments."

1. Base regression.
2. Instrument: `SM`.
3. Instrument: `SM`, `SF`, `SIBLINGS`, `LIBRARY`.

In [4]:
data['LOG_EARN'] = data.EARNINGS.apply(np.log)
data['CONST'] = 1.0

In [5]:
base_reg = OLS(endog=data['LOG_EARN'], exog=data[['CONST','S', 'EXPER']]).fit()

In [6]:
base_reg.summary()

0,1,2,3
Dep. Variable:,LOG_EARN,R-squared:,0.273
Model:,OLS,Adj. R-squared:,0.27
Method:,Least Squares,F-statistic:,100.9
Date:,"Sun, 15 Dec 2019",Prob (F-statistic):,6.47e-38
Time:,21:52:46,Log-Likelihood:,-393.37
No. Observations:,540,AIC:,792.7
Df Residuals:,537,BIC:,805.6
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
CONST,0.5093,0.166,3.061,0.002,0.182,0.836
S,0.1236,0.009,13.583,0.000,0.106,0.141
EXPER,0.0351,0.005,7.010,0.000,0.025,0.045

0,1,2,3
Omnibus:,14.231,Durbin-Watson:,1.81
Prob(Omnibus):,0.001,Jarque-Bera (JB):,24.612
Skew:,0.15,Prob(JB):,4.52e-06
Kurtosis:,4.002,Cond. No.,170.0


More on `statsmodels.regression.linear_model.OLS` [here](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html).

### Two-stage Least Squares*

Original model: $y_i = \beta_1 + \beta_2 x_{2i} + \beta_3 x_{3i} + u_i$.

**First stage**</u>: regress the endogenous variable of interest on the instrument(s) - <u>include all other endogenous variables</u>.

Say $Z = \{z_i\}$ will be used as an intrument to $X_2 = \{x_{2i}\}$.

Then, run $x_{2i} = \alpha_1 + \alpha_2 z_i + \alpha_3 x_{3i} + v_i$, where $v_i$ is the random error term.

**Second stage**: use the predicted variable $\hat X_2 = \{\hat x_{2i}\}$ as an endogenous variable in the original model.

Run $y_i = \beta_1 + \beta_2 \hat x_{2i} + \beta_3 x_{3i} + \tilde u_i$. (<u>READ MORE</u>)

***

More on Two-Stage Least Squares [here](https://python.quantecon.org/ols.html).

***

`SM` as an instrument of `S`:

In [24]:
ivreg1_stage1 = OLS(endog=data['S'], exog=data[['CONST', 'SM', 'EXPER']]).fit()

In [27]:
data['S_HAT'] = ivreg1_stage1.predict(data[['CONST','SM', 'EXPER']])

In [28]:
ivreg1_stage2 = OLS(endog=data['LOG_EARN'], exog=data[['CONST', 'S_HAT', 'EXPER']]).fit()

In [29]:
ivreg1_stage2.summary()

0,1,2,3
Dep. Variable:,LOG_EARN,R-squared:,0.079
Model:,OLS,Adj. R-squared:,0.076
Method:,Least Squares,F-statistic:,23.07
Date:,"Sun, 15 Dec 2019",Prob (F-statistic):,2.44e-10
Time:,23:22:09,Log-Likelihood:,-457.22
No. Observations:,540,AIC:,920.4
Df Residuals:,537,BIC:,933.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
CONST,-0.0617,0.451,-0.137,0.891,-0.947,0.823
S_HAT,0.1600,0.028,5.705,0.000,0.105,0.215
EXPER,0.0394,0.006,6.122,0.000,0.027,0.052

0,1,2,3
Omnibus:,19.381,Durbin-Watson:,1.869
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.453
Skew:,0.356,Prob(JB):,4.9e-06
Kurtosis:,3.761,Cond. No.,408.0


`SM`, `SF`, `SIBLINGS`, `LIBRARY` as instruments of `S`.

In [34]:
ivreg2 = IV2SLS(endog=data['LOG_EARN'],
                exog=data[['CONST', 'S', 'EXPER']],
                instrument=data[['CONST','SM', 'SF', 'SIBLINGS', 'LIBRARY', 'EXPER']]).fit()

In [35]:
ivreg2.summary()

0,1,2,3
Dep. Variable:,LOG_EARN,R-squared:,0.248
Model:,IV2SLS,Adj. R-squared:,0.245
Method:,Two Stage,F-statistic:,37.11
,Least Squares,Prob (F-statistic):,7.99e-16
Date:,"Sun, 15 Dec 2019",,
Time:,23:35:26,,
No. Observations:,540,,
Df Residuals:,537,,
Df Model:,2,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
CONST,-0.1035,0.347,-0.298,0.766,-0.786,0.579
S,0.1626,0.021,7.588,0.000,0.121,0.205
EXPER,0.0398,0.006,7.110,0.000,0.029,0.051

0,1,2,3
Omnibus:,11.797,Durbin-Watson:,1.818
Prob(Omnibus):,0.003,Jarque-Bera (JB):,21.0
Skew:,0.06,Prob(JB):,2.75e-05
Kurtosis:,3.959,Cond. No.,170.0


More on `statsmodels.sandbox.regression.gmm.IV2SLS` [here](https://www.statsmodels.org/stable/generated/statsmodels.sandbox.regression.gmm.IV2SLS.html).