IV regression also called instrumental variable estimation provides the information about the movements in X that are uncorrelated with u. This means that it isolates movements in X that are uncorrelated with u. <br>
As a result, it allows the user to identify causal effects. <br>
There are some conditions to be met. First, these are some notations. <br>
* Endogenous var: var correlated with error term (bad ones) - (X)
* Exogenous var: var uncorrelated with error term (good ones) - (W)
* Dependent var: target var. -(Y)
* Instrument var: -(Z)

[How}
* 2SLS is divided into two stages.
* In the first stage, regress endogenous variable on constant, exogenous variables, and instrument
* In the second stage, regress target value on estimated endogenous variable from first stage, constant, and exogenous variables.

[Condition1]
* For (W) to be an effective control var, including (W) should make instruments uncorrelated with u, which then makes the coefficient of X consistent after generalizing. Means it effectively controls the omitted factor.
* W should be independent from u
* $E(u_{i}|z_{i}w_{i}) = E(u_{i}|w_{i})$

[Condition 2]
* X, W, Z all should be i.i.d

[Condition 3]
* Large outliers should be unlikely

[Important condition : instrument relevance]
* $cor(z_{i}, x_{i}) \neq 0 $
* Instruments should be related with the endogenous vaiable.
* Is Z useful in predicting X, given W?
* The more relevant the instrument, more information is available for use in IV regression. 
* If this relevancy is small, it's called a weak instrument.
* If its weak, it won't explain the endogenous variable very well. Meaning that the result will not be generalized.
* The coefficient of the endogenous variable in the second stage is $ \hat{\beta_{1}} = \frac{S_{zy}}{S_{zx}} \xrightarrow{p} \frac{cov(z,y)}{cov(z,x)}$
* So if the instrument is weak or irrelevant, $cov(z,x)$ converges to zero, making $\hat{\beta_{1}}$ inconsistent
* During first stage, if the F-stat is bigger than 10, no worries. Uses Stock-Yogo test, and the null hypothesis is that the instrument is weak.
* If you have a strong instrument, only keep strong ones. Here, you do not need to care about standard errors
* If it's exactly identified, and you only have weak instruments, try finding another instruments or use other methods that are not sensitive to weak instruments. (2SLS is sensitive to weak instruments)

[Important condition : instrument exogeneity]
* $cor(z_{i}, u_{i}) = 0$
* Exogeneity of the instruments means that they are uncorrelated with the error term. 
* If it's exactly identified, needs an expert's judgement. There is no way to statistically test this.
* When over identified, use restrictions test
* When the error term could be expressed as the following, the coefficients of X should be zero.
* $ \hat{u_{i}} = Y_{i} - \hat{\beta_{0}} + \hat{\beta_{1}}X_{1i} + \cdots + \hat{\beta_{k}}X_{ki} + \cdots + \hat{\beta_{k+r}}W_{(k+r)i}$
* The stat is computed using the homoskedasticity-only F stat, and it's called the J-statistic.
* F-stat is done on every coefficients of X, and they are combined. J follows a chi-square dist, df is m-k(# of endogenous - # of instruments)
* If exactly identified, degree of freedom of J stat will be m-k = 0, so the stat is also 0.
* Null hypothesis is that every instruments are significant.
* If null is rejected, some or every instruments are not significant. A judgement call is needed.

### Example

A replication using data from Stock and Watson introduction to econometrics <br>
Dataset from :https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/Stock-Watson-EmpiricalExercises-DataSets.htm

1980 Census data.
254654 women between the age of 21 and 35. <br><br>
morekids. =1 if mom had more than 2 children<br>
boy1st.  =1 if 1st child was a boy<br>
boy2nd.   =1 if 2nd child was a boy<br>
samesex.  =1 if 1st two children same sex<br>
agem1. age of mom at census<br>
black.  =1 if mom is black<br>
hispan.  =1 if mom is Hispanic<br>
othrace.  =1 if mom is not black, Hispanic or white<br>
weeksm1.  mom's weeks worked in 1979<br>

In [1]:
import pandas as pd
from linearmodels.iv import IV2SLS
import statsmodels.api as sm

  from pandas.core import datetools


In [3]:
df = pd.read_excel('fertility.xlsx')

In [4]:
df.head(2)

Unnamed: 0,morekids,boy1st,boy2nd,samesex,agem1,black,hispan,othrace,weeksm1
0,0,1,0,0,27,0,0,0,0
1,0,0,1,0,30,0,0,0,30


In [9]:
## add constant term to estimate the constant term beta0
df['const'] = 1

### Simple OLS

In [19]:
## try simple OLS regression
## Do women with more than two kids work less than women with two children?

reg1 = sm.OLS(endog=df['weeksm1'], exog=df[['const', 'morekids', 'agem1', 'black', 'hispan', 'othrace']], missing='drop')
print('type of regression is : ', type(reg1))
results = reg1.fit()
print(results.summary())

## It seems that more than twoo kids lead to about 6 less weeks of labor for women.
## Also, number of kids could be correlated with other factors such as the sex of the child

type of regression is :  <class 'statsmodels.regression.linear_model.OLS'>
                            OLS Regression Results                            
Dep. Variable:                weeksm1   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     2331.
Date:                Fri, 04 Jan 2019   Prob (F-statistic):               0.00
Time:                        15:48:54   Log-Likelihood:            -1.1412e+06
No. Observations:              254654   AIC:                         2.283e+06
Df Residuals:                  254648   BIC:                         2.283e+06
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------

### Testing endogeneity

In [39]:
## To statistically figure out if 'morekids' is truely endogenous, use another regression to 
## see the relationship with the error term and 'morekids'

reg_test = IV2SLS(df['morekids'], df[['const', 'samesex', 'agem1', 'black', 'hispan', 'othrace']], None, None).fit()
df['resid'] = reg_test.resids
reg_test2 = IV2SLS(df['weeksm1'], df[['const', 'morekids', 'agem1', 'black', 'hispan', 'othrace', 'resid']], None, None).fit()
print(reg_test2.tstats.resid**2)

## if the error term from the first regression is significant in predicting 'weeksm1', 
## it means that variable 'morekids' is endogenous

0.10838777346472657


### First stage + test instrument validity

In [14]:
## Use 'samesex' as an instrument to rule our the correlation of 'morekids' with the error term
## The first stage of 2sls

results_fs = sm.OLS(df['morekids'], df[['const', 'samesex', 'agem1', 'black', 'hispan', 'othrace']], missing = 'drop').fit()
print(results_fs.summary())

## The result shows that samesex is a valid instrument, and it is not weak.
## Also, the t-score is 35.774. It is counted as a weak instrument if the F-stat is smaller than 10.
## This case, there is only one instrument, so f-stat will be just t-stat, which is bigger than 10.
## If the instrument is weak, the coefficient of 'morekids' in the second stage 
#### will converge to zero if generalized to the entier population because it is related with cov(instrument, endog)

                            OLS Regression Results                            
Dep. Variable:               morekids   R-squared:                       0.024
Model:                            OLS   Adj. R-squared:                  0.024
Method:                 Least Squares   F-statistic:                     1262.
Date:                Fri, 04 Jan 2019   Prob (F-statistic):               0.00
Time:                        15:45:34   Log-Likelihood:            -1.7423e+05
No. Observations:              254654   AIC:                         3.485e+05
Df Residuals:                  254648   BIC:                         3.485e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1395      0.009    -16.041      0.0

### Second stage

In [18]:
## Now, retrieve morekids(hat) to use it in the second stage
df['morekidshat'] = results_fs.predict()

results_ss = sm.OLS(df['weeksm1'], df[['const', 'morekidshat', 'agem1', 'black', 'hispan', 'othrace']], missing = 'drop').fit()
print(results_ss.summary())

## The result shows that the coefficient of morekids is -5.8211, not -6.23.
## The variable was biased, and by bringing in the instrument, the bias is partially fixed.

                            OLS Regression Results                            
Dep. Variable:                weeksm1   R-squared:                       0.025
Model:                            OLS   Adj. R-squared:                  0.025
Method:                 Least Squares   F-statistic:                     1310.
Date:                Fri, 04 Jan 2019   Prob (F-statistic):               0.00
Time:                        15:47:55   Log-Likelihood:            -1.1437e+06
No. Observations:              254654   AIC:                         2.287e+06
Df Residuals:                  254648   BIC:                         2.287e+06
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -4.7919      0.411    -11.673      

### 2SLS with single model

In [28]:
## This could be done with a single model.

iv = IV2SLS(dependent = df['weeksm1'], exog = df[['const', 'agem1', 'black', 'hispan', 'othrace']],
           endog = df['morekids'],
           instruments = df['samesex']).fit()
print(iv.summary)

## you can see that the results are exactly the same.

                          IV-2SLS Estimation Summary                          
Dep. Variable:                weeksm1   R-squared:                      0.0437
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0437
No. Observations:              254654   F-statistic:                    6955.0
Date:                Fri, Jan 04 2019   P-value (F-stat)                0.0000
Time:                        18:47:28   Distribution:                  chi2(5)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const         -4.7919     0.3898    -12.294     0.0000     -5.5559     -4.0279
agem1          0.8316     0.0226     36.730     0.00