# Introduction to instrumental variables (IV)
### Iván andrés Trujillo

The main reference for this session is the Wooldridge(2015).

# Endogenity 
The independet variables are related with error term.

## Sources of endogenity 


### Bias by omited variable:
Confounding variable that affects $y$ and $X$ explicative vector.
### Simultaneous especification ( $x$ cause $y$ and $y$ cause $x$)
Think in prices and quantities.
### Measurement error
We can not observed the variable $x_{i}$ directly, therefore we can uses $x_{i} + n_{i}$.  where $n$ is the noise.


All this problems lead to biased estimatiors.



# Instrumental variables 
Design to tackle endogenity.



## Example 
We can estimate the consumption of a product in function its prices.
\begin{equation}
q = f(p)\\
p = f^{-1}(q)
\end{equation}




# Biased $\bf{\beta}$

If we omitted confounders variables then the estimation by OLS could produce biased and inconsistent $\bf{\beta}$.

\begin{equation}
y = \beta_{0} + \beta_{1}x + \beta_{2} z + u 
\end{equation}



## Two least squares (2SLS) 

In a simplified diagram we have:
\begin{equation}
y = \beta x +  u 
\end{equation}

but $x$ is endogenous, then we can uses a (z) as a IV variable and estimate:
\begin{equation}
\hat{x} = \beta z + v \\
y = \beta \hat{x} + u 
\end{equation}


In general terms we can said that we uses instrumental varaibles to estimate the endogenous predictors, and after uses the estimated predictors in the original model.





## $z$ could be a instrumental variable?
A instrumental variable must be satisfy the following:
\begin{equation}
corr(z,x) \neq 0 \\
corr(z,u) = 0
\end{equation}


In the example of predict the demand of a article, then we have a clear problem of endogenity, we could uses a instrumental variable as prices of another products.



## Validity of the instrument
First we can check $corr(x,z)$. Now we can estimate the following model:

\begin{equation}
x = \alpha_{0} + \alpha_{1} z + v
\end{equation}

And now we can perform a statistical test: $H_{0} : \alpha_{1} = 0$ and reject or not the hypothesis if we accept  $H_{0}$ then the instrument is weak.



## How we can test if 2LSL is neccesary
The Hausman test, could give us the answer comparing OLS and 2SLS estimations , given that if all variables are exogenous then both techniques give us consistent estimations.


# Important things 

There is a package in r Called wooldridge 

In [1]:
%load_ext rpy2.ipython

In [2]:
%%R
#install.packages("wooldridge")
library(wooldridge)
data("mroz")
write.csv(mroz,"mroz.csv", row.names = FALSE) # There are a way of pass data.frame to pandas class.

In [3]:
# Now we can pass this data.frame to a dataframe in pandas

In [4]:
import pandas as pd
mroz = pd.read_csv("mroz.csv")

In [5]:
mroz.columns

Index(['inlf', 'hours', 'kidslt6', 'kidsge6', 'age', 'educ', 'wage', 'repwage',
       'hushrs', 'husage', 'huseduc', 'huswage', 'faminc', 'mtr', 'motheduc',
       'fatheduc', 'unem', 'city', 'exper', 'nwifeinc', 'lwage', 'expersq'],
      dtype='object')

# Return to education for working woman
### Example 15.5
\begin{equation}
\hat{log(wage)} = \beta_{0} + \beta_{1} educ + \beta_{2} exper + \beta_{3} exper^{2}
\end{equation}

#### Modeling software model
```{python}
formula = 'y ~ 1 + Exogenous1 + Exogenous2  + ... + [endogenous ~ instrument]'

model= IV2SLS.from_formula(formula, data)

model.fit().summary
```
(1) means for constant 

In [6]:
# Perform OLS
from statsmodels.regression.linear_model import OLS

In [7]:
from linearmodels.iv import IV2SLS
import numpy as np

In [8]:
mroz["exper2"] = mroz["exper"]**2

In [9]:
import statsmodels.formula.api as smf       # Permite ajustar modelos estadísticos utilizando fórmulas de estilo R
model = smf.ols(formula = "educ ~ 1 + exper + exper2 + motheduc + fatheduc", data=mroz).fit()
model.summary()

0,1,2,3
Dep. Variable:,educ,R-squared:,0.262
Model:,OLS,Adj. R-squared:,0.258
Method:,Least Squares,F-statistic:,66.52
Date:,"lun, 19 sep 2022",Prob (F-statistic):,3.67e-48
Time:,20:40:34,Log-Likelihood:,-1574.1
No. Observations:,753,AIC:,3158.0
Df Residuals:,748,BIC:,3181.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,8.3667,0.267,31.370,0.000,7.843,8.890
exper,0.0854,0.026,3.342,0.001,0.035,0.136
exper2,-0.0019,0.001,-2.243,0.025,-0.003,-0.000
motheduc,0.1856,0.026,7.143,0.000,0.135,0.237
fatheduc,0.1846,0.024,7.534,0.000,0.136,0.233

0,1,2,3
Omnibus:,15.108,Durbin-Watson:,2.0
Prob(Omnibus):,0.001,Jarque-Bera (JB):,28.196
Skew:,0.005,Prob(JB):,7.54e-07
Kurtosis:,3.948,Cond. No.,1150.0


In [10]:
mroz.dropna(inplace=True)
formulaIV = 'lwage ~ 1 + exper + exper2 + [educ ~ motheduc + fatheduc]'
model= IV2SLS.from_formula(formulaIV, mroz)
model.fit()

0,1,2,3
Dep. Variable:,lwage,R-squared:,0.1357
Estimator:,IV-2SLS,Adj. R-squared:,0.1296
No. Observations:,428,F-statistic:,18.611
Date:,"lun, sep 19 2022",P-value (F-stat),0.0003
Time:,20:40:34,Distribution:,chi2(3)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,0.0481,0.4278,0.1124,0.9105,-0.7903,0.8865
exper,0.0442,0.0155,2.8546,0.0043,0.0138,0.0745
exper2,-0.0009,0.0004,-2.1001,0.0357,-0.0017,-5.997e-05
educ,0.0614,0.0332,1.8503,0.0643,-0.0036,0.1264


# Reference
Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Cengage learning.