# PS2: IV estimation
This notebook serves as a starting point for PS2. 

# IV estimation with 2SLS
The Two-Stage Least Squares (2SLS) method is an approach used to compute Instrumental Variable (IV) estimates. The procedure works in two steps.

#### Stage 1

In the first step, each explanatory variable, which is an endogenous in the main equation, is regressed on all the exogenous variables in the model. This includes both the exogenous variables in the main equation and the instruments. 
$$
\mathbf{X} = \mathbf{Z}\delta + \mathbf{e}
$$

and obtain the estimated coefficients using OLS,

$$
\hat{\delta} = (Z'Z)^{-1}Z'X
$$

The predicted values from these regressions are then obtained.
$$
\mathbf{\hat{X}} = \mathbf{Z}\hat{\delta} = Z(Z'Z)^{-1}Z'X
$$


#### Stage 2

This stage is the usual regression estimated using OLS, but where the endogenous $\mathbf{X}s$ have substituted by $\mathbf{\hat{X}}$.

$$
\mathbf{y} = \mathbf{\hat{X}}\beta + \mathbf{u}
$$

which gives,

$$
b_{2SLS} = (\hat{X}'\hat{X})^{-1}\hat{X}y
$$


In [None]:
# import standard libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from linearmodels.iv import IV2SLS # One of many packages for IV regression (if you want to double check results)

We will use the dataset also used  by Ramey (2016)*.

*Ramey, Valerie A. "Macroeconomic shocks and their propagation." Handbook of macroeconomics 2 (2016): 71-162.

In [15]:
# here is a dictionary that give you the names for the relevant variables
vars_short_names = [
    'DATES',	
    'LIP',	
    'FFR',	
    'RRSHOCK']
vars_long_names = [
    'Month',
    'Industrial Production (log)',
    'Fed Funds Rate',
    'Romer-Romer Shock']

vars_name_mapping = dict(zip(vars_short_names, vars_long_names))
vars_name_mapping

{'DATES': 'Month',
 'LIP': 'Industrial Production (log)',
 'FFR': 'Fed Funds Rate',
 'RRSHOCK': 'Romer-Romer Shock'}

In [18]:
# Load data and rearrange for OLS estimation
data = pd.read_csv('data/ramey_data_clean.csv')
data = data[vars_short_names]
data.head()

Unnamed: 0,DATES,LIP,FFR,RRSHOCK
0,1969-01-01,3.676827,6.3,0.0
1,1969-02-01,3.683206,6.61,0.0
2,1969-03-01,3.691017,6.79,-0.231698
3,1969-04-01,3.687328,7.41,0.456873
4,1969-05-01,3.683543,8.67,0.210627


In [19]:
# Transform DATES to datetime and set as index
data['DATES'] = pd.to_datetime(data['DATES'])
data.set_index('DATES', inplace=True) # inplace=True modifies the data frame in place (i.e. no need to reassign)
data.head()

Unnamed: 0_level_0,LIP,FFR,RRSHOCK
DATES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1969-01-01,3.676827,6.3,0.0
1969-02-01,3.683206,6.61,0.0
1969-03-01,3.691017,6.79,-0.231698
1969-04-01,3.687328,7.41,0.456873
1969-05-01,3.683543,8.67,0.210627


In [28]:
# After running an OLS regression with statsmodels, the fitted values are stored in the results object.
import statsmodels.api as sm

# Example regression
X = data[['FFR']]
X = sm.add_constant(X)  # Adds a constant term to the predictor
y = data['LIP']
model = sm.OLS(y, X).fit()
fitted_values = model.fittedvalues.head()  # Fitted values of the model

# When merging on the index, each row in the resulting DataFrame aligns based on the index value rather than a specific column
df_with_fitted = pd.merge(data, fitted_values.rename('FFR_change_fitted'), left_index=True, right_index=True)
df_with_fitted.head()


Unnamed: 0_level_0,LIP,FFR,RRSHOCK,FFR_change_fitted
DATES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1969-01-01,3.676827,6.3,0.0,4.161942
1969-02-01,3.683206,6.61,0.0,4.148534
1969-03-01,3.691017,6.79,-0.231698,4.140748
1969-04-01,3.687328,7.41,0.456873,4.113931
1969-05-01,3.683543,8.67,0.210627,4.059432
