# Econometrics

# 8th Session

# Simulating Instrumental Variable Data and Performing 2SLS Estimation

# The goal of econometrics is to understand the relationships between variables accurately.

### Importing data from the excel file

### Using the simulation in the appendix, I created this Excel file. I generated Y values using an intercept of 0.7 and an X variable with a coefficient of 1.2.

In [38]:
import pandas as pd
import numpy as np

In [192]:
data = pd.read_excel("IV_data_session_8_example.xlsx")

Y = data["Y"]
X = data["X"]
Z = data["Z"]

### Checking the correlation between Z and X

In [194]:
np.corrcoef(Z, X)

array([[1.        , 0.67194391],
       [0.67194391, 1.        ]])

### Checking the covariance between X and Y

In [198]:
np.cov(X, Y)

array([[   65.44454431,   220.50413431],
       [  220.50413431, 18969.22434043]])

### Checking the covariance between Z and Y

In [200]:
np.cov(Z, Y)

array([[1.60768174e+01, 2.61547776e+01],
       [2.61547776e+01, 1.89692243e+04]])

### Impelmenting OLS Y and X

In [202]:
import statsmodels.api as sm

X_ols = sm.add_constant(X)
ols_model = sm.OLS(Y, X_ols).fit()

print("OLS result: ")
print(ols_model.summary())

OLS result: 
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.039
Method:                 Least Squares   F-statistic:                     407.5
Date:                Fri, 06 Jun 2025   Prob (F-statistic):           7.30e-89
Time:                        00:57:39   Log-Likelihood:                -63242.
No. Observations:               10000   AIC:                         1.265e+05
Df Residuals:                    9998   BIC:                         1.265e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -28.5733      2.626    -10

### The estimated coefficient is 3.36 instead of the expected 1.2, indicating that something is not right.

### Using the instrumental variable Z, I estimated the endogenous variable.

In [204]:
Z_intercept = sm.add_constant(Z)
first_stage_model = sm.OLS(X, Z_intercept).fit()
X_hat = first_stage_model.fittedvalues

print("First OLS result: ")
print(first_stage_model.summary())

First OLS result: 
                            OLS Regression Results                            
Dep. Variable:                      X   R-squared:                       0.452
Model:                            OLS   Adj. R-squared:                  0.451
Method:                 Least Squares   F-statistic:                     8230.
Date:                Fri, 06 Jun 2025   Prob (F-statistic):               0.00
Time:                        01:02:32   Log-Likelihood:                -32092.
No. Observations:               10000   AIC:                         6.419e+04
Df Residuals:                    9998   BIC:                         6.420e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0691      0.161 

### Then, I used the estimated values of X to run a regression with Y as the dependent variable.

In [206]:
X_hat_intercept = sm.add_constant(X_hat)
second_stage_model = sm.OLS(Y, X_hat_intercept).fit()

print("Second OLS result: ")
print(second_stage_model.summary())

Second OLS result: 
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     22.48
Date:                Fri, 06 Jun 2025   Prob (F-statistic):           2.16e-06
Time:                        01:05:01   Log-Likelihood:                -63431.
No. Observations:               10000   AIC:                         1.269e+05
Df Residuals:                    9998   BIC:                         1.269e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.7000      3.682

### The estimated parameters are now 1.2 and 0.7, matching the values used in the data simulation.

# Appendix 1:
### Simulating Instrumental Variable Data with One Explanatory Variable

**The process begins by generating the instrumental variable \(Z\) from a normal distribution, ensuring it has sufficient variation. Next, \(X\) is derived as a function of \(Z\) with added noise, reinforcing the correlation between them.**

**To isolate the effect of \(Z\) on \(X\), a linear regression is performed, and the influence of \(Z\) is removed from \(X\) through the regression residuals. This step is crucial as it allows \(X\) to maintain some dependency on \(Z\) while controlling for its direct influence. Similarly, \(\epsilon\) is constructed to have a non-zero covariance with \(X\) but a zero covariance with \(Z\), ensuring that \(Z\) does not directly affect the error term. This is achieved by orthogonalizing \(\epsilon\) with respect to \(Z\), further isolating the relationships among the variables.**

**Finally, \(Y\) is constructed as a linear function of \(X\) and \(\epsilon\), incorporating predetermined coefficients to define the relationship. The resulting covariances and correlations are computed to analyze the dependencies among the variables, providing insights into their interactions and validating the assumptions underlying the model. This approach is fundamental in econometrics, particularly in scenarios involving instrumental variables and causal inference.**

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

n = 10000
alpha = 0.7
beta = 1.2

Z = np.random.normal(10, 4, n)  

X_raw = 4.5 * Z + np.random.normal(0, 6, n)  

#Removing Z’s effect from X using
reg_X = LinearRegression().fit(Z.reshape(-1, 1), X_raw)
lambda_factor = 0.3  #strength of correlation
X = X_raw - (1 - lambda_factor) * reg_X.predict(Z.reshape(-1, 1))

epsilon_raw = np.random.normal(0, 5, n) + np.random.gamma(2, 2, n) * X_raw

# Remove Z’s effect from epsilon 
reg_epsilon = LinearRegression().fit(Z.reshape(-1, 1), epsilon_raw)
epsilon = epsilon_raw - reg_epsilon.predict(Z.reshape(-1, 1))  #orthogonalizing Z out of epsilon

Y = alpha + beta * X + epsilon

cov_matrix = np.cov([X, Y, Z, epsilon])

cov_X_Y = cov_matrix[0, 1]  #cov(X, Y)
cov_Z_Y = cov_matrix[2, 1]  #cov(Z, Y)
cov_X_epsilon = cov_matrix[0, 3]  #cov(X, epsilon)
cov_Z_epsilon = cov_matrix[2, 3]  #cov(Z, epsilon)
cor_Z_X = np.corrcoef(Z, X)[0, 1]  #correlation(Z, X)

print("Cov(X, Y):", cov_X_Y)
print("Cov(Z, Y):", cov_Z_Y)  
print("Cov(X, epsilon):", cov_X_epsilon)
print("Cov(Z, epsilon):", cov_Z_epsilon)  
print("Correlation(Z, X):", cor_Z_X)  


Cov(X, Y): 234.3792105499043
Cov(Z, Y): 26.35018334073745
Cov(X, epsilon): 156.07069586581417
Cov(Z, epsilon): 2.402215728955199e-13
Correlation(Z, X): 0.6737881632562699


**We can force cov(𝑍, 𝑌) to be zero by orthogonalizing 𝑌 with respect to 𝑍, but this approach is not viable in IV estimation because it removes the variation necessary for the second-stage regression to identify an endogenous effect. A nonzero covariance between 𝑍 and 𝑌 (such as 26 in this result) is acceptable as long as 𝑍 does not directly affect the error term 𝜖. The simulation should focus on ensuring that 𝑍 is uncorrelated with 𝜖 while remaining relevant for 𝑋.**

## Simulating Instrumental Variable Data with Two Explanatory Variables  
*Note: Full code to be uploaded after the student assignment is complete.*

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample size
n = 10000
alpha = 0.7
beta1 = 1.2
beta2 = -0.8

# Generate instrumental variables Z1 and Z2
Z1 = np.random.normal(10, 4, n)  # Unique variation
Z2 = np.random.normal(5, 3, n)  # Different variation

# Create endogenous variables X1 and X2 with dependence on Z1 and Z2
X1_raw = 3.5 * Z1 + 2.0 * Z2 + np.random.normal(0, 5, n)
X2_raw = 2.0 * Z1 + 4.0 * Z2 + np.random.normal(0, 5, n)

# Remove Z’s effect using regression residuals to control correlation strength
reg_X1 = LinearRegression().fit(np.column_stack((Z1, Z2)), X1_raw)
lambda_factor1 = 0.3  # Adjust correlation strength
X1 = X1_raw - (1 - lambda_factor1) * reg_X1.predict(np.column_stack((Z1, Z2)))

reg_X2 = LinearRegression().fit(np.column_stack((Z1, Z2)), X2_raw)
lambda_factor2 = 0.3  # Adjust correlation strength
X2 = X2_raw - (1 - lambda_factor2) * reg_X2.predict(np.column_stack((Z1, Z2)))

# Generate error term epsilon (ensuring Cov(X1, epsilon) ≠ 0 and Cov(X2, epsilon) ≠ 0)
epsilon_raw = np.random.normal(0, 5, n) + np.random.gamma(2, 2, n) * (X1_raw + X2_raw)
reg_epsilon = LinearRegression().fit(np.column_stack((Z1, Z2)), epsilon_raw)
epsilon = epsilon_raw - reg_epsilon.predict(np.column_stack((Z1, Z2)))

# Construct Y with correct coefficients
Y = alpha + beta1 * X1 + beta2 * X2 + epsilon

# Compute covariance and correlation matrices
cov_matrix = np.cov([X1, X2, Y, Z1, Z2, epsilon])

cov_X1_Y = cov_matrix[0, 2]  # cov(X1, Y)
cov_X2_Y = cov_matrix[1, 2]  # cov(X2, Y)
cov_Z1_Y = cov_matrix[3, 2]  # cov(Z1, Y)
cov_Z2_Y = cov_matrix[4, 2]  # cov(Z2, Y)
cov_X1_epsilon = cov_matrix[0, 5]  # cov(X1, epsilon)
cov_X2_epsilon = cov_matrix[1, 5]  # cov(X2, epsilon)
cov_Z1_epsilon = cov_matrix[3, 5]  # cov(Z1, epsilon)
cov_Z2_epsilon = cov_matrix[4, 5]  # cov(Z2, epsilon)
cor_Z1_X1 = np.corrcoef(Z1, X1)[0, 1]  # correlation(Z1, X1)
cor_Z2_X2 = np.corrcoef(Z2, X2)[0, 1]  # correlation(Z2, X2)

# Print results
print("Cov(X1, Y):", cov_X1_Y)
print("Cov(X2, Y):", cov_X2_Y)
print("Cov(Z1, Y):", cov_Z1_Y)  # Should be close to 0
print("Cov(Z2, Y):", cov_Z2_Y)  # Should be close to 0
print("Cov(X1, epsilon):", cov_X1_epsilon)
print("Cov(X2, epsilon):", cov_X2_epsilon)
print("Cov(Z1, epsilon):", cov_Z1_epsilon)  # Should be close to 0
print("Cov(Z2, epsilon):", cov_Z2_epsilon)  # Should be close to 0
print("Correlation(Z1, X1):", cor_Z1_X1)  # Should be **higher**
print("Correlation(Z2, X2):", cor_Z2_X2)  # Should be **higher**
