<a href="https://colab.research.google.com/github/khaichiong/meco7312/blob/master/L13_OLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from numpy.linalg import inv

## Generate data

Generate data according to the linear model $Y = 2 - 3X_{1} + 0.5X_{2} + \epsilon$

In [3]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1))
x2 = np.random.normal(-1,1,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

Stack data in a data matrix, and create pandas dataframe object

In [4]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])

In [5]:
df.head()

Unnamed: 0,Spending,Intercept,Age,Income
0,0.687042,1.0,0.848337,-0.570346
1,-1.787827,1.0,0.925578,-0.727972
2,0.499758,1.0,0.690084,-1.375119
3,-15.477492,1.0,5.355435,-2.137626
4,-8.200758,1.0,2.963194,-2.908071


## OLS estimator

$\beta = (X'X)^{-1}X^{T}y$

In [6]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 2.01760072]
 [-2.98040451]
 [ 0.52837441]]


In [7]:
inv((X.T)@X)@(X.T)@y

array([[ 2.01760072],
       [-2.98040451],
       [ 0.52837441]])

In [8]:
X=np.asmatrix(X)
y=np.asmatrix(y)
((((X.T)*X).I)*X.T)*y

matrix([[ 2.01760072],
        [-2.98040451],
        [ 0.52837441]])

Compare with estimates from OLS package

In [9]:
result = sm.ols('Spending ~ Age + Income',data=df).fit()

In [10]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.950
Method:                 Least Squares   F-statistic:                     9400.
Date:                Wed, 20 Nov 2024   Prob (F-statistic):               0.00
Time:                        07:11:54   Log-Likelihood:                -1740.9
No. Observations:                1000   AIC:                             3488.
Df Residuals:                     997   BIC:                             3502.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0176      0.078     25.945      0.0

# Exogeneity condition does not hold


Generate data according to the linear model $Y = 2 - 3X_{1} + 0.5X_{2} + \epsilon$, but $X_{1}$ and $\epsilon$ are correlated. Hence, $E[\epsilon|X_{1}] \neq 0$

In [74]:
n = 1000 #sample size, number of observations

#e and x1 are correlated from a multivariate normal
samples = np.random.multivariate_normal([0, -2], [[1, 0.5], [0.5, 1]], n)
e = samples[:, 0].reshape(-1,1)  # First column as 'e'
x1 = samples[:, 1].reshape(-1,1) # Second column as 'x1'
x2 = np.random.exponential(2,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

In [75]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])

In [76]:
inv((X.T)@X)@(X.T)@y

array([[ 2.90807443],
       [-2.52989792],
       [ 0.5245614 ]])

Here we say that $X_1$ is endogenous, and we see that the OLS estimator of $\beta_{2}$ is biased. When $\epsilon$ and $X_{1}$ are negatively correlated, then we underestimate $\beta_{2}$, but if they are positively correlated, then we overestimate $\beta_{2}$. The OLS estimator for $\beta_{3}$ appears to be unbiased and unaffected.

Compare with estimates from OLS package

In [34]:
sm.ols('Spending ~ Age + Income',data=df).fit().summary()

0,1,2,3
Dep. Variable:,Spending,R-squared:,0.944
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,8329.0
Date:,"Wed, 20 Nov 2024",Prob (F-statistic):,0.0
Time:,07:27:37,Log-Likelihood:,-1310.8
No. Observations:,1000,AIC:,2628.0
Df Residuals:,997,BIC:,2642.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.9618,0.069,13.888,0.000,0.826,1.098
Age,-3.4959,0.028,-123.793,0.000,-3.551,-3.441
Income,0.5090,0.014,36.478,0.000,0.482,0.536

0,1,2,3
Omnibus:,3.489,Durbin-Watson:,2.044
Prob(Omnibus):,0.175,Jarque-Bera (JB):,3.809
Skew:,-0.05,Prob(JB):,0.149
Kurtosis:,3.285,Cond. No.,8.89


# Unbiasedness of OLS

In [59]:
n = 1000  # Sample size
num_iterations = 10000  # Number of repetitions

# Store the estimated coefficients for each iteration
estimated_coeffs = []

for _ in range(num_iterations):
    # Generate data
    e = np.random.normal(0, np.sqrt(2), (n, 1))
    x1 = np.random.exponential(2, (n, 1))
    x2 = np.random.normal(-1, 1, (n, 1))
    y = 2 - 3 * x1 + 0.5 * x2 + e

    # Construct the design matrix
    X = np.concatenate((np.ones((n, 1)), x1, x2), axis=1)

    # Calculate the OLS estimator
    b = inv(X.T @ X) @ X.T @ y

    # Store the estimated coefficients
    estimated_coeffs.append(b)

# Convert the list of estimated coefficients to a NumPy array
estimated_coeffs = np.array(estimated_coeffs)

# Calculate the mean of the estimated coefficients
mean_coeffs = np.mean(estimated_coeffs, axis=0)

# Print the mean coefficients
print("Mean of estimated coefficients:")
print(mean_coeffs)

Mean of estimated coefficients:
[[ 1.99906687]
 [-2.99970232]
 [ 0.49956056]]


Calculate the bias when one of the regressors is endogenous

In [140]:
n = 1000  # Sample size
num_iterations = 10000  # Number of repetitions

# Store the estimated coefficients for each iteration
estimated_coeffs = []

for _ in range(num_iterations):
    # Generate data
    samples = np.random.multivariate_normal([0, -2], [[1, -0.5], [-0.5, 1]], n)
    e = samples[:, 0].reshape(-1,1)  # First column as 'e'
    x1 = samples[:, 1].reshape(-1,1) # Second column as 'x1'
    x2 = np.random.exponential(2,(n,1))
    y = 2 - 3*x1 + 0.5*x2 + e

    # Construct the design matrix
    X = np.concatenate((np.ones((n, 1)), x1, x2), axis=1)

    # Calculate the OLS estimator
    b = inv(X.T @ X) @ X.T @ y

    # Store the estimated coefficients
    estimated_coeffs.append(b)

# Convert the list of estimated coefficients to a NumPy array
estimated_coeffs = np.array(estimated_coeffs)

# Calculate the mean of the estimated coefficients
mean_coeffs = np.mean(estimated_coeffs, axis=0)

# Print the mean coefficients
print("Mean of estimated coefficients:")
print(mean_coeffs)

Mean of estimated coefficients:
[[ 0.99980721]
 [-3.50025736]
 [ 0.49982782]]


# Multicollinearity

When the regressors are highly collinear, OLS estimates become highly imprecise, even though the regressors are exogenous.

In [54]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1))
x2 = 3*x1 - 2 + np.random.normal(0,0.0001,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

X = np.concatenate((np.ones((n,1)),x1,x2),axis=1) #stack data in a data matrix
b = inv((X.T)@X)@(X.T)@y
print(b)
#run this block multiple times

[[-1112.66006955]
 [ 1669.06321518]
 [ -556.84817582]]


When the condition number of X'X is above 30, the regression may have significant multicollinearity

In [55]:
np.linalg.cond((X.T)@X)

75227219356.26456

In [56]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])
result = sm.ols('Spending ~ Age + Income',data=df).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.816
Model:                            OLS   Adj. R-squared:                  0.816
Method:                 Least Squares   F-statistic:                     2213.
Date:                Wed, 20 Nov 2024   Prob (F-statistic):               0.00
Time:                        07:35:40   Log-Likelihood:                -1739.8
No. Observations:                1000   AIC:                             3486.
Df Residuals:                     997   BIC:                             3500.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1112.6602    859.964     -1.294      0.1

## Omitting a variable

Suppose we omit x2 (estimating a misspecified model).

In [60]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1))
x2 = np.random.normal(-1,1,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

In [61]:
X = np.concatenate((np.ones((n,1)),x1),axis=1)

In [62]:
b = inv((X.T)@X)@(X.T)@y
print(b)

[[ 1.47410201]
 [-2.98197186]]


Surprisingly, we still get consistent estimate for the coefficient of x1! That is because x1 and x2 are independent. Now assume a data-generating process where x1 and x2 are not independent.


In [136]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,)) #error term
sigma = np.array([[1,0.75],[0.75,1]])
x = np.random.multivariate_normal(np.array([-1,2]),sigma,(n,))
np.shape(x)
y = 2 - 3*x[:,0] + 0.5*x[:,1] + e

In [137]:
#ols estimator for the correct specification
X = np.concatenate((np.ones((n,1)),x),axis=1) #stack data in a data matrix
b = inv((X.T)@X)@(X.T)@y
print(b)

[ 1.9629114  -2.96270261  0.50607632]


In [138]:
#ols estimator omitting x2
X = np.concatenate((np.ones((n,1)),x[:,0].reshape(-1,1)),axis=1) #stack data in a data matrix
b = inv((X.T)@X)@(X.T)@y
print(b)

[ 3.32868391 -2.60312045]


What is the direction of the bias? $X_1$ and $X_2$ are positively correlated here, and the OLS estimate is larger than the true value, hence OLS estimate is biased upwards

What if $X_1$ and $X_2$ are negatively correlated?

In [66]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,)) #error term
sigma = np.array([[1,-1.5],[-1.5,3]])
x = np.random.multivariate_normal(np.array([1,2]),sigma,(n,))
y = 2 - 3*x[:,0] + 0.5*x[:,1] + e

In [67]:
#ols estimator omitting x2
X = np.concatenate((np.ones((n,1)),x[:,0].reshape(-1,1)),axis=1) #stack data in a data matrix
b = inv((X.T)@X)@(X.T)@y
print(b)

[ 3.72540368 -3.7722727 ]


If $X_1$ and $X_2$ are negatively correlated? The OLS estimate is smaller than the true value, hence OLS estimate is biased downwards.