<a href="https://colab.research.google.com/github/khaichiong/meco7312/blob/master/L13_OLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from numpy.linalg import inv

## Generate data

Generate data according to the linear model $Y = 2 - 4X_{1} + 0.5X_{2} + \epsilon$

In [7]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1))
x2 = np.random.normal(-1,1,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

Stack data in a data matrix, and create pandas dataframe object

In [8]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])

In [9]:
df.head()

Unnamed: 0,Spending,Intercept,Age,Income
0,-24.440565,1.0,8.240716,-0.748951
1,-1.08615,1.0,0.417885,-1.925114
2,-3.42374,1.0,2.532255,1.238727
3,-0.385652,1.0,0.730117,-0.971147
4,-3.319178,1.0,1.124163,-2.241811


## OLS estimator

$\beta = (X'X)^{-1}X^{T}y$

In [18]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 1.91491138]
 [-2.96912964]
 [ 0.46134125]]


In [19]:
inv((X.T)@X)@(X.T)@y

array([[ 1.91491138],
       [-2.96912964],
       [ 0.46134125]])

In [20]:
X=np.asmatrix(X)
y=np.asmatrix(y)
((((X.T)*X).I)*X.T)*y

matrix([[ 1.91491138],
        [-2.96912964],
        [ 0.46134125]])

Compare with estimates from OLS package

In [21]:
result = sm.ols('Spending ~ Age + Income',data=df).fit()

In [22]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.943
Model:                            OLS   Adj. R-squared:                  0.943
Method:                 Least Squares   F-statistic:                     8276.
Date:                Wed, 29 Nov 2023   Prob (F-statistic):               0.00
Time:                        08:43:45   Log-Likelihood:                -1791.9
No. Observations:                1000   AIC:                             3590.
Df Residuals:                     997   BIC:                             3604.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.9149      0.082     23.393      0.0

## Multicollinearity

When the regressors are highly collinear, OLS estimates become highly imprecise

In [24]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1))
x2 = 3*x1 - 2 + np.random.normal(0,0.0001,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1) #stack data in a data matrix
b = inv((X.T)@X)@(X.T)@y
print(b)
#run this block multiple times

[[ 216.46730624]
 [-324.85183457]
 [ 107.79066331]]


When the condition number of X'X is above 30, the regression may have significant multicollinearity

In [27]:
np.linalg.cond((X.T)@X)

79292753073.28932

In [26]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])
result = sm.ols('Spending ~ Age + Income',data=df).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     2225.
Date:                Wed, 29 Nov 2023   Prob (F-statistic):               0.00
Time:                        08:49:36   Log-Likelihood:                -1741.2
No. Observations:                1000   AIC:                             3488.
Df Residuals:                     997   BIC:                             3503.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    216.4654    861.401      0.251      0.8

## Omitting a variable

Suppose we omit x2 (estimating a misspecified model).

In [None]:
X = np.concatenate((np.ones((n,1)),x1),axis=1)

In [None]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 0.93606163]
 [-1.5042855 ]]


Surprisingly, we still get consistent estimate for the coefficient of x1! That is because x1 and x2 are independent. Now assume a data-generating process where x1 and x2 are not independent.


In [None]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,)) #error term
sigma = np.array([[1,1.5],[1.5,3]])
x = np.random.multivariate_normal(np.array([1,2]),sigma,(n,))
np.shape(x)
y = 2 - 3*x[:,0] + 0.5*x[:,1] + e

In [None]:
#ols estimator for the correct specification
X = np.concatenate((np.ones((n,1)),x),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 1.94584289 -2.92398208  0.49657136]


In [None]:
#ols estimator omitting x2
X = np.concatenate((np.ones((n,1)),x[:,0].reshape(-1,1)),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 2.18939954 -2.19825046]
