<a href="https://colab.research.google.com/github/khaichiong/meco7312/blob/master/L13_OLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [44]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from numpy.linalg import inv

## Generate data

Generate data according to the linear model $y_{i} = 2 - 4x_{i1} + 0.5x_{i2} + \epsilon_{i}$

In [100]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1)) 
x2 = np.random.normal(-1,1,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

Stack data in a data matrix, and create pandas dataframe object

In [101]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])

In [35]:
df.head()

Unnamed: 0,Spending,Intercept,Age,Income
0,-13.086769,1.0,5.068038,-3.541896
1,0.426572,1.0,0.467618,-0.607608
2,-1.522705,1.0,1.464857,-0.649126
3,-2.166404,1.0,1.104383,-1.131389
4,2.872789,1.0,0.017304,-0.671162


## OLS estimator

$\beta = (X'X)^{-1}X^{T}y$

In [58]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 1.50029657]
 [-3.00075668]]


Compare with estimates from OLS package

In [51]:
result = sm.ols('Spending ~ Age + Income',data=df).fit()

In [54]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.957
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                 1.116e+04
Date:                Wed, 17 Nov 2021   Prob (F-statistic):               0.00
Time:                        08:28:28   Log-Likelihood:                -1736.9
No. Observations:                1000   AIC:                             3480.
Df Residuals:                     997   BIC:                             3494.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0308      0.073     27.763      0.0

## Multicollinearity

When the regressors are highly collinear, OLS estimates become highly imprecise

In [133]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1)) 
x2 = 3*x1 - 2 + np.random.normal(0,0.0001,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)
#run this block multiple times

[[ 482.81302311]
 [-724.28424139]
 [ 240.93513062]]


When the condition number of X'X is above 30, the regression may have significant multicollinearity

In [134]:
np.linalg.cond(np.matmul(np.transpose(X),X))

77330048965.87491

## Omitting a variable

Suppose we omit x2 (estimating a misspecified model).

In [102]:
X = np.concatenate((np.ones((n,1)),x1),axis=1)

In [103]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 1.35518965]
 [-2.95971221]]


Surprisingly, we still get consistent estimate for the coefficient of x1! That is because x1 and x2 are independent. Now assume a data-generating process where x1 and x2 are not independent.


In [84]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,)) #error term
sigma = np.array([[1,1.5],[1.5,3]])
x = np.random.multivariate_normal(np.array([1,2]),sigma,(n,))
np.shape(x)
y = 2 - 3*x[:,0] + 0.5*x[:,1] + e

In [94]:
#ols estimator for the correct specification
X = np.concatenate((np.ones((n,1)),x),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 2.02532012 -3.01368559  0.4914241 ]


In [98]:
#ols estimator omitting x2
X = np.concatenate((np.ones((n,1)),x[:,0].reshape(-1,1)),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 2.25342743 -2.25596839]
