<a href="https://colab.research.google.com/github/khaichiong/meco7312/blob/master/L13_OLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from numpy.linalg import inv

  import pandas.util.testing as tm


## Generate data

Generate data according to the linear model $y_{i} = 2 - 4x_{i1} + 0.5x_{i2} + \epsilon_{i}$

In [None]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1)) 
x2 = np.random.normal(-1,1,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e

Stack data in a data matrix, and create pandas dataframe object

In [None]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])

In [None]:
df.head()

Unnamed: 0,Spending,Intercept,Age,Income
0,-2.177458,1.0,1.28278,-0.427104
1,-1.029388,1.0,0.815241,-0.002973
2,-9.342586,1.0,4.293476,-0.50805
3,-5.305171,1.0,2.29301,-0.608761
4,-3.403129,1.0,1.220889,-1.796661


## OLS estimator

$\beta = (X'X)^{-1}X^{T}y$

In [None]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 2.07943462]
 [-3.01277273]
 [ 0.51327929]]


In [None]:
X=np.asmatrix(X)
y=np.asmatrix(y)
((((X.T)*X).I)*X.T)*y

Compare with estimates from OLS package

In [None]:
result = sm.ols('Spending ~ Age + Income',data=df).fit()

In [None]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.950
Method:                 Least Squares   F-statistic:                     9461.
Date:                Wed, 17 Nov 2021   Prob (F-statistic):               0.00
Time:                        10:40:38   Log-Likelihood:                -1768.7
No. Observations:                1000   AIC:                             3543.
Df Residuals:                     997   BIC:                             3558.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0794      0.078     26.602      0.0

## Multicollinearity

When the regressors are highly collinear, OLS estimates become highly imprecise

In [None]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,1)) #error term
x1 = np.random.exponential(2,(n,1)) 
x2 = 3*x1 - 2 + np.random.normal(0,0.0001,(n,1))
y = 2 - 3*x1 + 0.5*x2 + e
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)
#run this block multiple times

[[-2157.03970412]
 [ 3235.46914813]
 [-1078.99143743]]


When the condition number of X'X is above 30, the regression may have significant multicollinearity

In [None]:
np.linalg.cond(np.matmul(np.transpose(X),X))

68348614111.67602

In [None]:
X = np.concatenate((np.ones((n,1)),x1,x2),axis=1)
df = pd.DataFrame(np.concatenate((y,X),axis=1),columns=['Spending', 'Intercept', 'Age','Income'])
result = sm.ols('Spending ~ Age + Income',data=df).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.779
Model:                            OLS   Adj. R-squared:                  0.779
Method:                 Least Squares   F-statistic:                     1760.
Date:                Wed, 17 Nov 2021   Prob (F-statistic):               0.00
Time:                        10:40:38   Log-Likelihood:                -1780.8
No. Observations:                1000   AIC:                             3568.
Df Residuals:                     997   BIC:                             3582.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2157.0402    909.904     -2.371      0.0

## Omitting a variable

Suppose we omit x2 (estimating a misspecified model).

In [None]:
X = np.concatenate((np.ones((n,1)),x1),axis=1)

In [None]:
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[[ 0.93606163]
 [-1.5042855 ]]


Surprisingly, we still get consistent estimate for the coefficient of x1! That is because x1 and x2 are independent. Now assume a data-generating process where x1 and x2 are not independent.


In [None]:
n = 1000 #sample size, number of observations
e = np.random.normal(0,np.sqrt(2),(n,)) #error term
sigma = np.array([[1,1.5],[1.5,3]])
x = np.random.multivariate_normal(np.array([1,2]),sigma,(n,))
np.shape(x)
y = 2 - 3*x[:,0] + 0.5*x[:,1] + e

In [None]:
#ols estimator for the correct specification
X = np.concatenate((np.ones((n,1)),x),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 1.94584289 -2.92398208  0.49657136]


In [None]:
#ols estimator omitting x2
X = np.concatenate((np.ones((n,1)),x[:,0].reshape(-1,1)),axis=1) #stack data in a data matrix
b = np.matmul(np.matmul(inv(np.matmul(np.transpose(X),X)),np.transpose(X)),y)
print(b)

[ 2.18939954 -2.19825046]
