This notebook refers to Homework 3, Winter 2012, which can be found [here](https://economistsview.typepad.com/economics421/).

"Using [this data set](https://economistsview.typepad.com/economics421/files/Data-for-Problem-3.xls), repeat the example from class for the first of the three cases we discussed, *i.e.* first regress the log of salary on a constant and the two variables proxying for experience, years and years squared:"

$$
\ln \textrm{salary} = \beta_0 + \beta_1 \textrm{years} + \beta_2 \textrm{years}^2 + u_t
$$

"Then, form the estimated residual squared and perform the LM test for heteroskedasticity."

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_excel('Data-for-Problem-3.xls', usecols=[1,2])

In [3]:
data.head()

Unnamed: 0,SALARY,YEARS
0,105.2,36
1,91.3,30
2,72.5,29
3,74.3,28
4,103.5,26


**Forming the variables of the original model**:

In [4]:
data['LOG_SALARY'] = data['SALARY'].apply(np.log)

In [5]:
data['YEARS_SQ'] = data['YEARS'].pow(2)

**Regressing the original model**:

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
lin_reg = LinearRegression()

In [8]:
X = data[['YEARS','YEARS_SQ']]
y = data['LOG_SALARY']

In [9]:
lin_reg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
print(lin_reg.intercept_, lin_reg.coef_)

3.8093653545744024 [ 0.04385284 -0.00062735]


**Calculating the residuals**:

In [11]:
y_hat = lin_reg.predict(X)

In [12]:
u_hat = y - y_hat

**Performing LM test for heteroskedasticity**:

Variance model: $\sigma^2_i = \alpha_1 + \alpha_2 \textrm{years} + \alpha_3 \textrm{years}^2$ (Breusch-Pagan model).

In [13]:
u_hat_sq = u_hat.pow(2)

In [14]:
lm_reg = LinearRegression()

In [15]:
lm_reg.fit(X,u_hat_sq)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Calculating the test statistic $\textrm{LM} = N R^2$ :

In [16]:
n, p_minus_one = X.shape

In [17]:
r_squared = lm_reg.score(X, u_hat_sq)

In [18]:
lm_stat = n * r_squared

In [19]:
print(lm_stat)

16.586551368622093


Obtaining the critical value $\chi^2_{0.05} (p-1) \textrm{ df}$:

In [21]:
alpha = 0.05
df = p_minus_one

In [20]:
from scipy.stats import chi2

In [22]:
chi2_crit = chi2.ppf(1-alpha, df)

In [23]:
print(chi2_crit)

5.991464547107979


Testing the null hypothesis:

$$
\begin{align}
H_0 & : \alpha_2 = \alpha_3 = 0 \\
H_1 & : \alpha_2 \neq 0 \vee \alpha_3 \neq 0
\end{align}
$$

In [24]:
if lm_stat > chi2_crit:
    print('NULL HYP. REJECTED! -> There is heteroskedasticity.')
else:
    print('Failed to reject the null hypothesis.')

NULL HYP. REJECTED! -> There is heteroskedasticity.
