Data set available at [this link](https://economistsview.typepad.com/economics421/files/Data-for-Problem-3.xls)

Implementation inspired by the solution presented at [this link](https://economistsview.typepad.com/economics421/2012/02/solution-to-homework-3.html)

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_excel('../Lecture04/Data-for-Problem-3.xls', usecols=[1,2])

In [3]:
data.head()

Unnamed: 0,SALARY,YEARS
0,105.2,36
1,91.3,30
2,72.5,29
3,74.3,28
4,103.5,26


Linear model:

$$
\ln \textrm{ Salary} = \beta_1 + \beta_2 \textrm{ years} + \beta_3 \textrm{ years}^2 + u_i
$$

In [4]:
data['LOG_SALARY'] = data['SALARY'].apply(np.log)

In [5]:
data['YEARS_SQ'] = data['YEARS'].pow(2)

In [6]:
data.head()

Unnamed: 0,SALARY,YEARS,LOG_SALARY,YEARS_SQ
0,105.2,36,4.655863,1296
1,91.3,30,4.514151,900
2,72.5,29,4.283587,841
3,74.3,28,4.308111,784
4,103.5,26,4.639572,676


___

1. Regress $y$ on constant, $\textrm{ years}, \textrm{ years}^2$

In [7]:
from sklearn.linear_model import LinearRegression

In [8]:
linreg = LinearRegression()

In [9]:
X = data[['YEARS','YEARS_SQ']]
y = data['LOG_SALARY']

In [10]:
linreg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [11]:
print(linreg.intercept_, linreg.coef_)

3.8093653545744024 [ 0.04385284 -0.00062735]


In [12]:
linreg.score(X,y)

0.5361786383088458

2. Calculate the estimated errors: $\hat u_i = y_i - \hat\beta_1 + \hat\beta_2 \textrm{ years } + \hat\beta_3 \textrm{ years}^2$ :

In [13]:
y_hat = linreg.predict(X)

In [14]:
u_hat = y - y_hat

Variance model:

$$
(b) \quad \sigma_i = \alpha_1 + \alpha_2 z_2 + \alpha_3 z_3 + ... + \alpha_p z_p
$$

3. Regress $|\hat u_i|$ on constant, $z_1, ..., z_p$

In [15]:
abs_u_hat = u_hat.abs()

In [16]:
varreg = LinearRegression()

In [17]:
Xvar = X.copy()
yvar = abs_u_hat.copy()

In [18]:
varreg.fit(Xvar, yvar)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [19]:
print(varreg.intercept_, varreg.coef_)

0.031202477190820144 [ 0.01439226 -0.00029753]


4. Use the predicted variance to obtain $\hat\sigma_i$:

In [20]:
sigma_hat = varreg.predict(Xvar)

5. Divide the whole original model by $\hat\sigma_i$:

In [21]:
data['LOG_SALARY_STAR'] = data['LOG_SALARY'].divide(sigma_hat)
data['C_STAR'] = np.reciprocal(sigma_hat)
data['YEARS_STAR'] = data['YEARS'].divide(sigma_hat)
data['YEARS_SQ_STAR'] = data['YEARS_SQ'].divide(sigma_hat)

In [22]:
data.head()

Unnamed: 0,SALARY,YEARS,LOG_SALARY,YEARS_SQ,LOG_SALARY_STAR,C_STAR,YEARS_STAR,YEARS_SQ_STAR
0,105.2,36,4.655863,1296,28.43603,6.107574,219.872667,7915.416006
1,91.3,30,4.514151,900,23.126061,5.123015,153.690443,4610.713276
2,72.5,29,4.283587,841,21.595089,5.041357,146.199351,4239.78119
3,74.3,28,4.308111,784,21.441285,4.976957,139.354807,3901.934593
4,103.5,26,4.639572,676,22.712472,4.895381,127.279915,3309.277795


In [23]:
fgls = LinearRegression(fit_intercept=False)

In [24]:
Xstar = data[['C_STAR','YEARS_STAR','YEARS_SQ_STAR']]
ystar = data['LOG_SALARY_STAR']

In [25]:
fgls.fit(Xstar, ystar)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
         normalize=False)

In [26]:
print(fgls.coef_)

[ 3.84155286e+00  3.65552097e-02 -4.06817430e-04]


In [27]:
fgls.score(Xstar,ystar)

0.9911182484066892