# Weighted Least Squares (WLS) Regression
In this exercise we will study the impact of hetereoskedasticity (non-constant variance) of errors and a possible approach to rectify the Ordinary Least Squares (OLS) model.<br>
For this exercise, the dataset is made available as 'wls_data.csv' in the same directory as this notebook. The data file contains two observed variables: 'x' and 'y'.<br>
The goal of this exercise is to estimate the regression coefficients for a single-variable linear model between y and x with independent errors: <br>
$y_i = \beta_0 + \beta_1 x_i + e_i$, where <br>
$e_i$ is independent, Normal noise with $E[e_i | x_i] = 0$.

In [1]:
# import libraries
#Import the necessary libraries
%matplotlib inline

from matplotlib import pyplot as plt
import numpy as np
import statsmodels.api as sm
from scipy import stats
import csv

# read the data
filen = 'wls_data.csv'

# read the data (x, y)
x = []
y = []
# read from file
with open(filen) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        x.append(float(row['x']))
        y.append(float(row['y']))

N = len(x)

<b>1. OLS Regression</b><br>
We will first build an OLS regression model. Note that we expect the estimated co-efficients, $\hat{\beta_0}, \hat{\beta_1}$, to be unbiased estimators of $\beta_0, \beta_1$, due to the assumed properties of the errors. However, note that we have <b>not</b> assumed homoskedasticity (constant variance) of noise terms.

(1a). Run an OLS regression (using the <b>statsmodels.OLS</b> library) with: <br>
y as the dependent variable, <br>
x as the independent variable (with intercept)
You should add a constant to x using the <b>add_constant()</b> function in statsmodels.<br>
Save the model as <b>ols_model</b>

(1b). Fit the model using <b>ols_model.fit()</b> and save the result as <b>ols_results</b>.

(1c). Print the summary by calling <b>ols_results.summary()</b>. Comment on the statistical significance of the parameters.

(Type you comments on the statistical significance of the coefficients here)

(1d). Produce the in-sample predictions, and store them as <b>ols_y_hat</b>. Use <b>predict()</b> function with the parameters from <b>ols_results</b>. <br>
Check: http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.predict.html

(1e). Compute the residuals. Store them as <b>ols_residuals</b>.

(1f). On the same plot, produce the following:
1. Scatter plot of the data (x, y) <br>
2. Plot (or scatter plot) of (x, ols_y_hat) <br>
Label your plot and axes.

(1g). Comment on the fit produced.

(Type your comments here)

(1h). Produce a scatter plot of the residuals vs 'x'. Label your plot and axes.

(1i). What does this plot show? Is there evidence of constant variance (or otherwise)?

(Type your answer here)

(1j). Now produce a QQ-plot of the residuals vs the Normal distribution. You can use the <b>probplot()</b> function in the <b>Scipy.stats</b> library using default options.

(1k). Do the residuals appear Normally distributed?

(Type your answer here)

<b>2. Weights Least Squares (WLS)</b><br>
Given the apparent heteroskedasticity, we will look to Weighted Least Squares to produce estimates of the parameters and the standard errors. Refer to the theory problem set for a detailed theoretical development of WLS estimation. In this case, we are not provided information about the variance of each data point. Therefore, we face the harder problem of first having to produce estimates of the appropriate weights to eliminate hetereoskedasticity.

(2a). Write a function <b>estW()</b> which takes as input the following: <br><br>
1. An array of OLS residuals <br>
2. An array of 'x' values (corresponding to each residual) <br>

This function should return an array of estimates of the weights for each data point, i. <br>

Details:<br>
<b>estW(residuals, x)</b> should do the following: <br>
Regress (OLS) residuals^2 against x^2 and a constant. <br>
Using the regression parameters, produce estimates of the residuals^2. Return these estimates of the residuals^2. These estimates are the estimates of the variance (true values of which are unknown) for each data point.

In [2]:
def estW(residuals, x): 
    # complete the rest of the function
    # returns estimates of variance for each data point

SyntaxError: unexpected EOF while parsing (<ipython-input-2-d856550e6fe6>, line 1)

(2b). Use the <b>estW(ols_residuals, x)</b> function to produce the estimates of the variances. Save these estimates as <b>sig_est</b>

(2c). On the same plot, produce the following in different colors: <br>
1. Scatter Plot of the absolute values of ols_residuals against x. <br>
2. Plot (or scatterplot) of the square root of the variance estimates, i.e. each $\sqrt{\text{sig_est}_i}$ against $x_i$. <br>
Label the plots and axes clearly.

(2d). What does the plot above show? Do you think the estimated residuals make sense?

(Type your answer here)

(2e). Now we will use the estimates of the variances to determine the weights to be used for WLS. <br>
Compute the weights for each point as the reciprocal of the <b>square-root of variance estimates</b>.<br>
Store the weights as <b>weights</b>.

(2e). Produce the following variables: <br>
i. <b>X</b> = independent variable (x) and a constant (use add_constant(x) from the statsmodels library). <br>
ii. <b>W</b> = a diagonal matrix of weights using the <b>weights</b> array.<br>
iii. <b>X_w</b> = the product of W and X. <br>
iv. <b>y_w</b> = the product of W and Y. <br>

(2f). Run an OLS regression (using the <b>statsmodels.OLS</b> library) with <b>y_w</b> and <b>X_w</b>. Save your model as <b>wls_model</b>.

(2e). Similar to the OLS exercise above, produce the fit, in-sample predictions and residuals. <br>
Save them as <b>wls_results, wls_y_hat, wls_residuals</b>, respectively.

(2f). Print the summary of this regression. Comment on the statistical significance of the parameters and contrast them to the simple OLS model earlier.

(Type your comments here)

(2g). Show the scatter plot of these <b>wls_residuals</b> vs x. Is there continued evidence of hetereoskedasticity?

(Type your answer here)

(2h). Produce the QQ-plot of the <b>wls_residuals</b> vs the Normal distribution.

(2i). Compared to the OLS residuals, is the Normal fit better? If so, how?

(Type your answer here)

(2j). Transform the WLS predictions (<b>wls_y_hat</b>) to the same scale as the original variable (<b>y</b>).<br>
Save the transformation as <b>wls_y_hat_transformed</b>. <br>
Hint: wls_y_hat_transformed $ = W^{-1}(\text{wls_y_hat})$

(2k). On the same plot, produce in different colors: <br>
1. Scatter plot of the observed data (x, y) <br>
2. Plot (or scatterplot) (x, wls_y_hat_w_transformed) <br>
Label your plots and axes clearly.

(2l). Using the WLS regression summaries and the plots, contrast the regression fit for both the OLS and WLS cases:<br>
i. Are the coefficient estimates much different? Is that what you had expected? <br>
ii. Are the standard errors different? Is that what you had expected?