**PART 2 : HETEROSKEDASTICITY** 

In [38]:
#Importing libraries
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from scipy.stats import f
from statsmodels.stats.diagnostic import het_white
sns.set_style('dark')
from scipy.stats import t
import statsmodels.stats.api as sms
from statsmodels.tsa.stattools import grangercausalitytests

**Question 20 : Explain the problem of heteroskedasticity with an example of the course.**

**Question 21 : Suppose that Eu u' = ${σ}^2 Ω $ . Show that the GLS estimator is the best linear unbiased estimator**

We have that the covariance matrix of the errors u is ${σ}^2 Ω$ , where ${σ}^2$ is the variance of the errors and Ω the known positive definite matrix.
We can obtain the GLS estimator as following :

$$\hat{β}_{\text{GLS}} = (\mathbf{X}' \Omega^{-1} \mathbf{X})^{-1} \mathbf{X}' \Omega^{-1} \mathbf{y}$$

1. Linearity : Obviously the GLS estimator is linear because each term in its expression is a linear function of the variables

$$$$

2. Unbiased Estimator : The GLS estimator is unbiased if $E(\hat{\beta}_{\text{GLS}}) = \beta. $

We have $ E(\hat{\beta}_{\text{GLS}}) = (\mathbf{X}' \Omega^{-1} \mathbf{X})^{-1} \mathbf{X}' \Omega^{-1} E(\mathbf{y}). $ Assuming $ E(\mathbf{u}) = 0 $, the expected value of $\mathbf{y}$ is $\mathbf{X} \beta + \mathbf{u} $, therefore : 

$ E(\hat{\beta}_{\text{GLS}}) = (\mathbf{X}' \Omega^{-1} \mathbf{X})^{-1} \mathbf{X}' \Omega^{-1} (\mathbf{X} \beta + \mathbf{u}). $

Simplifying, we get $ E(\hat{\beta}_{\text{GLS}}) = \beta,$ which shows that the GLS estimator is unbiased.

$$$$

3. Minimum Variance : The GLS estimator has the minimum variance among all linear unbiased estimators if the variance-covariance matrix of the estimator is minimized.The variance-covariance matrix of the GLS estimator is given by:

$\text{Var}(\hat{\beta}_{\text{GLS}}) = (\mathbf{X}' \Omega^{-1} \mathbf{X})^{-1}$

This is the minimum variance among all linear unbiased estimators under the assumption $Eu u' = \sigma^2 \Omega$.


Therefore, the GLS estimator is the Best Linear Unbiased Estimator under the given assumption
 

**Question 22 : in the specification of question 10, test the hypothesis of no heteroskedasticity of linear
form, i.e. in the regression of ${u}^2$ on constant, crime, nox, rooms, proptax, test H0: $\delta_{crime}$, $\delta_{nox}$,
$\delta_{room}$, $\delta_{proptax}$ = 0, where the coefficients $\delta_{k}$ (k = crime, nox, rooms, proptax) are associated
with the corresponding explanatory variables**

In linear regression, the presence of non-homoscedastic residuals implies that the coefficients estimated by the Ordinary Least Squares (OLS) method are biased, leading to an unreliable estimation of their variance.
Therefore, if there is suspicion of non-uniform variances (a simple representation of residuals against explanatory variables can reveal heteroscedasticity), it is advisable to conduct a heteroscedasticity test. Several tests have been developed, with null and alternative hypotheses as follows:

H0 : The residuals are homoscedastic ie $\delta_{crime}$, $\delta_{nox}$, $\delta_{room}$, $\delta_{proptax}$ = 0.

H1: The residuals are heteroscedastic.

In order to test the hypothesis of no heteroskedasticity, we will use the Fisher test

In [27]:
# Variables
alpha = 0.05
q = 4  # Number of coefficients being tested

#OLS model
X = df_hprices[["crime", "nox", "rooms", "proptax"]]
X = sm.add_constant(X)
y = df_hprices["lprice"]

# OLS model for the original regression
model = sm.OLS(y, X).fit()

# White's test for heteroscedasticity
white_test = sms.het_white(model.resid, model.model.exog)
print("\n========== White's Test for Heteroscedasticity ==========")
print(f'* p_value: {white_test[1]}')
print(f'* F-statistic: {white_test[2]}')

# Interpretation of the test result
if white_test[1] < alpha:
    print(f"We reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is evidence of heteroscedasticity.")
else:
    print(f"We do not reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is no strong evidence of heteroscedasticity.")



* p_value: 1.5360087876817982e-19
* F-statistic: 11.278053381113295
We reject the null hypothesis at 5.0% significance level.
There is evidence of heteroscedasticity.


**Question 23 : in the specification of question 11, test the hypothesis of no heteroskedasticity of linear**

In [29]:
# Variables
alpha = 0.05
q = 4  # Number of coefficients being tested

# Assuming df_hprices is your DataFrame with the relevant columns
X = df_hprices[["crime", "lnox", "rooms", "lproptax"]]
X = sm.add_constant(X)
y = df_hprices["lprice"]

# OLS model for the original regression
model = sm.OLS(y, X).fit()


# White's test for heteroscedasticity
white_test = sms.het_white(model.resid, model.model.exog)
print("\n========== White's Test for Heteroscedasticity ==========")
print(f'* p_value: {white_test[1]}')
print(f'* F-statistic: {white_test[2]}')

# Interpretation of the test result
if white_test[1] < alpha:
    print(f"We reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is evidence of heteroscedasticity.")
else:
    print(f"We do not reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is no strong evidence of heteroscedasticity.")



* p_value: 5.094548061513088e-19
* F-statistic: 10.958866265078212
We reject the null hypothesis at 5.0% significance level.
There is evidence of heteroscedasticity.


**Question 24 : in the specification of question 9, test the hypothesis of no heteroskedasticity of linear**

In [30]:
# Variables
alpha = 0.05

#OLS model
X = df_hprices[["crime", "nox", "rooms", "proptax"]]
X = sm.add_constant(X)
y = df_hprices["price"]
model = sm.OLS(y,X).fit()


# White's test for heteroscedasticity
white_test = sms.het_white(model.resid, model.model.exog)
print("\n========== White's Test for Heteroscedasticity ==========")
print(f'* p_value: {white_test[1]}')
print(f'* F-statistic: {white_test[2]}')

# Interpretation of the test result
if white_test[1] < alpha:
    print(f"We reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is evidence of heteroscedasticity.")
else:
    print(f"We do not reject the null hypothesis at {alpha * 100}% significance level.")
    print("There is no strong evidence of heteroscedasticity.")



* p_value: 3.859447946880169e-12
* F-statistic: 7.045245453251409
We reject the null hypothesis at 5.0% significance level.
There is evidence of heteroscedasticity.


**Question 25 :  Comment on the differences between your results of questions 22,23, 24.**

In Questions 22 and 23, where the dependent variable is "lprice" there is evidence of heteroscedasticity based on White's Test for Heteroscedasticity. Conversely, in Question 24, where the dependent variable is "price" there is strong evidence of heteroscedasticity. The logarithmic transformations applied in Questions 22 and 23 might have helped mitigate heteroscedasticity, resulting in more homoscedastic residuals. The differences in the choice of the dependent variable and transformations applied may explain the variations in the presence of heteroscedasticity across the models.

**Question 26 : Regardless of the results of the test of question 22, identify the most significant variable 
causing heteroskedasticity using the student statistics and run a WLS regression with the 
identified variable as weight.**

In [31]:
# OLS model for the original regression
X = df_hprices[["crime", "nox", "rooms", "proptax"]]
X = sm.add_constant(X)
y = df_hprices["lprice"]
model_original = sm.OLS(y, X).fit()

# We collect Student statistics
u = model_original.tvalues

# Identify the most significant variable causing heteroskedasticity based on p-values
most_significant_index = np.argmax(np.abs(u[1:]))  # Exclude the constant term
variable_name = X.columns[most_significant_index]

print(f"Most Significant Variable Causing Heteroskedasticity: {variable_name}")

# Run WLS regression with the identified variable as weights
weights = 1 / df_hprices[variable_name]
wls_model = sm.WLS(y, X, weights=weights).fit()

# Display WLS regression results
print(wls_model.summary())

Most Significant Variable Causing Heteroskedasticity: nox
                            WLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.635
Model:                            WLS   Adj. R-squared:                  0.632
Method:                 Least Squares   F-statistic:                     217.6
Date:                Mon, 11 Dec 2023   Prob (F-statistic):          4.62e-108
Time:                        00:04:52   Log-Likelihood:                 7.0580
No. Observations:                 506   AIC:                            -4.116
Df Residuals:                     501   BIC:                             17.02
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------