Part 2 - Heteroskesdacity

Question 18: Explain the problem of heteroskedasticity with an example of the course.




Heteroskedasticity happens when the variance across the errors in a regression model are not constant, they vary acording to value of the independent variables.

A very simple way to visualize this concept is with the following example: imagine we plot the age and weight of a lot of people (age x weight). In the beggining of the graph we'll find a very low variance in the weights, considering that althought there are weight differences between babies and children, this difference is hardly too dramatic. Now, as age grows, we'll see a fraction of people gaining a lot of weight in comparisson to the average, we'll also see another fraction losing weight in comparisson to the average. This will certainly increase variance in comparisson with the variance in the first years of human development.

Now we can see an example given:

$$
\sigma_i^2 = x_{i1}^2 + x_{i2}^2 + \dots + x_{iK}^2
$$
Here, the variance of errors depends on the squares of the independent variables. This means the dispertion of errors will increase with the independent variables

The full model specification becomes:

$$
y_i = \beta_0 + \sum_{k=1}^{K} \beta_k x_{ik} + u_i
$$

With the heteroskedasticity assumption:

$$
\text{Var}(u_i | x_{i1}, x_{i2}, \dots, x_{iK}) = x_{i1}^2 + x_{i2}^2 + \dots + x_{iK}^2
$$



Question 19: In the specification of question 9, test the hypothesis of no heteroskedasticity of linear form, i.e. in the regression of u2 on constant, crime, nox, rooms, proptax, test H0: crime, nox, room, proptax = 0, where the coefficients k (k = crime, nox, rooms, proptax) are associated with the corresponding explanatory variables.

In [2]:
import pandas as pd

df = pd.read_csv('HPRICE2.RAW', delim_whitespace=True, decimal='.', header=None)
df = df.apply(pd.to_numeric) 

df = df.rename(columns={
    0: "price",
    1: "crime",
    2: "nox",
    3: "rooms",
    4: "dist",
    5: "radial",
    6: "proptax",
    7: "stratio",
    8: "lowstat",
    9: "lprice",
    10: "lnox",
    11: "lproptax",
})

  df = pd.read_csv('HPRICE2.RAW', delim_whitespace=True, decimal='.', header=None)


In [3]:
import statsmodels.api as sm

# this is the dataframe with the independent variables
X = df[['crime', 'nox', 'rooms', 'proptax']]
X = sm.add_constant(X)

# this is the dataframe with the dependent variable
y = df['price']

# running the regression
model = sm.OLS(y, X).fit()

# now, making the regression of u2 on the appointed variables
u=model.resid
u2=u**2
y=u2
model=sm.OLS(y,X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     6.799
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           2.47e-05
Time:                        14:47:50   Log-Likelihood:                -10130.
No. Observations:                 506   AIC:                         2.027e+04
Df Residuals:                     501   BIC:                         2.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -5.885e+07   6.41e+07     -0.917      0.3

Only 5.1% of the error variation can be explained by the independent variables (R-Squared).
Probability of F-Statistic is very low, which means the chances of gettig these results given homoskedascity are very low, meaning we are probably looking at heteroskesdacity. We don't really know at this point which variable is causing this, since the F-Statistic looks at the collective impact of the variables.
And the p-value for all our variables are pretty high except for proptax. This means proptax is significately associated with the residues variance: we reject homoskesdacity hypothesis.



Question 20: In the specification of question 10, test the hypothesis of no heteroskedasticity of linear form.

In [4]:
import statsmodels.api as sm

# this is the dataframe with the independent variables
X = df[['crime', 'nox', 'rooms', 'proptax']]
X = sm.add_constant(X)

# this is the dataframe with the dependent variable
y = df['lprice']

# running the regression
model = sm.OLS(y, X).fit()

# now, making the regression of u2 on the appointed variables
u=model.resid
u2=u**2
y=u2
model=sm.OLS(y,X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.126
Model:                            OLS   Adj. R-squared:                  0.119
Method:                 Least Squares   F-statistic:                     17.98
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           8.33e-14
Time:                        14:47:50   Log-Likelihood:                 185.10
No. Observations:                 506   AIC:                            -360.2
Df Residuals:                     501   BIC:                            -339.1
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0757      0.090      0.842      0.4

We found an even bigger F-Statistic (independent variables are explaining the change in error). This points to heteroskesdacity.
And now, we can see that not only proptax p-value is small, the rooms p-value is also small. That means these two independent variables are explaining the change in the error.

Question 21: In the specification of question 11, test the hypothesis of no heteroskedasticity of linear form.

In [5]:
import statsmodels.api as sm

# this is the dataframe with the independent variables
X = df[['crime', 'lnox', 'rooms', 'lproptax']]
X = sm.add_constant(X)

# this is the dataframe with the dependent variable
y = df['lprice']

# running the regression
model = sm.OLS(y, X).fit()

# now, making the regression of u2 on the appointed variables
u=model.resid
u2=u**2
y=u2
model=sm.OLS(y,X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.109
Method:                 Least Squares   F-statistic:                     16.51
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           1.02e-12
Time:                        14:47:50   Log-Likelihood:                 184.66
No. Observations:                 506   AIC:                            -359.3
Df Residuals:                     501   BIC:                            -338.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.5496      0.167     -3.289      0.0

High F-Statistic - this is a high joint significance of the independent variables which means they explain the variance of the dependent variable (error).

We can also see that rooms in marginaly statisticaly significant (close to 5%) and lproptax is higly significant in explaining the error. This means these two variables are explaining the variance in the error and thus are responsible for the high F-Statistic.



Question 22: Comment on the differences between your results of questions 20 and 21.

- The question 20 refers to the model built in question 10 which uses lprice as the dependent variable, and constant, crime, nox, rooms, proptax as the independent variables. The model showed reasonably good fit R-squared: 0.611 and the p-value of all the coeficients are very low, meaning they are all statistically significant. Now, about the results of question 20: we found a big value for the F-statistic, meaning our independent variables can collectively explain the change in error, this point to heteroskesdacity. And looking at the coeficients we can see that the p-values of proptax and rooms are small, meaning they are the variables explaining the change and making the F-statistic high.

- Now, for question 21 we utilized the model built on question 11 where the only difference in comparrison to question 10 is that we utilized lnox instead of nox and lproptax instead of proptax. In these results, we also got a high value for the F-statistic and uppon closer inspection, we can see that lproptax is still explaining the error change and rooms is now marginaly statisticaly significant (close to 5%).

Question 23: Using the specification of question 9, identify the most significant variable causing heteroskedasticity using the student statistics and run a WLS regression with the identified variable as weight. Compare the standards errors with those of question 9. Comment on your results.

In [None]:
import statsmodels.api as sm

# this is the dataframe with the independent variables
X = df[['crime', 'nox', 'rooms', 'proptax']]
X = sm.add_constant(X)

# this is the dataframe with the dependent variable
y = df['price']

# running the regression
model = sm.OLS(y, X).fit()

# now we need to test for heterokesdasticity
y=(model.resid) ** 2
results = sm.OLS(y,X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     6.799
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           2.47e-05
Time:                        14:49:25   Log-Likelihood:                -10130.
No. Observations:                 506   AIC:                         2.027e+04
Df Residuals:                     501   BIC:                         2.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -5.885e+07   6.41e+07     -0.917      0.3

proptax is the only low p-value -- which means proptax is the one variable causing heteroskesdacity

In [7]:
import statsmodels.api as sm

# this is the dataframe with the independent variables
X = df[['crime', 'nox', 'rooms', 'proptax']]
X = sm.add_constant(X)

# this is the dataframe with the dependent variable
y = df['price']

# running the regression
model = sm.OLS(y, X).fit()

u=model.resid
u2=u**2
y=u2
model=sm.OLS(y,X)
results = model.fit()

weights = 1 / df['proptax']

wls_model = sm.WLS(y, X, weights=weights).fit()
print(wls_model.summary())

print("\nWLS Standard Errors:")
print(wls_model.bse)


                            WLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.056
Model:                            WLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     7.409
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           8.39e-06
Time:                        14:47:50   Log-Likelihood:                -10004.
No. Observations:                 506   AIC:                         2.002e+04
Df Residuals:                     501   BIC:                         2.004e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.263e+08    4.9e+07     -2.575      0.0