# Tests for Heteroscedasticity

Under heteroscedastic errors, it is well known that OLS estimators are unbiased and consistent, but inefficient and provide incorrect standard errors. 

Hence it is very important to detect this anomaly in your regression.


We will illustrate how to test for heteroscedasticity using Current Population Survey (CPS) data consisting on 100 observations on wages, educational level, years of experience, and unionization status of U.S. male workers. The data was borrowed from J&DN’s (1997) Econometric Methods, and slightly adjusted for the purposes of this tutorial. The variables are defined as follows:

In [1]:
d.d<-read.table("http://www.econ.uiuc.edu/~econ508/data/CPS.txt",header=T)
head(d.d)

lnwage,grade,exp,union
2.331172,8,22,0
1.504077,14,2,0
3.911523,16,22,0
2.197225,8,34,1
2.788093,9,47,0
2.351375,9,32,0


In [2]:
lnwage<-d.d$lnwage
grade<-d.d$grade
exp<-d.d$exp
union<-d.d$union
exp2<-exp^2

In [3]:
summary(lm(lnwage~grade+exp+exp2+union))


Call:
lm(formula = lnwage ~ grade + exp + exp2 + union)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01553 -0.28642 -0.04438  0.29378  1.45359 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.5951062  0.2834855   2.099 0.038447 *  
grade        0.0835426  0.0200928   4.158 7.04e-05 ***
exp          0.0502742  0.0141370   3.556 0.000589 ***
exp2        -0.0005617  0.0002879  -1.951 0.053954 .  
union        0.1659285  0.1244544   1.333 0.185639    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.47 on 95 degrees of freedom
Multiple R-squared:  0.3718,	Adjusted R-squared:  0.3453 
F-statistic: 14.06 on 4 and 95 DF,  p-value: 4.794e-09


## Test 1: White

Run the OLS regression (as you’ve done above, the results are omitted):

Get the residuals:

Generate the squared residuals:

Generate new explanatory variables, in the form of the squares of the explanatory variables and the cross-product of the explanatory variables:

Because union is a dummy variable, its squared values are equal to the original values, and we don’t need to add the squared dummy in the model. Also the squared experience was already in the original model (in the form of exp2), so we don’t need to add that in this auxiliary regression.

Regress the squared residuals into a constant, the original explanatory variables, and the set of auxiliary explanatory variables (squares and cross-products) you’ve just created:


Get the sample size (N) and the R-squared (R2), and construct the test statistic N*R2:


Under the null hypothesis, the errors are homoscedastic, and NR2 is asymptotically distributed as a Chi-squared with k-1 degrees of freedom (where k is the number of coefficients on the auxiliary regression). In this last case, k=13.
And we observe that the test statistic NR2 is about 10.7881, while the Chi-squared(12, 5%) is about 21, much bigger than the test statistic. Hence, the null hypothesis (homoscedasticity) can not be rejected.



In [4]:
g<-lm(lnwage~grade+exp+exp2+union)

g.resid<-g$resid

g.resid2<-g.resid^2

grade2<-grade^2 
exp4<-exp2^2 
gradexp<-grade*exp 
gradexp2<-grade*exp2 
gradeuni<-grade*union 
exp3<-exp*exp2 
expunion<-exp*union 
exp2uni<-exp2*union

g.final<-lm(g.resid2~grade+exp+exp2+union+grade2+exp4+exp3+gradexp +gradexp2+gradeuni+expunion+exp2uni)

N<-(g$df)+length(g$coef)
R2<-summary(g.final)$r.squared 
N*R2


In [5]:
qchisq(.95, df=12) 

## Test 2: Breusch-Pagan-Godfrey

The Lagrange Multiplier test proposed by BPG can be executed as follows:

Run the OLS regression (as you’ve done above, the output is omitted):

    g<-lm(lnwage~grade+exp+exp2+union)
    
Get the sum of the squared residuals:


    g.resid<-g$resid
    g.ssr<-sum((g$resid)^2)
    g.ssr 
    

Generate a disturbance correction factor in the form of sum of the squared residuals divided by the sample size:

    dcf<-g.ssr/((g$df)+length(g$coef))
    
Regress the adjusted squared errors (in the form of original squared errors divided by the correction factor) on a list of explanatory variables supposed to influence the heteroscedasticity. Following JDN, we will assume that, from the original dataset, only the main variables grade, exp, and union affect the heteroscedasticity. Hence:

    adjerr2<-(g.resid^2)/dcf    
    g.bptest<-lm(adjerr2~grade+exp+union)
    summary(g.bptest)
    
    
This auxiliary regression gives you a model sum of squares (ESS):

    ess<-sum((g.bptest$fitted-mean(adjerr2))^2)  
    
    
Under the null hypothesis of homoscedasticity, (1/2) ESS asymptotically converges to a Chi-squared(k-1, 5%), where k is the number of coefficients on the auxiliary regression. In the last case, k=4. Hence, we need to compare (1/2) ESS with a Chi-squared with 3 degrees of freedom and 5%. Doing so we get (1/2) ESS = 5.35, while the critical value of a Chi-squared (3, 5%) = 7.81. Therefore, the test statistic falls short of the critical value, and the null hypothesis of homoscedasticity can not be rejected.

In [6]:
g<-lm(lnwage~grade+exp+exp2+union)
g.resid<-g$resid
g.ssr<-sum((g$resid)^2)
g.ssr 
dcf<-g.ssr/((g$df)+length(g$coef))
adjerr2<-(g.resid^2)/dcf    
g.bptest<-lm(adjerr2~grade+exp+union)
summary(g.bptest)


Call:
lm(formula = adjerr2 ~ grade + exp + union)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5484 -0.8613 -0.4512  0.2889  8.7774 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.326100   0.949202  -0.344    0.732
grade        0.098944   0.064351   1.538    0.127
exp          0.009954   0.013198   0.754    0.453
union       -0.582429   0.396332  -1.470    0.145

Residual standard error: 1.579 on 96 degrees of freedom
Multiple R-squared:  0.0428,	Adjusted R-squared:  0.01288 
F-statistic: 1.431 on 3 and 96 DF,  p-value: 0.2386


In [8]:
ess<-sum((g.bptest$fitted-mean(adjerr2))^2)  
1/2*ess

In [9]:
qchisq(.95, df=3) 

## Test 3: Goldfeld-Quandt

Suppose now you believe a single explanatory variable is responsible for most of the heteroscedasticy in your model. For example, let’s say that experience (exp) is the “trouble-maker” variable. Hence, you can proceed with the Goldfeld-Quandt test as follows:

Sort your data according to the variable exp. Then divide your data in, say, three parts, drop the observations of the central part, and run separate regressions for the bottom part (Regression 1) and the top part (Regression 2). After each regression, ask for the respective Residual Sum of Squares RSS:

Then compute the ratio of the Residuals Sum of Squares, $R= RSS2/RSS1$. Under the null hypothesis of homoscedasticity, this ratio R is distributed according to a $F((n−c−2k)2,(n−c−2k)2)$, where $n$ is the sample size, c is the number of dropped observations, and k is the number of regressors in the model.

To check your results you should get: $R<F$, and as a consequence can not reject the null hypothesis of homocedasticity

### A simpler Approach

The three Heteroscedasticity tests here presented are clasicals ones and so they are very likely to be packages that already calculate this for you. One of such packages is `lmtest` package. For example you could do the `Breusch-Pagan-Godfrey test` by typing

In [11]:
 require(lmtest)

Loading required package: lmtest
"package 'lmtest' was built under R version 3.4.3"Loading required package: zoo
"package 'zoo' was built under R version 3.4.3"
Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric



In [12]:
bptest(lnwage~grade+exp+exp2+union, studentize=FALSE)


	Breusch-Pagan test

data:  lnwage ~ grade + exp + exp2 + union
BP = 6.1161, df = 4, p-value = 0.1906


You can also run a Goldfeld-Quandt test and check wether your results following the above steps coincide with the output of the gqtest included in the package

In [13]:
gqtest(lnwage~grade+exp+exp2+union)


	Goldfeld-Quandt test

data:  lnwage ~ grade + exp + exp2 + union
GQ = 1.4923, df1 = 45, df2 = 45, p-value = 0.09161
alternative hypothesis: variance increases from segment 1 to 2
