# Chapter 11 - Collinearity diagnostics

Joshua French

To open this information in an interactive Colab notebook, click or scan the QR code below.

<a href="https://colab.research.google.com/github/jfrench/LinearRegression/blob/master/notebooks/11-collinearity-diagnostics-notebook.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg"> </a>

------------------------------------------------------------------------

In [None]:
devtools::install_github("jfrench/api2lm")
library(api2lm)
if(!require(faraway, quietly = TRUE)) {
  install.packages("faraway",
                   repos = "https://cran.rstudio.com/")
  library(faraway)
}

We adjust some printing options for clarity.

In [None]:
options(digits = 5, scipen = 2)

# Collinearity and its effects

**Collinearity** or **multicollinearity** occurs in a fitted regression model when the model’s regressor variables are linearly dependent.

-   This rarely occurs in practice unless we improperly specify our model, e.g., including all the indicator variables associated with the levels of a categorical variable.

In practice, our fitted model suffers from collinearity when some of the regressor variables are approximately linearly dependent.

Collinearity leads to many undesirable issues with our fitted model.

-   The parameter estimates can change dramatically for small changes in the data.
-   The signs of the estimated coefficients can be wrong, leading to erroneous conclusions.
-   The standard error of the estimates can be very large, which leads to insignificant tests for single regression coefficients.
-   The F test for a regression relationship can be significant even though t tests for individual regression coefficients are insignificant.
-   It becomes difficult to interpret the association between the regressors and the response because multiple regressors are trying to play the same role in the model.

# Exact versus practical collinearity

Exact collinearity occurs when two or more regressors are perfect linear combinations of each other, which means the columns of our matrix of regressors, $\mathbf{X}$, are linearly dependent.

When regressors in our model are exactly collinear, $\mathbf{X}^T\mathbf{X}$ isn’t invertible and there isn’t a unique solution for the estimated regression coefficients that minimize the RSS.

Exact collinearity only occurs when we have poorly chosen the set of regressors to include in our model.

-   We can correct for exact collinearity by sequentially removing collinear regressors until $\mathbf{X}$ has linearly independent columns.

We should be more concerned with practical collinearity, which leads to the problems mentioned above even when the columns of $\mathbf{X}$ are not exactly collinear.

# The `seatpos` data set

The `seatpos` data set in the **faraway** package provides data related car seat positioning of drivers and is useful for how to identify and address collinearity. The data were obtained by researchers at the HuMoSim laboratory at the University of Michigan. The data set includes 38 observations of 9 variables. The variables measured for each driver are:

-   `hipcenter`: the horizontal distance of the midpoint of the driver’s hips from a fixed location in the car in mm (`numeric`).
-   `Age`: age in years (`integer`).
-   `Weight`: weight in pounds (`integer`).
-   `HtShoes`: height when wearing shoes in cm (`integer`).
-   `Ht` : height without shoes in cm (`numeric`).
-   `Seated` : seated height in cm (`numeric`).
-   `Arm` : lower arm length in cm (`numeric`).
-   `Thigh` : thigh length in cm (`numeric`).
-   `Leg` : lower leg length in cm (`numeric`).

We start by attaching the `seatpos` data set to our R session.

In [None]:
data(seatpos, package = "faraway")

# Detecting collinearity

Collinearity is often described as occurring when:

-   Regressors are highly correlated with each other.
-   Two or more regressors are approximately linear combinations of each other.

These descriptions are helpful, though incomplete, because collinearity issues can sometimes still occur even when there are no regressors that have high correlation or are close to linear dependence. However, they do suggest some approaches for detecting collinearity.

We discuss several approaches for detecting collinearity below.

## Contradictory significance results

A clear indicator of a collinearity problem in our fitted model is when the test for a regression relationship is significant and the hypothesis tests for individual regression coefficients are all insignificant. We demonstrate this issue using the `seatpos` data.

We fit a model regressing `hipcenter` on all the other variables contained in `seatpos`.

In [None]:
lmod <- lm(hipcenter ~ ., data = seatpos)

We use the `summary` function to see our results.

In [None]:
summary(lmod)

In this context, our set of regressors is $\mathbb{X} = \{\mathtt{Age}, \mathtt{Weight}, \ldots, \mathtt{Leg}\}$.

The test for a regression relationship decides between $$
\begin{aligned}
&H_0: E(Y \mid \mathbb{X}) = \beta_0 \\
&H_a: E(Y\mid \mathbb{X}) = \beta_0 + \beta_1 \mathtt{Age} + \cdots + \beta_{8} \mathtt{Leg}.
\end{aligned}
$$

The test statistic for this test is 7.94 with an associated p-value of approximately 0.000013 (based on an F distribution with 8 numerator degrees of freedom and 29 denominator degrees of freedom).

Thus, we conclude that at least one of the regression coefficients for the regressors in our model differs from zero.

In contradiction, the hypothesis tests for whether the individual coefficients differ from zero assuming the other regressors are in the model are all insignificant.

-   The p-value for the test associated with $\beta_{\mathtt{Age}}$ is 0.1843.
-   The p-value for the test associated with $\beta_{\mathtt{Weight}}$ is 0.9372.

Outside of the intercept coefficient, all of the tests for the individual regression coefficients are insignificant.

We have a contradiction in our testing results. We have concluded that:

-   At least one of the regression coefficients for our regressors differs from zero using the test for a regression relationship.
-   None of the regression coefficients for our regressors differs from zero based on the individual tests for a single regression coefficients.

These contradictory results will occur when our regressors exhibit collinearity.

## Pairwise correlation

The simplest approach for identifying a potential issue with collinearity is by computing the matrix of pairwise correlations among the regressors in our model.

Dormann et al. (2013) suggest that pairs of regressors with a correlation of at least 0.7 could be problematic if included in a fitted model.

This approach can only detect a potential collinearity problems for the simplest kinds of linear relationships.

### Correlation example

The `cor_stats` function in the **api2lm** package computes the pairwise correlations for the regressors included in a fitted `lm` object.

The function will only print the values with magnitudes of at least 0.7 (or some other threshold the user specifies), making it easy to identify potentially problematic pairs of regressors.

We use `cor_stats` to identify pairs of regressors with high correlation for the model we fit to the `seatpos` data.

In [None]:
cor_stats(lmod)

-   The `Ht` and `HtShoes` variables have a correlation of 1 (when rounded to 2 decimal places).
-   The `Ht` and `Weight` variables have a correlation of 0.83.
-   Several other pairs of variables have a high correlations.

We can customize the printing behavior of the `cor_stats` function to change the number of digits shown or the threshold used to censor values.

In [None]:
print(cor_stats(lmod), digits = 3, threshold = 0.9)

## Variance inflation factor

The **variance inflation factor (VIF)** measures the relative increase in $\hat{\mathrm{var}}(\hat{\beta}_j)$ that results from the model’s regressors not being orthogonal.

VIFs are the standard tool for identifying combinations of regressors exhibiting collinearity.

Recall that practical collinearity occurs when regressors are approximate linear combinations of each other.

If one variable is an approximate linear combination of other variables, then regressing that variable on the other regressors should result in a model with a large coefficient of determination ($R^2$).

Let $R_j^2$ denote the coefficient of determination when regressing $X_j$ on $X_1, X_2, \ldots, X_{j-1}, X_{j+1}, \ldots, X_{p-1}$.

The variance inflation factor of $X_j$ is computed as $$VIF_{j} = \frac{1}{1-R_j^2}.$$

The variance of the estimated regression coefficient can be expressed as $$
\mathrm{var}(\hat{\beta}_j)= \sigma^2 \left( \frac{1}{1-R_j^2} \right)\frac{1}{(n-1) s_j^2} = \sigma^2 (VIF_j)\frac{1}{(n-1) s_j^2},
$$ where $s_j^2=\sum_{i=1}_n (x_{ij} - \bar{x}_j)^2$ is the sample variance of the observed values of $X_j$ and $\bar{x}_j$ is the sample mean of the observed values of $x_j$.

Notice that as $R_j^2$ gets closer to 1 (i.e., we move closer to exact linear dependence between $X_j$ and the other regressors), then $VIF_j$ becomes larger.

If $VIF_j \geq 10$, then there is a potential collinearity problem with $X_j$.

-   A more conservative threshold is 5, since that will identify more regressors with potential collinearity problems.

A VIF of 1 indicates that a regressor is orthogonal to all the other regressors.

As a side note, the equation above for $\mathrm{var}{\hat{\beta}_j}$ tells us that:

-   If the observed values $X_j$ do not vary much, then the variance of $\hat{\beta}_j$ will be large.
-   If $s_j$ is large (i.e., the observed values of $X_j$ do vary a lot), then $\mathrm{var}(\hat{\beta}_j)$ will be smaller.
-   This provides us with insight in the rare situation that we control the values of our predictor variables such as in an experiment.

### VIF example

We can use the `vif_stats` function in the **api2lm** package to identify collinear regressors.

In [None]:
vif(lmod)

We see that the variances of `HtShoes` and `Ht` are extremely inflated and well above the threshold of 10.

There appears to be a multicollinearity problem in our data set.

VIF is not an appropriate statistic for assessing collinearity for sets of related regressors like dummy-variable regressors or polynomial regressors.

-   The **generalized VIF** should be used in these cases.
-   The VIF is adjusted by the size of their joint confidence region.
-   The `vif` function in the **car** package automatically computes the generalized VIF for related regressors.

# Remediation

How do we address the presence of collinearity?

The most common approaches are:

-   Removing collinearity variables.
-   Making simple transformations of the regressors.
    -   Centering the regressors (subtracting their mean)
    -   Scaling the regressors (dividing by their standard deviation).
    -   Standardizing regressors (centering and scaling the regressors).
-   Combining multiple regressors into a single regressor.

## Amputation

We remove one or more regressors from our analysis because they seem to be trying to play the same role in the model.

Removing a regressor from a model when the regressor has a non-zero coefficient will result in a biased model.

## Simple transformation

Simple transformation can sometimes correct the collinearity problem.

The intercept column of $\mathbf{X}$ becomes orthogonal to the other regressors when the other regressors are centered.

-   In that case, the interpretation of the intercept is that it is the mean response when the regressors are at their sample mean values.

## Polynomial correction

Centering a regressor BEFORE using it to construct polynomial terms can help mitigate problems with collinearity among the polynomial terms but will not remove all problems.

Even better, use the `poly` function to create orthogoal polynomial regressors.

## Combination

Combining the collinear regressors into a single regressor means they can’t be collinear in our model.

-   The is essentially what happens in Principal Component Analysis, which results in a set of orthogonal regressors.
-   The new regressor will have a novel interpretation.

# References

Dormann, Carsten F., Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, et al. 2013. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” *Ecography* 36 (1): 27–46. <https://doi.org/10.1111/j.1600-0587.2012.07348.x>.