# Multicollinearity

Multicollinearity is an assumption of regression that explores whether the predictor attributes (X) are correlated. Linear regression assumes that the predictor attributes are uncorrelated, but in practice this assumption is never met. What impacts does multicollinearity have on the linear regression? The short answer is that the regression parameters (the $\beta$ terms) will be more or less unbiased, however, the same cannot be said for the standard errors. 

Let's first use some real data to explore this, then we'll use simulated data to show the impact correlated predictor attributes have on the linear regression estiamtes. 

In [4]:
library(tidyverse)
library(ggformula)

theme_set(theme_bw(base_size = 18))

college <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/College-scorecard-4143.csv") %>%
  mutate(act_mean = actcmmid - mean(actcmmid, na.rm = TRUE),
         cost_mean = costt4_a - mean(costt4_a, na.rm = TRUE)) %>%
  drop_na(act_mean, cost_mean)

adm_mult_reg <- lm(adm_rate ~ actcmmid + costt4_a, data = college)

summary(adm_mult_reg)

[1m[1mRows: [1m[22m[34m[34m7058[34m[39m [1m[1mColumns: [1m[22m[34m[34m16[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (6): instnm, city, stabbr, preddeg, region, locale
[32mdbl[39m (10): adm_rate, actcmmid, ugds, costt4_a, costt4_p, tuitionfee_in, tuiti...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.




Call:
lm(formula = adm_rate ~ actcmmid + costt4_a, data = college)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65484 -0.12230  0.02291  0.13863  0.37054 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.135e+00  3.326e-02  34.130  < 2e-16 ***
actcmmid    -1.669e-02  1.650e-03 -10.114  < 2e-16 ***
costt4_a    -2.304e-06  4.047e-07  -5.693 1.55e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1813 on 1278 degrees of freedom
Multiple R-squared:  0.1836,	Adjusted R-squared:  0.1824 
F-statistic: 143.8 on 2 and 1278 DF,  p-value: < 2.2e-16


In [9]:
library(mosaic)

cor(actcmmid ~ costt4_a, data = college)

## Variance Inflation Factor

The variance inflation factor (VIF) is a commonly used statistic to aid in diagnosing multicollinearity. It attempts to estimate how much the variance of the estimated regression equation is inflated due to correlated predictor attributes. 

First, this can be calculated from the car package, using the function `vif()`. 

In [10]:
library(car)

vif(adm_mult_reg)

Fundamentally, the VIF is calculated with the following steps. 

1. Fit a regression where one of the X attributes is the outcome and the remaining X attributes are predictors. 
2. Calculate VIF: $VIF = 1 / 1 - R^2$ from step 1. 
3. Repeate this for all X attributes.
4. Evaluate extent to which VIF is problematic. Rules of thumb include VIF statistics greater than 5 or 10. 

Interpretation wise, the square root of the VIF statistic can provide an indication of how inflated the standard error would be if the predictors are uncorrelated. 

In [11]:
act_lm <- lm(actcmmid ~ costt4_a, data = college)
summary(act_lm)$r.square

In [13]:
1 / (1 - .3088)

In [14]:
sqrt(1.446)

### Simulated data exploration

Let's use some simulated data to show the extent to which multicollinearity can be problematic, particularly for more highly correlated attributes. 

First, let's start with uncorrelated (on average) predictor attributes.

In [37]:
library(simglm)

sim_args <- list(formula = y ~ 1 + act + gpa + sat, 
                 fixed = list(act = list(var_type = 'continuous',
                                         mean = 20, 
                                         sd = 4),
                              gpa = list(var_type = 'continuous',
                                         mean = 2, 
                                         sd = .5),
                              sat = list(var_type = 'continuous',
                                         mean = 500, 
                                         sd = 100)),
                 correlate = list(fixed = data.frame(x = c('act', 'act', 'gpa'), 
                                                     y = c('gpa', 'sat', 'sat'), 
                                                     corr = c(0, 0, 0))),
                error = list(variance = 100),
                reg_weights = c(1, 1, 1, 1),
                 sample_size = 10000)

sim_data <- simulate_fixed(data = NULL, sim_args) %>%
   simulate_error(sim_args) %>%
   correlate_variables(sim_args) %>%
   generate_response(sim_args)

head(sim_data)

Unnamed: 0_level_0,X.Intercept.,act_old,gpa_old,sat_old,level1_id,error,act,gpa,sat,fixed_outcome,random_effects,y
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,25.81868,2.387232,625.0705,1,19.894009,23.09785,2.625352,645.467,672.1902,0,692.0842
2,1,20.78,1.851529,449.0296,2,12.183575,18.81223,1.745148,519.5,541.0573,0,553.2409
3,1,19.32782,1.987358,641.002,3,-5.933002,19.89886,2.70501,483.1956,506.7995,0,500.8665
4,1,23.35195,2.040886,567.8754,4,-12.370279,20.32709,2.339377,583.7989,607.4653,0,595.0951
5,1,25.34864,2.433653,380.9119,5,6.917448,23.46923,1.404559,633.716,659.5898,0,666.5072
6,1,23.98652,2.246192,419.8021,6,-9.635669,21.96954,1.599011,599.6631,624.2316,0,614.596


In [25]:
cor(sim_data[c('y', 'act', 'gpa', 'sat')])

Unnamed: 0,y,act,gpa,sat
y,1.0,0.039278283,0.0047364836,0.9941980371
act,0.039278283,1.0,0.0074519131,0.000421451
gpa,0.004736484,0.007451913,1.0,0.0003548581
sat,0.994198037,0.000421451,0.0003548581,1.0


In [38]:
sim_lm <- lm(y ~ 1 + act + gpa + sat, data = sim_data)
summary(sim_lm)


Call:
lm(formula = y ~ 1 + act + gpa + sat, data = sim_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.056  -6.743   0.005   6.694  40.075 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -0.129302   0.823839   -0.157    0.875    
act          0.986796   0.025136   39.258  < 2e-16 ***
gpa          1.244655   0.199483    6.239 4.57e-10 ***
sat          1.001616   0.001002 1000.071  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.993 on 9996 degrees of freedom
Multiple R-squared:  0.9901,	Adjusted R-squared:  0.9901 
F-statistic: 3.341e+05 on 3 and 9996 DF,  p-value: < 2.2e-16


In [30]:
vif(sim_lm)

Let's now increase the correlation among the attributes. 

In [39]:
library(simglm)

sim_args <- list(formula = y ~ 1 + act + gpa + sat, 
                 fixed = list(act = list(var_type = 'continuous',
                                         mean = 20, 
                                         sd = 4),
                              gpa = list(var_type = 'continuous',
                                         mean = 2, 
                                         sd = .5),
                              sat = list(var_type = 'continuous',
                                         mean = 500, 
                                         sd = 100)),
                 correlate = list(fixed = data.frame(x = c('act', 'act', 'gpa'), 
                                                     y = c('gpa', 'sat', 'sat'), 
                                                     corr = c(0.5, 0.5, 0.25))),
                error = list(variance = 100),
                reg_weights = c(1, 1, 1, 1),
                 sample_size = 10000)

sim_data <- simulate_fixed(data = NULL, sim_args) %>%
   simulate_error(sim_args) %>%
   correlate_variables(sim_args) %>%
   generate_response(sim_args)

head(sim_data)

Unnamed: 0_level_0,X.Intercept.,act_old,gpa_old,sat_old,level1_id,error,act,gpa,sat,fixed_outcome,random_effects,y
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,22.44951,1.143265,552.0649,1,-2.7700701,12.85496,1.322129,438.8815,454.0586,0,451.2885
2,1,20.99447,2.094476,559.4903,2,14.6823989,20.17272,1.753849,475.1251,498.0516,0,512.734
3,1,18.91376,2.500787,619.5386,3,-4.6245695,24.04456,1.738612,527.0863,553.8695,0,549.2449
4,1,18.86514,1.49062,433.3239,4,0.7870816,17.0223,2.099114,528.4424,548.5638,0,549.3509
5,1,21.60813,2.207381,567.7602,5,3.1802718,20.64968,1.748622,459.7679,483.1662,0,486.3465
6,1,20.52082,1.908382,535.7272,6,7.0095885,19.11459,1.789336,486.9923,508.8963,0,515.9058


In [32]:
cor(sim_data[c('y', 'act', 'gpa', 'sat')])

Unnamed: 0,y,act,gpa,sat
y,1.0,0.5321332,0.2651093,0.9946684
act,0.5321332,1.0,0.5096087,0.5027017
gpa,0.2651093,0.5096087,1.0,0.2475699
sat,0.9946684,0.5027017,0.2475699,1.0


In [40]:
sim_lm2 <- lm(y ~ 1 + act + gpa + sat, data = sim_data)
vif(sim_lm2)

In [41]:
broom::tidy(sim_lm)
broom::tidy(sim_lm2)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-0.1293016,0.823839326,-0.1569501,0.8752874
act,0.9867961,0.025135869,39.2584824,1.222306e-313
gpa,1.2446553,0.199482665,6.2394159,4.570085e-10
sat,1.0016156,0.001001545,1000.0709711,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.7395581,0.604706104,2.8767,0.004027075
act,1.0365325,0.031981282,32.410599,3.292363e-219
gpa,0.5840605,0.228712664,2.553687,0.01067362
sat,0.9990386,0.001154967,864.993101,0.0


In [42]:
library(simglm)

sim_args <- list(formula = y ~ 1 + act + gpa + sat, 
                 fixed = list(act = list(var_type = 'continuous',
                                         mean = 20, 
                                         sd = 4),
                              gpa = list(var_type = 'continuous',
                                         mean = 2, 
                                         sd = .5),
                              sat = list(var_type = 'continuous',
                                         mean = 500, 
                                         sd = 100)),
                 correlate = list(fixed = data.frame(x = c('act', 'act', 'gpa'), 
                                                     y = c('gpa', 'sat', 'sat'), 
                                                     corr = c(0.5, 0.9, 0.5))),
                error = list(variance = 100),
                reg_weights = c(1, 1, 1, 1),
                 sample_size = 10000)

sim_data <- simulate_fixed(data = NULL, sim_args) %>%
   simulate_error(sim_args) %>%
   correlate_variables(sim_args) %>%
   generate_response(sim_args)

head(sim_data)

Unnamed: 0_level_0,X.Intercept.,act_old,gpa_old,sat_old,level1_id,error,act,gpa,sat,fixed_outcome,random_effects,y
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,13.39385,2.051822,455.2614,1,4.1461119,26.1211,2.611051,665.1471,694.8792,0,699.0254
2,1,22.6592,2.337876,443.6007,2,-6.1012174,18.77416,2.116692,433.4771,455.368,0,449.2668
3,1,22.1416,2.077299,492.0787,3,-12.4331747,18.33998,1.909499,446.4502,467.6997,0,455.2665
4,1,16.21051,1.906111,445.0111,4,0.9036132,23.07636,2.461179,594.7487,621.2862,0,622.1898
5,1,21.09031,1.505425,477.8051,5,4.4194871,17.29276,1.966768,472.8044,493.0639,0,497.4834
6,1,20.54478,1.972338,607.4112,6,-7.1182748,19.42934,1.502119,486.3845,508.3159,0,501.1977


In [43]:
cor(sim_data[c('y', 'act', 'gpa', 'sat')])

Unnamed: 0,y,act,gpa,sat
y,1.0,0.9031246,0.4873983,0.9951724
act,0.9031246,1.0,0.4843913,0.8996471
gpa,0.4873983,0.4843913,1.0,0.4848667
sat,0.9951724,0.8996471,0.4848667,1.0


In [44]:
sim_lm3 <- lm(y ~ 1 + act + gpa + sat, data = sim_data)
vif(sim_lm3)

In [45]:
broom::tidy(sim_lm)
broom::tidy(sim_lm2)
broom::tidy(sim_lm3)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-0.1293016,0.823839326,-0.1569501,0.8752874
act,0.9867961,0.025135869,39.2584824,1.222306e-313
gpa,1.2446553,0.199482665,6.2394159,4.570085e-10
sat,1.0016156,0.001001545,1000.0709711,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.7395581,0.604706104,2.8767,0.004027075
act,1.0365325,0.031981282,32.410599,3.292363e-219
gpa,0.5840605,0.228712664,2.553687,0.01067362
sat,0.9990386,0.001154967,864.993101,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.7394096,0.549400326,1.345849,0.1783819
act,1.0390601,0.057773432,17.985086,3.356943e-71
gpa,0.7898908,0.228273886,3.460277,0.0005418735
sat,1.0001095,0.002327301,429.729395,0.0


In [46]:
library(simglm)

sim_args <- list(formula = y ~ 1 + act + gpa + sat, 
                 fixed = list(act = list(var_type = 'continuous',
                                         mean = 20, 
                                         sd = 4),
                              gpa = list(var_type = 'continuous',
                                         mean = 2, 
                                         sd = .5),
                              sat = list(var_type = 'continuous',
                                         mean = 500, 
                                         sd = 100)),
                 correlate = list(fixed = data.frame(x = c('act', 'act', 'gpa'), 
                                                     y = c('gpa', 'sat', 'sat'), 
                                                     corr = c(0.9, 0.98, 0.85))),
                error = list(variance = 100),
                reg_weights = c(1, 1, 1, 1),
                 sample_size = 10000)

sim_data <- simulate_fixed(data = NULL, sim_args) %>%
   simulate_error(sim_args) %>%
   correlate_variables(sim_args) %>%
   generate_response(sim_args)

head(sim_data)

Unnamed: 0_level_0,X.Intercept.,act_old,gpa_old,sat_old,level1_id,error,act,gpa,sat,fixed_outcome,random_effects,y
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,20.71208,2.3151561,491.9592,1,1.782571,19.7988,2.052707,482.178,505.0295,0,506.8121
2,1,25.51451,2.306849,565.4787,2,2.605566,15.11093,1.397471,362.1172,379.6256,0,382.2312
3,1,20.18762,2.3870823,549.8151,3,-14.669732,20.45214,2.022497,495.2845,518.7591,0,504.0894
4,1,15.61219,1.9207986,487.3058,4,16.615445,24.16911,2.462412,609.7004,637.3319,0,653.9474
5,1,20.64776,0.9965156,439.9581,5,14.243003,17.7464,1.687754,483.8705,504.3047,0,518.5477
6,1,18.27444,1.5515749,613.0509,6,12.457967,21.02867,1.804536,543.1666,566.9998,0,579.4578


In [47]:
cor(sim_data[c('y', 'act', 'gpa', 'sat')])

Unnamed: 0,y,act,gpa,sat
y,1.0,0.9758796,0.8446465,0.9952441
act,0.9758796,1.0,0.8978368,0.9790133
gpa,0.8446465,0.8978368,1.0,0.8454233
sat,0.9952441,0.9790133,0.8454233,1.0


In [48]:
sim_lm4 <- lm(y ~ 1 + act + gpa + sat, data = sim_data)
vif(sim_lm4)

In [49]:
broom::tidy(sim_lm)
broom::tidy(sim_lm2)
broom::tidy(sim_lm3)
broom::tidy(sim_lm4)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-0.1293016,0.823839326,-0.1569501,0.8752874
act,0.9867961,0.025135869,39.2584824,1.222306e-313
gpa,1.2446553,0.199482665,6.2394159,4.570085e-10
sat,1.0016156,0.001001545,1000.0709711,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.7395581,0.604706104,2.8767,0.004027075
act,1.0365325,0.031981282,32.410599,3.292363e-219
gpa,0.5840605,0.228712664,2.553687,0.01067362
sat,0.9990386,0.001154967,864.993101,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.7394096,0.549400326,1.345849,0.1783819
act,1.0390601,0.057773432,17.985086,3.356943e-71
gpa,0.7898908,0.228273886,3.460277,0.0005418735
sat,1.0001095,0.002327301,429.729395,0.0


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),1.5671212,0.52860667,2.964626,0.003037663
act,0.7816337,0.1632586,4.787703,1.711219e-06
gpa,0.8451437,0.49857105,1.695132,0.09008155
sat,1.0084804,0.00538413,187.306099,0.0
