In [2]:
import numpy as np
import pandas as pd


__Colinearity__ is the state where 2 variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a __correlation matrix__ and find variables with large absolute values. ( by using numpy's corrcoef function.)

__Multicolinearity__ on the other hand is more troublesome to detect because it emerges when 3 or more variables, which are highly correlated, are included within a model. _To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear._

__VIF__ is way to measure __Multicolinearity__

In statistics, the __variance inflation factor (VIF)__ is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone.

The VIF estimates how much the __variance of a regression coefficient__ is inflated due to multicollinearity in the model.


$$VIF = \frac{1}{1 - R^2}$$

__Interpretation__
The square root of the variance inflation factor indicates how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model.

VIF values of 10 or greater indicate __Multicolinearity__

Example
If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

In [3]:
import pandas as pd
from pandas.core import datetools
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from statsmodels.tools.tools import add_constant

import scipy as sp

  


In [4]:
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

In [5]:
ck = np.column_stack([a, b, c, d])

In [6]:
ck

array([[1, 2, 4, 4],
       [1, 2, 6, 3],
       [2, 3, 7, 4],
       [3, 2, 8, 5],
       [4, 1, 9, 4]])

In [7]:
corr = sp.corrcoef(ck, rowvar=False)

In [8]:
corr

array([[ 1.        , -0.54232614,  0.91707006,  0.54232614],
       [-0.54232614,  1.        , -0.36760731,  0.        ],
       [ 0.91707006, -0.36760731,  1.        ,  0.36760731],
       [ 0.54232614,  0.        ,  0.36760731,  1.        ]])

In [9]:
VIF = np.linalg.inv(corr)
VIF

array([[ 22.95      ,   6.45368112, -16.30191707,  -6.45368112],
       [  6.45368112,   3.        ,  -4.08044115,  -2.        ],
       [-16.30191707,  -4.08044115,  12.95      ,   4.08044115],
       [ -6.45368112,  -2.        ,   4.08044115,   3.        ]])

In [10]:
VIF.diagonal()

array([22.95,  3.  , 12.95,  3.  ])

# using statsmodel

In [11]:
df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

In [12]:
X = add_constant(df)

In [13]:
X

Unnamed: 0,const,a,b,c,d
0,1.0,1,2,4,4
1,1.0,1,2,6,3
2,1.0,2,3,7,4
3,1.0,3,2,8,5
4,1.0,4,1,9,4


In [14]:
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

const    136.875
a         22.950
b          3.000
c         12.950
d          3.000
dtype: float64

#### Why multicollinearity is an issue for regression analysis

$$
β = ({X}^\top X)^{-1} ({X}^\top y )
$$