## Variable selection 2: Variance Inflation Factor

This notebook goes with the blog post: [Variable selection in Python, part I](https://mycarta.wordpress.com/2019/04/30/variable-selection-in-python-part-i/).

### Based on:
https://etav.github.io/python/vif_factor_python.html

Also useful read, mentioning 5 or 10 as multicollinearity cutoff for VIF:

https://pvanb.wordpress.com/2016/04/25/vif-stepwise-variable-selection/


### Libraries

In [1]:
import pandas as pd
import numpy as np
from patsy import dmatrices, dmatrix
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Rename data

In [2]:
data = pd.read_csv('../data/Table2_Hunt_2013_edit.csv')
data.rename(columns={'Gross pay': 'Gross_pay', 'Phi-h': 'Phi_h', 'Random 1': 'Random1', 'Random 2': 'Random2', 
                     'Gross pay transform': 'Gross_pay_transform'}, inplace = True)
data.head()

Unnamed: 0,Gross_pay,Phi_h,Position,Pressure,Random1,Random2,Gross_pay_transform,Production
0,0.1,0.5,2.1,19,5,379,3.54,15.1
1,1.0,4.0,1.1,16,13,269,5.79,21.3
2,1.9,19.0,1.0,14,12,245,8.51,22.75
3,3.1,21.7,2.1,17,6,273,11.52,15.72
4,4.1,24.6,2.9,11,10,237,10.16,7.71


### Regression

In [3]:
outcome, predictors = dmatrices("Production ~ Gross_pay +Phi_h +Position +Pressure +Random1 +Random2 +Gross_pay_transform", 
                                data, return_type='dataframe')

### Variance Inflation Factor 

In [4]:
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(predictors.values, i) for i in range(predictors.shape[1])]
vif["features"] = predictors.columns

In [5]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,152.0,Intercept
1,29.4,Gross_pay
2,4.6,Phi_h
3,1.5,Position
4,1.2,Pressure
5,1.1,Random1
6,1.7,Random2
7,24.2,Gross_pay_transform


__Gross pay and Gross pay transform have a high variance inflation factor because they "explain" the same variance within this dataset.__

__The high VIF for the intercept is to be expected. Read below:__

here:
https://stackoverflow.com/a/48819434/1034648

and here:
https://stats.stackexchange.com/a/386157/54871

and here:
https://stats.stackexchange.com/a/7950/54871

### Remove Gross pay and repeat

In [6]:
data_d1=data.drop(columns=['Gross_pay'])
data_d1.head()

Unnamed: 0,Phi_h,Position,Pressure,Random1,Random2,Gross_pay_transform,Production
0,0.5,2.1,19,5,379,3.54,15.1
1,4.0,1.1,16,13,269,5.79,21.3
2,19.0,1.0,14,12,245,8.51,22.75
3,21.7,2.1,17,6,273,11.52,15.72
4,24.6,2.9,11,10,237,10.16,7.71


In [7]:
outcome, predictors = dmatrices("Production ~ Gross_pay_transform +Phi_h +Position +Pressure +Random1 +Random2", 
                                data_d1, return_type='dataframe')

In [8]:
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(predictors.values, i) for i in range(predictors.shape[1])]
vif["features"] = predictors.columns

In [9]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,101.7,Intercept
1,3.4,Gross_pay_transform
2,3.8,Phi_h
3,1.5,Position
4,1.0,Pressure
5,1.1,Random1
6,1.6,Random2


### This time remove Gross pay transform and repeat

In [10]:
data_d2=data.drop(columns=['Gross_pay_transform'])
data_d2.head()

Unnamed: 0,Gross_pay,Phi_h,Position,Pressure,Random1,Random2,Production
0,0.1,0.5,2.1,19,5,379,15.1
1,1.0,4.0,1.1,16,13,269,21.3
2,1.9,19.0,1.0,14,12,245,22.75
3,3.1,21.7,2.1,17,6,273,15.72
4,4.1,24.6,2.9,11,10,237,7.71


In [11]:
outcome, predictors = dmatrices("Production ~ Gross_pay +Phi_h +Position +Pressure +Random1 +Random2", 
                                data_d2, return_type='dataframe')

In [12]:
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(predictors.values, i) for i in range(predictors.shape[1])]
vif["features"] = predictors.columns

In [13]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,99.9,Intercept
1,4.1,Gross_pay
2,4.5,Phi_h
3,1.5,Position
4,1.1,Pressure
5,1.1,Random1
6,1.6,Random2
