# Interpreting your coefficients with multicollinearity

**Multicollinearity** is when one predictor variable in your regression model can be accurately predicted from the others.


### Use `import statsmodels.api as sm`, `from statsmodels.tools.tools import add_constant` (provide full summary of the data)
+ check for `R-Squared` value.
+ check for `Coefficient` of independent variables. These tell us how you can expect the likelihood of being one class to respond to changes in features.
+ check for `P-values` which tell us the relative statistical significance of each variables.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

### Data

In [7]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [8]:
iris['species'].value_counts()

virginica     50
setosa        50
versicolor    50
Name: species, dtype: int64

In [9]:
# We will make it into Binary Classification problem (dropping Sentosa)
iris = iris[iris['species'] != 'setosa']

# convert label to 0 and 1
iris['species'] = iris['species'].apply(lambda x: 1 if x=='versicolor' else 0)

In [10]:
iris['species'].value_counts()

1    50
0    50
Name: species, dtype: int64

### Regression modelling

In [11]:
X = iris.drop('species', axis=1) 
y = iris['species']

## Use logit

https://pypi.org/project/statsmodels/

In [15]:
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant

In [17]:
# add constant term
X = add_constant(X)

In [18]:
model = sm.Logit(y, X)

In [19]:
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.059493
         Iterations 12


In [20]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Wed, 10 Feb 2021   Pseudo R-squ.:                  0.9142
Time:                        13:20:12   Log-Likelihood:                -5.9493
converged:                       True   LL-Null:                       -69.315
Covariance Type:            nonrobust   LLR p-value:                 1.947e-26
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           42.6378     25.708      1.659      0.097      -7.748      93.024
sepal_length     2.4652      2.394      1.030      0.303      -2.228       7.158
sepal_width      6.6809      4.480      1.49

**Before you interpret those values, we need to understand if multicollinearity is present in our data.**

Multicollinearity won't affect the quality of predictions in our model, but only our abiblity to intrepret individual coefficients in p-values. This brings us to `Variance Inflation Factor(VIF)`.

------
------

 # Variance Inflation Factor (VIF)
 
+ **Variance Inflation Factor (VIF)** tells us the extent to which we have multicollinearity in our result
+ a `factor of 5 to 10 or more is considered high` and tells us to be wary of the model coefficients.

In [24]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [25]:
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [27]:
vif # this responds for each columns

[125.17027669562,
 3.9901127895124597,
 1.7219537585608635,
 7.252446651447896,
 3.948354236712144]

In [28]:
pd.DataFrame(list(zip(list(X.columns), vif)))


Unnamed: 0,0,1
0,const,125.170277
1,sepal_length,3.990113
2,sepal_width,1.721954
3,petal_length,7.252447
4,petal_width,3.948354


In this case `petal_length` has VIF above 7. That tells us that we don't want to interpret the coefficient for that particular variable.

To treat for multicollinearity, you can remove those variables with the high VIF. 

Note this will affect the explanatory power of your overall model. 