* Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.

#### Why Multicollinearity is a problem?
* Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable. For example, let’s assume that in the following linear equation: 
    * Y = W0+W1*X1+W2*X2

* Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2 constant. But since X1 and X2 are highly correlated, changes in X1 would also cause changes in X2 and we would not be able to see their individual effect on Y.
* As we know, in regression we want to understand, impact of all the variables individually on target, but multicollinearity prevents that. So that's why multicollinearity is a problem. 
* Multicollinearity may not affect the accuracy of the model as much

In [2]:
# Import required libraries
import pandas as pd

In [3]:
data_url = "https://raw.githubusercontent.com/krishnaik06/Multicollinearity/master/data/Salary_Data.csv"
df = pd.read_csv(data_url)
df.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


### Multicollinearity can be detected via various methods, like:
* **Correlation matrix**:
    *  Let's say when we have less number of features like 20-30, we can then simply create correlation matrix and see which feature have correlation greater than like 0.9 and then remove one of it.
    * Now how we decide which one of the correlated feature to delete: We have to remove the one which least contributes(less correlated) with the target variable. In that way, we will be able to preserve the feature which has high contribution.

* **VIF (Variance Inflation Factors)**:
    * It is the most common method to remove multicollinearity.
    * VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.
    * **For eg: There are some features like X1 X2 X3 X4 X5, then we try to fit regression model, first time treating X1 as target feature and rest as independent feature. And then we calculate R^2 for this and we use this value in VIF formula:VIF = 1/(1-R^2)**
    * Similarly we will do this for all the features, for eg second time, X2 will be target and rest will be independent features.
    * VIF score of an independent variable represents how well the variable is explained by other independent variables.
    * R^2 value is determined to find out how well an independent variable is described by the other independent variables. A high value of R^2 means that the variable is highly correlated with the other variables.

    * **Closer the R^2 value to 1(high), the higher the value of VIF and the higher the multicollinearity with the particular independent variable.** And when R^2 is low then denominator will be high value and VIF will be less.
    * *VIF is calculated for all the features*.

In [4]:
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(X):
    # Create a dataframe
    vif = pd.DataFrame()
    # Add features names to column called `variables`
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif

# Get independent features
X = df.iloc[:, :-1]
calculate_vif(X)

Unnamed: 0,variables,VIF
0,YearsExperience,11.24047
1,Age,11.24047


* VIF starts at 1 and has no upper limit
* VIF = 1, no correlation between the independent variable and the other variables
* VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

We can see here that the ‘Age’ and ‘YearsExperience’ have a high VIF value, meaning they can be predicted by other independent variables in the dataset.

So Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features.

Let's do same steps on different dataset

In [20]:
url = "https://raw.githubusercontent.com/krishnaik06/Multicollinearity/master/data/Advertising.csv"
Ad_df = pd.read_csv(url)
Ad_df

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


In [23]:
X = Ad_df[['TV', 'radio','newspaper']]
calculate_vif(X)

Unnamed: 0,variables,VIF
0,TV,2.486772
1,radio,3.285462
2,newspaper,3.055245


Here we can see we have low VIF value, hence in this dataset multicollinearity doesn't exists.