Identifying and Addressing Multicollinearity in  Regression Models

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

In [6]:
salary_data=pd.read_csv('/content/Salary_Data.csv')
salary_data.head(5)

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [15]:
#independent variables
X_1=salary_data[['YearsExperience']]
X_2=salary_data[['Age']]
#dependent variable
y=salary_data['Salary']

In [16]:
#try linear regression model on X_1
model = LinearRegression()

model.fit(X_1, y)

r_sq = model.score(X_1, y)

print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

coefficient of determination: 0.9569566641435086
intercept: 25792.200198668696
slope: [9449.96232146]


In [21]:
#try linear regression model on X_2
model.fit(X_2, y)
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

coefficient of determination: 0.9497077571981798
intercept: -64878.12422952533
slope: [5176.28135565]


Multicollinearity is a situation where two or more independent variables in a regression model are highly correlated with each other. This can lead to problems in estimating the coefficients of the independent variables and can result in unstable or unreliable predictions

In [23]:
print(salary_data[['YearsExperience', 'Salary']].corr())
print(salary_data[['Age', 'Salary']].corr())

                 YearsExperience    Salary
YearsExperience         1.000000  0.978242
Salary                  0.978242  1.000000
            Age   Salary
Age     1.00000  0.97453
Salary  0.97453  1.00000


tackeling the problem of multicollinearity using Variance Inflation Factor (VIF)

In [26]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_1 = sm.add_constant(X_1)
vif = pd.Series([variance_inflation_factor(X_1.values, i) for i in range(X_1.shape[1])], index=X_1.columns)
print(vif)
                         

const              4.626326
YearsExperience    1.000000
dtype: float64


hence VIF for YearsExperience is 1.00

In [27]:
X_2 = sm.add_constant(X_2)
vif_2 = pd.Series([variance_inflation_factor(X_2.values, i) for i in range(X_2.shape[1])], index=X_2.columns)
print(vif_2)

const    29.766065
Age       1.000000
dtype: float64


hence VIF for Age is 1.00

As a thumb rule, any  independent variable with VIF > 1.5 is avoided in a regression analysis. Sometimes the condition is relaxed to 2, instead of 1.5.


hence both our both independent variables have VIF=1.00 so we can consider both variable but if we find collinearity we should eliminate/drop the column which have higher VIF.
In last i would like to recommend you to follow this practice of checking multicollinearity before applying regression model .
