# Multicollinearity In Linear Regression

### In regression, "multicollinearity" refers to predictors that are correlated with other predictors. 

### Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. 

### In other words, it results when you have factors that are a bit redundant.

In [1]:
import pandas as pd

In [2]:
import statsmodels.api as sm

In [3]:
df_adv = pd.read_csv("Advertising.csv")
df_adv.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [4]:
X= df_adv[["TV","radio", "newspaper"]]
y = df_adv["sales"]

In [17]:
X= sm.add_constant(X) # constant is the Y intercept
X

Unnamed: 0,const,TV,radio,newspaper
0,1.0,230.1,37.8,69.2
1,1.0,44.5,39.3,45.1
2,1.0,17.2,45.9,69.3
3,1.0,151.5,41.3,58.5
4,1.0,180.8,10.8,58.4
...,...,...,...,...
195,1.0,38.2,3.7,13.8
196,1.0,94.2,4.9,8.1
197,1.0,177.0,9.3,6.4
198,1.0,283.6,42.0,66.2


In [5]:
model=sm.OLS(y,X).fit()  #Ordinary Least Square
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared (uncentered):,0.982
Model:,OLS,Adj. R-squared (uncentered):,0.982
Method:,Least Squares,F-statistic:,3566.0
Date:,"Sat, 31 Oct 2020",Prob (F-statistic):,2.43e-171
Time:,14:59:40,Log-Likelihood:,-423.54
No. Observations:,200,AIC:,853.1
Df Residuals:,197,BIC:,863.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
TV,0.0538,0.001,40.507,0.000,0.051,0.056
radio,0.2222,0.009,23.595,0.000,0.204,0.241
newspaper,0.0168,0.007,2.517,0.013,0.004,0.030

0,1,2,3
Omnibus:,5.982,Durbin-Watson:,2.038
Prob(Omnibus):,0.05,Jarque-Bera (JB):,7.039
Skew:,-0.232,Prob(JB):,0.0296
Kurtosis:,3.794,Cond. No.,12.6


### The std dev and the "P" values are very low

In [6]:
import matplotlib.pyplot as plt
X.iloc[:,1:].corr()

Unnamed: 0,radio,newspaper
radio,1.0,0.354104
newspaper,0.354104,1.0


##### Here the collinearity amogst the independent variables is very less

### Lets take another example and check its Collinearity

In [7]:
df_salary= pd.read_csv("Salary_Data.csv")
df_salary.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [8]:
X = df_salary[["YearsExperience", "Age"]]
y = df_salary["Salary"]

In [9]:
X= sm.add_constant(X)
X.head()

Unnamed: 0,const,YearsExperience,Age
0,1.0,1.1,21.0
1,1.0,1.3,21.5
2,1.0,1.5,21.7
3,1.0,2.0,22.0
4,1.0,2.2,22.2


In [10]:
model1= sm.OLS(y,X).fit()
model1.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Sat, 31 Oct 2020",Prob (F-statistic):,1.35e-19
Time:,15:02:24,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


In [11]:
X.iloc[:,1 :].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


### Years of Experience & Age is having 98% correlation.

### Hence one of the feature would be more than enough to predict the salary.

### The value of "P" is more for the Age then YearsExperience.

### So we can drop the Age feature and train the model n calculate the salary