## Multicollinearity and inferential linear regression

Based on a discussion about https://link.springer.com/article/10.3758/s13428-015-0624-x, I'm trying to understand the effect of mean-centering on multicollinearity on stability of the coefficient estimates for linear regression.

In [1]:
from scipy import stats
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

In [2]:
x1 = stats.norm(10, 1).rvs(100)
x2 = 0.5*x1 + stats.norm(-15, 0.75).rvs(100)
x3 = 0.3*x1 + + 0.4 *x2 + stats.norm(3, 0.5).rvs(100)

y = -0.2*x1 + 0.3*x2 + 0.1*x3 + stats.norm(3, 0.6).rvs(100)

In [3]:
np.corrcoef(x1-x1.mean(), x2-x2.mean())

array([[1.        , 0.56764549],
       [0.56764549, 1.        ]])

In [4]:
np.corrcoef(x1, x2)

array([[1.        , 0.56764549],
       [0.56764549, 1.        ]])

In [5]:
np.corrcoef(x1*x2, y)

array([[1.        , 0.25518564],
       [0.25518564, 1.        ]])

In [6]:
np.corrcoef((x1-x1.mean())*(x2-x2.mean()), y)

array([[1.        , 0.09946242],
       [0.09946242, 1.        ]])

In [7]:
df = pd.DataFrame({'x1': x1,
                   'x2': x2,
                   'x3': x3,
                   'x1x2': x1*x2,
                   'x1c': x1-x1.mean(),
                   'x2c': x2-x2.mean(),
                   'x3c': x3-x3.mean(),
                   'x1cx2c': (x1-x1.mean())*(x2-x2.mean()),
                   'y': y})


In [8]:
smf.ols('y ~ x1 + x2', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.103
Model:,OLS,Adj. R-squared:,0.085
Method:,Least Squares,F-statistic:,5.586
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.00506
Time:,16:00:15,Log-Likelihood:,-82.553
No. Observations:,100,AIC:,171.1
Df Residuals:,97,BIC:,178.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8883,1.363,1.385,0.169,-0.817,4.594
x1,-0.1069,0.075,-1.421,0.159,-0.256,0.042
x2,0.2594,0.079,3.297,0.001,0.103,0.416

0,1,2,3
Omnibus:,2.325,Durbin-Watson:,1.857
Prob(Omnibus):,0.313,Jarque-Bera (JB):,1.582
Skew:,0.039,Prob(JB):,0.453
Kurtosis:,2.389,Cond. No.,345.0


In [9]:
smf.ols('y ~ x1c + x2c', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.103
Model:,OLS,Adj. R-squared:,0.085
Method:,Least Squares,F-statistic:,5.586
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.00506
Time:,16:00:15,Log-Likelihood:,-82.553
No. Observations:,100,AIC:,171.1
Df Residuals:,97,BIC:,178.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.7679,0.056,-31.518,0.000,-1.879,-1.657
x1c,-0.1069,0.075,-1.421,0.159,-0.256,0.042
x2c,0.2594,0.079,3.297,0.001,0.103,0.416

0,1,2,3
Omnibus:,2.325,Durbin-Watson:,1.857
Prob(Omnibus):,0.313,Jarque-Bera (JB):,1.582
Skew:,0.039,Prob(JB):,0.453
Kurtosis:,2.389,Cond. No.,1.91


In [10]:
smf.ols('y ~ x1 + x2 + x1x2', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.109
Model:,OLS,Adj. R-squared:,0.081
Method:,Least Squares,F-statistic:,3.924
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.0109
Time:,16:00:15,Log-Likelihood:,-82.22
No. Observations:,100,AIC:,172.4
Df Residuals:,96,BIC:,182.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3.7099,7.124,-0.521,0.604,-17.851,10.431
x1,0.4440,0.692,0.641,0.523,-0.930,1.818
x2,-0.2945,0.696,-0.423,0.673,-1.677,1.088
x1x2,0.0547,0.068,0.801,0.425,-0.081,0.190

0,1,2,3
Omnibus:,3.339,Durbin-Watson:,1.87
Prob(Omnibus):,0.188,Jarque-Bera (JB):,1.949
Skew:,0.026,Prob(JB):,0.377
Kurtosis:,2.318,Cond. No.,12900.0


In [11]:
smf.ols('y ~ x1c + x2c + x1cx2c', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.109
Model:,OLS,Adj. R-squared:,0.081
Method:,Least Squares,F-statistic:,3.924
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.0109
Time:,16:00:15,Log-Likelihood:,-82.22
No. Observations:,100,AIC:,172.4
Df Residuals:,96,BIC:,182.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.7923,0.064,-28.047,0.000,-1.919,-1.665
x1c,-0.1014,0.076,-1.339,0.184,-0.252,0.049
x2c,0.2543,0.079,3.215,0.002,0.097,0.411
x1cx2c,0.0547,0.068,0.801,0.425,-0.081,0.190

0,1,2,3
Omnibus:,3.339,Durbin-Watson:,1.87
Prob(Omnibus):,0.188,Jarque-Bera (JB):,1.949
Skew:,0.026,Prob(JB):,0.377
Kurtosis:,2.318,Cond. No.,2.05


In [12]:
smf.ols('y ~ x1 + x2 + x3', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.141
Model:,OLS,Adj. R-squared:,0.114
Method:,Least Squares,F-statistic:,5.233
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.00218
Time:,16:00:15,Log-Likelihood:,-80.43
No. Observations:,100,AIC:,168.9
Df Residuals:,96,BIC:,179.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.0341,1.405,0.736,0.464,-1.755,3.824
x1,-0.1526,0.077,-1.972,0.051,-0.306,0.001
x2,0.1700,0.089,1.911,0.059,-0.007,0.347
x3,0.1948,0.095,2.041,0.044,0.005,0.384

0,1,2,3
Omnibus:,1.884,Durbin-Watson:,1.935
Prob(Omnibus):,0.39,Jarque-Bera (JB):,1.648
Skew:,0.184,Prob(JB):,0.439
Kurtosis:,2.49,Cond. No.,366.0


In [13]:
smf.ols('y ~ x1c + x2c + x3c', data=df).fit().summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.141
Model:,OLS,Adj. R-squared:,0.114
Method:,Least Squares,F-statistic:,5.233
Date:,"Sun, 21 Mar 2021",Prob (F-statistic):,0.00218
Time:,16:00:15,Log-Likelihood:,-80.43
No. Observations:,100,AIC:,168.9
Df Residuals:,96,BIC:,179.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.7679,0.055,-32.028,0.000,-1.877,-1.658
x1c,-0.1526,0.077,-1.972,0.051,-0.306,0.001
x2c,0.1700,0.089,1.911,0.059,-0.007,0.347
x3c,0.1948,0.095,2.041,0.044,0.005,0.384

0,1,2,3
Omnibus:,1.884,Durbin-Watson:,1.935
Prob(Omnibus):,0.39,Jarque-Bera (JB):,1.648
Skew:,0.184,Prob(JB):,0.439
Kurtosis:,2.49,Cond. No.,2.59
