多重ロジスティック解析における多重共線性

～実証実験～


In [21]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns
sns.set()

In [22]:
N=20
a = np.arange(N)
a1 = np.arange(N)
a1[0]=1

b = np.array([0]*int(N/2)+[1]*int(N/2))
b[int(N/2)-5]=1
b[int(N/2)+5]=0

np.random.seed(123)
r=np.random.random(N)

X1とX2は、非常に強力に、Yを推定する変数である。
しかし、X1とX2は非常に強く相関している。
一方、X3はただのランダム。

In [24]:
data = pd.DataFrame({
    'X1':a,
    'X2':a1,
    'X3':r,
    'Y':b
    })
data

Unnamed: 0,X1,X2,X3,Y
0,0,1,0.696469,0
1,1,1,0.286139,0
2,2,2,0.226851,0
3,3,3,0.551315,0
4,4,4,0.719469,0
5,5,5,0.423106,1
6,6,6,0.980764,0
7,7,7,0.68483,0
8,8,8,0.480932,0
9,9,9,0.392118,0


Y ~ X1

Y ~ X1 + X2

Y ~ X1 + X3

Y ~ X1 + X2 + X3

これらを順番に試して、X1,X2,X3のP値に注目。

In [32]:
logistic_model = smf.glm(formula = "Y ~ X1", 
                       data = data, 
                       family=sm.families.Binomial(link=sm.genmod.families.links.logit()))
logistic_result = logistic_model.fit() 
logistic_result.summary()

0,1,2,3
Dep. Variable:,Y,No. Observations:,20.0
Model:,GLM,Df Residuals:,18.0
Model Family:,Binomial,Df Model:,1.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-7.9965
Date:,"Sun, 20 Mar 2022",Deviance:,15.993
Time:,20:33:01,Pearson chi2:,18.4
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.6020,1.614,-2.232,0.026,-6.764,-0.440
X1,0.3792,0.156,2.425,0.015,0.073,0.686


In [33]:
logistic_model = smf.glm(formula = "Y ~ X1+X2", 
                       data = data, 
                       family=sm.families.Binomial(link=sm.genmod.families.links.logit()))
logistic_result = logistic_model.fit() 
logistic_result.summary()

0,1,2,3
Dep. Variable:,Y,No. Observations:,20.0
Model:,GLM,Df Residuals:,17.0
Model Family:,Binomial,Df Model:,2.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-7.9687
Date:,"Sun, 20 Mar 2022",Deviance:,15.937
Time:,20:33:01,Pearson chi2:,18.0
No. Iterations:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.5291,1.644,-2.147,0.032,-6.751,-0.307
X1,17.4096,1.77e+04,0.001,0.999,-3.47e+04,3.48e+04
X2,-17.0370,1.77e+04,-0.001,0.999,-3.48e+04,3.47e+04


In [34]:
logistic_model = smf.glm(formula = "Y ~ X1+X3", 
                       data = data, 
                       family=sm.families.Binomial(link=sm.genmod.families.links.logit())                        )
logistic_result = logistic_model.fit() 
logistic_result.summary()

0,1,2,3
Dep. Variable:,Y,No. Observations:,20.0
Model:,GLM,Df Residuals:,17.0
Model Family:,Binomial,Df Model:,2.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-6.5423
Date:,"Sun, 20 Mar 2022",Deviance:,13.085
Time:,20:33:01,Pearson chi2:,11.7
No. Iterations:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.6485,2.149,-0.302,0.763,-4.860,3.563
X1,0.3964,0.176,2.254,0.024,0.052,0.741
X3,-6.3379,4.506,-1.406,0.160,-15.170,2.495


In [35]:
logistic_model = smf.glm(formula = "Y ~ X1+X2+X3", 
                       data = data, 
                       family=sm.families.Binomial(link=sm.genmod.families.links.logit()))
logistic_result = logistic_model.fit() 
logistic_result.summary()

0,1,2,3
Dep. Variable:,Y,No. Observations:,20.0
Model:,GLM,Df Residuals:,16.0
Model Family:,Binomial,Df Model:,3.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-6.5358
Date:,"Sun, 20 Mar 2022",Deviance:,13.072
Time:,20:33:02,Pearson chi2:,11.7
No. Iterations:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.6481,2.143,-0.302,0.762,-4.848,3.552
X1,15.9321,1.77e+04,0.001,0.999,-3.47e+04,3.48e+04
X2,-15.5380,1.77e+04,-0.001,0.999,-3.48e+04,3.47e+04
X3,-6.2888,4.520,-1.391,0.164,-15.148,2.570
