# 05.04 분산 분석과 모형 성능

## 1. 분산 분석 속성값 출력

    - TSS : "result.uncentered_tss"
    - ESS : "result.mse_model"
    - RSS : "result.ssr"
    - R squared : "result.rsquared"

In [1]:
from sklearn.datasets import make_regression

X0, y, coef = make_regression(
    n_samples=100, n_features=1, noise=30, coef=True, random_state=0)
dfX0 = pd.DataFrame(X0, columns=["X"])
dfX = sm.add_constant(dfX0)
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model = sm.OLS.from_formula("Y ~ X", data=df)
result = model.fit()

In [2]:
print("TSS = ", result.uncentered_tss)
print("ESS = ", result.mse_model)
print("RSS = ", result.ssr)
print("ESS + RSS = ", result.mse_model + result.ssr)
print("R squared = ", result.rsquared)

TSS =  291345.7578983061
ESS =  188589.61349210914
RSS =  102754.33755137536
ESS + RSS =  291343.9510434845
R squared =  0.6473091780922584


## 2. 분산분석표 출력

    - F-검정, 분산분석 속성값 확인 가능
    - sm.stats.anova_lm( )
    
    - anova_lm( ) 의 F 통계량, 유의확률 == reuslt.summary() 의 F 통계량, 유의확률
    
    - df(자유도) : K-1 or N-K(잔차)
    - sum_sq(카이제곱분포 변수(제곱합)) : ESS, RSS
    - mean_sq : sum_sq/df
    - F : F 검정 통계량
    - PR(>F) : 유의확률값

In [3]:
# 분산분석표 출력

sm.stats.anova_lm(result)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
X,1.0,188589.613492,188589.613492,179.863766,6.601482e-24
Residual,98.0,102754.337551,1048.513648,,


In [6]:
# result.summary()
# 모형은 유용하다. (유의확률값 = 아주 낮은 0. F검정의 귀무가설 기각)

print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     179.9
Date:                Mon, 18 May 2020   Prob (F-statistic):           6.60e-24
Time:                        17:19:12   Log-Likelihood:                -488.64
No. Observations:                 100   AIC:                             981.3
Df Residuals:                      98   BIC:                             986.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.4425      3.244     -0.753      0.4

## 3. F 검정 활용

    - 두 모형의 비교 "sm.stats.anova_lm(model_reduced.fit(), model_full.fit())"
    
    - 변수의 중요성 확인 "sm.stats.anova_lm(result_boston, typ=2)"
       - F검정 유의확률 = 단일계수 t검정의 유의확률(=result.summary()) (동치 성질)
       - 대신, anova_lm으로 뽑으면, 소수점 자리가 깊게 나와 누가 더 0에 가까운지 비교 가능

**모형 비교**

In [11]:
from sklearn.datasets import load_boston

boston = load_boston()
dfX0_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy_boston = pd.DataFrame(boston.target, columns=["MEDV"])
dfX_boston = sm.add_constant(dfX0_boston)
df_boston = pd.concat([dfX_boston, dfy_boston], axis=1)

In [9]:
model_full = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)
model_reduced = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)

sm.stats.anova_lm(model_reduced.fit(), model_full.fit())

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,494.0,11081.363952,0.0,,,
1,492.0,11078.784578,2.0,2.579374,0.057274,0.944342


**변수의 중요도 확인**

In [None]:
model_full = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)
model_reduced = sm.OLS.from_formula(
    "MEDV ~ ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)

sm.stats.anova_lm(model_reduced.fit(), model_full.fit())

In [12]:
model_boston = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)
result_boston = model_boston.fit()
sm.stats.anova_lm(result_boston, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
CRIM,243.219699,1.0,10.801193,0.00108681
ZN,257.492979,1.0,11.435058,0.0007781097
INDUS,2.516668,1.0,0.111763,0.7382881
NOX,487.155674,1.0,21.634196,4.245644e-06
RM,1871.324082,1.0,83.104012,1.979441e-18
AGE,0.061834,1.0,0.002746,0.9582293
DIS,1232.412493,1.0,54.730457,6.013491e-13
RAD,479.153926,1.0,21.278844,5.070529e-06
TAX,242.25744,1.0,10.75846,0.001111637
PTRATIO,1194.233533,1.0,53.03496,1.308835e-12
