# 변수선택법

## 최적회귀방정식의 선택 
- 모형 내 설명변수의 수가 증가할수록 데이터 관리에는 많은 노력이 요구된다. 따라서 상황에 따라 종속변수에 영향을 미치는 유의미한 독립변수들을 선택하여 최적의 회귀방정식을 도출하는 과정이 필요하다. 
- 변수를 선택할 때는 F-통계량이나 AIC와 같은 특정 기준을 근거로 변수를 제거하거나 선택한다. 
- t-통계량의 유의확률이 유의수준보다 큰 변수는 통계적으로 유의하지 않으므로 제거해야하고, AIC와 같은 벌점화 기준을 가장 낮게 만드는 변수 조합을 선택해야 한다. 


$AIC\ =\ -2\ln \left(L\right)\ +2k$

여기서  -2ln(L) 은 모형의 적합도를 의미하며, k는 모형의 추정된 파라미터의 개수이다. -2ln(L)에서 L은 Likelihood function 을 의미하며, AIC 값이 낮다는 것은 즉 모형의 적합도가 높은 것을 의미한다.

(모형의 적합도란 실제 자료와 연구자의 연구 모형이 얼마나 부합하는지 평가하는 것)

​

여기서 2k는 모형의 추정된 파라미터의 개수를 의미하며, 해당 모형에 패널티를 주기 위해 사용한다. 

실제로 어떤 모형이 2ln(L) 즉 적합도를 높이기 위해  여러 불필요한 파라미터를 사용할 수도 있다.  실제 모형 비교 시 독립변수가 많은 모형이 적합도 면에서 유리하게 되는데, 이는 즉 독립변수에 따라서 모형의 적합도에 차이가 난다는 의미이다. 따라서 이를 상쇄시키기 위하여 불필요한 파라미터, 즉 독립변수의 수가 증가할수록 2k를 증가시켜 패널티를 부여하여 모델의 품질을 평가한다. 




In [1]:
import pandas as pd
from pandas import DataFrame

# 데이터 불러오기
Cars = pd.read_csv('../data/Cars93.csv')


In [2]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

model = smf.ols(formula = "Price ~ EngineSize + RPM + Weight+ Length", data = Cars)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.563
Model:,OLS,Adj. R-squared:,0.543
Method:,Least Squares,F-statistic:,28.34
Date:,"Mon, 22 Nov 2021",Prob (F-statistic):,3.93e-15
Time:,06:17:20,Log-Likelihood:,-303.89
No. Observations:,93,AIC:,617.8
Df Residuals:,88,BIC:,630.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-45.4934,14.654,-3.104,0.003,-74.616,-16.371
EngineSize,4.5091,1.381,3.266,0.002,1.765,7.253
RPM,0.0070,0.001,5.139,0.000,0.004,0.010
Weight,0.0079,0.002,3.255,0.002,0.003,0.013
Length,-0.0457,0.083,-0.550,0.584,-0.211,0.120

0,1,2,3
Omnibus:,62.028,Durbin-Watson:,1.405
Prob(Omnibus):,0.0,Jarque-Bera (JB):,353.003
Skew:,2.067,Prob(JB):,2.22e-77
Kurtosis:,11.602,Cond. No.,133000.0


In [3]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

model = smf.ols(formula = "Price ~ EngineSize + RPM + Weight", data = Cars)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.561
Model:,OLS,Adj. R-squared:,0.547
Method:,Least Squares,F-statistic:,37.98
Date:,"Mon, 22 Nov 2021",Prob (F-statistic):,6.75e-16
Time:,06:17:55,Log-Likelihood:,-304.05
No. Observations:,93,AIC:,616.1
Df Residuals:,89,BIC:,626.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-51.7933,9.106,-5.688,0.000,-69.887,-33.699
EngineSize,4.3054,1.325,3.249,0.002,1.673,6.938
RPM,0.0071,0.001,5.208,0.000,0.004,0.010
Weight,0.0073,0.002,3.372,0.001,0.003,0.012

0,1,2,3
Omnibus:,62.441,Durbin-Watson:,1.406
Prob(Omnibus):,0.0,Jarque-Bera (JB):,361.88
Skew:,2.076,Prob(JB):,2.62e-79
Kurtosis:,11.726,Cond. No.,82700.0


In [6]:
import time
import itertools

In [7]:
def processSubset(X,y, feature_set):
            model = sm.OLS(y,X[list(feature_set)]) # Modeling
            regr = model.fit() # 모델 학습
            AIC = regr.aic # 모델의 AIC
            return {"model":regr, "AIC":AIC}
        
'''
전진선택법
'''
def forward(X, y, predictors):
    
    # 데이터 변수들이 미리정의된 predictors에 있는지 없는지 확인 및 분류
    remaining_predictors = [p for p in X.columns.difference(['Intercept']) if p not in predictors]
    results = []
    for p in remaining_predictors:
        results.append(processSubset(X=X, y= y, feature_set=predictors+[p]+['Intercept']))
        
    # 데이터프레임으로 변환
    models = pd.DataFrame(results)

    # AIC가 가장 낮은 것을 선택
    best_model = models.loc[models['AIC'].argmin()] # index

    print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in")
    print('Selected predictors:',best_model['model'].model.exog_names,' AIC:',best_model[0] )
    return best_model

'''
후진소거법
'''
def backward(X,y,predictors):
    tic = time.time()
    results = []
    
    # 데이터 변수들이 미리정의된 predictors 조합 확인
    for combo in itertools.combinations(predictors, len(predictors) - 1):
        results.append(processSubset(X=X, y= y,feature_set=list(combo)+['Intercept']))
    models = pd.DataFrame(results)
    
    # 가장 낮은 AIC를 가진 모델을 선택
    best_model = models.loc[models['AIC'].argmin()]
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors) - 1, "predictors in",
          (toc - tic))
    print('Selected predictors:',best_model['model'].model.exog_names,' AIC:',best_model[0] )
    return best_model



'''
단계적 선택법
'''

def Stepwise_model(X,y):
    Stepmodels = pd.DataFrame(columns=["AIC", "model"])
    tic = time.time()
    predictors = []
    Smodel_before = processSubset(X,y,predictors+['Intercept'])['AIC']
    # 변수 1~10개 : 0~9 -> 1~10
    for i in range(1, len(X.columns.difference(['Intercept'])) + 1):
        Forward_result = forward(X=X, y=y, predictors=predictors) # constant added
        print('forward')
        Stepmodels.loc[i] = Forward_result
        predictors = Stepmodels.loc[i]["model"].model.exog_names
        predictors = [ k for k in predictors if k != 'Intercept']
        Backward_result = backward(X=X, y=y, predictors=predictors)
        if Backward_result['AIC']< Forward_result['AIC']:
            Stepmodels.loc[i] = Backward_result
            predictors = Stepmodels.loc[i]["model"].model.exog_names
            Smodel_before = Stepmodels.loc[i]["AIC"]
            predictors = [ k for k in predictors if k != 'Intercept']
            print('backward')
        if Stepmodels.loc[i]['AIC']> Smodel_before:
            break
        else:
            Smodel_before = Stepmodels.loc[i]["AIC"]
    toc = time.time()
    print("Total elapsed time:", (toc - tic), "seconds.")
    return (Stepmodels['model'][len(Stepmodels['model'])])

In [8]:
from patsy import dmatrices

y,X = dmatrices("Price ~ EngineSize + RPM + Weight+ Length",
                data = Cars,return_type = "dataframe")

In [11]:
Stepwise_best_model = Stepwise_model(X=X, y=y)

Processed  4 models on 1 predictors in
Selected predictors: ['Weight', 'Intercept']  AIC: <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000027D57974988>
forward
Processed  1 models on 0 predictors in 0.003023862838745117
Selected predictors: ['Intercept']  AIC: <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000027D57964148>
Processed  3 models on 2 predictors in
Selected predictors: ['Weight', 'RPM', 'Intercept']  AIC: <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000027D5797CC08>
forward
Processed  2 models on 1 predictors in 0.004073619842529297
Selected predictors: ['Weight', 'Intercept']  AIC: <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000027D5797CA08>
Processed  2 models on 3 predictors in
Selected predictors: ['Weight', 'RPM', 'EngineSize', 'Intercept']  AIC: <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000027D57973E48>
forward
Proces

In [12]:
Stepwise_best_model

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x27d579d29c8>

In [13]:
Stepwise_best_model.aic

616.0976497740975

In [16]:
Stepwise_best_model.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.561
Model:,OLS,Adj. R-squared:,0.547
Method:,Least Squares,F-statistic:,37.98
Date:,"Mon, 22 Nov 2021",Prob (F-statistic):,6.75e-16
Time:,06:22:48,Log-Likelihood:,-304.05
No. Observations:,93,AIC:,616.1
Df Residuals:,89,BIC:,626.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Weight,0.0073,0.002,3.372,0.001,0.003,0.012
RPM,0.0071,0.001,5.208,0.000,0.004,0.010
EngineSize,4.3054,1.325,3.249,0.002,1.673,6.938
Intercept,-51.7933,9.106,-5.688,0.000,-69.887,-33.699

0,1,2,3
Omnibus:,62.441,Durbin-Watson:,1.406
Prob(Omnibus):,0.0,Jarque-Bera (JB):,361.88
Skew:,2.076,Prob(JB):,2.62e-79
Kurtosis:,11.726,Cond. No.,82700.0
