대표적인 앙상블 기법
  - Voting
     - 다수결 원칙에 기반해서 분류기들의 예측결과를 결합
     - 분류기의 다양성을 이용
  - Staking
    - 예측을하고 그 결과를 입력으로 다른 분류기에 전달해서 최종 예측

In [1]:
from sklearn.datasets import load_breast_cancer

In [2]:
x,y = load_breast_cancer(return_X_y=True)

In [6]:
x.shape, y.shape

((569, 30), (569,))

교차 검증기

In [8]:
from sklearn.model_selection import cross_val_score,StratifiedKFold
def classification_model(model,x,y):  
  # 5폴드 교차 검증
  scores = cross_val_score(model,x,y,cv=StratifiedKFold())
  return scores.mean()

In [10]:
from xgboost import XGBClassifier, XGBRFClassifier
classification_model(XGBClassifier(),x,y)

0.9771619313771154

In [11]:
# 메모리사용이 적어서 빠른 추정이 가능하다
# tree 형태의 알고리즘이 아니여서 정확성이 낮다
classification_model(XGBClassifier(booster ='gblinear' ),x,y)

0.9666356155876418

In [12]:
# dart : Dropout Additive Regression Trees
# 일부트리를 랜덤하게 제거 -> 과접합을 피하는 규제 효과
classification_model(XGBClassifier(booster ='dart',one_drop = True ),x,y)

0.9683744760130415

In [18]:
from sklearn.ensemble import RandomForestClassifier,StackingClassifier
classification_model(RandomForestClassifier(random_state=5),x,y)

0.9631268436578171

In [20]:
from sklearn.linear_model import LogisticRegression
classification_model(LogisticRegression(max_iter=10000),x,y)

0.9507995652848935

In [21]:
classification_model(XGBClassifier(n_estimators=500,max_depth=2,learning_rate=0.1 ),x,y)

0.9701133364384411

앙상블모델들의 상관관계

In [23]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=45)
def y_pred(model):
  model.fit(x_train,y_train)
  y_pred = model.predict(x_test)
  score = accuracy_score(y_pred,y_test)
  print(score)
  return y_pred

- 다양성 확인
  - 개별을 모델을 이용해서 예측을 수행하는데.. 각 모델들간의 상관관계를 확인
  - 상관관계가 높은 모델들은 유사한 예측할 확률이 높다.
- 과적합 회피
  - 상관관계가 높은 모델끼리 앙상블을 이루면 유사한 특징만 잡아낼수 있어서 과적합의 위험이 있음
- 성능향상
  - 상관관계가 낮은 모델들을 선택


In [31]:
classifier_lists = []
classifier_lists.append( y_pred(XGBClassifier(booster ='gbtree')) )
classifier_lists.append( y_pred(XGBClassifier(booster ='dart')) )
classifier_lists.append( y_pred(RandomForestClassifier(random_state=5)) )
classifier_lists.append( y_pred(LogisticRegression(max_iter=10000)) )
classifier_lists.append( y_pred(XGBClassifier(n_estimators=500,max_depth=2,learning_rate=0.1 )) )

0.9790209790209791
0.9790209790209791
0.965034965034965
0.951048951048951
0.972027972027972


In [34]:
import pandas as pd
classifier_lists

[array([1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0,
        1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
        0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
        1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,
        0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1]),
 array([1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0,
        1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
        0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
        1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,
        0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1]),
 array([1, 0, 0, 1, 1, 1, 

In [51]:
import numpy as np
np.c_[classifier_lists].shape  # 여러개의 배을 위에서 아래로 합치
df = pd.DataFrame(data = np.c_[classifier_lists]
                  , index=['gbtree','dart','forest','logistic','xgb']
                  )
df = df.T
df.head()

Unnamed: 0,gbtree,dart,forest,logistic,xgb
0,1,1,1,1,1
1,0,0,0,0,0
2,0,0,0,0,0
3,1,1,1,1,1
4,1,1,1,1,1


In [52]:
df.corr()

Unnamed: 0,gbtree,dart,forest,logistic,xgb
gbtree,1.0,1.0,0.969523,0.938952,0.98481
dart,1.0,1.0,0.969523,0.938952,0.98481
forest,0.969523,0.969523,1.0,0.908191,0.954195
logistic,0.938952,0.938952,0.908191,1.0,0.95377
xgb,0.98481,0.98481,0.954195,0.95377,1.0


Voting : 투표기반

In [55]:
estimators = []
logistic_model = LogisticRegression(max_iter=10000)
estimators.append( ('logistic', logistic_model)  )
rf_model = RandomForestClassifier(random_state=5)
estimators.append( ('rf', rf_model)  )
dart_model = XGBClassifier(booster ='dart')
estimators.append( ('dart', dart_model)  )

In [54]:
from sklearn.ensemble import VotingClassifier

In [56]:
ensemble_vote_model =  VotingClassifier(estimators)

In [59]:
scores = cross_val_score(ensemble_vote_model,x,y,cv=StratifiedKFold())
scores.mean()

0.9754230709517155

Stacking
  - 메타모델 : 앙상블에서 각각의 모델로 합습한 최종 결과를 예측하는

In [60]:
from sklearn.ensemble import StackingClassifier

In [61]:
estimators = []
logistic_model = LogisticRegression(max_iter=10000)
estimators.append( ('logistic', logistic_model)  )
rf_model = RandomForestClassifier(random_state=5)
estimators.append( ('rf', rf_model)  )
dart_model = XGBClassifier(booster ='dart')
estimators.append( ('dart', dart_model)  )
# 메타모델 정의
meta_model = LogisticRegression()
# 스태킹 앙상블 모델
stacking_model =  StackingClassifier(estimators,final_estimator = meta_model)
scores = cross_val_score(stacking_model,x,y,cv=StratifiedKFold())
scores.mean()

0.9789318428815401