<a href="https://colab.research.google.com/github/juhee3199/Machine-learning_advanced-study/blob/juhee3199-basic_code/classification/Ensemble_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting
: 순차적으로 학습-예측하면서 잘못 예측한 데이터에 대한 가중치 부여를 통해 오류를 개선해 나가면서 학습
- 대표적인 구현: AdaBoost, Gradient Boosting


# 1. AdaBoost

- 순차적으로 학습 예측하면서 오류 데이터에 가중치를 부여
- 마지막에는 약한 학습기가 오류 값에 대해 가중치를 부여한 것을 모두 결합해서 최종 예측함

# 2. GBM (Gradient Boosting Machine)
- AdaBoost와 유사하나, 가중치 업데이트를 경사하강법(Gradient Descent)를 이용
- 오류 값 = 실제 값 - 예측 값
- h(x) = y - F(x)
- 경사하강법: h(x)를 최소화하는 방향성을 가지고 반복적으로 가중치 값을 업데이트


In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
data_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=0)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time 
import warnings
warnings.filterwarnings('ignore')

start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=0)
gb_clf.fit(x_train, y_train)
gb_pred = gb_clf.predict(x_test)
gb_acc = accuracy_score(y_test, gb_pred)

print('GBM 정확도: {0:.4f}'.format(gb_acc))
print('GBM 수행시간: {0:.1f} 초'.format(time.time() - start_time))

GBM 정확도: 0.9649
GBM 수행시간: 0.4 초


In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[100, 500],
    'learning_rate':[0.05, 0.1]
}

grid_cv = GridSearchCV(gb_clf, param_grid = params, cv=2, verbose=1)
grid_cv.fit(x_train, y_train)
print('최적 하이퍼 파라미터: ', grid_cv.best_params_)
print('최고 예측 정확도: {0:.4f}'.format(grid_cv.best_score_))

Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.9s finished


최적 하이퍼 파라미터:  {'learning_rate': 0.1, 'n_estimators': 500}
최고 예측 정확도: 0.9516


In [None]:
grid_pred = grid_cv.best_estimator_.predict(x_test)
accuracy = accuracy_score(y_test, grid_pred)
print('GridCV 정확도: {0:.4f}'.format(accuracy))

GridCV 정확도: 0.9737


# 3. XGBoost(Extra Gradient Boost)
    - 뛰어난 예측 성능
    - GBM 대비 빠른 수행시간
    - 과적합 규제
    - Tree Prning
    - 자체 내장된 교차 검증
    - 결손값 자체 처리

- 사이킷런 XGB의 주요 하이퍼 파라미터
    - n_estimator, learning_rate, max_depth
    - 조기 중단 수행: fit()에 eval_metrixs, eval_set과 함께 입력
    - (조기 중단을 너무 급격하게 줄이면 예측 성능이 저하될 우려가 있음)


In [20]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
data_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=156)

# 평가지표 
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, f1_score, roc_auc_score

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
          F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [22]:
from xgboost import XGBClassifier

xgb_wrapper = XGBClassifier(n_estimator=400, learning_rate=0.1, max_depth=3)
xgb_wrapper.fit(x_train, y_train)
w_preds = xgb_wrapper.predict(x_test)
w_pred_proba = xgb_wrapper.predict_proba(x_test)[:,1] # predict_proba(): 예측확률 반환

get_clf_eval(y_test, w_preds, w_pred_proba)

오차 행렬
[[34  3]
 [ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740,          F1: 0.9677, AUC:0.9947


In [24]:
# early_stopping_rounds를 10으로 설정하고 재학습

evals = [(x_test, y_test)]
xgb_wrapper.fit(x_train, y_train, early_stopping_rounds=10, eval_metric='logloss', eval_set=evals, verbose=True)
ws10_preds = xgb_wrapper.predict(x_test)
ws10_pred_proba = xgb_wrapper.predict_proba(x_test)[:,1]
get_clf_eval(y_test, ws10_preds, ws10_pred_proba)

# 62번 반복까지만 수행된 후 학습이 종료. 예측 정확도는 조기중단 이전과 동일

[0]	validation_0-logloss:0.61352
Will train until validation_0-logloss hasn't improved in 10 rounds.
[1]	validation_0-logloss:0.547842
[2]	validation_0-logloss:0.494247
[3]	validation_0-logloss:0.447986
[4]	validation_0-logloss:0.409109
[5]	validation_0-logloss:0.374977
[6]	validation_0-logloss:0.345714
[7]	validation_0-logloss:0.320529
[8]	validation_0-logloss:0.29721
[9]	validation_0-logloss:0.277991
[10]	validation_0-logloss:0.260302
[11]	validation_0-logloss:0.246037
[12]	validation_0-logloss:0.231556
[13]	validation_0-logloss:0.22005
[14]	validation_0-logloss:0.208572
[15]	validation_0-logloss:0.199993
[16]	validation_0-logloss:0.190118
[17]	validation_0-logloss:0.181818
[18]	validation_0-logloss:0.174729
[19]	validation_0-logloss:0.167657
[20]	validation_0-logloss:0.158202
[21]	validation_0-logloss:0.154725
[22]	validation_0-logloss:0.148947
[23]	validation_0-logloss:0.143308
[24]	validation_0-logloss:0.136344
[25]	validation_0-logloss:0.132778
[26]	validation_0-logloss:0.127912


# 4. LightGBM

    - GBM, XGB보다 빠른 속도, 적은 메모리 사용량을 가지지만 예측 성능은 크게 차이나지 않음
    - 적은 데이터셋을 사용할 경우 과적합이 발생하는 단점
    - 리프 중심 트리 분할 방식을 사용하기 때문에 예측 오류 손실을 최소화할 수는 있지만 과적합이 발생할 수 있음(다른 대부분 트리 기반 알고리즘은 균형 트리 분할 방식을 사용) 

In [25]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
data_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=156)

In [27]:
from lightgbm import LGBMClassifier

lgbm_wrapper = LGBMClassifier(n_estimator=400)
evals = [(x_test, y_test)]
lgbm_wrapper.fit(x_train, y_train, early_stopping_rounds=100, eval_metric = 'logloss', eval_set=evals, verbose=True)
preds = lgbm_wrapper.predict(x_test)
pred_proba = lgbm_wrapper.predict_proba(x_test)[:,1]

get_clf_eval(y_test, preds, pred_proba)

[1]	valid_0's binary_logloss: 0.565079	valid_0's binary_logloss: 0.565079
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's binary_logloss: 0.507451	valid_0's binary_logloss: 0.507451
[3]	valid_0's binary_logloss: 0.458489	valid_0's binary_logloss: 0.458489
[4]	valid_0's binary_logloss: 0.417481	valid_0's binary_logloss: 0.417481
[5]	valid_0's binary_logloss: 0.385507	valid_0's binary_logloss: 0.385507
[6]	valid_0's binary_logloss: 0.355846	valid_0's binary_logloss: 0.355846
[7]	valid_0's binary_logloss: 0.330897	valid_0's binary_logloss: 0.330897
[8]	valid_0's binary_logloss: 0.306923	valid_0's binary_logloss: 0.306923
[9]	valid_0's binary_logloss: 0.28776	valid_0's binary_logloss: 0.28776
[10]	valid_0's binary_logloss: 0.26917	valid_0's binary_logloss: 0.26917
[11]	valid_0's binary_logloss: 0.250954	valid_0's binary_logloss: 0.250954
[12]	valid_0's binary_logloss: 0.23847	valid_0's binary_logloss: 0.23847
[13]	valid_0's binary_logloss: 0.225865	valid_0's bi