<a href="https://colab.research.google.com/github/juhee3199/Machine-learning_advanced-study/blob/juhee3199-basic_code/classification/Ensemble-Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting
: 순차적으로 학습-예측하면서 잘못 예측한 데이터에 대한 가중치 부여를 통해 오류를 개선해 나가면서 학습
- 대표적인 구현: AdaBoost, Gradient Boosting


# 1. AdaBoost

- 순차적으로 학습 예측하면서 오류 데이터에 가중치를 부여
- 마지막에는 약한 학습기가 오류 값에 대해 가중치를 부여한 것을 모두 결합해서 최종 예측함

# 2. GBM (Gradient Boosting Machine)
- AdaBoost와 유사하나, 가중치 업데이트를 경사하강법(Gradient Descent)를 이용
- 오류 값 = 실제 값 - 예측 값
- h(x) = y - F(x)
- 경사하강법: h(x)를 최소화하는 방향성을 가지고 반복적으로 가중치 값을 업데이트


In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
data_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=0)

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time 
import warnings
warnings.filterwarnings('ignore')

start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=0)
gb_clf.fit(x_train, y_train)
gb_pred = gb_clf.predict(x_test)
gb_acc = accuracy_score(y_test, gb_pred)

print('GBM 정확도: {0:.4f}'.format(gb_acc))
print('GBM 수행시간: {0:.1f} 초'.format(time.time() - start_time))

GBM 정확도: 0.9649
GBM 수행시간: 0.4 초


In [9]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[100, 500],
    'learning_rate':[0.05, 0.1]
}

grid_cv = GridSearchCV(gb_clf, param_grid = params, cv=2, verbose=1)
grid_cv.fit(x_train, y_train)
print('최적 하이퍼 파라미터: ', grid_cv.best_params_)
print('최고 예측 정확도: {0:.4f}'.format(grid_cv.best_score_))

Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.9s finished


최적 하이퍼 파라미터:  {'learning_rate': 0.1, 'n_estimators': 500}
최고 예측 정확도: 0.9516


In [13]:
grid_pred = grid_cv.best_estimator_.predict(x_test)
accuracy = accuracy_score(y_test, grid_pred)
print('GridCV 정확도: {0:.4f}'.format(accuracy))

GridCV 정확도: 0.9737


# 3. XGBoost(Extra Gradient Boost)

