https://lsjsj92.tistory.com/522?category=853217
https://lsjsj92.tistory.com/523?category=853217


## 부스팅(boosting)
예측한 분류기가 예측을 틀린 부분에 있어서 가중치를 부여
ex) AdaBoost, gradient boosting

- AdaBoost

맞추지 못하는 부분에 있어서 가중치를 부여함, 마지막에 모든 것을 결합한 예측 모델을 만들어냄


![image of ](https://miro.medium.com/max/1700/0*paPv7vXuq4eBHZY7.png)

- Gradient Boosting

비슷하나 가중치 업데이트를 gradient desent 경사 하강법으로 진행한다. (실제값 - 예측값)을 최소화 하는 방향성으로 가중치를 업데이트
sklearn의 GB는 알고리즘 자체의 학습시간도 오래걸리고 병렬처리도 안되어 더욱 더 느리다. 하이퍼 파라미터 튜닝도 오래걸리는 편
RF와 하이퍼 파라미터는 똑같음

- 추가 hyper parameter
 1. loss: GB에서 사용할 비용함수, 특별한 이유가 없으면 default 인 deviance를 적용
 2. learning_rate: 학습률, 적으면 학습이 더디고, 많으면 너무 뛰수 있다. 보통 0.05 ~ 0.2 사이의 값을 사용
 3. n_estimators: weak learner의 갯수. 기본 값은 100이고 많으면 모델 성능이 좋아질 수 있지만 시간 소모가 크다.
 4. subsample: weak learner가 학습에 사용하는 데이터의 샘플링의 비율 기본 값은 1이며 전체 학습 데이터를 기반으로 사용. 0.5면 50%만 사용


- XGBoost(eXtra Gradient Boost)

GBM을 기반으로 하고 있지만 좀더 빠르며 overfitting의 regularization 문제를 해결
Tree pruning(나무가지치기)를 통해 이득 없는 가지를 제거하고 내장된 교차검증 과정이 있음. 또한 조기 종료도 가능함

- hypyer parameter

1. booster: gbtree(tree base model, 기본값) 또는 gblinear(linear model)
2. slient: 출력 메세지를 나타내고 싶지 않으면 1
3. nthread: cpu 실행 스레드 갯수, default는 전체 스레드 사용
4. n_estimators: GBM의 그것과 동일
5. learning_rate
6. max_depth
7. sub_sample
8. reg_lambda: L2 Regularization 적용값 default = 1
9. reg_alpha: L1 Regularization 적용값, default = 0
10. 손살함수: binary:logistic(이진분류용), mutil:softmax(다중분류용)
11. eval_metric: 검증에 사용되는 함수 정의. 회귀의 기본 RMSE, 분류는 error

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import plot_importance
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics  import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
import time

## Gradient Boosting 예제

In [3]:
def get_human_dataset():
    feature_name_df = pd.read_csv("./features.txt", sep ='\s+', header = None, names =['column_index','column_name'])
    feature_name = feature_name_df.iloc[:,1].values.tolist()
    
    X_train = pd.read_csv('./X_train.txt', sep ='\s+', names =feature_name)
    X_test = pd.read_csv('./X_test.txt', sep = '\s+', names = feature_name)
    y_train = pd.read_csv('./y_train.txt', sep = '\s+', header = None, names =['action'])
    y_test = pd.read_csv('./y_test.txt', sep = '\s+', header = None, names = ['action'])
    return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = get_human_dataset()

  return _read(filepath_or_buffer, kwds)


In [4]:
start_time = time.time()

In [20]:
gb_clf = GradientBoostingClassifier(random_state = 0)
gb_clf.fit(X_train, y_train.values.ravel()) #ravel(): y_train을 1d array로 펴주는 함수
gb_pred = gb_clf.predict(X_test)
acc = accuracy_score(y_test, gb_pred)
print("정확도 : {0:.4f}".format(acc))
print("시간 ", time.time() - start_time)

정확도 : 0.9376
시간  2094.0707857608795


Grid Search를 활용하여 튜닝. 시간이 오래걸리기 때문에 n_estimator와 learning_rate만 해본다.

In [21]:
params = {
    'n_estimators':[100,300],
    'learning_rate' : [0.05,0.1]
}

grid_cv = GridSearchCV(gb_clf, param_grid = params, cv = 2, verbose = 1)
grid_cv.fit(X_train, y_train.values.ravel())
print('최고 파라미터 :', grid_cv.best_params_)
print('최고 예측 정확도 :', grid_cv.best_score_)

Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 10.5min finished


최고 파라미터 : {'learning_rate': 0.05, 'n_estimators': 300}
최고 예측 정확도 : 0.9008433079434167


## xgboost 예제

In [76]:
def get_clf_eval(y_test, pred):
    y_test = y_test.values.ravel()
    confusion = confusion_matrix(y_test,pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test,pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_score = roc_auc_score(y_test, pred)
    print("오차 행혈 \n")
    print(confusion)
    print("정확도 : {0:.4f}, 정밀도 : {1:.4f}, 재현율 {2:.4f}, f1-score :{3:.4f}, auc 값 : {4:.4f}".format(accuracy, precision, recall,accuracy))

In [29]:
data = load_breast_cancer()
X_features = data.data
label = data.target

cancer_df = pd.DataFrame(X_features, columns = data.feature_names)
cancer_df['target'] = label
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [77]:
# xgb = XGBClassifier(n_estimators=400, learning_rate = 0.1, max_depth = 3)
# xgb.fit(X_train, y_train.values.ravel())
# xgb_pred = xgb.predict(X_test)
get_clf_eval(y_test, xgb_pred)

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

In [85]:
y_test['action'].value_counts()

6    537
5    532
1    496
4    491
2    471
3    420
Name: action, dtype: int64