<a href="https://colab.research.google.com/github/junieberry/ML-PerfectGuide/blob/main/04.%20%EB%B6%84%EB%A5%98/06%20XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### XGBoost 개요
- GBM 기반

<br>

- 뛰어난 예측 성능
- GBM 대비 빠른 수행 시간
- Regularization
- Tree Pruning (이득이 없는 분할 가지치기)
- 교차 검증 내장

In [7]:
!pip install -q xgboost==0.80
import xgboost

### 파이썬 래퍼 XGBoost 하이퍼 파라미터

1. 일반 파라미터
    
    - booster : tree 기반 or linear 기반
    - silent : 출력 메시지 안 찍을때
    - nthread : CPU 실행 스레드 개수 조정

2. 부스터 파라미터

    - eta : GBM의 learning rate와 같다
    - num_boost_rounds : GBM의 n_estimators
    - *min_child_weight : 트리에서 추가적으로 가지 나눌지 결정하기 위한 데이터들의 weight 총합*
    - *gamma : 트리 리프 노드를 추가적으로 나눌지 결정할 최소 손실 감소 값*
    - max_depth
    - sub_sample
    - colsample_bytree
    - lambda
    - alpha
    - scale_pos_weight

3. 학습 태스크 파라미터

    - objective
    - binary:logistic
    - multi:softmax
    - multi:softprob
    - eval_metric

<br>

---

**과적합 해결**
1. eta 낮추기
2. max_depth 낮추기
3. min_child_weight 높이기
4. gamma 높이기

In [8]:
xgboost.__version__

'0.4'

### 파이썬 래퍼 XGBoost 적용 : 위스콘신 유방암 예측

In [11]:
import xgboost as xgb
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

dataset = load_breast_cancer()
x_features= dataset.data
y_label = dataset.target

cancer_df = pd.DataFrame(data=x_features, columns=dataset.feature_names)
cancer_df['target']= y_label
cancer_df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0


In [12]:
print(dataset.target_names)
print(cancer_df['target'].value_counts())

['malignant' 'benign']
1    357
0    212
Name: target, dtype: int64


In [15]:
x_train, x_test, y_train, y_test=train_test_split(x_features, y_label, test_size=0.2, random_state=156 )
print(x_train.shape , x_test.shape)

(455, 30) (114, 30)


In [16]:
## XGBoost 전용 데이터 세트
dtrain = xgb.DMatrix(data=x_train , label=y_train)
dtest = xgb.DMatrix(data=x_test , label=y_test)

In [17]:
params = { 'max_depth':3, # 트리 최대 깊이
           'eta': 0.1, # learning rate
           'objective':'binary:logistic', # 목적 함수
           'eval_metric':'logloss', # 오류 함수 평가 성능 지표
           'early_stoppings':100 # 조기 중단ㄴ 할 수 있는 최소 반복 횟수
          # 이때 무조건 eval_set과 eval_metric이 정의 되어야 한다
        }
num_rounds = 400

In [None]:
# train 데이터 셋은 train , evaluation 데이터 셋은 eval 
wlist = [(dtrain,'train'),(dtest,'eval') ]
# 하이퍼 파라미터와 early stopping 파라미터를 train( ) 함수의 파라미터로 전달
xgb_model = xgb.train(params = params , dtrain=dtrain , num_boost_round=num_rounds , evals=wlist )

In [24]:
pred_probs = xgb_model.predict(dtest)

print(np.round(pred_probs[:10],3))

preds = [ 1 if x>0.5 else 0 for x in pred_probs]
print(preds[:10])

[0.95  0.003 0.887 0.066 0.993 1.    1.    0.999 0.997 0.   ]
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0]


In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score

def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    roc_auc = roc_auc_score(y_test, pred)
    print('오차 행렬')
    print(confusion)
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [25]:
get_clf_eval(y_test, preds)

오차 행렬
[[35  2]
 [ 1 76]]
정확도: 0.9737, 정밀도: 0.9744, 재현율: 0.9870,    F1: 0.9806, AUC:0.9665


## 사이킷런 래퍼 XGBoost의 개요 및 적용



In [None]:
from xgboost import XGBClassifier

evals = [(x_test, y_test)]
xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)
xgb_wrapper.fit(x_train , y_train,  early_stopping_rounds=400,eval_set=evals, eval_metric="logloss",  verbose=True)
w_preds = xgb_wrapper.predict(x_test)

In [30]:
get_clf_eval(y_test, w_preds)

오차 행렬
[[35  2]
 [ 1 76]]
정확도: 0.9737, 정밀도: 0.9744, 재현율: 0.9870,    F1: 0.9806, AUC:0.9665


In [31]:
from xgboost import XGBClassifier

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)
evals = [(x_test, y_test)]
xgb_wrapper.fit(x_train, y_train, early_stopping_rounds=100, eval_metric="logloss",eval_set=evals, verbose=True)
ws100_preds = xgb_wrapper.predict(x_test)

Will train until validation_0 error hasn't decreased in 100 rounds.
[0]	validation_0-logloss:0.613520
[1]	validation_0-logloss:0.547843
[2]	validation_0-logloss:0.494248
[3]	validation_0-logloss:0.447986
[4]	validation_0-logloss:0.409109
[5]	validation_0-logloss:0.374977
[6]	validation_0-logloss:0.345714
[7]	validation_0-logloss:0.320529
[8]	validation_0-logloss:0.297210
[9]	validation_0-logloss:0.277991
[10]	validation_0-logloss:0.260302
[11]	validation_0-logloss:0.246037
[12]	validation_0-logloss:0.231556
[13]	validation_0-logloss:0.220050
[14]	validation_0-logloss:0.208572
[15]	validation_0-logloss:0.199993
[16]	validation_0-logloss:0.190118
[17]	validation_0-logloss:0.181818
[18]	validation_0-logloss:0.174729
[19]	validation_0-logloss:0.167657
[20]	validation_0-logloss:0.158202
[21]	validation_0-logloss:0.154432
[22]	validation_0-logloss:0.148656
[23]	validation_0-logloss:0.141245
[24]	validation_0-logloss:0.136042
[25]	validation_0-logloss:0.132487
[26]	validation_0-logloss:0.1276

213때부터 100 동안 성능이 나아지지 않아서 멈췄다~