# Xgboost 하이퍼 파라미터 튜닝

다음 글을 참고 및 번역했습니다. <br/>
[Complete Guide to Parameter Tuning in XGBoost (with codes in Python)](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

Table of Contents
- Xgboost 장점
- Xgboost Parameter 이해하기
- Parameter Tuning 하기


## 1. Xgboost 장점

Xgboost는 기존 GBDT 모델에 비해서 다음 기능이 있다.

- 정규화(Regularization)
- 병렬 처리
- 고수준의 유연성
- 결측치 처리
- Tree Pruning
- 내장 Cross Validation
- 기존 모델에 이어서 재학습할 수 있음

## 2. Xgboost Hyperparameter

Xgboost는 다음과 같은 Hyperparameter가 있습니다.

- Parameter 종류
    - General Parameter: 전체 기능을 가이드
    - Boost Parameter: 각각의 step에서 booster 가이드
    - Learning Task Parameter: 최적화 수행 가이드

1. General Parameter
    - booster: tree 기반 모델 / 선형 모델
    - silent: 메세지 조절
    - nthread: 병렬 처리 조절

2. Boost Parameter
    - eta: Learning rate(일반적으로 0.01 - 0.2)
    - min_child_weight: min_child_weight를 기준으로 추가 분기 결정(크면 Underfitting)
    - max_depth: Tree 깊이 수
    - max_leaf_node: 하나의 트리에서 node 개수
    - gamma: split 하기 위한 최소의 loss 감소 정의
    - subsample: 데이터 중 샘플링(0.5 - 1)
    - colsample_bytree: column 중 sampling(0.5 - 1)
    - colsample_bylevel: 각 level마다 샘플링 비율
    - lambda: L2 nrom
    - alpha: L1 norm
    - scale_pos_weight: positive, negative weight 지정
    - 기타 등

3. Learning Task Parameter
    - object: 목적함수 종류
        - binary:logistic(이진 분류)
        - multi:softmax(다중 분류)
        - multi-softprob(다중 확률)    
    - eval_metric: 평가 지표
        - rmse – root mean square error
        - mae – mean absolute error
        - logloss – negative log-likelihood
        - error – Binary classification error rate (0.5 threshold)
        - merror – Multiclass classification error rate
        - mlogloss – Multiclass logloss
        - auc: Area under the curve
    - seed


> ## 3. 하이퍼파라미터 튜닝

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics   #Additional scklearn functions
from sklearn.model_selection import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

### 데이터 정제하기

여기에서는 Hyperparameter Tuning이 목적이므로 전처리를 자세히 하지는 않겠습니다.

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
train.head()

In [None]:
test.info()

In [None]:
# 처리하기 복잡한 Column은 제거
del train['Ticket']; del test['Ticket']
del train['Cabin']; del test['Cabin']
del train['Name']; del test['Name']

In [None]:
# train, test에 다른 Category 존재 가능
test.insert(loc=1, column='Survived', value=0)
total = pd.concat([train, test], axis=0)

In [None]:
# One hot encoding
sex = pd.get_dummies(total['Sex'])
embarked = pd.get_dummies(total['Embarked'])

In [None]:
# 기존 컬럼 제거
del total['Sex']
del total['Embarked']

In [None]:
total = pd.concat([total, sex, embarked], axis=1)
total['Family'] = total['Parch'] + total['SibSp']

In [None]:
# one hot 컬럼이 있는 train, test 
train = total[0:len(train)]
test = total[len(train):]

In [None]:
train.head()

In [None]:
test.head()

## 예측 모델 함수 생성

In [None]:
target = 'Survived'
IDcol = 'PassengerId'

In [None]:
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=100):
   
    # get new n_estimator
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics='error', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
        print(alg)
    
    # Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Survived'], eval_metric='error')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print("\nModel Report")
    print("Training Accuracy : %.4g" % metrics.accuracy_score(dtrain['Survived'].values, dtrain_predictions))

## 3. 일반적인 Hyperparameter 튜닝 방법

1. high learning rate(0.05 - 0.3)를 선택하고 이 학습률에 맞는 tree 개수를 선정한다.
2. tree-specific parameter를 수정한다.
    - max_depth, min_child_weight, gamma, subsample, colsample_bytree
3. regularization parameter를 수정한다.
4. 학습률을 낮추고 다시 반복한다.

## 3-1. Learning rate와  estimator 수를 고정한다.

초기값은 다음과 같이 선정한다.

1. max_depth = 5: 보통 4-6 를 시작점으로 한다.

2. min_child_weight = 1 : 향후에 튜닝할 것이다.

3. gamma = 0 :  0.1 - 0.2로 시작해도 된다. 그런데 어짜피 튜닝할 것이다.

4. subsample, colsample_bytree = 0.8 : 보통 0.5 - 0.9로 시작한다.

5. scale_pos_weight = 1: Because of high class imbalance.


In [None]:
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=-1,
    scale_pos_weight=1,
    seed=2019
)
modelfit(xgb1, train, predictors)

## 3-2. max_depth와 min_child_weight를 튜닝한다.

In [None]:
param_test1 = {
 'max_depth':range(3,10,3),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate=0.1, 
                                                  n_estimators=1000, 
                                                  max_depth=5, 
                                                  min_child_weight=1, 
                                                  gamma=0, 
                                                  subsample=0.8, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  nthread=-1, 
                                                  scale_pos_weight=1, seed=2019),
param_grid = param_test1, scoring='accuracy',n_jobs=-1,iid=False, cv=5, verbose=10)
gsearch1.fit(train[predictors],train[target])
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

## 3-3. Gamma를 튜닝한다.

In [None]:
param_test2 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, 
                                                  n_estimators=1000, 
                                                  max_depth=3,
                                                  min_child_weight=5, 
                                                  gamma=0, 
                                                  subsample=0.8, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  thread=-1, 
                                                  scale_pos_weight=1,
                                                  seed=2019), 
                        param_grid = param_test2, scoring='accuracy', n_jobs=-1, iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

## 3-4. subsample and colsample_bytree를 튜닝한다.


In [None]:
param_test3 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, 
                                                  n_estimators=1000, 
                                                  max_depth=3,
                                                  min_child_weight=5, 
                                                  gamma=0, 
                                                  subsample=0.8, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  thread=-1, 
                                                  scale_pos_weight=1,
                                                  seed=2019), 
                        param_grid = param_test3, scoring='accuracy', n_jobs=-1, iid=False, cv=5, verbose=10)
gsearch3.fit(train[predictors],train[target])
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

## 3-4-2. subsample 추가 튜닝하기

In [None]:
param_test4 = {
 'subsample':[i/100.0 for i in range(40,80)],
}
gsearch4 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, 
                                                  n_estimators=1000, 
                                                  max_depth=3,
                                                  min_child_weight=5, 
                                                  gamma=0, 
                                                  subsample=0.6, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  thread=-1, 
                                                  scale_pos_weight=1,
                                                  seed=2019), 
                        param_grid = param_test4, scoring='accuracy', n_jobs=-1, iid=False, cv=5, verbose=10)
gsearch4.fit(train[predictors],train[target])
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

## 3-5. Regularization Parameter 튜닝

In [None]:
param_test5 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, 
                                                  n_estimators=1000, 
                                                  max_depth=3,
                                                  min_child_weight=5, 
                                                  gamma=0, 
                                                  subsample=0.67, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  thread=-1, 
                                                  scale_pos_weight=1,
                                                  seed=2019), 
                        param_grid = param_test5, scoring='accuracy', n_jobs=-1, iid=False, cv=5, verbose=10)
gsearch5.fit(train[predictors],train[target])
gsearch5.cv_results_, gsearch5.best_params_, gsearch5.best_score_

## 3-6. Learning Rate 감소

In [None]:
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
    learning_rate =0.01,
    n_estimators=5000,
    max_depth=3,
    min_child_weight=5,
    gamma=0,
    reg_alpha=1e-05,
    subsample=0.67,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=-1,
    scale_pos_weight=1,
    seed=2019
)
modelfit(xgb1, train, predictors)

# 4. seed별 앙상블 후 결과 제출

In [None]:
sample_submission = pd.read_csv('../input/sample_submission.csv')

In [None]:
seeds = [2015, 2016, 2017, 2018, 2019]
predictors = [x for x in train.columns if x not in [target, IDcol]]

for seed in seeds:
    xgb1 = XGBClassifier(
        learning_rate =0.01,
        n_estimators=5000,
        max_depth=3,
        min_child_weight=5,
        gamma=0,
        reg_alpha=1e-05,
        subsample=0.67,
        colsample_bytree=0.8,
        objective= 'binary:logistic',
        nthread=-1,
        scale_pos_weight=1,
        seed=seed
    )
    modelfit(xgb1, train, predictors)
    sample_submission['Survived'] += xgb1.predict(test[test.columns[2:]])

## 5. 결과 제출

In [None]:
sample_submission['Survived'] = sample_submission['Survived'] > 2.5

In [None]:
sample_submission['Survived'] = sample_submission['Survived'].apply(lambda x: int(x))

In [None]:
sample_submission.to_csv('./my_third_submission.csv', index=False)