[Презентация](http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions) Owen Zhang, одного из лидеров Kaggle про соревнования по анализу данных и в том числе про настройку гиперпараметров Xgboost.
<img src='../../img/xgboost_tuning_owen.png'>

In [1]:
import numpy as np

from xgboost.sklearn import XGBClassifier
import xgboost as xgb
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
from hyperopt import fmin, hp, tpe, STATUS_OK, Trials

from scipy.stats import randint, uniform



**Генерируем синтетические данные.**

In [2]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, 
                           n_repeated=2, random_state=42)

**Будем проводить 10-кратную стратифицированную кросс-валидацию.**

In [3]:
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=42)

### Grid-Search

In [5]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

**Инициализируем отдельно словарь фиксированных параметров.**

In [6]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': 42
}

In [7]:
xgb_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

In [8]:
%%time
xgb_grid.fit(X, y)

CPU times: user 700 ms, sys: 110 ms, total: 810 ms
Wall time: 3min 31s


GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[0 1 ..., 0 1], n_folds=10, shuffle=True, random_state=42),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=42, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [1, 2, 3], 'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'n_estimators': [5, 10, 25, 50]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

**C помощью grid_scores можно строить кривые валидации.**

In [9]:
xgb_grid.grid_scores_

[mean: 0.49700, std: 0.00245, params: {'max_depth': 1, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 5},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 1, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 10},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 1, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 25},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 1, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 50},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 2, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 5},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 2, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 10},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 2, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 25},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 2, 'learning_rate': 9.9999999999999998e-17, 'n_estimators': 50},
 mean: 0.49700, std: 0.00245, params: {'max_depth': 3, 'learning_r

**Или просто использовать лучшее сочетание параметров.**

In [10]:
print("Best accuracy obtained: {0}".format(xgb_grid.best_score_))
print("Parameters:")
for key, value in xgb_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.871
Parameters:
	max_depth: 3
	learning_rate: 1.0
	n_estimators: 50


### Randomized Grid-Search
**Часто неплохо, а главное, намного быстрее, работает рандомизированная версия.
Теперь создаем словарь с распределениями параметров:**

In [11]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': uniform(), # gaussian distribution
    'colsample_bytree': uniform() # gaussian distribution
}

**Инициализируем `RandomizedSearchCV` так чтобы случайно выбрать 10 комбинаций параметров.**

In [12]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_distributions=params_dist_grid,
    n_iter=10,
    cv=cv,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
)

In [13]:
%%time
rs_grid.fit(X, y)

KeyboardInterrupt: 

In [14]:
rs_grid.best_estimator_

AttributeError: 'RandomizedSearchCV' object has no attribute 'best_estimator_'

In [None]:
rs_grid.best_params_

In [None]:
rs_grid.best_score_

### Hyperopt
**В библиотеке Hyperopt реализовано намного больше алгоритмов подбора параметров разных моделей. Будем настраивать, например, функцию log_loss по валидационной выборке.**

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3, random_state=42)

**Определим функцию, которую надо минимизировать.**

In [5]:
def score(params):
    print("Training with params:")
    print(params)
    num_round = int(params['n_estimators'])
    del params['n_estimators']
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_test, label=y_test)
    model = xgb.train(params, dtrain, num_round)
    predictions = model.predict(dvalid).reshape((X_test.shape[0], 1))
    score = log_loss(y_test, predictions)
    print("\tScore {0}\n\n".format(score))
    return {'loss': score, 'status': STATUS_OK}

In [8]:
def optimize(trials):
    space = {
             'n_estimators' : 150,
             'eta' : hp.quniform('eta', 0.025, 0.5, 0.025),
             'max_depth' : hp.choice('max_depth', np.arange(4, 10, 2, dtype=int)),
             'min_child_weight' : hp.quniform('min_child_weight', 1, 6, 1),
             'subsample' : hp.quniform('subsample', 0.5, 1, 0.25),
             'gamma' : 0,
             'colsample_bytree' : hp.quniform('colsample_bytree', 0.5, 1, 0.25),
             'eval_metric': 'merror',
             'objective': 'binary:logistic',
             'nthread' : 4,
             'silent' : 1
             }
    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=20)

    return best

In [9]:
trials = Trials()
best_params = optimize(trials)
print(best_params)

Training with params:
{'silent': 1, 'eta': 0.25, 'gamma': 0, 'objective': 'binary:logistic', 'eval_metric': 'merror', 'colsample_bytree': 0.5, 'min_child_weight': 4.0, 'nthread': 4, 'n_estimators': 150, 'subsample': 1.0, 'max_depth': 6}
	Score 0.3239594238937328


Training with params:
{'silent': 1, 'eta': 0.1, 'gamma': 0, 'objective': 'binary:logistic', 'eval_metric': 'merror', 'colsample_bytree': 1.0, 'min_child_weight': 4.0, 'nthread': 4, 'n_estimators': 150, 'subsample': 0.75, 'max_depth': 6}
	Score 0.312898863555165


Training with params:
{'silent': 1, 'eta': 0.07500000000000001, 'gamma': 0, 'objective': 'binary:logistic', 'eval_metric': 'merror', 'colsample_bytree': 0.75, 'min_child_weight': 1.0, 'nthread': 4, 'n_estimators': 150, 'subsample': 0.5, 'max_depth': 8}
	Score 0.30555439836888887


Training with params:
{'silent': 1, 'eta': 0.1, 'gamma': 0, 'objective': 'binary:logistic', 'eval_metric': 'merror', 'colsample_bytree': 0.75, 'min_child_weight': 4.0, 'nthread': 4, 'n_esti