<img style="width:100%" src="../images/practical_xgboost_in_python_notebook_header.png" />

# Hyper-parameter tuning

As you know there are plenty of tunable parameters. Each one results in different output. The question is which combination results in best output.

The following notebook will show you how to use Scikit-learn modules for figuring out the best parameters for your  models.

**What's included:**
- <a href="#data">data preparation</a>,
- <a href="#grid">finding best hyper-parameters using grid-search</a>,
- <a href="#rgrid">finding best hyper-parameters using randomized grid-search<a>

### Prepare data<a name='data' />
Let's begin with loading all required libraries in one place and setting seed number for reproducibility.

In [25]:
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold

from scipy.stats import randint, uniform

# reproducibility
seed = 342
np.random.seed(seed)

Generate artificial dataset:

In [3]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing. Let's use `StratifiedKFold` which guarantees that target label is equally distributed across each fold:

In [4]:
#cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

### Grid-Search<a name='grid' />
In grid-search we start by defining a dictionary holding possible parameter values we want to test. **All** combinations will be evaluted.

In [5]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

Add a dictionary for fixed parameters.

In [6]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

Create a `GridSearchCV` estimator. We will be looking for combination giving the best accuracy.

In [20]:
bst_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_grid=params_grid,
    cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=seed),
    scoring='accuracy',
    return_train_score=True
)

Before running the calculations notice that $3*4*3*10=360$ models will be created to test all combinations. You should always have rough estimations about what is going to happen.

In [21]:
bst_grid.fit(X, y)

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=342, shuffle=True),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=342, silent=1,
       subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'n_estimators': [5, 10, 25, 50], 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

Also keep in mind that train() will return a model from the last iteration, not the best one.

Now, we can look at all obtained scores, and try to manually see what matters and what not. A quick glance looks that the largeer `n_estimators` then the accuracy is higher.

In [22]:
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters:")
for key, value in bst_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.877
Parameters:
	learning_rate: 0.5
	n_estimators: 50
	max_depth: 3


If there are many results, we can filter them manually to get the best combination

In [29]:
pd.set_option('display.max_columns', 500)
pd.DataFrame(bst_grid.cv_results_ )

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_learning_rate,param_max_depth,param_n_estimators,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,split5_test_score,split5_train_score,split6_test_score,split6_train_score,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.005248,0.000492,0.504,0.504,1e-16,1,5,"{'learning_rate': 1e-16, 'n_estimators': 5, 'm...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.001018,0.000135,0.002,0.000222
1,0.006362,0.00039,0.504,0.504,1e-16,1,10,"{'learning_rate': 1e-16, 'n_estimators': 10, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000646,0.000124,0.002,0.000222
2,0.012106,0.000375,0.504,0.504,1e-16,1,25,"{'learning_rate': 1e-16, 'n_estimators': 25, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000371,5.1e-05,0.002,0.000222
3,0.022464,0.000432,0.504,0.504,1e-16,1,50,"{'learning_rate': 1e-16, 'n_estimators': 50, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000557,9.4e-05,0.002,0.000222
4,0.005491,0.000329,0.504,0.504,1e-16,2,5,"{'learning_rate': 1e-16, 'n_estimators': 5, 'm...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000142,5.5e-05,0.002,0.000222
5,0.009224,0.000325,0.504,0.504,1e-16,2,10,"{'learning_rate': 1e-16, 'n_estimators': 10, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000135,1.3e-05,0.002,0.000222
6,0.020551,0.000392,0.504,0.504,1e-16,2,25,"{'learning_rate': 1e-16, 'n_estimators': 25, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000376,2.9e-05,0.002,0.000222
7,0.039951,0.000431,0.504,0.504,1e-16,2,50,"{'learning_rate': 1e-16, 'n_estimators': 50, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.001047,1.8e-05,0.002,0.000222
8,0.00759,0.000327,0.504,0.504,1e-16,3,5,"{'learning_rate': 1e-16, 'n_estimators': 5, 'm...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.00079,3.3e-05,0.002,0.000222
9,0.012943,0.000372,0.504,0.504,1e-16,3,10,"{'learning_rate': 1e-16, 'n_estimators': 10, '...",25,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.50495,0.503893,0.5,0.504444,0.5,0.504444,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.505051,0.503885,0.000298,4.4e-05,0.002,0.000222


Looking for best parameters is an iterative process. You should start with coarsed-granularity and move to to more detailed values.

### Randomized Grid-Search<a name='rgrid' />
When the number of parameters and their values is getting big traditional grid-search approach quickly becomes ineffective. A possible solution might be to randomly pick certain parameters from their distribution. While it's not an exhaustive solution, it's worth giving a shot.

Create a parameters distribution dictionary:

In [30]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': uniform(), # gaussian distribution
    'colsample_bytree': uniform() # gaussian distribution
}

Initialize `RandomizedSearchCV` to **randomly pick 10 combinations of parameters**. With this approach you can easily control the number of tested models.

In [37]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_distributions=params_dist_grid,
    n_iter=100,
    cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=seed),
    scoring='accuracy',
    random_state=seed
)

Fit the classifier:

In [38]:
rs_grid.fit(X, y)

RandomizedSearchCV(cv=StratifiedKFold(n_splits=10, random_state=342, shuffle=True),
          error_score='raise',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=342, silent=1,
       subsample=1),
          fit_params=None, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x119e8ed68>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x119e8e588>, 'gamma': [0, 0.5, 1], 'colsample_bytree': <scipy.stats._distn_infrastructure.rv_frozen object at 0x119fe8320>, 'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x119fe8e80>, 'max_depth': [1, 2, 3, 4]}

One more time take a look at choosen parameters and their accuracy score:

In [39]:
rs_grid.grid_scores_



[mean: 0.62000, std: 0.11225, params: {'learning_rate': 0.24756150723102166, 'n_estimators': 5, 'gamma': 0, 'colsample_bytree': 0.065034396841929132, 'subsample': 0.11848249237448605, 'max_depth': 3},
 mean: 0.82200, std: 0.03194, params: {'learning_rate': 0.4325346125891868, 'n_estimators': 492, 'gamma': 0, 'colsample_bytree': 0.13214054942810016, 'subsample': 0.61087022642994204, 'max_depth': 1},
 mean: 0.87800, std: 0.03714, params: {'learning_rate': 0.20992824607318106, 'n_estimators': 719, 'gamma': 0, 'colsample_bytree': 0.37042173856789762, 'subsample': 0.50413311801798655, 'max_depth': 3},
 mean: 0.86400, std: 0.02845, params: {'learning_rate': 0.38076975648982458, 'n_estimators': 625, 'gamma': 1, 'colsample_bytree': 0.19015760885089361, 'subsample': 0.80580143163765727, 'max_depth': 3},
 mean: 0.59200, std: 0.06095, params: {'learning_rate': 0.76526283302535481, 'n_estimators': 188, 'gamma': 0, 'colsample_bytree': 0.46363095388213049, 'subsample': 0.0056355243866283988, 'max_de

There are also some handy properties allowing to quickly analyze best estimator, parameters and obtained score:

In [40]:
print("Best accuracy obtained: {0}".format(rs_grid.best_score_))
print("Parameters:")
for key, value in rs_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.89
Parameters:
	learning_rate: 0.22673587222022618
	n_estimators: 819
	gamma: 0.5
	colsample_bytree: 0.2963651190082076
	subsample: 0.7946945369030941
	max_depth: 3


In [41]:
rs_grid.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.29636511900820761, gamma=0.5,
       learning_rate=0.22673587222022618, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=819, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=342, silent=1,
       subsample=0.79469453690309411)

In [42]:
rs_grid.scorer_

make_scorer(accuracy_score)