# Tuning the hyper-parameters of an estimator

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

It is possible and recommended to search the hyper-parameter space for the best Cross-validation: evaluating estimator performance score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use:

`estimator.get_params()`

A search consists of:
* an estimator (regressor or classifier such as sklearn.svm.SVC());
* a parameter space;
* a method for searching or sampling candidates;
* a cross-validation scheme; and
* a score function.

Some models allow for specialized, efficient parameter search strategies, outlined below. Two generic approaches to sampling search candidates are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution. After describing these tools we detail best practice applicable to both approaches.

Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation performance of the model while others can be left to their default values. It is recommend to read the docstring of the estimator class to get a finer understanding of their expected behavior, possibly by reading the enclosed reference to the literature.

## GridSearch

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following param_grid:

In [1]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second one with an RBF kernel, and the cross-product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001, 0.0001].

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

In [2]:
from sklearn.model_selection import GridSearchCV

GridSearchCV?

In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import SVC

digits = datasets.load_digits()

n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]



clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='f1_macro')
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['rbf'], 'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_macro', verbose=0)

In [4]:
clf.best_params_


{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

In [5]:
clf.cv_results_

{'mean_fit_time': array([ 0.05294981,  0.04619932,  0.05392065,  0.02693901,  0.05347157,
         0.02645354,  0.05356345,  0.02492204,  0.02082181,  0.02132759,
         0.02084498,  0.0199594 ]),
 'mean_score_time': array([ 0.01215878,  0.01334348,  0.01125422,  0.00844197,  0.01104765,
         0.00820484,  0.01102276,  0.00877604,  0.00709801,  0.00607562,
         0.00648918,  0.00706677]),
 'mean_test_score': array([ 0.98558453,  0.95701735,  0.98693226,  0.98097324,  0.98693226,
         0.98115042,  0.98693226,  0.98115042,  0.97273826,  0.97273826,
         0.97273826,  0.97273826]),
 'mean_train_score': array([ 0.99887682,  0.96731111,  1.        ,  0.99800229,  1.        ,
         1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ]),
 'param_C': masked_array(data = [1 1 10 10 100 100 1000 1000 1 10 100 1000],
              mask = [False False False False False False False False False False False False],
        fill_value = ?),

In [7]:
y_true, y_pred = y_test, clf.predict(X_test)
print classification_report(y_true, y_pred)

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899



In [6]:
clf.cv_results_.keys()

['std_train_score',
 'rank_test_score',
 'split4_test_score',
 'param_gamma',
 'param_C',
 'split2_train_score',
 'std_score_time',
 'split4_train_score',
 'split2_test_score',
 'mean_score_time',
 'mean_fit_time',
 'split3_train_score',
 'split0_train_score',
 'std_test_score',
 'split1_train_score',
 'split0_test_score',
 'mean_test_score',
 'param_kernel',
 'params',
 'std_fit_time',
 'split3_test_score',
 'mean_train_score',
 'split1_test_score']

In [8]:
for param, score in zip(clf.cv_results_['params'], clf.cv_results_['mean_test_score']):
    print param, score

{'kernel': 'rbf', 'C': 1, 'gamma': 0.001} 0.985584530844
{'kernel': 'rbf', 'C': 1, 'gamma': 0.0001} 0.957017352561
{'kernel': 'rbf', 'C': 10, 'gamma': 0.001} 0.986932256371
{'kernel': 'rbf', 'C': 10, 'gamma': 0.0001} 0.980973238881
{'kernel': 'rbf', 'C': 100, 'gamma': 0.001} 0.986932256371
{'kernel': 'rbf', 'C': 100, 'gamma': 0.0001} 0.981150421585
{'kernel': 'rbf', 'C': 1000, 'gamma': 0.001} 0.986932256371
{'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001} 0.981150421585
{'kernel': 'linear', 'C': 1} 0.972738260762
{'kernel': 'linear', 'C': 10} 0.972738260762
{'kernel': 'linear', 'C': 100} 0.972738260762
{'kernel': 'linear', 'C': 1000} 0.972738260762


## Randomized Search

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

* A budget can be chosen independent of the number of parameters and possible values.
* Adding parameters that do not influence the performance does not decrease efficiency.

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified:

In [9]:
import scipy

params = {'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf'], 'class_weight':['balanced', None]}

This example uses the scipy.stats module, which contains many useful distributions for sampling parameters, such as expon, gamma, uniform or randint. In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A call to the rvs function should provide independent random samples from possible parameter values on consecutive calls.

For continuous parameters, such as C above, it is important to specify a continuous distribution to take full advantage of the randomization. This way, increasing n_iter will always lead to a finer search.

In [12]:
from sklearn.model_selection import RandomizedSearchCV

RandomizedSearchCV?

In [11]:
clf = RandomizedSearchCV(SVC(), params, cv=5,
                       scoring='f1_macro')
clf.fit(X_train, y_train)

  'precision', 'predicted', average, warn_for)


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'kernel': ['rbf'], 'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10631af10>, 'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1095a90d0>, 'class_weight': ['balanced', None]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='f1_macro', verbose=0)

In [13]:
clf.best_params_


{'C': 77.66272238664412,
 'class_weight': 'balanced',
 'gamma': 0.005068411787780897,
 'kernel': 'rbf'}

In [14]:
clf.cv_results_

{'mean_fit_time': array([ 0.16069083,  0.14710426,  0.13715243,  0.14030814,  0.15247364,
         0.13619647,  0.14480162,  0.162533  ,  0.15331717,  0.1470336 ]),
 'mean_score_time': array([ 0.0197228 ,  0.01762419,  0.01764722,  0.01734538,  0.0189888 ,
         0.01650276,  0.01705256,  0.01957798,  0.01750913,  0.01967664]),
 'mean_test_score': array([ 0.02093678,  0.02093678,  0.40259223,  0.07258202,  0.02093678,
         0.95238736,  0.02516369,  0.02516369,  0.02093678,  0.02093678]),
 'mean_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'param_C': masked_array(data = [197.43804316699695 66.457135213559567 177.34863184477223
  246.38577363335213 106.14275113094149 77.66272238664412 544.70767720523997
  17.176008353499572 97.301699633849807 5.0170210293950577],
              mask = [False False False False False False False False False False],
        fill_value = ?),
 'param_class_weight': masked_array(data = [None None None None 'balanced' 'balanced

In [15]:
y_true, y_pred = y_test, clf.predict(X_test)
print classification_report(y_true, y_pred)

             precision    recall  f1-score   support

          0       1.00      0.98      0.99        89
          1       1.00      0.98      0.99        90
          2       1.00      0.95      0.97        92
          3       0.99      0.95      0.97        93
          4       0.99      1.00      0.99        76
          5       1.00      0.94      0.97       108
          6       1.00      0.94      0.97        89
          7       1.00      0.99      0.99        78
          8       0.80      0.98      0.88        92
          9       0.94      0.98      0.96        92

avg / total       0.97      0.97      0.97       899



In [16]:
for param, score in zip(clf.cv_results_['params'], clf.cv_results_['mean_test_score']):
    print param, score

{'kernel': 'rbf', 'C': 197.43804316699695, 'gamma': 0.06685665237888809, 'class_weight': None} 0.0209367845031
{'kernel': 'rbf', 'C': 66.457135213559567, 'gamma': 0.052735834620258396, 'class_weight': None} 0.0209367845031
{'kernel': 'rbf', 'C': 177.34863184477223, 'gamma': 0.012859019553229155, 'class_weight': None} 0.402592233575
{'kernel': 'rbf', 'C': 246.38577363335213, 'gamma': 0.024434327856707762, 'class_weight': None} 0.0725820179991
{'kernel': 'rbf', 'C': 106.14275113094149, 'gamma': 0.055239367053976862, 'class_weight': 'balanced'} 0.0209367845031
{'kernel': 'rbf', 'C': 77.66272238664412, 'gamma': 0.005068411787780897, 'class_weight': 'balanced'} 0.952387358855
{'kernel': 'rbf', 'C': 544.70767720523997, 'gamma': 0.047295948666685128, 'class_weight': 'balanced'} 0.0251636863296
{'kernel': 'rbf', 'C': 17.176008353499572, 'gamma': 0.047877537276124066, 'class_weight': 'balanced'} 0.0251636863296
{'kernel': 'rbf', 'C': 97.301699633849807, 'gamma': 0.065282190141343288, 'class_wei

Don't forget the old _CV classes that are faster than gridsearch! And also don't forget about OOB error that can be a great proxy