# 超参数优化方法
学习器模型中一般有两类参数
- 模型参数：可以从数据中学习估计得到
- 超参数：无法从数据中估计，只能靠人的经验进行设计指定。比如SVM里面的C、Kernel、$\gamma$，朴素贝叶斯里的$\alpha$等

可以通过```get_params()```来获取学习器的参数和取值

参数空间的搜索包括
- 学习器
- 参数空间
- 用于获得候选参数组合的搜索或采样方法
- 交叉验证机制
- 评分函数

In [1]:
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn import model_selection,datasets,ensemble

def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print('Model with rank: {0}'.format(i))
            print('Mean validation score: {0:.3f} (std: {1:.3f})'.format(results['mean_test_score'][candidate],
                                                                         results['std_test_score'][candidate]))
            print('Parameters:{0}'.format(results['params'][candidate]))
            print("")
digits = datasets.load_digits()
x, y = digits.data, digits.target
clf = ensemble.RandomForestClassifier(n_estimators=20)
print('===============RandomizedSearchCV的结果==============================')
param_dist = {'max_depth':[3, None],
              'max_features':sp_randint(1,11),
              'min_samples_split':sp_randint(2,11),
              'min_samples_leaf':sp_randint(1,11),
              'bootstrap':[True, False],
              'criterion':['gini', 'entropy']}
n_iter_search = 20
random_search = model_selection.RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search)
start = time()
random_search.fit(x, y)
print('RandomizedSearchCV took %.2f seconds for %d candidates parameter settings.' % ((time() - start), n_iter_search))
report(random_search.cv_results_)

print('===============GridSearchCV的结果==============================')
param_grid = {'max_depth':[3, None],
              'max_features':[1,3,10],
              'min_samples_split':[2,3,10],
              'min_samples_leaf':[1,3,10],
              'bootstrap':[True, False],
              'criterion':['gini', 'entropy']}

grid_search = model_selection.GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(x, y)
print('GridSearchCV took %.2f seconds for %d candidates parameter settings.' % ((time() - start, len(grid_search.cv_results_['params']))))
report(grid_search.cv_results_)



RandomizedSearchCV took 9.86 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.936 (std: 0.009)
Parameters:{'max_features': 9, 'min_samples_split': 2, 'criterion': 'gini', 'bootstrap': False, 'min_samples_leaf': 3, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.931 (std: 0.015)
Parameters:{'max_features': 7, 'min_samples_split': 10, 'criterion': 'gini', 'bootstrap': False, 'min_samples_leaf': 1, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.923 (std: 0.001)
Parameters:{'max_features': 3, 'min_samples_split': 2, 'criterion': 'entropy', 'bootstrap': False, 'min_samples_leaf': 2, 'max_depth': None}



GridSearchCV took 86.05 seconds for 216 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.940 (std: 0.008)
Parameters:{'max_features': 10, 'min_samples_split': 2, 'criterion': 'gini', 'bootstrap': False, 'min_samples_leaf': 1, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.934 (std: 0.005)
Parameters:{'max_features': 3, 'min_samples_split': 3, 'criterion': 'entropy', 'bootstrap': False, 'min_samples_leaf': 1, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.934 (std: 0.008)
Parameters:{'max_features': 10, 'min_samples_split': 3, 'criterion': 'entropy', 'bootstrap': False, 'min_samples_leaf': 1, 'max_depth': None}



随机搜索的运行时间比网格搜索显著的少，随机搜索得到的超参数组合的性能稍差一些，但这很大程度上时候噪声引起的
优化方法
- 指定一个合适的目标测度对模型进行评估
- 使用SKLearn的PipeLine将estimators和他们的参数空间结合起来
- 合理划分数据集。model_selection.train_test_split()来搞定
- 在参数节点的计算上可以做到并行计算，通过参数n_jobs来指定
- 提高在某些参数节点上发生错误时的鲁棒性：在出错节点上只提示警告，error_score=0来搞定