<a href="https://colab.research.google.com/github/narutaku0914/KIKAGAKU/blob/master/kikagaku_ML3hyper_param.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ハイパーパラメータ調整

パラメータ: モデルの学習実行後に獲得される値を指しており、重みとも<br>
ハイパーパラメータ: 各アルゴリズムに付随して、アルゴリズムの挙動を制御するための値

### K-分割交差検証(K-fold cross-validation)

ホールドアウト法:　これまでの学習用、テスト用<br>
→ 実際の開発時にはモデルの性能評価をより適切にするためにデータを 3 分割(学習、**検証**、テスト)してモデルを評価することが一般的

train, validationは学習段階

十分なデータ量が用意できない<br>
→ 3 分割すると偏りが生じて適切な学習・検証が行われない可能性<br>
データの偏りを回避する方法として K-分割交差検証 (K-fold cross-validation)

K個に分割 → 検証: 1 と学習: K-1で利用<br>
1 回で学習を終わらせず、計 K回の学習を行う<br>
その際、既に検証用データセットに使ったデータを次は学習用データセットとして使用し、新たに検証用データセットを選択<br>
各検証の結果を平均して最終的な検証結果に

### ハイパーパラメータの調整方法

手動での調整

グリッドサーチ (Grid Search)

ランダムサーチ (Random Search)

ベイズ最適化 (Bayesian Optimization)

1. 手動での調整

In [None]:
import numpy as np
import pandas as pd

# 乳がんに関するDataSetの読み込み
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()

x = dataset.data
t = dataset.target

x.shape, t.shape

((569, 30), (569,))

In [None]:
# テスト : その他 = 20 : 80
from sklearn.model_selection import train_test_split
x_train_val, x_test, t_train_val, t_test = train_test_split(x, t, test_size=0.2, random_state=1)

# 検証 : 学習 = 30 : 70
x_train, x_val, t_train, t_val = train_test_split(x_train_val, t_train_val, test_size=0.3, random_state=1)

x_train.shape, x_val.shape, x_test.shape

((318, 30), (137, 30), (114, 30))

In [None]:
# 決定木の実装
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(random_state=0)

dtree.fit(x_train, t_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [None]:
print('train score ', dtree.score(x_train, t_train))
print('validation score ', dtree.score(x_val, t_val)) # 学習内の検証はval

train score  1.0
validation score  0.927007299270073


In [None]:
# 若干の過学習のため、ハイパーパラメータを設定し、再定義
dtree = DecisionTreeClassifier(max_depth=10, min_samples_split=30, random_state=0)

dtree.fit(x_train, t_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=30,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [None]:
print('train score ', dtree.score(x_train, t_train))
print('validation score ', dtree.score(x_val, t_val))

train score  0.9308176100628931
validation score  0.9562043795620438


In [None]:
print('test score ', dtree.score(x_test, t_test))

test score  0.9298245614035088


2. グリッドサーチ

手動で適当にハイパーパラメータの値を決めたが、適当に入れた値が常に最適なハイパーパラメータである可能性は低い<br>
最適なハイパーパラメータを獲得するにはある程度の探索（試行錯誤）を行う必要がある<br>
→ 効率的に最適なハイパーパラメータを探索する方法はいくつかあり、その内の 1 つがグリッドサーチ

1. ハイパーパラメータを探索する範囲を決める<br>
max_depthとmin_samples_splitを５段階ずつなら25通りをすべて学習・検証
2. 中からもっと精度良いものを採用

メリット: 漏れがない<br>
デメリット: パターンが膨大になり時間がかかる場合も

In [None]:
from sklearn.model_selection import GridSearchCV

# 学習に使用するアルゴリズムの定義
estimator = DecisionTreeClassifier(random_state=0)

# 探索するハイパーパラメータと範囲の定義
param_grid = [{
    'max_depth': [3, 20, 50],
    'min_samples_split': [3, 20, 30]
}]

# 分割数　Kの値
cv = 5

In [None]:
# GridSearchCVクラスを用いたモデル定義
tuned_model = GridSearchCV(estimator=estimator,
                           param_grid=param_grid,
                           cv=cv, return_train_score=False)

# 学習＆検証
tuned_model.fit(x_train_val, t_train_val)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=0, splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'max_depth': [3, 20, 50],
                          'min_samples_split': [3, 20, 30]}],
             

学習結果は cv_results_ に保持<br>
→ 辞書型で格納されているため、pandas.DataFrame 型に変換して確認すると見やすく

In [None]:
# 検証結果の確認
pd.DataFrame(tuned_model.cv_results_).T  # 転置

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,0.00538058,0.00372853,0.00377016,0.00451956,0.00435839,0.00448179,0.00518718,0.00481591,0.00447125
std_fit_time,0.000407422,0.000109212,0.000145345,0.00022555,0.000231186,0.000293879,0.000224799,0.000256493,0.000391513
mean_score_time,0.00051136,0.000320292,0.00031414,0.000328207,0.000318861,0.00032835,0.000376225,0.000377846,0.000313282
std_score_time,7.82553e-05,2.06947e-05,1.74085e-05,2.78852e-05,5.69935e-06,3.76092e-05,4.66543e-05,2.21939e-05,1.63834e-05
param_max_depth,3,3,3,20,20,20,50,50,50
param_min_samples_split,3,20,30,3,20,30,3,20,30
params,"{'max_depth': 3, 'min_samples_split': 3}","{'max_depth': 3, 'min_samples_split': 20}","{'max_depth': 3, 'min_samples_split': 30}","{'max_depth': 20, 'min_samples_split': 3}","{'max_depth': 20, 'min_samples_split': 20}","{'max_depth': 20, 'min_samples_split': 30}","{'max_depth': 50, 'min_samples_split': 3}","{'max_depth': 50, 'min_samples_split': 20}","{'max_depth': 50, 'min_samples_split': 30}"
split0_test_score,0.923077,0.912088,0.912088,0.956044,0.912088,0.912088,0.956044,0.912088,0.912088
split1_test_score,0.901099,0.901099,0.901099,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099
split2_test_score,0.934066,0.934066,0.934066,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066


結果を参照して先ほどより狭い範囲でハイパーパラメータを調整<br>
↑これを何度か繰り返すことで徐々に予測精度が高くなるハイパーパラメータへと近づけて行く

In [None]:
estimator = DecisionTreeClassifier(random_state=0)

param_grid = [{
    'max_depth': [5, 10, 15],
    'min_samples_split': [10, 12, 15]
}]

cv = 5

In [None]:
# 定義
tuned_model = GridSearchCV(estimator=estimator,
                           param_grid=param_grid,
                           cv=cv, return_train_score=False)

# 学習
tuned_model.fit(x_train_val, t_train_val)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=0, splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'max_depth': [5, 10, 15],
                          'min_samples_split': [10, 12, 15]}],
            

In [None]:
# 確認
pd.DataFrame(tuned_model.cv_results_).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,0.0050252,0.00455704,0.00439839,0.00449286,0.0044167,0.00437136,0.00438671,0.00468855,0.00459666
std_fit_time,0.000632319,0.00026257,0.000221195,0.000207764,0.00017486,0.000224028,0.000189805,0.00022735,0.00023642
mean_score_time,0.000400496,0.000358343,0.000330973,0.000355291,0.000326967,0.000296974,0.000316763,0.000363398,0.000350666
std_score_time,7.8139e-05,7.24827e-05,1.77582e-05,8.19323e-05,2.09698e-05,1.04313e-05,3.65618e-05,4.11854e-05,7.63455e-05
param_max_depth,5,5,5,10,10,10,15,15,15
param_min_samples_split,10,12,15,10,12,15,10,12,15
params,"{'max_depth': 5, 'min_samples_split': 10}","{'max_depth': 5, 'min_samples_split': 12}","{'max_depth': 5, 'min_samples_split': 15}","{'max_depth': 10, 'min_samples_split': 10}","{'max_depth': 10, 'min_samples_split': 12}","{'max_depth': 10, 'min_samples_split': 15}","{'max_depth': 15, 'min_samples_split': 10}","{'max_depth': 15, 'min_samples_split': 12}","{'max_depth': 15, 'min_samples_split': 15}"
split0_test_score,0.967033,0.923077,0.912088,0.967033,0.923077,0.912088,0.967033,0.923077,0.912088
split1_test_score,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099
split2_test_score,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066


最後にテストデータを用いて、グリッドサーチで学習させたモデルの予測精度を確認

In [None]:
# 最も予測精度の高かったハイパーパラメータの確認
tuned_model.best_params_

{'max_depth': 5, 'min_samples_split': 10}

In [None]:
# 最も予測精度の高かったmodelの引き継ぎ
best_model = tuned_model.best_estimator_

# 検証
print(best_model.score(x_train_val, t_train_val))
print(best_model.score(x_test, t_test))

0.9934065934065934
0.956140350877193


手動でハイパーパラメータの調整を行ったモデルのテスト用データセットに対する予測精度より精度が向上している

3. ランダムサーチ

グリッドサーチの 1 つの欠点として、グリッド上にしか探索できないという点<br>
→ ランダムサーチで指定した範囲のハイパーパラメータをランダムに抽出し、学習・検証を行う

欠点：全てのハイパーパラメータを探索するわけではないため、そのハイパーパラメータが最適かは判断が難しい点

ランダムサーチである程度の範囲を絞ったあとに、グリッドサーチで局所的に探索するという方法もありかも

In [None]:
# クラスのインポート
from sklearn.model_selection import RandomizedSearchCV

# 学習に使用するアルゴリズム
estimator = DecisionTreeClassifier(random_state=0)

# ハイパーパラメータで探す範囲の指定
param_distributions = {
    'max_depth': list(range(5, 100, 2)), 
    'min_samples_split': list(range(2, 50, 1))
}

In [None]:
# 試行回数の指定
n_iter = 100

cv=5

In [None]:
# モデルの定義
tuned_model = RandomizedSearchCV(
    estimator=estimator,
    param_distributions=param_distributions,
    n_iter=n_iter, cv=cv,
    random_state=0, return_train_score=False
)

# 学習
tuned_model.fit(x_train_val, t_train_val)

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort='deprecated',
                                                    random_state=0,
             

In [None]:
# 結果の確認
pd.DataFrame(tuned_model.cv_results_).sort_values('rank_test_score').T

Unnamed: 0,47,77,82,90,42,19,28,12,11,62,69,39,70,3,96,29,6,68,43,34,9,48,45,33,91,32,25,37,44,46,36,52,54,57,59,61,63,66,76,75,...,97,78,83,92,79,89,88,80,81,87,86,85,93,0,49,71,2,5,7,16,17,20,21,22,23,26,72,30,35,38,40,41,98,50,55,58,60,67,31,99
mean_fit_time,0.00445838,0.00464659,0.0046514,0.00451841,0.00453186,0.00468645,0.00467234,0.00457497,0.0044528,0.00450087,0.00448284,0.00451918,0.00455518,0.00448589,0.00447445,0.00462341,0.00472898,0.00463834,0.00449481,0.00446253,0.00448613,0.0045135,0.00457406,0.00448818,0.00439792,0.00444193,0.00468016,0.00455141,0.00442305,0.00441236,0.00441704,0.00442314,0.00435185,0.00440474,0.00455632,0.00515432,0.0044208,0.0043776,0.00443525,0.00444531,...,0.00436234,0.00444055,0.00449553,0.00432367,0.00458736,0.00428576,0.00436463,0.00453935,0.00450506,0.00438032,0.00443544,0.00435729,0.00432405,0.00548983,0.00439672,0.004459,0.00431781,0.00451088,0.00439315,0.00431719,0.00439587,0.00453525,0.00452576,0.00536437,0.00461078,0.00455875,0.00443826,0.00449376,0.00438313,0.00433655,0.0043354,0.00440416,0.00433483,0.00436707,0.00429335,0.00428762,0.00435429,0.00433421,0.00442853,0.00513306
std_fit_time,0.000204358,0.000254905,0.000362382,0.000227998,0.000180702,0.000307203,0.00031108,0.0002982,0.000226895,0.00018839,0.000198673,0.000231008,0.000225031,0.000179938,0.000219211,0.000261818,0.000169405,0.000198043,0.000213739,0.000221133,0.000210379,0.000236005,0.000155485,0.000182134,0.000233776,0.000232693,0.000187839,0.000202387,0.000217124,0.000189218,0.000215952,0.000249602,0.00024437,0.00022616,0.000245348,0.000855518,0.000252411,0.000249988,0.000220618,0.000215792,...,0.00023754,0.000238753,0.000322059,0.000255144,0.000266598,0.000227522,0.000253373,0.000282285,0.000282547,0.000258218,0.000238636,0.000246714,0.000243556,0.00144303,0.00024477,0.000251185,0.000245155,0.000176763,0.000253467,0.000229856,0.000307517,0.000333186,0.00019616,0.000968233,0.000329289,0.000343912,0.00024478,0.000231919,0.000230898,0.000219953,0.000235685,0.000280814,0.000258117,0.000224429,0.000224774,0.000248542,0.000213183,0.000236166,0.000290886,0.0010407
mean_score_time,0.000339222,0.000392389,0.000405645,0.000342178,0.000351381,0.000372982,0.000376749,0.000362921,0.000327682,0.000350189,0.000347853,0.000347757,0.000347519,0.000335312,0.000332022,0.000365639,0.00036273,0.000408506,0.000354576,0.000330782,0.000348854,0.000333691,0.0003479,0.000360489,0.000328875,0.000344706,0.000395298,0.00038929,0.000339794,0.000341606,0.000346422,0.000348234,0.000314856,0.000328779,0.000376034,0.00041728,0.000342751,0.000345421,0.000339317,0.000363684,...,0.000333166,0.00036335,0.000359583,0.000324059,0.000379658,0.000328875,0.000339556,0.000392771,0.000385857,0.000334358,0.000348282,0.00033164,0.000319242,0.000437164,0.000342417,0.000362349,0.000316715,0.000379181,0.000331926,0.000333452,0.000320339,0.000375652,0.000399065,0.000520706,0.00039506,0.000387526,0.000353432,0.00036149,0.000327969,0.000332737,0.000323248,0.000361586,0.000331497,0.00034976,0.000327539,0.000319099,0.000345421,0.000335026,0.000353432,0.000488043
std_score_time,1.57337e-05,2.10519e-05,6.34462e-05,2.57074e-05,1.57417e-05,1.41071e-05,2.78431e-05,1.9116e-05,1.41811e-05,2.89061e-05,3.79387e-05,1.82614e-05,1.6956e-05,1.98902e-05,2.24e-05,3.26964e-05,3.40425e-05,8.79343e-05,1.19603e-05,2.06188e-05,5.59629e-06,2.13677e-05,2.64734e-05,1.93487e-05,1.97065e-05,2.38664e-05,2.02277e-05,2.34725e-05,1.98134e-05,1.079e-05,3.67109e-05,2.78228e-05,8.15182e-06,1.63278e-05,3.89005e-05,8.41662e-05,1.72056e-05,1.94807e-05,2.33111e-05,6.00453e-05,...,2.02965e-05,2.97081e-05,1.67066e-05,1.50626e-05,2.03228e-05,1.54422e-05,1.87327e-05,1.51823e-05,2.87555e-05,1.92027e-05,1.86415e-05,2.99015e-05,1.01601e-05,0.00014283,2.70074e-05,2.5232e-05,1.3538e-05,1.96759e-05,1.69927e-05,1.81784e-05,1.9128e-05,3.51227e-05,1.78934e-05,0.000157722,1.89329e-05,5.21246e-05,2.26714e-05,2.71347e-05,7.97131e-06,1.84707e-05,9.41485e-06,2.17061e-05,1.75199e-05,2.16874e-05,4.85438e-06,8.2406e-06,1.63496e-05,1.05664e-05,1.78324e-05,0.000206083
param_min_samples_split,10,10,4,4,7,9,11,2,8,7,4,2,2,2,4,6,8,4,9,5,5,5,5,13,12,12,12,13,14,16,14,14,24,14,20,16,23,23,15,16,...,29,39,44,36,27,35,36,31,48,43,31,39,42,30,38,27,37,40,36,40,39,27,27,43,41,27,30,42,27,43,49,31,45,27,43,36,36,47,44,39
param_max_depth,23,65,95,39,15,37,7,87,29,7,9,21,97,89,41,65,25,47,35,59,87,29,13,73,5,31,55,35,11,77,15,49,7,53,91,45,91,95,69,61,...,89,27,61,39,81,89,17,73,15,67,27,37,71,9,9,45,63,95,59,11,25,27,37,73,55,19,79,93,35,49,87,23,19,99,27,27,47,75,95,87
params,"{'min_samples_split': 10, 'max_depth': 23}","{'min_samples_split': 10, 'max_depth': 65}","{'min_samples_split': 4, 'max_depth': 95}","{'min_samples_split': 4, 'max_depth': 39}","{'min_samples_split': 7, 'max_depth': 15}","{'min_samples_split': 9, 'max_depth': 37}","{'min_samples_split': 11, 'max_depth': 7}","{'min_samples_split': 2, 'max_depth': 87}","{'min_samples_split': 8, 'max_depth': 29}","{'min_samples_split': 7, 'max_depth': 7}","{'min_samples_split': 4, 'max_depth': 9}","{'min_samples_split': 2, 'max_depth': 21}","{'min_samples_split': 2, 'max_depth': 97}","{'min_samples_split': 2, 'max_depth': 89}","{'min_samples_split': 4, 'max_depth': 41}","{'min_samples_split': 6, 'max_depth': 65}","{'min_samples_split': 8, 'max_depth': 25}","{'min_samples_split': 4, 'max_depth': 47}","{'min_samples_split': 9, 'max_depth': 35}","{'min_samples_split': 5, 'max_depth': 59}","{'min_samples_split': 5, 'max_depth': 87}","{'min_samples_split': 5, 'max_depth': 29}","{'min_samples_split': 5, 'max_depth': 13}","{'min_samples_split': 13, 'max_depth': 73}","{'min_samples_split': 12, 'max_depth': 5}","{'min_samples_split': 12, 'max_depth': 31}","{'min_samples_split': 12, 'max_depth': 55}","{'min_samples_split': 13, 'max_depth': 35}","{'min_samples_split': 14, 'max_depth': 11}","{'min_samples_split': 16, 'max_depth': 77}","{'min_samples_split': 14, 'max_depth': 15}","{'min_samples_split': 14, 'max_depth': 49}","{'min_samples_split': 24, 'max_depth': 7}","{'min_samples_split': 14, 'max_depth': 53}","{'min_samples_split': 20, 'max_depth': 91}","{'min_samples_split': 16, 'max_depth': 45}","{'min_samples_split': 23, 'max_depth': 91}","{'min_samples_split': 23, 'max_depth': 95}","{'min_samples_split': 15, 'max_depth': 69}","{'min_samples_split': 16, 'max_depth': 61}",...,"{'min_samples_split': 29, 'max_depth': 89}","{'min_samples_split': 39, 'max_depth': 27}","{'min_samples_split': 44, 'max_depth': 61}","{'min_samples_split': 36, 'max_depth': 39}","{'min_samples_split': 27, 'max_depth': 81}","{'min_samples_split': 35, 'max_depth': 89}","{'min_samples_split': 36, 'max_depth': 17}","{'min_samples_split': 31, 'max_depth': 73}","{'min_samples_split': 48, 'max_depth': 15}","{'min_samples_split': 43, 'max_depth': 67}","{'min_samples_split': 31, 'max_depth': 27}","{'min_samples_split': 39, 'max_depth': 37}","{'min_samples_split': 42, 'max_depth': 71}","{'min_samples_split': 30, 'max_depth': 9}","{'min_samples_split': 38, 'max_depth': 9}","{'min_samples_split': 27, 'max_depth': 45}","{'min_samples_split': 37, 'max_depth': 63}","{'min_samples_split': 40, 'max_depth': 95}","{'min_samples_split': 36, 'max_depth': 59}","{'min_samples_split': 40, 'max_depth': 11}","{'min_samples_split': 39, 'max_depth': 25}","{'min_samples_split': 27, 'max_depth': 27}","{'min_samples_split': 27, 'max_depth': 37}","{'min_samples_split': 43, 'max_depth': 73}","{'min_samples_split': 41, 'max_depth': 55}","{'min_samples_split': 27, 'max_depth': 19}","{'min_samples_split': 30, 'max_depth': 79}","{'min_samples_split': 42, 'max_depth': 93}","{'min_samples_split': 27, 'max_depth': 35}","{'min_samples_split': 43, 'max_depth': 49}","{'min_samples_split': 49, 'max_depth': 87}","{'min_samples_split': 31, 'max_depth': 23}","{'min_samples_split': 45, 'max_depth': 19}","{'min_samples_split': 27, 'max_depth': 99}","{'min_samples_split': 43, 'max_depth': 27}","{'min_samples_split': 36, 'max_depth': 27}","{'min_samples_split': 36, 'max_depth': 47}","{'min_samples_split': 47, 'max_depth': 75}","{'min_samples_split': 44, 'max_depth': 95}","{'min_samples_split': 39, 'max_depth': 87}"
split0_test_score,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.956044,0.967033,0.967033,0.967033,0.956044,0.956044,0.956044,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.967033,0.923077,0.923077,0.923077,0.923077,0.923077,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,...,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088
split1_test_score,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.901099,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,...,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099,0.901099
split2_test_score,0.923077,0.923077,0.912088,0.912088,0.912088,0.912088,0.923077,0.923077,0.912088,0.912088,0.912088,0.923077,0.923077,0.923077,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.912088,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,0.934066,...,0.934066,0.945055,0.945055,0.945055,0.934066,0.945055,0.945055,0.934066,0.945055,0.945055,0.934066,0.945055,0.945055,0.934066,0.945055,0.934066,0.945055,0.945055,0.945055,0.945055,0.945055,0.934066,0.934066,0.945055,0.945055,0.934066,0.934066,0.945055,0.934066,0.945055,0.945055,0.934066,0.945055,0.934066,0.945055,0.945055,0.945055,0.945055,0.945055,0.945055


In [None]:
# 最も精度の高かったハイパーパラメータ
tuned_model.best_params_

{'max_depth': 23, 'min_samples_split': 10}

In [None]:
# 引き継ぎ
best_model = tuned_model.best_estimator_

# 検証
print(best_model.score(x_train_val, t_train_val))
print(best_model.score(x_test, t_test))

0.9934065934065934
0.956140350877193


指定したハイパーパラメータを網羅していないので完全とは言えないですが、どこに予測精度が高くなるハイパーパラメータがあるのかあたりをつける目的では非常に有用

4. ベイズ最適化

事前分布と事後分布と呼ばれる確率統計の理論を使用してハイパーパラメータの探索を行う<br>
その際、探索と活用と呼ばれる試行錯誤を繰り返し<br>
イメージとしては人間が行う試行錯誤に近い

探索：まだ試していない値の範囲でハイパーパラメータを更新して、予測精度がどう変化するか情報を得る

活用：探索で得られた情報をもとに、予測精度が高まる可能性が高い範囲にハイパーパラメータを更新する

本チュートリアルでは、ベイズ最適化を実装するためには日本の Prefferd Networks 社が開発している Optuna というフレームワークを使用

In [None]:
!pip install optuna

Collecting optuna
[?25l  Downloading https://files.pythonhosted.org/packages/85/ee/2688cce5ced0597e12832d1ec4f4383a468f6bddff768eeaa3b5bf4f6500/optuna-1.3.0.tar.gz (163kB)
[K     |████████████████████████████████| 163kB 3.3MB/s 
[?25hCollecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/60/1e/cabc75a189de0fbb2841d0975243e59bde8b7822bacbb95008ac6fe9ad47/alembic-1.4.2.tar.gz (1.1MB)
[K     |████████████████████████████████| 1.1MB 10.4MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting cliff
[?25l  Downloading https://files.pythonhosted.org/packages/b9/17/57187872842bf9f65815b6969b515528ec7fd754137d2d3f49e3bc016175/cliff-3.1.0-py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 6.3MB/s 
[?25hCollecting cmaes
  Downloading https://files.pythonhosted.org/packages/56/00/16b9a086181cd227178d5279a40f7dce4fd16d7b3

In [None]:
import optuna

Optuna では最初に関数 objective を定義して内部に以下の要素を関数として順に定義します。



1. ハイパーパラメータごとに探索範囲を指定

2. 学習に使用するアルゴリズムを指定

3. 学習の実行、検証結果の表示

In [None]:
from sklearn.model_selection import cross_val_score

def objective(trial, x, t, cv):
  # ①ハイパーパラメータごとに探索範囲を指定
  max_depth = trial.suggest_int('max_depth', 2, 100)
  min_samples_split = trial.suggest_int('min_samples_split', 2, 100)

  # ②学習に使用するアルゴリズムを指定
  estimator = DecisionTreeClassifier(
      max_depth = max_depth,
      min_samples_split = min_samples_split
  )

  # ③学習の実行、検証結果の表示
  print('Current_params: ', trial.params)
  accuracy = cross_val_score(estimator, x, t, cv=cv).mean()
  return accuracy

ハイパーパラメータの調整を行う<br>
デフォルトでは最小化を行うようになっているが、今回は正解率の最大化を目的

In [None]:
# studyオブジェクトの作成(最大化)
study = optuna.create_study(direction='maximize')

In [None]:
# K-分割交差検証
cv = 5

# 目的関数の最適化
study.optimize(lambda trial: objective(trial, x_train_val, t_train_val, cv), n_trials=10)

print(study.best_trial)

[32m[I 2020-04-19 06:28:28,480][0m Finished trial#10 with value: 0.9384615384615385 with parameters: {'max_depth': 100, 'min_samples_split': 2}. Best is trial#9 with value: 0.9428571428571428.[0m


Current_params:  {'max_depth': 100, 'min_samples_split': 2}
Current_params:  {'max_depth': 97, 'min_samples_split': 2}


[32m[I 2020-04-19 06:28:28,622][0m Finished trial#11 with value: 0.9406593406593406 with parameters: {'max_depth': 97, 'min_samples_split': 2}. Best is trial#9 with value: 0.9428571428571428.[0m
[32m[I 2020-04-19 06:28:28,762][0m Finished trial#12 with value: 0.9252747252747253 with parameters: {'max_depth': 100, 'min_samples_split': 19}. Best is trial#9 with value: 0.9428571428571428.[0m


Current_params:  {'max_depth': 100, 'min_samples_split': 19}
Current_params:  {'max_depth': 68, 'min_samples_split': 3}


[32m[I 2020-04-19 06:28:28,905][0m Finished trial#13 with value: 0.9472527472527472 with parameters: {'max_depth': 68, 'min_samples_split': 3}. Best is trial#13 with value: 0.9472527472527472.[0m
[32m[I 2020-04-19 06:28:29,048][0m Finished trial#14 with value: 0.9186813186813187 with parameters: {'max_depth': 66, 'min_samples_split': 36}. Best is trial#13 with value: 0.9472527472527472.[0m


Current_params:  {'max_depth': 66, 'min_samples_split': 36}
Current_params:  {'max_depth': 61, 'min_samples_split': 17}


[32m[I 2020-04-19 06:28:29,195][0m Finished trial#15 with value: 0.9252747252747253 with parameters: {'max_depth': 61, 'min_samples_split': 17}. Best is trial#13 with value: 0.9472527472527472.[0m
[32m[I 2020-04-19 06:28:29,337][0m Finished trial#16 with value: 0.9230769230769231 with parameters: {'max_depth': 82, 'min_samples_split': 33}. Best is trial#13 with value: 0.9472527472527472.[0m


Current_params:  {'max_depth': 82, 'min_samples_split': 33}
Current_params:  {'max_depth': 76, 'min_samples_split': 2}


[32m[I 2020-04-19 06:28:29,483][0m Finished trial#17 with value: 0.9428571428571428 with parameters: {'max_depth': 76, 'min_samples_split': 2}. Best is trial#13 with value: 0.9472527472527472.[0m
[32m[I 2020-04-19 06:28:29,625][0m Finished trial#18 with value: 0.9208791208791209 with parameters: {'max_depth': 52, 'min_samples_split': 63}. Best is trial#13 with value: 0.9472527472527472.[0m


Current_params:  {'max_depth': 52, 'min_samples_split': 63}
Current_params:  {'max_depth': 89, 'min_samples_split': 28}


[32m[I 2020-04-19 06:28:29,772][0m Finished trial#19 with value: 0.9186813186813187 with parameters: {'max_depth': 89, 'min_samples_split': 28}. Best is trial#13 with value: 0.9472527472527472.[0m


FrozenTrial(number=13, value=0.9472527472527472, datetime_start=datetime.datetime(2020, 4, 19, 6, 28, 28, 764476), datetime_complete=datetime.datetime(2020, 4, 19, 6, 28, 28, 904913), params={'max_depth': 68, 'min_samples_split': 3}, distributions={'max_depth': IntUniformDistribution(high=100, low=2, step=1), 'min_samples_split': IntUniformDistribution(high=100, low=2, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=13, state=TrialState.COMPLETE)


In [None]:
study.best_params

{'max_depth': 68, 'min_samples_split': 3}

In [None]:
# 最適なハイパーパラメータを設定したモデルの定義
best_model = DecisionTreeClassifier(**study.best_params)

# モデルの学習
best_model.fit(x_train_val, t_train_val)

# 検証
print(best_model.score(x_train_val, t_train_val))
print(best_model.score(x_test, t_test))

1.0
0.9473684210526315
