# 決定木モデルの自動パラメータチューニング
Azure Machine Learning service が提供する自動ハイパーパラメータチューニング機能 **Hyperdrive** を利用して、Scikit-learn による決定木のハイパーパラメータチューニングを実施します。

## Azure ML Workspaceへ接続
Azure Machine Learning service ワークスペースへ接続します。

In [1]:
from azureml.core import Workspace, Experiment

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

experiment = Experiment(workspace = ws, name = "simple-hyperdrive")

Workspace name: azureml
Azure region: eastus
Subscription id: 9c0f91b8-eb2f-484c-979c-15848c098a6b
Resource group: mlservice


In [2]:
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.55


## クラウドにデータをアップロード
学習で使用するデータをオンプレミスからクラウドにアップロードします。ここではデフォルトの [Datastore](https://docs.microsoft.com/ja-JP/azure/machine-learning/service/how-to-access-data) "workspaceblobstore" を利用します。

In [3]:
# Datastoreの一覧
#ws.datastores

In [4]:
# デフォルトの Datastore を設定
ds = ws.get_default_datastore()
ds.name

'workspaceblobstore'

In [5]:
# dataフォルダにアップロード
ds.upload(src_dir='./data', target_path='data', overwrite=True, show_progress=True)

Uploading an estimated of 1 files
Uploading ./data/Factory.csv
Uploaded ./data/Factory.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_53ac5041af354195abdd688ab7d4c07b

## 学習コード準備

In [6]:
import os
project_folder = "./script"
os.makedirs(project_folder, exist_ok=True)

In [7]:
%%writefile {project_folder}/myenv.yml
name: project_environment
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - pydotplus
- python-graphviz
- scikit-learn=0.20.3
channels:
- conda-forge

Overwriting ./script/myenv.yml


In [14]:
%%writefile {project_folder}/DecisionTree_hyperdrive.py

import pandas as pd
import numpy as np
from azureml.core import Run
run = Run.get_context()

parser = argparse.ArgumentParser(description='Decision Tree Hyperparameter')
parser.add_argument('--max_depth', '-m', type=int, default=5, help='max depth of Decision Tree')
parser.add_argument('--criterion', '-c', type=str, default="gini", help='Criterion policy')
parser.add_argument('--min_samples_split', '-s', type=int, default=2, help='Min Sample Split')
parser.add_argument('--dataset', '-d', dest='data_folder',help='The datastore')

args = parser.parse_args()
np.random.seed(12345)

# データ準備
df = pd.read_csv(args.data_folder+"/data/Factory.csv")

from sklearn.model_selection import train_test_split
X = df.drop(["Quality","ID"],axis=1)
y = df["Quality"].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=100,stratify=y)

# アルゴリズム
from sklearn import tree
max_depth = args.max_depth
criterion = args.criterion
min_samples_split = args.min_samples_split

clf = tree.DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, min_samples_split = min_samples_split)
clf.fit(X_train, y_train)

# 精度確認
from sklearn.metrics import (roc_curve, auc, accuracy_score)

pred = clf.predict(X_test)
print("Accuracy", accuracy_score(pred, y_test))

run.log("Accuracy",accuracy_score(pred, y_test))
run.log("Max Depth",max_depth)
run.log("criterion",criterion)
run.log("Min Samples Split",min_samples_split)

import pickle
filename = 'DT-model.pkl'
pickle.dump(clf, open(filename, 'wb'))
run.upload_file(name= "outputs/"+filename, path_or_stream=filename)

print(clf.predict(X_test))
print(clf.predict_proba(X_test))

# # Model  Picture
from sklearn.tree import export_graphviz
import pydotplus
with run:
    dot_data = tree.export_graphviz(clf, out_file=None, feature_names=X.columns, proportion=True, filled=True, rounded=True, special_characters=True)  
    graph = pydotplus.graph_from_dot_data(dot_data)
    graph.write_png("outputs/tree.png")
    

Overwriting ./script/DecisionTree_hyperdrive.py


## Machine Learning Compute設定

Machine Learning Computeの設定を行います。既存の Machine Learning Compute の環境は、Azure Portal のコンピューティングのタブもしくは、下記コマンドで確認します。

In [15]:
from azureml.core.compute import ComputeTarget, AmlCompute
compute_target = ComputeTarget(ws,"cpucluster")

## モデル学習設定

Estimatorの設定を行います。今回は Scikit-learn による機械学習モデリングのため、[Scikit-learn Estimator](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) を利用します。最初はハイパーパラメータを指定した状態で実行します。

In [16]:
from azureml.train.sklearn import SKLearn
from azureml.train.estimator import Estimator
from azureml.train.dnn import TensorFlow

script_params = {
    '--dataset': ds.as_mount(),
    '--max_dept':5,
    '--criterion':'gini',
    '--min_samples_split':2
}

estimator = Estimator(source_directory=project_folder,
                    compute_target=compute_target,
                    entry_script='DecisionTree_hyperdrive.py',
                    script_params=script_params,
                    conda_dependencies_file="myenv.yml"
                     )

### 実行開始

上記で定義した scikit-learn Estimator の設定に従って、トレーニング環境を構築し、モデル学習を始めます。

In [17]:
run = experiment.submit(estimator)
print(run)

Run(Experiment: simple-hyperdrive,
Id: simple-hyperdrive_1566738739_57033de4,
Type: azureml.scriptrun,
Status: Queued)


In [19]:
# Status を取りにいっているだけなので、何回実行しても問題なし
from azureml.widgets import RunDetails
RunDetails(run).show() 

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

モデル無事完了したことを確認して、次に進みます。

## モデル登録

In [26]:
# run 完了後に実行
run.get_file_names()

In [27]:
model = run.register_model(model_name = 'DT-hyperdrive', model_path = 'outputs/DT-model.pkl',tags = {'area': "decision tree model for quality prediciton with hyperdrive", 'type': "scikit-learn DT"})
print(model.name, model.id, model.version, sep = '\t')

DT-hyperdrive	DT-hyperdrive:7	7


In [28]:
#run.get_details()

# ハイパーパラメータチューニング  Hyperdrive

Machine Learning Compute を用いて複数サーバでパラメータチューニングを分散で実行します。今回は Random Search を用います。

In [29]:
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
    

# ハイパーパラメータの範囲
param_sampling = RandomParameterSampling( {
    "--max_dept": choice(range(1,100)),
    "--criterion": choice("gini","entropy"),
    "--min_samples_split":choice(range(2,5))
    }
)

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                            hyperparameter_sampling=param_sampling, 
                                            primary_metric_name='Accuracy',
                                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                            max_total_runs=20,
                                            max_concurrent_runs=10)

## 実行開始

In [30]:
hyperdrive_run = experiment.submit(hyperdrive_run_config)

The same input parameter(s) are specified in estimator/run_config script params and HyperDrive parameter space. HyperDrive parameter space definition will override these duplicate entries. ['--max_dept', '--criterion', '--min_samples_split'] is the list of overridden parameter(s).


In [31]:
from azureml.widgets import RunDetails
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Champion Model のダウンロード

In [33]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()

In [34]:
#model = best_run.register_model(model_name='hyperdrive-model', model_path='outputs/DT-model.pkl')
model = run.register_model(model_name='hyperdrive-model', model_path='outputs/DT-model.pkl')
path = model.download(exist_ok = True)

In [35]:
from sklearn.externals import joblib
fitted_model = joblib.load(path)

## テストデータの取得

In [37]:
import pandas as pd
df = pd.read_csv('./data/Factory.csv')

from sklearn.model_selection import train_test_split
X = df.drop(["Quality","ID"],axis=1)
y = df["Quality"].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=100,stratify=y)

## モデル解釈

In [38]:
from azureml.explain.model.tabular_explainer import TabularExplainer
classes = ["false","true"]
tabular_explainer = TabularExplainer(fitted_model, X_train, features=X_train.columns, classes=classes)

In [39]:
global_explanation = tabular_explainer.explain_global(X_test[:100])

In [40]:
from azureml.contrib.explain.model.visualize import ExplanationDashboard
ExplanationDashboard(global_explanation, fitted_model, X_test[:100])

ExplanationWidget(value={'predictedY': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1…

<azureml.contrib.explain.model.visualize.ExplanationDashboard.ExplanationDashboard at 0x125499cf8>