# Hyperparameter Tuning of Dilated CNN Models with AML SDK and HyperDrive

This notebook performs hyperparameter tuning of Dilated CNN models with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.

## Prerequisites
To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `activate <your env>` or `source activate <your env>` in a Linux VM.   
`pip3 install --upgrade azureml-sdk[notebooks,automl]`  
`jupyter nbextension install --py --user azureml.train.widgets`  
`jupyter nbextension enable --py --user azureml.train.widgets`  

Besides, you need to create an Azure ML workspace and its configuration file (`config.json`) by following the [00.configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/00.configuration.ipynb) notebook.

In [None]:
# %matplotlib inline
# import os
# import numpy as np
# import matplotlib
# import matplotlib.pyplot as plt

In [1]:
import azureml
from azureml.core import Workspace, Run

# Check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  0.1.74


In [None]:
from azureml.telemetry import set_diagnostics_collection

# Opt-in diagnostics for better experience of future releases
set_diagnostics_collection(send_diagnostics=True)

## Initialize Workspace & Create an Azure ML Experiment

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json`.

In [2]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Found the config file in: /home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/config.json
Workspace name: chhws
Azure region: southcentralus
Subscription id: ff18d7a8-962a-406c-858f-49acd23d6c01
Resource group: tsperf


In [3]:
from azureml.core import Experiment

exp = Experiment(workspace=ws, name='tune_dcnn')

## Validate Script Locally

In [4]:
from azureml.core.runconfig import RunConfiguration

# Configure local, user managed environment
run_config_user_managed = RunConfiguration()
run_config_user_managed.environment.python.user_managed_dependencies = True
run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'

In [5]:
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory='./', 
                      script='train_validate.py', 
                      arguments=['--data-folder', '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', '--dropout-rate', '0.3'],
                      run_config=run_config_user_managed)
run_local = exp.submit(src)

In [6]:
# Check job status
run_local.fail

<bound method Run.fail of Run(Experiment: tune_dcnn,
Id: tune_dcnn_1543443482_13116394,
Type: azureml.scriptrun,
Status: Running)>

In [7]:
# Check results
run_local.get_details()
run_local.get_metrics()

{'MAPE': 53.67131952139047}

## Run Script on BatchAI 

### Create Batch AI cluster as compute target

In [8]:
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your cluster
cluster_name = "gpucluster"

try:
    # Look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    if type(compute_target) is BatchAiCompute:
        print('Found existing compute target {}.'.format(cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = BatchAiCompute.provisioning_configuration(vm_size="STANDARD_NC6", # GPU-based VM
                                                                #vm_priority='lowpriority', # optional
                                                                autoscale_enabled=True,
                                                                cluster_min_nodes=0, 
                                                                cluster_max_nodes=4)

    # Create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

Found existing compute target gpucluster.


In [9]:
# If you have created the compute target, you should see one entry named 'gpucluster' of type BatchAI 
# in the workspace's compute_targets property.
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

gpucluster BatchAI Succeeded


### Configure Docker environment

In [10]:
from azureml.core.runconfig import EnvironmentDefinition
from azureml.core.conda_dependencies import CondaDependencies

env = EnvironmentDefinition()

env.python.user_managed_dependencies = False
env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],
                                                         python_version='3.6.2')
env.python.conda_dependencies.add_channel('conda-forge')
env.docker.enabled=True

### Upload data to default datastore

Upload training and test sets of Orange Juice dataset to the workspace's default datastore, which will later be mounted on a Batch AI cluster for training. 

In [11]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

AzureFile chhws9475974549 azureml-filestore-87c07697-2faa-418a-a879-bbaddc5c7cac


In [12]:
path_on_datastore = 'data'
ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)

$AZUREML_DATAREFERENCE_752662519aaf4f4894f19ee0e72cebcf

In [17]:
# Get data reference object for the data path
ds_data = ds.path(path_on_datastore)
print(ds_data)

$AZUREML_DATAREFERENCE_912122ef541340f2962f3b1f82e1cb3d


### Create estimator

In [20]:
from azureml.core.runconfig import EnvironmentDefinition
from azureml.train.estimator import Estimator

script_folder = './'

script_params = {
    '--data-folder': ds_data.as_mount(),
    '--dropout-rate': 0.3,
    '--learning-rate': 0.01
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                use_docker=True,
                entry_script='train_validate.py',
                environment_definition=env)

### Submit job

In [21]:
# Submit job to Batch AI cluster
run_batchai = exp.submit(config=est)

### Check job status

In [22]:
from azureml.train.widgets import RunDetails

RunDetails(run_batchai).show()

_UserRun(widget_settings={'childWidgetDisplay': 'popup'})

In [23]:
run_batchai.get_details()

{'endTimeUtc': '2018-11-29T19:08:42.856806Z',
 'logFiles': {'azureml-logs/55_batchai_execution.txt': 'https://chhws9475974549.blob.core.windows.net/azureml/ExperimentRun/tune_dcnn_1543516606666/azureml-logs/55_batchai_execution.txt?sv=2017-04-17&sr=b&sig=%2FHDbsKX1j3jPtIqbYv1WW%2Fsl8vA%2FSAYc4eU9cj4OEPw%3D&st=2018-11-29T19%3A01%3A09Z&se=2018-11-30T03%3A11%3A09Z&sp=r',
  'azureml-logs/56_batchai_stderr.txt': 'https://chhws9475974549.blob.core.windows.net/azureml/ExperimentRun/tune_dcnn_1543516606666/azureml-logs/56_batchai_stderr.txt?sv=2017-04-17&sr=b&sig=2gJ0v%2Ffs7CXZH7rP7MyrJo8hd2JjS1N%2B5tqoCB3oqK0%3D&st=2018-11-29T19%3A01%3A09Z&se=2018-11-30T03%3A11%3A09Z&sp=r',
  'azureml-logs/60_control_log.txt': 'https://chhws9475974549.blob.core.windows.net/azureml/ExperimentRun/tune_dcnn_1543516606666/azureml-logs/60_control_log.txt?sv=2017-04-17&sr=b&sig=gIT07ajS0f%2Furau5Wk8vh9tA%2FPL%2B9nMBw6EHyds8EB0%3D&st=2018-11-29T19%3A01%3A09Z&se=2018-11-30T03%3A11%3A09Z&sp=r',
  'azureml-logs/80_driv

In [24]:
run_batchai.get_metrics()

{'MAPE': 51.45909094199156}

## Tune Hyperparameters using HyperDrive

In [26]:
from azureml.train.hyperdrive import *

script_folder = './'
script_params = {
    '--data-folder': ds_data.as_mount()
}
est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                use_docker=True,
                entry_script='train_validate.py',

                environment_definition=env)
ps = RandomParameterSampling({
    '--seq-len': choice(6, 8, 10, 12, 14, 16, 18, 20),
    '--batch-size': choice(16, 32, 64),
    '--learning-rate': choice(0.01, 0.015, 0.02, 0.025),
    '--epochs': choice(3,4,5,6,8)
})
htc = HyperDriveRunConfig(estimator=est, 
                          hyperparameter_sampling=ps, 
                          primary_metric_name='MAPE', 
                          primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                          max_total_runs=20,
                          max_concurrent_runs=4)
htr = exp.submit(config=htc)

In [27]:
RunDetails(htr).show()

_HyperDrive(widget_settings={'childWidgetDisplay': 'popup'})

In [28]:
htr.get_metrics()

{'tune_dcnn_1543519296309_0': {'MAPE': [52.406915348521984]},
 'tune_dcnn_1543519296309_1': {'MAPE': [48.22620680559373]},
 'tune_dcnn_1543519296309_10': {'MAPE': [45.50651538777104]},
 'tune_dcnn_1543519296309_11': {'MAPE': [53.624342050763744]},
 'tune_dcnn_1543519296309_12': {'MAPE': [50.111951512463456]},
 'tune_dcnn_1543519296309_13': {'MAPE': [47.39905703800912]},
 'tune_dcnn_1543519296309_14': {'MAPE': [45.82307980941858]},
 'tune_dcnn_1543519296309_15': {'MAPE': [55.16572770960648]},
 'tune_dcnn_1543519296309_16': {'MAPE': [52.154747659311084]},
 'tune_dcnn_1543519296309_17': {'MAPE': [47.836595551418156]},
 'tune_dcnn_1543519296309_18': {'MAPE': [47.548769835232136]},
 'tune_dcnn_1543519296309_19': {'MAPE': [51.587706457682664]},
 'tune_dcnn_1543519296309_2': {'MAPE': [47.51181064242932]},
 'tune_dcnn_1543519296309_3': {'MAPE': [55.49454015575376]},
 'tune_dcnn_1543519296309_4': {'MAPE': [49.35784863332553]},
 'tune_dcnn_1543519296309_5': {'MAPE': [45.99110019145163]},
 'tune_

In [29]:
best_run = htr.get_best_run_by_primary_metric()
parameter_values = best_run.get_details()['runDefinition']['Arguments']
print(parameter_values)

['--data-folder', '$AZUREML_DATAREFERENCE_b70efd6708e94a6d927566fa15263a0d', '--batch-size', '16', '--epochs', '8', '--learning-rate', '0.01', '--seq-len', '16']
