Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Hyper-parameter tuning using Azure ML service

The hyper-parameters of methods presented in this tutorial are tuned using Hyperdrive, a feature of Azure Machine Learning (Azure ML) service. Before running this notebook, please follow instructions in the [README](./README.md) file to set up the environment and in [configuration notebook](./configuration.ipynb) to provision Azure ML workspace.

The running time depends on the size of your Azure ML cluster and the method being tuned.

This notebook is used to tune several approaches:
- Feed-forward network multi-step multivariate - [ff_multistep_config.json](ff_multistep_config.json)
- RNN multi-step - [rnn_multistep_config.json](rnn_multistep_config.json)
- RNN teacher forcing - [rnn_teacher_forcing_config.json](rnn_teacher_forcing_config.json)
- RNN encoder decoder - [rnn_encoder_decoder_config.json](rnn_encoder_decoder_config.json)
- CNN - [cnn_config.json](cnn_config.json)

Each of these use cases is defined in a json configuration file listed above in parenthesis. To run a specific approach, please specify the appropriate configuration file in the next section

## Specify configuration file

In [1]:
config_file = "cnn_config.json"

## Import required libraries

In [2]:
import json
import azureml
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.dnn import TensorFlow
from azureml.train.hyperdrive import (
    RandomParameterSampling,
    BayesianParameterSampling,
    HyperDriveConfig,
    PrimaryMetricGoal,
    choice,
    BanditPolicy,
)


## Import configuration file

Please specify the configuration file for this notebook run. Configuration file is a json file containing the following parameters that will be used by the notebook:
- EXPERIMENT_NAME
- CLUSTER_NAME
- SCRIPT_DIR
- SCRIPT_NAME
- SCRIPT_PARAMS
- HYPER_PARAMS
- MODEL_NAME

In [3]:
try:
    with open(config_file) as f:
        params = json.load(f)
except IOError:
    print("Config file not found.")

In [4]:
print("Specified notebook parameters:\n\n", params)

Specified notebook parameters:

 {'EXPERIMENT_NAME': 'cnn', 'CLUSTER_NAME': 'gpucluster', 'SCRIPT_DIR': './CNN', 'SCRIPT_NAME': 'CNN.py', 'SCRIPT_PARAMS': {'--latent-dim-1': 5, '--latent-dim-2': 5, '--kernel-size': 3, '--batch-size': 32, '--T': 72, '--learning-rate': 0.01, '--alpha': 0}, 'HYPER_PARAMS': {'--latent-dim-1': [5, 10, 15], '--latent-dim-2': [0, 5, 10], '--kernel-size': 3, '--batch-size': [16, 32], '--T': [72, 168, 336], '--learning-rate': [0.01, 0.001, 0.0001], '--alpha': [0.1, 0.001, 0]}, 'MODEL_NAME': 'cnn-best'}


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the configuration notebook. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [5]:
# check core SDK version number
print("You are using Azure ML SDK Version: ", azureml.core.VERSION)

You are using Azure ML SDK Version:  1.0.76


In [6]:
try:
    ws = Workspace.from_config()
    print(
        "Workspace name: " + ws.name,
        "Azure region: " + ws.location,
        "Subscription id: " + ws.subscription_id,
        "Resource group: " + ws.resource_group,
        sep="\n",
    )
except:
    print(
        "Workspace not accessible. Change your parameters or create a new workspace using configuration.ipynb notebook."
    )

Workspace name: AML2
Azure region: southcentralus
Subscription id: 6fa1b60b-c4be-4966-a446-261a3ad62d42
Resource group: AML2


## Create an Azure ML experiment
Let's create an Azure ML experiment. All executed runs will be recorded under this experiment in Azure.

In [7]:
exp = Experiment(workspace=ws, name=params["EXPERIMENT_NAME"])

## Upload data and scripts to default datastore 
A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). For simple data handling, each workspace provides a default datastore that can be used, in case the data is not already in Blob Storage or File Share.

In [8]:
ds = ws.get_default_datastore()

In this next step, we will upload the training and test set into the workspace's default datastore, which we will then later be mount on an AmlCompute cluster for training.

In [9]:
ds.upload_files(
    files=["../data/GEFCom2014.zip"],
    target_path="energy",
    overwrite=True,
    show_progress=True,
)
ds.upload_files(
    files=["../common/extract_data.py", "../common/utils.py"],
    target_path="common",
    overwrite=True,
    show_progress=True,
)

Uploading an estimated of 1 files
Uploading ../data/GEFCom2014.zip
Uploaded ../data/GEFCom2014.zip, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 2 files
Uploading ../common/extract_data.py
Uploading ../common/utils.py
Uploaded ../common/extract_data.py, 1 files out of an estimated total of 2
Uploaded ../common/utils.py, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_217c331b42f746e786db0ad7ae35d8b3

## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [10]:
# choose a name for your cluster
cluster_name = params["CLUSTER_NAME"]

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target")
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", min_nodes=0, max_nodes=4
    )
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20
    )

# use get_status() to get a detailed status for the current cluster.
print(compute_target.get_status().serialize())

Found existing compute target
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-12-11T03:25:24.904000+00:00', 'errors': None, 'creationTime': '2019-12-11T02:13:19.119388+00:00', 'modifiedTime': '2019-12-11T02:13:39.916420+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


### Azure ML concepts  
Please note the following three things in the code below:
1. The script accepts arguments using the argparse package. In this case there is one argument `--datadir` which specifies the file system folder in which the script can find the MNIST data
```
    parser = argparse.ArgumentParser()
    parser.add_argument('--datadir')
```
2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the loss and accuracy at the end of each epoch via callback.
```
    run.log('Loss', log['loss'])
```
3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record.

## Create TensorFlow estimator and add Keras
Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the `gpucluster` as compute target, and pass the mount-point of the datastore to the training code as a parameter.
The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed. In this case, we add `keras` package (for the Keras framework), and additional packages required for running the training script on the compute target.

We also specify `entry_script` as one of the parameters. This is the script that is going to be executed on the compute target. Please examine the script for better understanding of the training process, and the metrics that are being logged. 

In [11]:
script_params = params["SCRIPT_PARAMS"]
script_params.update(
    {
        "--datadir": ds.path("energy").as_mount(),
        "--scriptdir": ds.path("common").as_mount(),
    }
)

est = TensorFlow(
    source_directory=params["SCRIPT_DIR"],
    script_params=script_params,
    compute_target=compute_target,
    conda_packages=["pandas", "numpy"],
    pip_packages=["keras", "matplotlib", "scikit-learn", "xlrd", "azureml-sdk"],
    entry_script=params["SCRIPT_NAME"],
    use_gpu=True,
)



## Submit job to run
Submit the estimator to the Azure ML experiment to kick off the execution.

In [12]:
run = exp.submit(est)

In [13]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'NOTSET',…

In [14]:
run.get_metrics()

{'Loss': [0.013505496894741916,
  0.005491749510720438,
  0.005226260294792852,
  0.005162392597918985,
  0.005060522292578749,
  0.0051179951146085945,
  0.005134332971165339,
  0.005085030978962463,
  0.005099362424143114,
  0.005027222022359642,
  0.005071478483926507,
  0.005055981670719877,
  0.005059191733506301,
  0.005070991393479208,
  0.005058864032981121],
 'validationLoss': 0.0025542569243245656,
 'testMAPE': 0.05312445242839181}

## Hyperparameter tuning
We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.

In [15]:
hparams = dict()

for key, value in params["HYPER_PARAMS"].items():
    hparams[key] = choice(value)

ps = RandomParameterSampling(hparams)


Next, we will create a new estimator without the above parameters since they will be passed in later by Hyperdrive configuration. Note we still need to keep the `--datadir` and `--scriptdir` parameters since they are not hyperparamters we will sweep.

In [16]:
est = TensorFlow(
    source_directory=params["SCRIPT_DIR"],
    script_params={
        "--datadir": ds.path("energy").as_mount(),
        "--scriptdir": ds.path("common").as_mount(),
    },
    compute_target=compute_target,
    conda_packages=["pandas", "numpy"],
    pip_packages=["keras", "matplotlib", "scikit-learn", "xlrd", "azureml-sdk"],
    entry_script=params["SCRIPT_NAME"],
    use_gpu=True,
)



Now we will define an early termnination policy. The BanditPolicy states to check the job every iterations when metrics are reported, starting at interval 5. Any run whose best metric is less than 1/(1+0.1) or 91% of the best performing run will be terminated. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [17]:
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

Now we are ready to configure a run configuration object, and specify the primary metric `meanValidationMAPE` that's recorded in your training runs. We also want to tell the service that we are looking to minimize this value. We also set the number of total runs, and maximal concurrent job count, which should be up to the the number of nodes in our computer cluster.

In [18]:
hdc = HyperDriveConfig(
    estimator=est,
    hyperparameter_sampling=ps,
    policy=early_termination_policy,
    primary_metric_name="Loss",
    primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
    max_total_runs=300,
    max_concurrent_runs=4,
    max_duration_minutes=180,
)

Finally, let's launch the hyperparameter tuning job.

In [19]:
hdr = exp.submit(config=hdc)

We can use a run history widget to show the progress. Be patient as this might take a while to complete.

In [20]:
RunDetails(hdr).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'NOTSE…

## Find and register the best model
When all the jobs finish, we can find out the one that has the lowest mean validation MAPE.

In [21]:
best_run = hdr.get_best_run_by_primary_metric()
print(best_run.get_details()["runDefinition"]["arguments"])

['--datadir', '$AZUREML_DATAREFERENCE_bc05cd0424b44823ae2861964d34c82e', '--scriptdir', '$AZUREML_DATAREFERENCE_8f1bb2deb4dd4fe9a157ea3319691d2b', '--T', '336', '--alpha', '0', '--batch-size', '32', '--kernel-size', '3', '--latent-dim-1', '15', '--latent-dim-2', '0', '--learning-rate', '0.0001']


In [22]:
best_run.get_metrics()

{'Loss': [0.00828489969507693,
  0.004136542220777458,
  0.0034588238296976675,
  0.0031144267807767605,
  0.0028496974581269006,
  0.0026915336654038365,
  0.002564426893895588,
  0.0024713229856310154,
  0.002391322827202215,
  0.002278872424158897,
  0.002219806186929776,
  0.002150637478210458,
  0.0020954079746401447,
  0.002052621989421208,
  0.001995195306968414,
  0.001964645971722809,
  0.0019075027098691877,
  0.0018852346998479538,
  0.0018389220900815843,
  0.0018096223448300364,
  0.0017915446650011754,
  0.0017693798549019055,
  0.0017443429152237675,
  0.0017261701835313817,
  0.0017015754092010817,
  0.0016880552035095416,
  0.0016621246432036668,
  0.0016454236919215969,
  0.001628912583543553,
  0.0016216333113423323,
  0.00160095536015865,
  0.001578251617845031,
  0.0015633399822259724,
  0.0015564324238287039,
  0.0015432167530753582,
  0.0015338566808518227,
  0.0015159097882403644,
  0.0015120560548694288,
  0.0015056484845395416,
  0.0014921563051374907],
 'vali

Now let's list the model files uploaded during the run.

In [23]:
print(best_run.get_file_names())

['azureml-logs/55_azureml-execution-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/65_job_prep-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json']


We can then register the folder (and all files in it) as a model under the workspace model registry for future deployment.

In [24]:
model = best_run.register_model(model_name=params["MODEL_NAME"], model_path='outputs/model')

ModelPathNotFoundException: ModelPathNotFoundException:
	Message: Could not locate the provided model_path outputs/model in the set of files uploaded to the run: ['azureml-logs/55_azureml-execution-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/65_job_prep-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json']
                See https://aka.ms/run-logging for more details.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Could not locate the provided model_path outputs/model in the set of files uploaded to the run: ['azureml-logs/55_azureml-execution-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/65_job_prep-tvmps_f73dd0bad9aec836a2a78ef5afa21354100de2fea34deac509033bbc16248f65_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json']\n                See https://aka.ms/run-logging for more details."
    }
}