# HyperDrive demo notebook

This notebook will demostrate the use of the AzureML tool HyperDrive which allows for distributed hyperparameter tuning.

 Hyperparameters are used to control the training of a machine learning model and hyperparameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm for the specific application. This requires compiling and training a model over and over with different combinations of hyperparameters in a defined hyperparameter space, so the process can be very time intensive. Parallelising this processing can significantly increase the time efficiency of this process. 

The environment that this notebook works best with is: <b>Python 3.8 - AzureML </b>

In [1]:
%matplotlib inline

In [39]:
import tempfile

In [2]:
import pandas as pd
import numpy as np
import pathlib
import matplotlib.pyplot as plt

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
import prd_pipeline

## Set up azure experiment


In [5]:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Environment
from azureml.core import Experiment, ComputeTarget, ScriptRunConfig

In [6]:
prd_ws = Workspace.from_config()

In [7]:
azure_experiment_name='prd_fraction_models'
azure_env_name = 'prd_ml_cluster'
cluster_name = 'prd-ml-fractions-cluster'

In [8]:
prd_model_name = 'azml_fractions_cluster_hpt'

In [9]:
target_parameter = [
    'radar_fraction_in_band_instant_0.0',
    'radar_fraction_in_band_instant_0.25',
    'radar_fraction_in_band_instant_2.5',
    'radar_fraction_in_band_instant_7.0',
    'radar_fraction_in_band_instant_10.0'
]
# target_parameter = 'radar_mean_rain_instant'
profile_features = ['air_temperature', 'relative_humidity']
single_lvl_features = [] #'air_pressure_at_sea_level'

In [10]:
feature_dict = {
    'profile': profile_features,
    'single_level': single_lvl_features,
    'target': target_parameter,
} 

In [11]:
prd_exp = Experiment(workspace=prd_ws, name=azure_experiment_name)
prd_exp

Name,Workspace,Report Page,Docs Page
prd_fraction_models,precip_rediagnosis,Link to Azure Machine Learning studio,Link to Documentation


Get the AzML environment (basically a conda environment) from the workspace.

In [12]:
# test_run = azureml.core.get_run(prd_exp, 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa')

In [13]:
# test_run.get_best_run_by_primary_metric()

In [14]:
prd_env = Environment.get(workspace=prd_ws, name=azure_env_name)
prd_env

{
    "assetId": "azureml://locations/uksouth/workspaces/57546dc9-9763-4025-831d-c19991c81540/environments/prd_ml_cluster/versions/4",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04",
        "baseImageRegistry": {
            "address": "mcr.microsoft.com",
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {},
    "inferencingStackVersion": null,
    "name": "prd_ml_cluster",
    "python": {
        "baseCondaEnvironment": null

### Load data

Use prd_pipeline to load and preprocess data

In [15]:
load_all = False

In [16]:
if load_all:
    prd_azml_dataset_name = 'prd_merged_all_events_files'
else:
    prd_azml_dataset_name = 'prd_merged_202110_nswws_amber_oct_files'
prd_azml_dataset = azureml.core.Dataset.get_by_name(prd_ws, name=prd_azml_dataset_name)

In [17]:
prd_prefix = 'prd'
merged_prefix = prd_prefix + '_merged'
csv_file_suffix = 'csv'

In [18]:
import datetime
log_dir = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

### Execute our training run on a cluster with hyperdrive for parallelised hyperparameter tuning

In [19]:
from azureml.train.hyperdrive import RandomParameterSampling, GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform

In [20]:
prd_demo_compute_target = ComputeTarget(workspace=prd_ws, name=cluster_name)
prd_demo_compute_target

AmlCompute(workspace=Workspace.create(name='precip_rediagnosis', subscription_id='07efdc52-cd27-48ed-9443-3aad2b6b777b', resource_group='precip_rediagnosis'), name=prd-ml-fractions-cluster, id=/subscriptions/07efdc52-cd27-48ed-9443-3aad2b6b777b/resourceGroups/precip_rediagnosis/providers/Microsoft.MachineLearningServices/workspaces/precip_rediagnosis/computes/prd-ml-fractions-cluster, type=AmlCompute, provisioning_state=Succeeded, location=uksouth, tags={})

Hyperparameters that we want to vary using hyperdrive need to be input arguments for the prd_cluster_train_demo.py script which is called through ScriptRunConfig. Hyperparameters set in prd_demo_args will be overwritten by Hyperdrive. 

In [21]:
nepochs = 1

In [22]:
prd_demo_args = ['--dataset-name', prd_azml_dataset_name,
                 '--model-name', prd_model_name, 
                 '--test-fraction', 0.2,
                ]
prd_demo_args += ['--target-parameter']
prd_demo_args += target_parameter
prd_demo_args += ['--profile-features']
prd_demo_args += profile_features
prd_demo_args += ['--single-level_features']
prd_demo_args += single_lvl_features
prd_demo_args += ['--epochs', nepochs]
prd_demo_args += ['--batch-size', 128]
prd_demo_args += ['--learning-rate', 0.01]
prd_demo_args += ['--test-fraction', 0.2]
prd_demo_args += ['--log-dir', './logs']

prd_demo_args

['--dataset-name',
 'prd_merged_202110_nswws_amber_oct_files',
 '--model-name',
 'azml_fractions_cluster_hpt',
 '--test-fraction',
 0.2,
 '--target-parameter',
 'radar_fraction_in_band_instant_0.0',
 'radar_fraction_in_band_instant_0.25',
 'radar_fraction_in_band_instant_2.5',
 'radar_fraction_in_band_instant_7.0',
 'radar_fraction_in_band_instant_10.0',
 '--profile-features',
 'air_temperature',
 'relative_humidity',
 '--single-level_features',
 '--epochs',
 1,
 '--batch-size',
 128,
 '--learning-rate',
 0.01,
 '--test-fraction',
 0.2,
 '--log-dir',
 './logs']

In [23]:
prd_run_src = ScriptRunConfig(
    source_directory=os.getcwd(),
    script='prd_cluster_train_demo.py',
    arguments=prd_demo_args,
    compute_target=prd_demo_compute_target,
    environment=prd_env
)

### HyperDrive configuration

The next step is to configure our HyperDrive run. The run config defined above is passed to our HyperDriveConfig as the run_config. We must provide a <code>hyperparameter_sampling</code> explained in more detail below. We are required to also provide the <code>primary_metric_name</code> (either be the models loss function or a metric define when compiling the model) and the <code>primary_metric_goal</code> to either minimize or maximize this metric. We must also define <code>max_total_runs</code> (the upper bound of the number of runs, may be smaller depending on the defined hyperparameter space and sampling strategy) and <code>max_concurrent_runs</code>, if this is set to None all run are launched in parallel.
There is also the option to define an early stopping policy with the <code>policy</code> argument.

#### Hyperparameter sampling

There are three different classes for sampling the hyperparameter space: 
- <code>GridParameterSampling</code>: define a search space as a grid of hyperparameter based on the given hyperparameter space, then evaluates every position in the grid in order (note if max_total_runs < total potential combinations, then this will only run a subsample of the grid).
- <code>RandomParameterSampling</code>: randomly samples hyperparameter combinations from the hyperparameter space 
- <code>BayesianParameterSampling</code>: defines Bayesian sampling over a hyperparameter space, tries to intelligently pick the next sample of hyperparameters based on how the previous samples performed.

We then define the hyperparameter space from which to select the combination of hyperparameters to assess. The hyperparameters that we want to tune should be input arguments to the script which is being run by ScriptRunConfig and any hyperparameter arguments that are input will be overwritten by hyperdrive.

In AzureML, there are different ways to define the set to sample each hyperparameters from (<code>choice</code>, <code>lognormal</code>, <code>loguniform</code>, <code>normal</code>, <code>qlognormal</code>, <code>qloguniform</code>, <code>qnormal</code>, <code>quniform</code>, <code>randint</code> and <code>uniform</code>).

In [24]:
ps = RandomParameterSampling(
    {
        '--batch-size': choice(32, 64, 128),
        '--learning-rate': choice(0.1, 0.01, 0.001, 0.0001),
        '--epochs': choice(10, 20, 50, 100)
    }
)

#### Early stopping policy

We can define an early stopping policy in which means that any poorly performing experiment runs are canceled and new ones started. Here we use BanditPolicy with a slack criteria of 0.1 (the ratio of slack allowed with respect to the best performing training run) and evaluation internal of 2 (the frequency for applying the policy, here every two training steps).

In [25]:
early_stop_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

In [26]:
htc = HyperDriveConfig(run_config=prd_run_src, 
                       hyperparameter_sampling=ps, 
                       policy=early_stop_policy, 
                       primary_metric_name='val_loss', 
                       primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                       max_total_runs=18,
                       max_concurrent_runs=12)

### HyperDrive running

In [27]:
prd_run_src

<azureml.core.script_run_config.ScriptRunConfig at 0x7efebd770b50>

In [28]:
prd_run = prd_exp.submit(htc)
prd_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,HD_137bff59-1a21-4445-9605-1f326007c23e,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


In [29]:
import azureml.tensorboard

In [30]:
tb = azureml.tensorboard.Tensorboard([prd_run])

# If successful, start() returns a string with the URI of the instance.
tb.start()

https://prd-ml-pipeline-6006.uksouth.instances.azureml.ms


'https://prd-ml-pipeline-6006.uksouth.instances.azureml.ms'

In [31]:
prd_run.wait_for_completion()
assert(prd_run.get_status() == "Completed")

### HyperDrive results

We can then select the model which performs best against the selected primary metric, as defined within the HyperDriveConfig.

Get the best trained model from the hyper drive run and load in the trained weights ready for inference.

In [32]:
prd_best_run = prd_run.get_best_run_by_primary_metric()
prd_best_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,HD_137bff59-1a21-4445-9605-1f326007c23e_17,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [42]:
import tempfile

In [45]:
import tensorflow as tf

In [46]:
with tempfile.TemporaryDirectory() as td1:
    td_path = pathlib.Path(td1)
    print(td_path)
    prd_best_run.download_files(prefix=prd_model_name, output_directory=td1)
    model_path = td_path / prd_model_name
    print(model_path)
    list(model_path.iterdir())
    trained_model = tf.keras.models.load_model(model_path)
trained_model

/tmp/tmpllspgjnb
/tmp/tmpllspgjnb/azml_fractions_cluster_hpt


2022-09-29 11:01:44.329663: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<keras.engine.functional.Functional at 0x7efebc0fa460>

In [51]:
import importlib


In [57]:
importlib.reload(prd_pipeline)

<module 'prd_pipeline' from '/mnt/batch/tasks/shared/LS_root/mounts/clusters/prd-ml-pipeline/code/Users/stephen.haddad/prd_fractional/fractions_model_pipeline/prd_pipeline.py'>

In [58]:
%%time
input_data = prd_pipeline.load_data(
    prd_ws,
    dataset_name=prd_azml_dataset_name
)

Volume mount is not enabled. 
Falling back to dataflow mount.
loading all event data
CPU times: user 6.28 s, sys: 915 ms, total: 7.2 s
Wall time: 14.1 s


In [59]:
do_save_split = False

In [60]:
%%time
if do_save_split:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
        test_savefn='tmp.csv',
    )
else:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
    )
    

target has dims: 5
dropping smallest bin: radar_fraction_in_band_instant_0.0
getting profile columns
{'nprof_features': 2, 'nheights': 33, 'nsinglvl_features': 0, 'nbands': 5}
CPU times: user 1.06 s, sys: 930 ms, total: 1.99 s
Wall time: 2.21 s


In [61]:
data_splits['X_train']

array([[[ 1.24593207,  0.94569035],
        [ 1.24593207,  0.96817307],
        [ 1.24563983,  0.99333412],
        ...,
        [ 1.48791667,  1.94644632],
        [ 1.50880768,  2.00976935],
        [ 1.54712621,  2.08105149]],

       [[ 1.14448783,  0.99757769],
        [ 1.14448783,  1.00261662],
        [ 1.14420951,  1.02742265],
        ...,
        [ 1.52299947,  1.98117967],
        [ 1.53759727,  2.03270693],
        [ 1.51815972,  2.12713304]],

       [[ 1.26042389,  1.06676082],
        [ 1.26042389,  1.07150371],
        [ 1.26012966,  1.09559971],
        ...,
        [ 1.41137246,  1.9580241 ],
        [ 1.41604101,  1.9638942 ],
        [ 1.41516774,  2.01192916]],

       ...,

       [[-1.80463548, -1.11250763],
        [-1.80463548, -1.15010504],
        [-1.80450905, -1.22242037],
        ...,
        [-1.3123258 , -1.93211163],
        [-1.45332854, -1.7749307 ],
        [-1.42998588, -1.60547275]],

       [[-1.99303169, -1.28546545],
        [-1.99303169, -1.32

We can also return information from each of the different hyperdrive child run

In [62]:
trained_model.predict(data_splits['X_val'])

array([[0.07449621, 0.17084315, 0.62800384, 0.12113631, 0.00552045],
       [0.16235153, 0.24355742, 0.42113695, 0.14735422, 0.0255999 ],
       [0.05635332, 0.26082215, 0.5709541 , 0.09801023, 0.01386014],
       ...,
       [0.6834262 , 0.14691615, 0.15181117, 0.01580195, 0.00204462],
       [0.6414617 , 0.10684927, 0.21061496, 0.03631389, 0.00476021],
       [0.6207515 , 0.12125029, 0.22463657, 0.03013871, 0.00322298]],
      dtype=float32)

In [33]:
prd_run.get_children_sorted_by_primary_metric()

[{'run_id': 'HD_137bff59-1a21-4445-9605-1f326007c23e_17',
  'hyperparameters': '{"--batch-size": 128, "--epochs": 50, "--learning-rate": 0.001}',
  'best_primary_metric': 0.3477984666824341,
  'status': 'Completed'},
 {'run_id': 'HD_137bff59-1a21-4445-9605-1f326007c23e_2',
  'hyperparameters': '{"--batch-size": 64, "--epochs": 100, "--learning-rate": 0.0001}',
  'best_primary_metric': 0.37217044830322266,
  'status': 'Completed'},
 {'run_id': 'HD_137bff59-1a21-4445-9605-1f326007c23e_11',
  'hyperparameters': '{"--batch-size": 32, "--epochs": 50, "--learning-rate": 0.0001}',
  'best_primary_metric': 0.3729400932788849,
  'status': 'Completed'},
 {'run_id': 'HD_137bff59-1a21-4445-9605-1f326007c23e_7',
  'hyperparameters': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.001}',
  'best_primary_metric': 0.374253511428833,
  'status': 'Completed'},
 {'run_id': 'HD_137bff59-1a21-4445-9605-1f326007c23e_6',
  'hyperparameters': '{"--batch-size": 64, "--epochs": 20, "--learning-rate":

In [34]:
prd_run.get_metrics()

{'HD_137bff59-1a21-4445-9605-1f326007c23e_17': {'loss': 0.23863039910793304,
  'accuracy': 0.7795819044113159,
  'val_loss': 0.3477984666824341,
  'val_accuracy': 0.7306534051895142},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_16': {'loss': 7.417075157165527,
  'accuracy': 0.5864090919494629,
  'val_loss': 7.932819843292236,
  'val_accuracy': 0.5465838313102722},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_15': {'loss': 7.589364528656006,
  'accuracy': 0.5704078674316406,
  'val_loss': 7.2281084060668945,
  'val_accuracy': 0.5965869426727295},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_14': {'loss': 0.39491021633148193,
  'accuracy': 0.6904413104057312,
  'val_loss': 0.424917608499527,
  'val_accuracy': 0.6774212121963501},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_13': {'loss': 0.35052040219306946,
  'accuracy': 0.7225698828697205,
  'val_loss': 0.4094114601612091,
  'val_accuracy': 0.6886098384857178},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_12': {'loss': 0.38068142533302307,
  'accura

In [35]:
prd_run.get_metrics()

{'HD_137bff59-1a21-4445-9605-1f326007c23e_17': {'loss': 0.23863039910793304,
  'accuracy': 0.7795819044113159,
  'val_loss': 0.3477984666824341,
  'val_accuracy': 0.7306534051895142},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_16': {'loss': 7.417075157165527,
  'accuracy': 0.5864090919494629,
  'val_loss': 7.932819843292236,
  'val_accuracy': 0.5465838313102722},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_15': {'loss': 7.589364528656006,
  'accuracy': 0.5704078674316406,
  'val_loss': 7.2281084060668945,
  'val_accuracy': 0.5965869426727295},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_14': {'loss': 0.39491021633148193,
  'accuracy': 0.6904413104057312,
  'val_loss': 0.424917608499527,
  'val_accuracy': 0.6774212121963501},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_13': {'loss': 0.35052040219306946,
  'accuracy': 0.7225698828697205,
  'val_loss': 0.4094114601612091,
  'val_accuracy': 0.6886098384857178},
 'HD_137bff59-1a21-4445-9605-1f326007c23e_12': {'loss': 0.38068142533302307,
  'accura

In [36]:
prd_run.get_hyperparameters()

{'HD_137bff59-1a21-4445-9605-1f326007c23e_0': '{"--batch-size": 64, "--epochs": 50, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_1': '{"--batch-size": 32, "--epochs": 20, "--learning-rate": 0.01}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_2': '{"--batch-size": 64, "--epochs": 100, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_3': '{"--batch-size": 64, "--epochs": 50, "--learning-rate": 0.1}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_4': '{"--batch-size": 128, "--epochs": 20, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_5': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.1}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_6': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_7': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_8': '{"--batch-size": 128, "--epochs": 50, "--learning-rate":

In [37]:
prd_run.get_hyperparameters()

{'HD_137bff59-1a21-4445-9605-1f326007c23e_0': '{"--batch-size": 64, "--epochs": 50, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_1': '{"--batch-size": 32, "--epochs": 20, "--learning-rate": 0.01}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_2': '{"--batch-size": 64, "--epochs": 100, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_3': '{"--batch-size": 64, "--epochs": 50, "--learning-rate": 0.1}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_4': '{"--batch-size": 128, "--epochs": 20, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_5': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.1}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_6': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.0001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_7': '{"--batch-size": 64, "--epochs": 20, "--learning-rate": 0.001}',
 'HD_137bff59-1a21-4445-9605-1f326007c23e_8': '{"--batch-size": 128, "--epochs": 50, "--learning-rate":