# HyperDrive demo notebook

This notebook will demostrate the use of the AzureML tool HyperDrive which allows for distributed hyperparameter tuning.

 Hyperparameters are used to control the training of a machine learning model and hyperparameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm for the specific application. This requires compiling and training a model over and over with different combinations of hyperparameters in a defined hyperparameter space, so the process can be very time intensive. Parallelising this processing can significantly increase the time efficiency of this process. 

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import pathlib
import matplotlib.pyplot as plt

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
import prd_pipeline

## Set up azure experiment


In [5]:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Environment
from azureml.core import Experiment, ComputeTarget, ScriptRunConfig

In [6]:
prd_ws = Workspace.from_config()

In [7]:
import tensorflow

In [8]:
use_full_dataset = True
if use_full_dataset:
    azure_dataset_name ='prd_merged_all_events_files'
else:
    #  use subset for development.
    azure_dataset_name ='prd_merged_202110_nswws_amber_oct_files'

In [9]:
if use_full_dataset:
    cluster_name = 'prd-ml-fractions-cluster'
    
else:
    cluster_name = 'mlops-test'

In [10]:
azure_experiment_name='prd_mlops_test'
azure_env_name = 'prd_ml_cluster'

In [11]:
prd_model_name = 'azml_cluster_demo_20220414'

In [12]:
target_parameter = 'radar_mean_rain_instant'
profile_features = ['air_temperature', 'relative_humidity']
single_lvl_features = ['air_pressure_at_sea_level'] 

In [13]:
prd_exp = Experiment(workspace=prd_ws, name=azure_experiment_name)
prd_exp

Name,Workspace,Report Page,Docs Page
prd_mlops_test,precip_rediagnosis,Link to Azure Machine Learning studio,Link to Documentation


Get the AzML environment (basically a conda environment) from the workspace.

In [14]:
test_run = azureml.core.get_run(prd_exp, 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa')

In [15]:
test_run.get_best_run_by_primary_metric()

Experiment,Id,Type,Status,Details Page,Docs Page
prd_mlops_test,HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [16]:
prd_env = Environment.get(workspace=prd_ws, name=azure_env_name)
prd_env

{
    "assetId": "azureml://locations/uksouth/workspaces/57546dc9-9763-4025-831d-c19991c81540/environments/prd_ml_cluster/versions/4",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04",
        "baseImageRegistry": {
            "address": "mcr.microsoft.com",
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {},
    "inferencingStackVersion": null,
    "name": "prd_ml_cluster",
    "python": {
        "baseCondaEnvironment": null

### Load data

Use prd_pipeline to load and preprocess data

In [17]:
# import importlib 
# importlib.reload(prd_cluster_train_demo)

In [18]:
%%time
input_data = prd_pipeline.load_data(
    prd_ws,
    dataset_name=azure_dataset_name
)



Volume mount is not enabled. 
Falling back to dataflow mount.
loading all event data
CPU times: user 1min 39s, sys: 19.4 s, total: 1min 59s
Wall time: 2min 29s


In [19]:
[c1 for c1 in input_data.columns if 'radar' in c1]

['radar_max_rain_aggregate_3hr',
 'radar_mean_rain_aggregate_3hr',
 'radar_max_rain_instant',
 'radar_mean_rain_instant',
 'radar_fraction_in_band_aggregate_3hr_0.0',
 'radar_fraction_in_band_aggregate_3hr_0.25',
 'radar_fraction_in_band_aggregate_3hr_2.5',
 'radar_fraction_in_band_aggregate_3hr_7.0',
 'radar_fraction_in_band_aggregate_3hr_10.0',
 'radar_fraction_in_band_instant_0.0',
 'radar_fraction_in_band_instant_0.25',
 'radar_fraction_in_band_instant_2.5',
 'radar_fraction_in_band_instant_7.0',
 'radar_fraction_in_band_instant_10.0']

In [20]:
do_save_split = False

In [21]:
%%time
if do_save_split:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
        test_savefn='tmp.csv',
    )
else:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
    )
    

target has dims: 23
dropping zeros
getting profile columns
['relative_humidity_5.0', 'relative_humidity_10.0', 'relative_humidity_20.0', 'relative_humidity_30.0', 'relative_humidity_50.0', 'relative_humidity_75.0', 'relative_humidity_100.0', 'relative_humidity_150.0', 'relative_humidity_200.0', 'relative_humidity_250.0', 'relative_humidity_300.0', 'relative_humidity_400.0', 'relative_humidity_500.0', 'relative_humidity_600.0', 'relative_humidity_700.0', 'relative_humidity_800.0', 'relative_humidity_1000.0', 'relative_humidity_1250.0', 'relative_humidity_1500.0', 'relative_humidity_1750.0', 'relative_humidity_2000.0', 'relative_humidity_2250.0', 'relative_humidity_2500.0', 'relative_humidity_2750.0', 'relative_humidity_3000.0', 'relative_humidity_3250.0', 'relative_humidity_3500.0', 'relative_humidity_3750.0', 'relative_humidity_4000.0', 'relative_humidity_4500.0', 'relative_humidity_5000.0', 'relative_humidity_5500.0', 'relative_humidity_6000.0', 'air_temperature_5.0', 'air_temperature

In [22]:
# these are example calls to the code for easier debugging than running on a separate cluster
# model = prd_cluster_train_demo.build_model(**data_dims)
# model = prd_cluster_train_demo.train_model(model, data_splits)

In [23]:
import datetime
log_dir = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

### Execute our training run on a cluster with hyperdrive for parallelised hyperparameter tuning

In [24]:
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, RandomParameterSampling
from azureml.train.hyperdrive import choice, loguniform

In [25]:
prd_demo_compute_target = ComputeTarget(workspace=prd_ws, name=cluster_name)
prd_demo_compute_target

AmlCompute(workspace=Workspace.create(name='precip_rediagnosis', subscription_id='07efdc52-cd27-48ed-9443-3aad2b6b777b', resource_group='precip_rediagnosis'), name=prd-ml-fractions-cluster, id=/subscriptions/07efdc52-cd27-48ed-9443-3aad2b6b777b/resourceGroups/precip_rediagnosis/providers/Microsoft.MachineLearningServices/workspaces/precip_rediagnosis/computes/prd-ml-fractions-cluster, type=AmlCompute, provisioning_state=Succeeded, location=uksouth, tags={})

Hyperparameters that we want to vary using hyperdrive need to be input arguments for the prd_cluster_train_demo.py script which is called through ScriptRunConfig. Hyperparameters set in prd_demo_args will be overwritten by Hyperdrive. 

In [26]:
nepochs = 1

In [27]:
prd_demo_args = ['--dataset-name', azure_dataset_name,
                 '--target-parameter', target_parameter,
                 '--model-name', prd_model_name,
                ]

prd_demo_args += ['--profile-features']
prd_demo_args += profile_features
prd_demo_args += ['--single-level_features']
prd_demo_args += single_lvl_features
prd_demo_args += ['--epochs', nepochs]
prd_demo_args += ['--batch-size', 128]
prd_demo_args += ['--learning-rate', 0.01]
prd_demo_args += ['--test-fraction', 0.2]
prd_demo_args += ['--log-dir', log_dir]
prd_demo_args

['--dataset-name',
 'prd_merged_all_events_files',
 '--target-parameter',
 'radar_mean_rain_instant',
 '--model-name',
 'azml_cluster_demo_20220414',
 '--profile-features',
 'air_temperature',
 'relative_humidity',
 '--single-level_features',
 'air_pressure_at_sea_level',
 '--epochs',
 1,
 '--batch-size',
 128,
 '--learning-rate',
 0.01,
 '--test-fraction',
 0.2,
 '--log-dir',
 'logs/fit/20220928-222609']

In [28]:
os.getcwd()

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/prd-ml-pipeline/code/Users/stephen.haddad/precip_rediagnosis/model_pipeline'

In [29]:
prd_run_src = ScriptRunConfig(source_directory=os.getcwd(),
                      script='prd_cluster_train_demo.py',
                      arguments=prd_demo_args,
                      compute_target=prd_demo_compute_target,
                      environment=prd_env)

### HyperDrive configuration

The next step is to configure our HyperDrive run. The run config defined above is passed to our HyperDriveConfig as the run_config. We must provide a <code>hyperparameter_sampling</code> explained in more detail below. We are required to also provide the <code>primary_metric_name</code> (either be the models loss function or a metric define when compiling the model) and the <code>primary_metric_goal</code> to either minimize or maximize this metric. We must also define <code>max_total_runs</code> (the upper bound of the number of runs, may be smaller depending on the defined hyperparameter space and sampling strategy) and <code>max_concurrent_runs</code>, if this is set to None all run are launched in parallel.
There is also the option to define an early stopping policy with the <code>policy</code> argument.

#### Hyperparameter sampling

There are three different classes for sampling the hyperparameter space: 
- <code>GridParameterSampling</code>: define a search space as a grid of hyperparameter based on the given hyperparameter space, then evaluates every position in the grid in order (note if max_total_runs < total potential combinations, then this will only run a subsample of the grid).
- <code>RandomParameterSampling</code>: randomly samples hyperparameter combinations from the hyperparameter space 
- <code>BayesianParameterSampling</code>: defines Bayesian sampling over a hyperparameter space, tries to intelligently pick the next sample of hyperparameters based on how the previous samples performed.

We then define the hyperparameter space from which to select the combination of hyperparameters to assess. The hyperparameters that we want to tune should be input arguments to the script which is being run by ScriptRunConfig and any hyperparameter arguments that are input will be overwritten by hyperdrive.

In AzureML, there are different ways to define the set to sample each hyperparameters from (<code>choice</code>, <code>lognormal</code>, <code>loguniform</code>, <code>normal</code>, <code>qlognormal</code>, <code>qloguniform</code>, <code>qnormal</code>, <code>quniform</code>, <code>randint</code> and <code>uniform</code>).

In [30]:
ps = RandomParameterSampling(
    {
        '--batch-size': choice(32, 64, 128),
        '--learning-rate': loguniform(-6, -1)
    }
)

#### Early stopping policy

We can define an early stopping policy in which means that any poorly performing experiment runs are canceled and new ones started. Here we use BanditPolicy with a slack criteria of 0.1 (the ratio of slack allowed with respect to the best performing training run) and evaluation internal of 2 (the frequency for applying the policy, here every two training steps).

In [31]:
early_stop_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

In [32]:
htc = HyperDriveConfig(run_config=prd_run_src, 
                       hyperparameter_sampling=ps, 
                       policy=early_stop_policy, 
                       primary_metric_name='mean_absolute_error', 
                       primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                       max_total_runs=4,
                       max_concurrent_runs=4)

### HyperDrive running

In [33]:
prd_run = prd_exp.submit(htc)
prd_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_mlops_test,HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


In [41]:
prd_run.wait_for_completion()
assert(prd_run.get_status() == "Completed")

### HyperDrive results

We can then select the model which performs best against the selected primary metric, as defined within the HyperDriveConfig.

prd_best_run = prd_run.get_best_run_by_primary_metric()
prd_best_run

We can also return information from each of the different hyperdrive child run

In [54]:
prd_best_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_mlops_test,HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_2,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [49]:
import tempfile

In [50]:
with tempfile.TemporaryDirectory() as td1:
    td_path = pathlib.Path(td1)
    print(td_path)
    prd_best_run.download_files(prefix=prd_model_name, output_directory=td1)
    model_path = td_path / prd_model_name
    print(model_path)
    list(model_path.iterdir())
    trained_model = tensorflow.keras.models.load_model(model_path)
trained_model

/tmp/tmpwx2wjjzp
/tmp/tmpwx2wjjzp/azml_cluster_demo_20220414


2022-09-29 09:07:38.543726: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<keras.engine.functional.Functional at 0x7f156b4ddd90>

In [51]:
trained_model

<keras.engine.functional.Functional at 0x7f156b4ddd90>

In [53]:
trained_model.predict(data_splits['X_val'])

array([[0.15438266],
       [0.15438266],
       [0.15438266],
       ...,
       [0.15438266],
       [0.15438266],
       [0.15438266]], dtype=float32)

In [43]:
prd_run.get_children_sorted_by_primary_metric()

[{'run_id': 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_2',
  'hyperparameters': '{"--batch-size": 64, "--learning-rate": 0.1076598586738495}',
  'best_primary_metric': 1.9538762494397507e+30,
  'status': 'Completed'},
 {'run_id': 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_0',
  'hyperparameters': '{"--batch-size": 32, "--learning-rate": 0.12068074221367489}',
  'best_primary_metric': 2.0789955060618905e+30,
  'status': 'Completed'},
 {'run_id': 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_1',
  'hyperparameters': '{"--batch-size": 32, "--learning-rate": 0.09173612513011196}',
  'best_primary_metric': 2.159375022937697e+30,
  'status': 'Completed'},
 {'run_id': 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_3',
  'hyperparameters': '{"--batch-size": 32, "--learning-rate": 0.003330776141701076}',
  'best_primary_metric': 2.8497554945720076e+30,
  'status': 'Completed'}]

In [44]:
prd_run.get_metrics()

{'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_1': {'mean_absolute_error': 2.159375022937697e+30,
  'R-squared score': -5.3476372861016586e-05},
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_3': {'mean_absolute_error': 2.8497554945720076e+30,
  'R-squared score': -6.0732208369662644e-05},
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_2': {'mean_absolute_error': 1.9538762494397507e+30,
  'R-squared score': -4.675980544988079e-05},
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_0': {'mean_absolute_error': 2.0789955060618905e+30,
  'R-squared score': -5.448359881343734e-05}}

In [45]:
test_run.get_metrics()

{'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_0': {'mean_absolute_error': 14.027489237225254,
  'R-squared score': 0.586384162867827},
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_1': {'mean_absolute_error': 14.818366171589505,
  'R-squared score': 0.49815163454629796},
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_3': {'mean_absolute_error': 14.964776840180555,
  'R-squared score': 0.5118016772403515},
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_2': {'mean_absolute_error': 14.473514401967618,
  'R-squared score': 0.5935323290592469}}

In [46]:
prd_run.get_hyperparameters()

{'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_0': '{"--batch-size": 32, "--learning-rate": 0.12068074221367489}',
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_1': '{"--batch-size": 32, "--learning-rate": 0.09173612513011196}',
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_2': '{"--batch-size": 64, "--learning-rate": 0.1076598586738495}',
 'HD_a7adf1dd-cfb8-4d35-b1df-4a719c4c499c_3': '{"--batch-size": 32, "--learning-rate": 0.003330776141701076}'}

In [47]:
test_run.get_hyperparameters()

{'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_0': '{"--batch-size": 32, "--learning-rate": 0.001}',
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_1': '{"--batch-size": 32, "--learning-rate": 0.01}',
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_2': '{"--batch-size": 64, "--learning-rate": 0.001}',
 'HD_18fd3ffb-6021-4845-88d0-e706c643a9fa_3': '{"--batch-size": 64, "--learning-rate": 0.01}'}