# HyperDrive demo notebook

This notebook will demostrate the use of the AzureML tool HyperDrive which allows for distributed hyperparameter tuning.

 Hyperparameters are used to control the training of a machine learning model and hyperparameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm for the specific application. This requires compiling and training a model over and over with different combinations of hyperparameters in a defined hyperparameter space, so the process can be very time intensive. Parallelising this processing can significantly increase the time efficiency of this process. 

The environment that this notebook works best with is: <b>`prd_model_dev_azml_001`</b> or `Python 3.8 - AzureML`

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import pathlib
import matplotlib.pyplot as plt

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
import prd_pipeline

## Set up azure experiment


In [5]:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Environment
from azureml.core import Experiment, ComputeTarget, ScriptRunConfig

In [6]:
prd_ws = Workspace.from_config()

In [7]:
# azure_dataset_name = 'prd_merged_202008_storm_francis' #'sd3'
azure_experiment_name='prd_fraction_models'
azure_env_name = 'prd_ml_cluster'
cluster_name = 'prd-ml-fractions-cluster'

In [8]:
prd_model_name = 'azml_cluster_demo_202208'

In [9]:
target_parameter = [
    'radar_fraction_in_band_instant_0.0',
    'radar_fraction_in_band_instant_0.25',
    'radar_fraction_in_band_instant_2.5',
    'radar_fraction_in_band_instant_7.0',
    'radar_fraction_in_band_instant_10.0'
]
# target_parameter = 'radar_mean_rain_instant'
profile_features = ['air_temperature', 'relative_humidity']
single_lvl_features = [] #'air_pressure_at_sea_level'

In [10]:
feature_dict = {
    'profile': profile_features,
    'single_level': single_lvl_features,
    'target': target_parameter,
} 

In [11]:
prd_exp = Experiment(workspace=prd_ws, name=azure_experiment_name)
prd_exp

Name,Workspace,Report Page,Docs Page
prd_fraction_models,precip_rediagnosis,Link to Azure Machine Learning studio,Link to Documentation


Get the AzML environment (basically a conda environment) from the workspace.

In [14]:
prd_env = Environment.get(workspace=prd_ws, name=azure_env_name)
prd_env

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04",
        "baseImageRegistry": {
            "address": "mcr.microsoft.com",
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {},
    "inferencingStackVersion": null,
    "name": "prd_ml_cluster",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "conda-forge"
            ],
            "dependencies": [
                "python=3.8",

### Load data

Use prd_pipeline to load and preprocess data

In [16]:
# azure_dataset_name = 'prd_merged_all_events_files'
azure_dataset_name = 'prd_merged_202112_storm_barra_files'
train202208_dataset_all = azureml.core.Dataset.get_by_name(prd_ws, name=azure_dataset_name)
prd_prefix = 'prd'
merged_prefix = prd_prefix + '_merged'
csv_file_suffix = 'csv'

In [23]:
import datetime
log_dir = 'log/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

### Execute our training run on a cluster with hyperdrive for parallelised hyperparameter tuning

In [30]:
prd_demo_compute_target = ComputeTarget(workspace=prd_ws, name=cluster_name)
prd_demo_compute_target

AmlCompute(workspace=Workspace.create(name='precip_rediagnosis', subscription_id='07efdc52-cd27-48ed-9443-3aad2b6b777b', resource_group='precip_rediagnosis'), name=prd-ml-fractions-cluster, id=/subscriptions/07efdc52-cd27-48ed-9443-3aad2b6b777b/resourceGroups/precip_rediagnosis/providers/Microsoft.MachineLearningServices/workspaces/precip_rediagnosis/computes/prd-ml-fractions-cluster, type=AmlCompute, provisioning_state=Succeeded, location=uksouth, tags={})

Hyperparameters that we want to vary using hyperdrive need to be input arguments for the prd_cluster_train_demo.py script which is called through ScriptRunConfig. Hyperparameters set in prd_demo_args will be overwritten by Hyperdrive. 

In [31]:
nepochs = 1

In [32]:
prd_demo_args = ['--dataset-name', azure_dataset_name,
                 '--model-name', prd_model_name, 
                 '--test-fraction', 0.2,
                ]
prd_demo_args += ['--target-parameter']
prd_demo_args += target_parameter
prd_demo_args += ['--profile-features']
prd_demo_args += profile_features
prd_demo_args += ['--single-level_features']
prd_demo_args += single_lvl_features
prd_demo_args += ['--epochs', nepochs]
prd_demo_args += ['--batch-size', 128]
prd_demo_args += ['--learning-rate', 0.01]

prd_demo_args

['--dataset-name',
 'prd_merged_202112_storm_barra_files',
 '--model-name',
 'azml_cluster_demo_202208',
 '--test-fraction',
 0.2,
 '--target-parameter',
 'radar_fraction_in_band_instant_0.0',
 'radar_fraction_in_band_instant_0.25',
 'radar_fraction_in_band_instant_2.5',
 'radar_fraction_in_band_instant_7.0',
 'radar_fraction_in_band_instant_10.0',
 '--profile-features',
 'air_temperature',
 'relative_humidity',
 '--single-level_features',
 '--epochs',
 1,
 '--batch-size',
 128,
 '--learning-rate',
 0.01]

In [33]:
prd_run_src = ScriptRunConfig(
    source_directory=os.getcwd(),
    script='prd_cluster_train_demo.py',
    arguments=prd_demo_args,
    compute_target=prd_demo_compute_target,
    environment=prd_env
)

### Run on cluster

In [34]:
prd_run = prd_exp.submit(prd_run_src)
prd_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,prd_fraction_models_1664272611_176642a9,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


In [35]:
prd_run.wait_for_completion()
assert(prd_run.get_status() == "Completed")

This approach to loading the model back in from the AzureML experiment run <b>only works using the `Python 3.8 - Pytorch and Tensorflow`</b> env. This works fine for running a single run on a cluster. However, in the following section where we use hyperdrive to automate hyperparameter tuning, we cannot reload the model from the run as this requires tensorflow version >=2.7 but only seem to be able to get Tensorflow 2.2 in a env with azureml.train.

In [36]:
# import tempfile
# import tensorflow.keras

In [38]:
# with tempfile.TemporaryDirectory() as td1:
#     td_path = pathlib.Path(td1)
#     print(td_path)
#     prd_run.download_files(prefix=prd_model_name, output_directory=td1)
#     model_path = td_path / prd_model_name
#     print(model_path)
#     list(model_path.iterdir())
#     trained_model = tensorflow.keras.models.load_model(model_path)
# trained_model

/tmp/tmpuxbo9ml7
/tmp/tmpuxbo9ml7/azml_cluster_demo_202208


2022-09-27 10:02:14.922243: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-27 10:02:14.922317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (prd-ml-fractions): /proc/driver/nvidia/version does not exist
2022-09-27 10:02:14.929858: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<keras.engine.functional.Functional at 0x7f9ea63de1f0>

### HyperDrive configuration

The next step is to configure our HyperDrive run. The run config defined above is passed to our HyperDriveConfig as the run_config. We must provide a <code>hyperparameter_sampling</code> explained in more detail below. We are required to also provide the <code>primary_metric_name</code> (either be the models loss function or a metric define when compiling the model) and the <code>primary_metric_goal</code> to either minimize or maximize this metric. We must also define <code>max_total_runs</code> (the upper bound of the number of runs, may be smaller depending on the defined hyperparameter space and sampling strategy) and <code>max_concurrent_runs</code>, if this is set to None all run are launched in parallel.
There is also the option to define an early stopping policy with the <code>policy</code> argument.

In [29]:
from azureml.train.hyperdrive import RandomParameterSampling, GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform

#### Hyperparameter sampling

There are three different classes for sampling the hyperparameter space: 
- <code>GridParameterSampling</code>: define a search space as a grid of hyperparameter based on the given hyperparameter space, then evaluates every position in the grid in order (note if max_total_runs < total potential combinations, then this will only run a subsample of the grid).
- <code>RandomParameterSampling</code>: randomly samples hyperparameter combinations from the hyperparameter space 
- <code>BayesianParameterSampling</code>: defines Bayesian sampling over a hyperparameter space, tries to intelligently pick the next sample of hyperparameters based on how the previous samples performed.

We then define the hyperparameter space from which to select the combination of hyperparameters to assess. The hyperparameters that we want to tune should be input arguments to the script which is being run by ScriptRunConfig and any hyperparameter arguments that are input will be overwritten by hyperdrive.

In AzureML, there are different ways to define the set to sample each hyperparameters from (<code>choice</code>, <code>lognormal</code>, <code>loguniform</code>, <code>normal</code>, <code>qlognormal</code>, <code>qloguniform</code>, <code>qnormal</code>, <code>quniform</code>, <code>randint</code> and <code>uniform</code>).

In [28]:
ps = RandomParameterSampling(
    {
        '--batch-size': choice(32),
        '--learning-rate': choice(0.1, 0.01),
        # '--epochs': choice(10, 20, 50, 100)
    }
)

#### Early stopping policy

We can define an early stopping policy in which means that any poorly performing experiment runs are canceled and new ones started. Here we use BanditPolicy with a slack criteria of 0.1 (the ratio of slack allowed with respect to the best performing training run) and evaluation internal of 2 (the frequency for applying the policy, here every two training steps).

In [29]:
early_stop_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

In [30]:
htc = HyperDriveConfig(run_config=prd_run_src, 
                       hyperparameter_sampling=ps, 
                       policy=early_stop_policy, 
                       primary_metric_name='val_loss', 
                       primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                       max_total_runs=18,
                       max_concurrent_runs=12)

### HyperDrive running

In [31]:
prd_run = prd_exp.submit(htc)
prd_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,HD_06511807-b851-42c5-921f-de4d85748b7e,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


In [32]:
prd_run.wait_for_completion()
assert(prd_run.get_status() == "Completed")

### HyperDrive results

We can then select the model which performs best against the selected primary metric, as defined within the HyperDriveConfig.

In [33]:
prd_best_run = prd_run.get_best_run_by_primary_metric()
prd_best_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,HD_06511807-b851-42c5-921f-de4d85748b7e_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


We can also return information from each of the different hyperdrive child run

In [40]:
prd_best_run.get_metrics()

{'loss': 0.3990348279476166,
 'accuracy': 0.6889052987098694,
 'val_loss': 0.4088818430900574,
 'val_accuracy': 0.6764727830886841}

In [34]:
prd_run.get_children_sorted_by_primary_metric()

[{'run_id': 'HD_06511807-b851-42c5-921f-de4d85748b7e_0',
  'hyperparameters': '{"--batch-size": 32, "--learning-rate": 0.01}',
  'best_primary_metric': 0.4088818430900574,
  'status': 'Completed'},
 {'run_id': 'HD_06511807-b851-42c5-921f-de4d85748b7e_1',
  'hyperparameters': '{"--batch-size": 32, "--learning-rate": 0.1}',
  'best_primary_metric': 7.4540114402771,
  'status': 'Completed'}]

In [35]:
prd_run.get_metrics()

{'HD_06511807-b851-42c5-921f-de4d85748b7e_0': {'loss': 0.3990348279476166,
  'accuracy': 0.6889052987098694,
  'val_loss': 0.4088818430900574,
  'val_accuracy': 0.6764727830886841},
 'HD_06511807-b851-42c5-921f-de4d85748b7e_1': {'loss': 7.303744792938232,
  'accuracy': 0.5967587232589722,
  'val_loss': 7.4540114402771,
  'val_accuracy': 0.5869709849357605}}

In [37]:
prd_run.get_hyperparameters()

{'HD_06511807-b851-42c5-921f-de4d85748b7e_0': '{"--batch-size": 32, "--learning-rate": 0.01}',
 'HD_06511807-b851-42c5-921f-de4d85748b7e_1': '{"--batch-size": 32, "--learning-rate": 0.1}'}

### Load model and evaluate - <b>not working</b> 

The following approach to reloading the model directly from a hyperdrive run does not work. This requires Tensorflow version >= 2.7. It appears the most recent version of tensorflow to work with azureml.train is v2.2.

In [39]:
# import tempfile
# import tensorflow.keras

In [40]:
# with tempfile.TemporaryDirectory() as td1:
#     td_path = pathlib.Path(td1)
#     prd_best_run.download_files(prefix=prd_model_name, output_directory=td1)
#     model_path = td_path / prd_model_name
#     list(model_path.iterdir())
#     trained_model = tensorflow.keras.models.load_model(model_path)
# trained_model