# HyperDrive demo notebook

This notebook will demostrate the use of the AzureML tool HyperDrive which allows for distributed hyperparameter tuning.

 Hyperparameters are used to control the training of a machine learning model and hyperparameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm for the specific application. This requires compiling and training a model over and over with different combinations of hyperparameters in a defined hyperparameter space, so the process can be very time intensive. Parallelising this processing can significantly increase the time efficiency of this process. 

The environment that this notebook works best with is: <b>Python 3.8 - AzureML </b>

In [1]:
%matplotlib inline

In [2]:
import tempfile

In [3]:
import pandas as pd
import numpy as np
import pathlib
import matplotlib.pyplot as plt

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [5]:
import prd_pipeline

## Set up azure experiment


In [6]:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Environment
from azureml.core import Experiment, ComputeTarget, ScriptRunConfig

In [7]:
prd_ws = Workspace.from_config()

In [8]:
azure_experiment_name='prd_fraction_models'
azure_env_name = 'prd_ml_cluster'

In [9]:
prd_model_name = 'azml_fractions_cluster_hpt'

In [10]:
use_full_dataset = True
if use_full_dataset:
    azure_dataset_name ='prd_merged_all_events_files'
    prd_blob_rel_path = 'prd/*/prd_merged*csv'    
else:
    #  use subset for development.
    azure_dataset_name ='prd_merged_202110_nswws_amber_oct_files'
    prd_blob_rel_path = 'prd/202110_nswws_amber_oct/prd_merged*csv'    


In [11]:
use_gpu = True
if use_gpu:
    do_download_data = True # if false, use the azml dataset, otherwise download from datastore or blob store
    use_blob_store = True # only applies if do download is true. If this is true, download from the blob store rather than the datastore.
else:
    do_download_data = False # if false, use the azml dataset, otherwise download from datastore or blob store
    use_blob_store = False # only applies if do download is true. If this is true, download from the blob store rather than the datastore.

In [12]:
# this env has fsspec and related to facilitate loading data on GPU instance
azure_env_name = 'prd_ml_gpu_cluster'

In [13]:
if use_gpu:
    if use_full_dataset:
        cluster_name = 'prd-ml-fractions-cluster-gpu'
    else:
        cluster_name = 'mlops-gpu-test'
else:
    if use_full_dataset:
        cluster_name = 'prd-ml-fractions-cluster'
    else:
        cluster_name = 'mlops-test'
cluster_name

'prd-ml-fractions-cluster-gpu'

In [14]:
target_parameter = [
    'radar_fraction_in_band_instant_0.0',
    'radar_fraction_in_band_instant_0.25',
    'radar_fraction_in_band_instant_2.5',
    'radar_fraction_in_band_instant_7.0',
    'radar_fraction_in_band_instant_10.0'
]
profile_features = ['air_temperature', 'relative_humidity']
single_lvl_features = []

In [15]:
feature_dict = {
    'profile': profile_features,
    'single_level': single_lvl_features,
    'target': target_parameter,
} 

In [16]:
prd_exp = Experiment(workspace=prd_ws, name=azure_experiment_name)
prd_exp

Name,Workspace,Report Page,Docs Page
prd_fraction_models,precip_rediagnosis,Link to Azure Machine Learning studio,Link to Documentation


Get the AzML environment (basically a conda environment) from the workspace.

In [17]:
prd_env = Environment.get(workspace=prd_ws, name=azure_env_name)
prd_env

{
    "assetId": "azureml://locations/uksouth/workspaces/57546dc9-9763-4025-831d-c19991c81540/environments/prd_ml_gpu_cluster/versions/1",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04",
        "baseImageRegistry": null,
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {},
    "inferencingStackVersion": null,
    "name": "prd_ml_gpu_cluster",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "conda-forge"
            ],
            "depen

### Execute our training run on a cluster with hyperdrive for parallelised hyperparameter tuning

In [18]:
import datetime
log_dir = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

In [19]:
from azureml.train.hyperdrive import RandomParameterSampling, GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform

In [20]:
prd_demo_compute_target = ComputeTarget(workspace=prd_ws, name=cluster_name)
prd_demo_compute_target

AmlCompute(workspace=Workspace.create(name='precip_rediagnosis', subscription_id='07efdc52-cd27-48ed-9443-3aad2b6b777b', resource_group='precip_rediagnosis'), name=prd-ml-fractions-cluster-gpu, id=/subscriptions/07efdc52-cd27-48ed-9443-3aad2b6b777b/resourceGroups/precip_rediagnosis/providers/Microsoft.MachineLearningServices/workspaces/precip_rediagnosis/computes/prd-ml-fractions-cluster-gpu, type=AmlCompute, provisioning_state=Succeeded, location=uksouth, tags={})

Hyperparameters that we want to vary using hyperdrive need to be input arguments for the prd_cluster_train_demo.py script which is called through ScriptRunConfig. Hyperparameters set in prd_demo_args will be overwritten by Hyperdrive. 

In [21]:
nepochs = 1

In [22]:
prd_demo_args = ['--dataset-name', azure_dataset_name,
                 '--model-name', prd_model_name, 
                 '--test-fraction', 0.2,
                ]
prd_demo_args += ['--target-parameter']
prd_demo_args += target_parameter
prd_demo_args += ['--profile-features']
prd_demo_args += profile_features
prd_demo_args += ['--single-level_features']
prd_demo_args += single_lvl_features
prd_demo_args += ['--epochs', nepochs]
prd_demo_args += ['--batch-size', 128]
prd_demo_args += ['--learning-rate', 0.01]
prd_demo_args += ['--test-fraction', 0.2]
prd_demo_args += ['--log-dir', './logs']


In [23]:
if do_download_data:
    if use_blob_store:
        prd_demo_args += ['--data-path', prd_blob_rel_path]
        prd_demo_args += ['--blob']    
    else:
        prd_demo_args += ['--data-path', azureml.core.Dataset.get_by_name(prd_ws, azure_dataset_name).as_download()]

In [24]:
prd_demo_args

['--dataset-name',
 'prd_merged_all_events_files',
 '--model-name',
 'azml_fractions_cluster_hpt',
 '--test-fraction',
 0.2,
 '--target-parameter',
 'radar_fraction_in_band_instant_0.0',
 'radar_fraction_in_band_instant_0.25',
 'radar_fraction_in_band_instant_2.5',
 'radar_fraction_in_band_instant_7.0',
 'radar_fraction_in_band_instant_10.0',
 '--profile-features',
 'air_temperature',
 'relative_humidity',
 '--single-level_features',
 '--epochs',
 1,
 '--batch-size',
 128,
 '--learning-rate',
 0.01,
 '--test-fraction',
 0.2,
 '--log-dir',
 './logs',
 '--data-path',
 'prd/*/prd_merged*csv',
 '--blob']

In [43]:
prd_run_src = ScriptRunConfig(
    source_directory=os.getcwd(),
    script='prd_cluster_train_demo.py',
    arguments=prd_demo_args,
    compute_target=prd_demo_compute_target,
    environment=prd_env
)

### HyperDrive configuration

The next step is to configure our HyperDrive run. The run config defined above is passed to our HyperDriveConfig as the run_config. We must provide a <code>hyperparameter_sampling</code> explained in more detail below. We are required to also provide the <code>primary_metric_name</code> (either be the models loss function or a metric define when compiling the model) and the <code>primary_metric_goal</code> to either minimize or maximize this metric. We must also define <code>max_total_runs</code> (the upper bound of the number of runs, may be smaller depending on the defined hyperparameter space and sampling strategy) and <code>max_concurrent_runs</code>, if this is set to None all run are launched in parallel.
There is also the option to define an early stopping policy with the <code>policy</code> argument.

#### Hyperparameter sampling

There are three different classes for sampling the hyperparameter space: 
- <code>GridParameterSampling</code>: define a search space as a grid of hyperparameter based on the given hyperparameter space, then evaluates every position in the grid in order (note if max_total_runs < total potential combinations, then this will only run a subsample of the grid).
- <code>RandomParameterSampling</code>: randomly samples hyperparameter combinations from the hyperparameter space 
- <code>BayesianParameterSampling</code>: defines Bayesian sampling over a hyperparameter space, tries to intelligently pick the next sample of hyperparameters based on how the previous samples performed.

We then define the hyperparameter space from which to select the combination of hyperparameters to assess. The hyperparameters that we want to tune should be input arguments to the script which is being run by ScriptRunConfig and any hyperparameter arguments that are input will be overwritten by hyperdrive.

In AzureML, there are different ways to define the set to sample each hyperparameters from (<code>choice</code>, <code>lognormal</code>, <code>loguniform</code>, <code>normal</code>, <code>qlognormal</code>, <code>qloguniform</code>, <code>qnormal</code>, <code>quniform</code>, <code>randint</code> and <code>uniform</code>).

In [44]:
ps = RandomParameterSampling(
    {
        '--batch-size': choice(32, 64, 128),
        '--learning-rate': choice(0.1, 0.01, 0.001, 0.0001),
        '--epochs': choice(10, 20, 50, 100)
    }
)

#### Early stopping policy

We can define an early stopping policy in which means that any poorly performing experiment runs are canceled and new ones started. Here we use BanditPolicy with a slack criteria of 0.1 (the ratio of slack allowed with respect to the best performing training run) and evaluation internal of 2 (the frequency for applying the policy, here every two training steps).

In [45]:
early_stop_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

In [46]:
htc = HyperDriveConfig(run_config=prd_run_src, 
                       hyperparameter_sampling=ps, 
                       policy=early_stop_policy, 
                       primary_metric_name='val_loss', 
                       primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                       max_total_runs=18,
                       max_concurrent_runs=12)

### HyperDrive running

In [47]:
prd_run_src

<azureml.core.script_run_config.ScriptRunConfig at 0x7fe223daa9a0>

In [48]:
prd_run = prd_exp.submit(htc)
prd_run

Experiment,Id,Type,Status,Details Page,Docs Page
prd_fraction_models,HD_5e1a1915-f4c7-4191-b1f0-bbf447a4ec75,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


In [49]:
import azureml.tensorboard

In [50]:
tb = azureml.tensorboard.Tensorboard([prd_run])

# If successful, start() returns a string with the URI of the instance.
tb.start()

https://prd-ml-large-6006.uksouth.instances.azureml.ms


'https://prd-ml-large-6006.uksouth.instances.azureml.ms'

In [51]:
prd_run.wait_for_completion()
assert(prd_run.get_status() == "Completed")

### HyperDrive results

We can then select the model which performs best against the selected primary metric, as defined within the HyperDriveConfig.

Get the best trained model from the hyper drive run and load in the trained weights ready for inference.

In [None]:
prd_best_run = prd_run.get_best_run_by_primary_metric()
prd_best_run

In [None]:
import tempfile

In [None]:
import tensorflow as tf

In [None]:
with tempfile.TemporaryDirectory() as td1:
    td_path = pathlib.Path(td1)
    print(td_path)
    prd_best_run.download_files(prefix=prd_model_name, output_directory=td1)
    model_path = td_path / prd_model_name
    print(model_path)
    list(model_path.iterdir())
    trained_model = tf.keras.models.load_model(model_path)
trained_model

### Load the data for inference

Load the data from the same loading functions as in the script for consistencyt

In [None]:
import json

In [None]:
with open('credentials_file.json') as credentials_file:
    az_blob_cred = json.load(credentials_file)
    
az_blob_cred.keys()

In [None]:
%%time
input_data = prd_pipeline.load_data(
    prd_ws,
    dataset_name=prd_azml_dataset_name
)

In [None]:
do_save_split = False

In [None]:
%%time
if do_save_split:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
        test_savefn='tmp.csv',
    )
else:
    data_splits, data_dims = prd_pipeline.preprocess_data(
        input_data,
        test_fraction=0.2,
        feature_dict={'profile': profile_features, 'single_level': single_lvl_features,'target': target_parameter,},
    )
    

In [None]:
data_splits['X_train']

We can also return information from each of the different hyperdrive child run

In [None]:
y_pred = trained_model.predict(data_splits['X_val'])

In [None]:
prd_run.get_children_sorted_by_primary_metric()

In [None]:
prd_run.get_metrics()

In [None]:
prd_run.get_metrics()

In [None]:
prd_run.get_hyperparameters()

In [None]:
prd_run.get_hyperparameters()