<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

TODO (make changes also in image classification notebook to stay in sync):
- Try setting env via the "conda_dependencies_file" parameter [link](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py)
- Why neg ap's and not identical with print-out
- Could upgrade to cuda10

TODO image classification only:
- Change to only upload the relevant data, and replace hard-coded foldername in training code.
- Fix display of _STANDARD_DS6_ etc machines
- Change output dir from "outputs" to: output_folder = os.path.join(current_directory, 'hyperdrive_outputs')
- Rename DATA to DATA_PATH
- Rename script_folder to hyperdrive from hyperparameters
- Add "use_gpu=True"

# Testing different Hyperparameters and Benchmarking

In this notebook, we'll cover how to test different hyperparameters for a particular dataset and how to benchmark different parameters across a group of datasets using AzureML. We assume familiarity with the basic concepts and parameters, which are discussed in the [01_training_introduction.ipynb](01_training_introduction.ipynb), [02_mask_rcnn.ipynb](02_mask_rcnn.ipynb) and [03_training_accuracy_vs_speed.ipynb](03_training_accuracy_vs_speed.ipynb) notebooks. 

Similar to the image classification notebook [11_exploring_hyperparameters.ipynb](../../classification/notebooks/11_exploring_hyperparameters.ipynb), we will learn more about how different learning rates and different image sizes affect our model's accuracy when restricted to 16 epochs, and we want to build an AzureML experiment to test out these hyperparameters. 

We will be using a Faster R-CNN model with ResNet-50 backbone to find all objects in an image belonging to 4 categories: 'can', 'carton', 'milk_bottle', 'water_bottle'. We will then conduct hyper-parameter tuning to find the best set of parameters for this model. For this, we present an overall process of utilizing AzureML, specifically [Hyperdrive](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py) component to run this tuning in parallel (and not successively).We demonstrate the following key steps:  
* Configure AzureML Workspace
* Create Remote Compute Target (GPU cluster)
* Prepare Data
* Prepare Training Script
* Setup and Run Hyperdrive Experiment
* Model Import, Re-train and Test

This notebook is very similar to the [24_exploring_hyperparameters_on_azureml.ipynb](../../classification/notebooks/24_exploring_hyperparameters_on_azureml.ipynb) hyperdrive notebook used for image classification.

In [1]:
import os
import sys

from distutils.dir_util import copy_tree
import numpy as np

import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import azureml.data
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import (
    RandomParameterSampling, GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice, uniform
)
import azureml.widgets as widgets

sys.path.append("../../")
from utils_cv.common.data import unzip_url
from utils_cv.detection.data import Urls

Ensure edits to libraries are loaded and plotting is shown in the notebook.

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

We now define some parameters which will be used in this notebook:

In [85]:
# Azure resources
# subscription_id = "YOUR_SUBSCRIPTION_ID"
# resource_group = "YOUR_RESOURCE_GROUP_NAME"  
# workspace_name = "YOUR_WORKSPACE_NAME"  
# workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2, etc.
subscription_id = "2ad17db4-e26d-4c9e-999e-adae9182530c"  #Sharat 
resource_group = "pabuehle_delme2_hyperdrive"  
workspace_name = "pabuehle_ws"  
workspace_region = "eastus" #Possible values eastus, eastus2, etc.

# Choose a size for our cluster and the maximum number of nodes
VM_SIZE = "STANDARD_NC6" #"STANDARD_NC6", STANDARD_NC6S_V3"
MAX_NODES = 5 #12

# Hyperparameter search space
IM_MAX_SIZES = [50] #[150, 300]
LEARNING_RATE_MAX = 1e-2
LEARNING_RATE_MIN = 1e-6
MAX_TOTAL_RUNS = 10 #Set to higher value to test more parameter combinations.

# Image data
#DATA = unzip_url(Urls.fridge_objects_path, exist_ok=True)
DATA_PATH = "C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny"

### 1. Config AzureML workspace
Below we setup (or load an existing) AzureML workspace, and get all its details as follows. Note that the resource group and workspace will get created if they do not yet exist. For more information regaring the AzureML workspace see also the [20_azure_workspace_setup.ipynb](../../classification/notebooks/20_azure_workspace_setup.ipynb) notebook in the image classification folder.

To simplify clean-up (see end of this notebook), we recommend creating a new resource group to run this notebook.

In [4]:
from utils_cv.common.azureml import get_or_create_workspace

ws = get_or_create_workspace(
        subscription_id,
        resource_group,
        workspace_name,
        workspace_region)

# Print the workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

ERROR - get_workspace error using subscription_id=2ad17db4-e26d-4c9e-999e-adae9182530c, resource_group_name=pabuehle_delme2_hyperdrive, workspace_name=pabuehle_ws


Creating new workspace




Deploying AppInsights with name pabuehleinsights13d0d903.
Deployed AppInsights with name pabuehleinsights13d0d903. Took 20.23 seconds.
Deploying KeyVault with name pabuehlekeyvaulta50e31a0.
Deploying StorageAccount with name pabuehlestorage33ceb94ce.
Deployed KeyVault with name pabuehlekeyvaulta50e31a0. Took 37.19 seconds.
Deployed StorageAccount with name pabuehlestorage33ceb94ce. Took 38.13 seconds.
Deploying Workspace with name pabuehle_ws.
Deployed Workspace with name pabuehle_ws. Took 60.59 seconds.
Workspace name: pabuehle_ws
Workspace region: eastus
Subscription id: 2ad17db4-e26d-4c9e-999e-adae9182530c
Resource group: pabuehle_delme2_hyperdrive


### 2. Create Remote Target
We create a GPU cluster as our remote compute target. If a cluster with the same name already exists in our workspace, the script will load it instead. This [link](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#compute-targets-for-training) provides more information about how to set up a compute target on different locations.

By default, the VM size is set to use STANDARD\_NC6 machines. However, if quota is available, our recommendation is to use STANDARD\_NC6S\_V3 machines which come with the much faster V100 GPU.

In [5]:
CLUSTER_NAME = "gpu-cluster"

try:
    # Retrieve if a compute target with the same cluster name already exists
    compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)
    print('Found existing compute target.')
    
except ComputeTargetException:
    # If it doesn't already exist, we create a new one with the name provided
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size=VM_SIZE,
                                                           min_nodes=0,
                                                           max_nodes=MAX_NODES)

    # create the cluster
    compute_target = ComputeTarget.create(ws, CLUSTER_NAME, compute_config)
    compute_target.wait_for_completion(show_output=True)

# we can use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-30T00:13:31.543000+00:00', 'errors': None, 'creationTime': '2019-08-30T00:13:28.454854+00:00', 'modifiedTime': '2019-08-30T00:13:44.726668+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 5, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


### 3. Prepare data
In this notebook, we'll use the Fridge Objects dataset, which is already stored in the correct format. We then upload our data to the AzureML workspace.


In [6]:
# Retrieving default datastore that got automatically created when we setup a workspace
ds = ws.get_default_datastore()

# We now upload the data to the 'data' folder on the Azure portal
ds.upload(
    src_dir=DATA_PATH,
    target_path='data',
    overwrite=True, # with "overwrite=True", if this data already exists on the Azure blob storage, it will be overwritten
    show_progress=True
)

Uploading an estimated of 8 files
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\35.xml
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\40.xml
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\45.xml
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\50.xml
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\images\35.jpg
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\images\40.jpg
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\images\45.jpg
Uploading C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\images\50.jpg
Uploaded C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\45.xml, 1 files out of an estimated total of 8
Uploaded C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny\annotations\40.xm

$AZUREML_DATAREFERENCE_97dd0cd1aec6415f8c6f7f2587601881


Here's where you can see the data in your portal: 
<img src="media/datastore.jpg" width="800" alt="Datastore screenshot for Hyperdrive notebook run">

### 4. Prepare training script

Next step is to prepare scripts that AzureML Hyperdrive will use to train and evaluate models with selected hyperparameters.

In [73]:
# Create a folder for the training script and the utils_cv library
script_folder = os.path.join(os.getcwd(), "hyperdrive")
os.makedirs(script_folder, exist_ok=True)

In [74]:
# Copy utils_cv library to script folder
_ = copy_tree(os.path.join('..', '..', 'utils_cv'), os.path.join(script_folder, 'utils_cv'))

In [81]:
%%writefile $script_folder/train.py

# Use different matplotlib backend to avoid error during remote execution
import matplotlib 
matplotlib.use("Agg") 
import matplotlib.pyplot as plt

import os
import sys
import argparse
import numpy as np
from pathlib import Path
from azureml.core import Run
from utils_cv.detection.dataset import DetectionDataset
from utils_cv.detection.model import DetectionLearner 
from utils_cv.common.gpu import which_processor
which_processor()


# Parse arguments passed by Hyperdrive
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=Path, dest='data_dir')
parser.add_argument('--epochs', type=int, dest='epochs', default=10)
parser.add_argument('--batch_size', type=int, dest='batch_size', default=1)
parser.add_argument('--learning_rate', type=float, dest='learning_rate', default=1e-4)
parser.add_argument('--min_size', type=int, dest='min_size', default=800)
parser.add_argument('--max_size', type=int, dest='max_size', default=1333)
parser.add_argument('--rpn_pre_nms_top_n_train', type=int, dest='rpn_pre_nms_top_n_train', default=2000)
parser.add_argument('--rpn_pre_nms_top_n_test', type=int, dest='rpn_pre_nms_top_n_test', default=1000)
parser.add_argument('--rpn_post_nms_top_n_train', type=int, dest='rpn_post_nms_top_n_train', default=2000)
parser.add_argument('--rpn_post_nms_top_n_test', type=int, dest='rpn_post_nms_top_n_test', default=1000)
parser.add_argument('--rpn_nms_thresh', type=float, dest='rpn_nms_thresh', default=0.7)
parser.add_argument('--box_score_thresh', type=float, dest='box_score_thresh', default=0.05)
parser.add_argument('--box_nms_thresh', type=float, dest='box_nms_thresh', default=0.5)
parser.add_argument('--box_detections_per_img', type=int, dest='box_detections_per_img', default=100)
parser.add_argument('-f', dest='dummy', default="dummy")
args = parser.parse_args()
params = vars(args)
print(f"params = {params}")


#path = "C:/Users/pabuehle/Desktop/ComputerVision/data/odFridgeObjectsTiny/"
#params['epochs'] = 1


# Getting training and validation data
path = params['data_dir'] + "/data/"
data = DetectionDataset(path, train_pct=0.5, batch_size = params["batch_size"])
print(
    f"Training dataset: {len(data.train_ds)} | Training DataLoader: {data.train_dl} \n \
    Testing dataset: {len(data.test_ds)} | Testing DataLoader: {data.test_dl}"
)

# Get model
detector = DetectionLearner(data,
    min_size = params["min_size"],
    max_size = params["max_size"],
    rpn_pre_nms_top_n_train = params["rpn_pre_nms_top_n_train"],
    rpn_pre_nms_top_n_test = params["rpn_pre_nms_top_n_test"],
    rpn_post_nms_top_n_train = params["rpn_post_nms_top_n_train"], 
    rpn_post_nms_top_n_test = params["rpn_post_nms_top_n_test"],
    rpn_nms_thresh = params["rpn_nms_thresh"],
    box_score_thresh = params["box_score_thresh"], 
    box_nms_thresh = params["box_nms_thresh"],
    box_detections_per_img = params["box_detections_per_img"]
)

# Run Training
detector.fit(params["epochs"], lr=params["learning_rate"], print_freq=30)
print(f"Average precision after each epoch: {detector.ap}")

# Add log entries
run = Run.get_context()
run.log("data_dir", params["data_dir"])
run.log("min_size", params["min_size"])
run.log("learning_rate", params["learning_rate"])
run.log("accuracy", float(detector.ap[-1]))  # Logging our primary metric 'accuracy'

Overwriting C:\Users\pabuehle\Desktop\ComputerVision\detection\notebooks\hyperdrive/train.py


### 5. Setup and run Hyperdrive experiment

#### 5.1 Create Experiment  
Experiment is the main entry point into experimenting with AzureML. To create new Experiment or get the existing one, we pass our experimentation name 'hyperparameter-tuning'.


In [82]:
experiment_name = 'hyperparameter-tuning'
exp = Experiment(workspace=ws, name=experiment_name)

#### 5.2. Define search space

Now we define the search space of hyperparameters. As shown below, to test discrete parameter values use 'choice()', and for uniform sampling use 'uniform()'. For more options, see [Hyperdrive parameter expressions](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.parameter_expressions?view=azure-ml-py).

Hyperdrive provides three different parameter sampling methods: 'RandomParameterSampling', 'GridParameterSampling', and 'BayesianParameterSampling'. Details about each method can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters). Here, we use the 'RandomParameterSampling'.

In [109]:
lrs = [float(f) for f in np.linspace(LEARNING_RATE_MIN, LEARNING_RATE_MAX, 3)]
lrs
#lrs = list(np.linspace(LEARNING_RATE_MIN, LEARNING_RATE_MAX, 3).astype(float))
#type(lrs)

[1e-06, 0.005000500000000001, 0.01]

In [110]:
# Hyperparameter search space
# param_sampling = RandomParameterSampling( {
#         '--learning_rate': uniform(LEARNING_RATE_MIN, LEARNING_RATE_MAX),
#         '--im_size': choice(IM_SIZES)
#     }
# )
#early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=20)



# Grid-search
param_sampling = GridParameterSampling( {
        '--learning_rate': choice(lrs),
        '--im_size': choice(IM_MAX_SIZES)
    }
)
early_termination_policy = None
MAX_TOTAL_RUNS = None # Set to zero to run all possible grid parameter combinations

<b>AzureML Estimator</b> is the building block for training. An Estimator encapsulates the training code and parameters, the compute resources and runtime environment for a particular training scenario.
We create one for our experimentation with the dependencies our model requires as follows:

In [111]:
script_params = {
    '--data-folder': ds.as_mount()
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                use_gpu=True,
                pip_packages=['nvidia-ml-py3','fastai'],
                conda_packages=['scikit-learn', 'pycocotools>=2.0','torchvision==0.3','cudatoolkit==9.0'])

We now create a HyperDriveConfig object which  includes information about parameter space sampling, termination policy, primary metric, estimator and the compute target to execute the experiment runs on. We feed the following parameters to it:

- our estimator object that we created in the above cell
- hyperparameter sampling method, in this case it is Random Parameter Sampling
- early termination policy, in this case we use [Bandit Policy](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#bandit-policy)
- primary metric name reported by our runs, in this case it is accuracy 
- the goal, which determines whether the primary metric has to be maximized/minimized, in this case it is to maximize our accuracy 
- number of total child-runs

The bigger the search space, the more child-runs get triggered for better results.

In [112]:
hyperdrive_run_config = HyperDriveConfig(estimator=est,
                                         hyperparameter_sampling=param_sampling,
                                         policy=early_termination_policy,
                                         primary_metric_name='accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=MAX_TOTAL_RUNS,
                                         max_concurrent_runs=MAX_NODES)

#### 5.3 Run Experiment

In [113]:
# Now we submit the Run to our experiment. 
hyperdrive_run = exp.submit(config=hyperdrive_run_config)

# We can see the experiment progress from this notebook by using 
widgets.RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [None]:
hyperdrive_run.wait_for_completion()

Or we can check from the Azure portal with the url link we get by running 
```python 
hyperdrive_run.get_portal_url().```

To load an existing Hyperdrive Run instead of start new one, we can use 
```python
hyperdrive_run = azureml.train.hyperdrive.HyperDriveRun(exp, <your-run-id>, hyperdrive_run_config=hyperdrive_run_config)
```
We also can cancel the Run with 
```python 
hyperdrive_run_config.cancel().
```

Once all the child-runs are finished, we can get the best run and the metrics.

In [None]:
# Get best run and print out metrics
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']
best_parameters = dict(zip(parameter_values[::2], parameter_values[1::2]))

print(f"* Best Run Id:{best_run.id}")
print(best_run)
print("\n* Best hyperparameters:")
print(best_parameters)
print(f"Accuracy = {best_run_metrics['accuracy']}")
#print("Learning Rate =", best_run_metrics['learning_rate'])

### 7. Clean up

To avoid unnecessary expenses, all resources which were created in this notebook need to get deleted once parameter search is concluded. To simplify this clean-up step, we recommend creating a new resource group to run this notebook. This resource group can then be deleted, e.g. using the Azure Portal, which will remove all created resources.