# Train Detection Model - Hyper Parameter Tuning

This notebook builds on the basic train_model.py but show how you can setup model param parsing, set an early stopping policy and execute this using AML hypertune.

Before executing the code please ensure you have followed the setup in the wiki and ensured you have the following:
- AML Workspace
- Blob or fileshare containing images and label version files
- Pretrained model from TF model zoo in the same storage
- Built docker image registered to ACR
- Local conda environment with the requirements and packages installed

In [None]:
import os
import shutil

from azureml.train.hyperdrive import PrimaryMetricGoal, BanditPolicy
from azureml.train.hyperdrive import RandomParameterSampling, choice

from azure_utils.azure import load_config
from azure_utils.experiment import AMLExperiment

## 1. Define Run Parameters

Below sets the run paramters including the dockerfile path, base model and datasets. Note that if you datasets are in the same date naming convention you can use the latest keyword to automatically retrieve the latest version.

In [None]:
# Run params    
env_config_file = "dev_config.json"

# Train with TensorFlow 1 use docker built from tf_1 - "csaddevamlacr.azurecr.io/tfod_tf1:test"
# Train with TensorFlow 2 use docker built from tf_2 - "csaddevamlacr.azurecr.io/tfod_tf2:test"
docker_image = "csaddevamlacr.azurecr.io/tfod_tf2:test"

# Train with TF1 use - "train_hypertune.py"
# Train with TF2 use - "train_tf2_hypertune.py"
training_script = "train_tf2_hypertune.py"

# Description
desc = "Testing TF2 with hypertune"
# Experiment name
experiment_name = "pothole"
    
# Training and test data selction
store_name = "test_data"
img_type = "pothole"
train_csv = "latest"
test_csv = "latest"

# Base model Selection
# Train with TF1 use - "faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28"
# Train with TF2 use - "faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu-8"
base_model = "faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu-8"

# Model Params
steps = 1000
eval_conf = 0.5

# model hyparams - model arch specific
# Set base parameter here and in Step 7 we set the HPT parameters
fs_nms_iou = 0.5 # first stage NMS IOU threshold
fs_nms_score = 0.0 # first stage NMS score threshold
fs_max_prop = 200 # first stage max proposals
fs_loc_loss = 2.0 # first stage localisation loss weight
fs_obj_loss = 1.0 # first stage objective loss weight

# Compute Params
cluster_name = "train-dev-2"
vm_type = "STANDARD_NC6"
nodes = 3

## 2. Initialise Experiment Class

Below creates and instance of the experiment class int he Azure utils package, it takes a config file to point to a speciifc AML workspace and the experiment name for usecase grouping

In [None]:
aml_exp = AMLExperiment(experiment_name, config_file=env_config_file)

## 3. Set AML Datastore reference

This package makes use of the old approach of mounting the entire datastore in order to access images, dataset files and base models. Below sets up the defined datastore to mount on execution.

In [None]:
aml_exp.set_datastore(store_name)
aml_exp.set_data_reference()

## 4. Create/Set Compute

Below checks if a compute with the provided name exists in the AML workspace and if not creates based on the spec. It takes arguments for node count and vm type with the base compute set to "STANDARD_NC" in order to provide GPU support.

In [None]:
aml_exp.set_compute(cluster_name, vm_type=vm_type, node_count=nodes)

## 5. Set script params and path

In [None]:
script_params = [
    '--desc', desc,
    '--data_dir', str(aml_exp.data_ref),
    '--image_type', img_type,
    '--train_csv', train_csv,
    '--test_csv', test_csv,
    '--base_model', base_model,
    '--steps', steps,
    '--fs_nms_iou', fs_nms_iou,
    '--fs_nms_score', fs_nms_score,
    '--fs_max_prop', fs_max_prop,
    '--fs_loc_loss', fs_loc_loss,
    '--fs_obj_loss', fs_obj_loss]

# Copy train file to /notebooks
shutil.copy(os.path.join(r'..\src\training\scripts', training_script), os.path.join('.'))

## 6. Create Run Config

Create run config brings together the compute, script , params and docker image to form a script run configuration.

In [None]:
scripts = os.path.join('.')
aml_exp.set_runconfig(scripts,
                      training_script,
                      script_params,
                      docker_image=docker_image)

## 7. Define Hypertune policy and parameter sweeps

Below we set a sweep for two of our hyperparams and set an early stop policy for each 100 logging steps. 

This means that each node with is selection of params that is not within the slackfactor of the current best loss will be stopped early.

In [None]:
ps = RandomParameterSampling({
    '--fs_nms_iou': choice(0.5, 0.6, 0.7),
    '--fs_max_prop': choice(100, 200, 300)})
policy = BanditPolicy(evaluation_interval=100, slack_factor=0.25)
metric_name = 'Train - Total Training Loss',
metric_goal = PrimaryMetricGoal.MINIMIZE

## 8. Submit

Finally execute the configuration defined above to AML. The execution can then be monitored from the AML studio.

In [None]:
aml_exp.submit_hypertune(ps, policy, metric_name, metric_goal)