# Train, hyperparameter tune with PyTorch

Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, saving you significant time and resources. You specify the range of hyperparameter values and a maximum number of training runs. The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. Poorly performing training runs are automatically terminated early, reducing the usage of compute resources. These resources are instead used to explore other hyperparameter configurations. For more information, please refer to [how to tune hyperparameters with Azure ML](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy).

In [None]:
import os
import shutil

import azureml
from azureml.core import Workspace, Experiment
from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.dnn import PyTorch
from azureml.core.container_registry import ContainerRegistry
from azureml.train.hyperdrive import (
    RandomParameterSampling,
    BanditPolicy,
    uniform,
    choice,
    HyperDriveConfig,
    PrimaryMetricGoal,
)

from dotenv import set_key, get_key, find_dotenv
from utilities import get_auth

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

In [None]:
env_path = find_dotenv(raise_error_if_not_found=True)

Let's first select a name for AMLCompute, number of epochs for training and maximum total runs for hyperdrive. The num epochs and maximum total run parameters deliberately have a low default value for the speed of running. In actual application, set these to higher values (i.e. num_epochs = 10, max_total_runs = 16)

In [None]:
# choose a name for your AMLCompute cluster
cluster_name = "YOUR_CLUSTER_NAME"

# number of epochs
num_epochs = 1

# max total runs for hyperdrive
max_total_runs = 1

In [None]:
set_key(env_path, "cluster_name", cluster_name)

In [None]:
ws = Workspace.from_config(auth=get_auth(env_path))
print(ws.name, ws.resource_group, ws.location, sep="\n")

In [None]:
experiment = Experiment(workspace=ws, name="torchvision")

Let's copy the training script and its dependencies to a script folder.

In [None]:
script_folder = "./torchdetect"
os.makedirs(script_folder, exist_ok=True)

shutil.copy("./scripts/coco_eval.py", script_folder)
shutil.copy("./scripts/coco_utils.py", script_folder)
shutil.copy("./scripts/engine.py", script_folder)
shutil.copy("./scripts/transforms.py", script_folder)
shutil.copy("./scripts/utils.py", script_folder)
shutil.copy("./scripts/maskrcnn_model.py", script_folder)
shutil.copy("./scripts/XMLDataset.py", script_folder)
shutil.copy("./scripts/train.py", script_folder)

## Upload dataset to default datastore

In [None]:
ds = ws.get_default_datastore()
ds.container_name

In [None]:
ds.upload(
    src_dir="./scripts/JPEGImages",
    target_path="JPEGImages",
    overwrite=True,
    show_progress=True,
)
ds.upload(
    src_dir="./scripts/Annotations",
    target_path="Annotations",
    overwrite=True,
    show_progress=True,
)

## Create  AmlCompute

We need a compute target for training the model. Here, we create [AmlCompute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) as our training compute resource to automate the process of hyperparameter tuning later using this resource.

In [None]:
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target.")
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=8
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster.
print(compute_target.get_status().serialize())

## Create A Pytorch Estimator

We first create a pytorch estimator and submit a run to make sure everything works fine before moving on to hyperparamater search. This run can take several hours depending on the num_epochs parameter selected.

In [None]:
script_folder = "./torchdetect"
image_name = get_key(env_path, "image_name")

In [None]:
# point to an image in private ACR
image_registry_details = ContainerRegistry()
image_registry_details.address = get_key(env_path, "acr_server_name")
image_registry_details.username = get_key(env_path, "acr_username")
image_registry_details.password = get_key(env_path, "acr_password")

In [None]:
script_params = {
    "--data_path": ds.as_mount(),
    "--workers": 8,
    "--learning_rate": 0.005,
    "--epochs": num_epochs,
    "--anchor_sizes": "16,32,64,128,256,512",
    "--anchor_aspect_ratios": "0.25,0.5,1.0,2.0",
    "--rpn_nms_thresh": 0.5,
    "--box_nms_thresh": 0.3,
    "--box_score_thresh": 0.10,
}

estimator = PyTorch(
    source_directory=script_folder,
    script_params=script_params,
    compute_target=compute_target,
    entry_script="train.py",
    use_docker=True,
    custom_docker_image=image_name,
    image_registry_details=image_registry_details,
    user_managed=True,
    use_gpu=True,
)

estimator.run_config.environment.environment_variables["PYTHONPATH"] = "$PYTHONPATH:/cocoapi/PythonAPI/"

### Submit job

In [None]:
run = experiment.submit(estimator)
print(run)

In [None]:
RunDetails(run).show()

In [None]:
# to get more details of your run
print(run.get_details())

In [None]:
run.wait_for_completion(show_output=True)

## Tune Model Hyperparameters

In this section, we automatically tune hyperparameters by exploring the range of values defined for each hyperparameter. The following run can take several hours depending on number of epochs and number of total runs selected. 

In [None]:
script_params = {
    "--data_path": ds.as_mount(),
    "--workers": 8,
    "--epochs": num_epochs,
    "--box_nms_thresh": 0.3,
    "--box_score_thresh": 0.10,
}

estimator = PyTorch(
    source_directory=script_folder,
    script_params=script_params,
    compute_target=compute_target,
    entry_script="train.py",
    use_docker=True,
    custom_docker_image=image_name,
    image_registry_details=image_registry_details,
    user_managed=True,
    use_gpu=True,
)

estimator.run_config.environment.environment_variables["PYTHONPATH"] = "$PYTHONPATH:/cocoapi/PythonAPI/"

### Sampling hyperparameters

Here we use random sampling which randomly selects hyperparameter values from the defined search space. Random sampling allows for both discrete and continuous hyperparameters.

In [None]:
param_sampling = RandomParameterSampling(
    {
        "learning_rate": uniform(0.0005, 0.005),
        "rpn_nms_thresh": uniform(0.3, 0.7),
        "anchor_sizes": choice(
            "16",
            "16,32",
            "16,32,64",
            "16,32,64,128",
            "16,32,64,128,256",
            "16,32,64,128,256,512",
        ),
        "anchor_aspect_ratios": choice(
            "0.25", "0.25,0.5", "0.25,0.5,1.0", "0.25,0.5,1.0,2.0"
        ),
    }
)

We will terminate poorly performing runs automatically with bandit early termination policy which is based on slack factor and evaluation interval. The policy terminates any run where the primary metric is not within the specified slack factor with respect to the best performing training run.

In [None]:
early_termination_policy = BanditPolicy(
    slack_factor=0.15, evaluation_interval=2, delay_evaluation=2
)

hyperdrive_config = HyperDriveConfig(
    estimator=estimator,
    hyperparameter_sampling=param_sampling,
    policy=early_termination_policy,
    primary_metric_name="mAP@IoU=0.50",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=max_total_runs,
    max_concurrent_runs=4,
)

In [None]:
hyperdrive_run = experiment.submit(hyperdrive_config)

In [None]:
RunDetails(hyperdrive_run).show()

In [None]:
hyperdrive_run.wait_for_completion(show_output=True, wait_post_processing=True)

### Find and register the best model

Once all the runs complete, we can find the run that produced the model with the highest accuracy.

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)

In [None]:
best_run.get_details()['runDefinition']['arguments']

In [None]:
model = best_run.register_model(model_name = "torchvision_best_model", model_path="/outputs/model_latest.pth")
print(model.name, model.id, model.version, sep = '\t')

You can deploy this model using the [deploy deep learning models using Azure ML](https://github.com/microsoft/AKSDeploymentTutorialAML).

You can now move on to the next notebook to [create an Azure Machine Learning pipeline to run the steps of tuning the hyperparameters and registering the model](05_TrainWithAMLPipeline.ipynb).