Copyright (c) Microsoft Corporation. All rights reserved.  

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/ml-frameworks/pytorch/deployment/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.png)

# Train, hyperparameter tune and convert a PyTorch model into ONNX

In this tutorial, you will train, hyperparameter tune, convert the model to ONNX and register a PyTorch model using the Azure Machine Learning (Azure ML) Python SDK. Additionally you will learn how to leverage ML pipelines to create a reproducible training process.

This tutorial will train an image classification model using transfer learning, based on PyTorch's [Transfer Learning tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). The model is trained to classify chickens and turkeys by first using a pretrained ResNet18 model that has been trained on the [ImageNet](http://image-net.org/index) dataset.

## Prerequisites
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [Configuration](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`


In [None]:
import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.datastore import Datastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.data.data_reference import DataReference
from azureml.pipeline.steps import HyperDriveStep, HyperDriveStepRun
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineRun

import os
import shutil
import urllib
import numpy as np
import matplotlib.pyplot as plt
# Check core SDK version number

print("SDK version:", azureml.core.VERSION)

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

### Download training data
The dataset we will use (located on a public blob [here](https://msdocsdatasets.blob.core.windows.net/pytorchfowl/fowl_data.zip) as a zip file) consists of about 120 training images each for turkeys and chickens, with 100 validation images for each class. The images are a subset of the [Open Images v5 Dataset](https://storage.googleapis.com/openimages/web/index.html). The unzipped files are in provided in the repository. The cell below you can learn how to easily upload your data into a datastore for traceability. 

In [None]:
# get the default workspace
data_dir = ws.get_default_datastore()

# upload data
data_dir.upload('data', overwrite=True, show_progress=False)

## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you will create a cpu cluster or select an existing one.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "cpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2', 
                                                           max_nodes=4)
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
# print(compute_target.get_status().serialize())

## Train model on the remote compute
Now that you have your data and training script prepared, you are ready to train on your remote compute cluster.

### Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [None]:
import os
#create the project directory 
script_folder = './pytorch-birds'
os.makedirs(script_folder, exist_ok=True)


### Prepare training script
Now you will need to create your training script. In this tutorial, the training script is already provided for you at `pytorch_train.py`. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

However, if you would like to use Azure ML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of Azure ML code inside your training script. 

In `pytorch_train.py`, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML `Run` object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `pytorch_train.py`, we log the learning rate and momentum parameters, and the best validation accuracy the model achieves:
```Python
run.log('lr', np.float(learning_rate))
run.log('momentum', np.float(momentum))

run.log('best_val_acc', np.float(best_acc))
```
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Once your script is ready, copy the training script `pytorch_train.py` into your project directory.

In [None]:
#copy the training script into the project directory 
shutil.copy( 'pytorch_train.py', script_folder)

### Create a PyTorch estimator
The Azure ML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). The following code will define a single-node PyTorch job.

In [None]:
from azureml.train.dnn import PyTorch

#create a Pytorch estimator 
est = PyTorch(source_directory=script_folder, 
                    compute_target=compute_target,
                    entry_script='pytorch_train.py',
                    pip_packages=['pillow==5.4.1'])

In [None]:
from azureml.pipeline.core import PipelineData

output = PipelineData("output", datastore=data_dir)
# define the input and output
input_data = DataReference(
        datastore=data_dir,
        data_reference_name="input_data",
        path_on_datastore="fowl_data")
                         

The `estimator_entry_script_arguments` parameter is contains the arguments to your training script `entry_script`. Please note the following:
- We passed our training data reference `input_data` to our script's `--data_dir` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the training data `fowl_data` on our datastore.
- The `outputs` directory is specially treated by Azure ML in that all the content in this directory gets uploaded to your workspace as part of your run history. The files written to this directory are therefore accessible even once your remote run is over. In this tutorial, we will save our trained model to this output directory in the training script.


## Create a hyperparameter step in an ML pipeline
Now that we've created a estimator let's get the most accurate model by using Azure Machine Learning's hyperparameter tuning capabilities. You will also learn how to create a pipeline step for an ML pipeline. 

### Start a hyperparameter sweep
First, we will define the hyperparameter space to sweep over. Since our training script uses a learning rate schedule to decay the learning rate every several epochs, let's tune the initial learning rate and the momentum parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (`best_val_acc`).

Then, we specify the early termination policy to use to early terminate poorly performing runs. Here we use the `BanditPolicy`, which will terminate any run that doesn't fall within the slack factor of our primary evaluation metric. In this tutorial, we will apply this policy every epoch (since we report our `best_val_acc` metric every epoch and `evaluation_interval=1`). Notice we will delay the first policy evaluation until after the first `10` epochs (`delay_evaluation=10`).
Refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy) for more information on the BanditPolicy and other policies available.

In [None]:
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, PrimaryMetricGoal

param_sampling = RandomParameterSampling( {
        'learning_rate': uniform(0.0005, 0.005),
        'momentum': uniform(0.9, 0.99)
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)

hd_config = HyperDriveConfig(estimator=est,
                             hyperparameter_sampling=param_sampling, 
                             policy=early_termination_policy,
                             primary_metric_name='best_val_acc',
                             primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                             max_total_runs=8,
                             max_concurrent_runs=4)

### Here you will learn to create a hyperdrive step for part of an ML pipeline 

In [None]:
# let's create the hyperdrive step 
# input_data = DataReference(
#         datastore=data_dir,
#         data_reference_name="input_data",
#         path_on_datastore="fowl_data")

metrics_output_name = 'metrics_output'
metrics_data = PipelineData(name='metrics_data',
                            datastore=data_dir,
                            pipeline_output_name=metrics_output_name)

hd_step_name='Hyperdrive_step'
hd_step = HyperDriveStep(
    name=hd_step_name,
    hyperdrive_config=hd_config,
    estimator_entry_script_arguments=["--data_dir", input_data, "--num_epochs", 5],
    inputs=[input_data],
    #outputs=[output],
    metrics_output=metrics_data)

### Run an ML pipeline
Finally, lauch the hyperparameter tuning job as part of an ML pipeline. To learn more about how to create an ML pipeline for end to end training jobs with additional steps such as data preperation, check out these [notebook samples](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines).

In [None]:
exp = Experiment(workspace=ws, name='Hyperdrive_sample')
pipeline = Pipeline(workspace=ws, steps=[hd_step])
pipeline_run = exp.submit(pipeline)

In [None]:
# you can create and publish these pipelines as well to run using rest endpoint
mypipeline = Pipeline(workspace=ws, steps=[hd_step])
print ("Pipeline is built")

### Find the best model
Once all the runs complete, we can find the run that produced the model with the highest accuracy.

In [None]:
hd_step_run = HyperDriveStepRun(step_run=pipeline_run.find_step_run(hd_step_name)[0])
best_run = hd_step_run.get_best_run_by_primary_metric()
best_run

### Convert to ONNX
Now that we have the best model, let's convert the model to ONNX to optimize the model for scoring. For more information about ONNX see details [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-onnx).

In [None]:
import torch

#download the model 
best_run.download_file('outputs/model.pt', output_file_path='model.pt')

best_run_model = torch.load('model.pt')

In [None]:
# export the model as onnx 
import torch.onnx

# Standard ImageNet input - 3 channels, 224x224,
# values don't matter as we care about network structure.
# But they can also be real inputs.
dummy_input = torch.randn(1, 3, 224, 224)

# Invoke export
torch.onnx.export(best_run_model, dummy_input, "model.onnx")

In [None]:
import onnx 

# confirm successful export
onnx_model = onnx.load('model.onnx')

# Check that the IR is well formed
onnx.checker.check_model(onnx_model)

In [None]:
# by registering the model you can then trigger devops release pipeline which will package and deploy your model
from azureml.core.model import Model

model = Model.register(workspace = ws,
                       model_path = "model.onnx",
                       model_name = "onnx-birds")

print(model.name, model.id, model.version, sep = '\t')

## Next Steps
Create a devops pipeline to trigger a release, everytime a new model is created