# Train a PyTorch Classification Model with Azure ML and AML Compute

Here we have a driver notebook that uses Azure ML Python SDK to create a Datastore (object linking to data in Blob), AmlCompute (compute cluster for training) and a PyTorch estimator to tell Azure ML where to find the training script and how to train.

Note:
* Please use the "Python 3.6 - PyTorch 1.1" kernel for this notebook or install appropriate library versions below.

## Imports

In [1]:
# Ensure specific Azure ML Python SDK version install
import sys
! {sys.prefix}/bin/pip install azureml-sdk[automl]==1.0.74
! {sys.prefix}/bin/pip install matplotlib







In [26]:
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core import Workspace, Experiment, Datastore
from azureml.exceptions import ProjectSystemException
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.dnn import PyTorch
import shutil
import os
import json
import time

# User will set the following
my_nickname = 'silinskiy'

STORAGE_CONTAINER_NAME_TRAINDATA='tardis3'
STORAGE_ACCOUNT_NAME='aicampsilinskiy'
STORAGE_ACCOUNT_KEY='rQ6mUj4ChUo3QiXK9Gb1UShsEJ+9W/xnUDzE5OXfq4BUpbU1BIpxXUYuFFCCCl8FXecCtqL8BpR99PyEUbHLsQ=='

In [23]:
# Check core SDK version number
import azureml.core
import torch

print("SDK version: ", azureml.core.VERSION)
print("PyTorch version: ", torch.__version__)

SDK version:  1.0.74
PyTorch version:  1.1.0


## Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.

In [24]:
from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [25]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config(path='config.json')

## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource.

**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [10]:
# choose a name for your cluster - under 16 characters
cluster_name = "tardis"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    # AML Compute config - if max_nodes are set, it becomes persistent storage that scales
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                        min_nodes=0,
                                                        max_nodes=3)
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


Check the provisioning status of the cluster.

In [11]:
# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-11-20T18:47:47.144000+00:00', 'errors': None, 'creationTime': '2019-11-20T18:47:45.128578+00:00', 'modifiedTime': '2019-11-20T18:48:00.544463+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 3, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


In [12]:
# Create a project directory and copy training script to it
project_folder = os.path.join(os.getcwd(), 'project')
os.makedirs(project_folder, exist_ok=True)
shutil.copy(os.path.join(os.getcwd(), 'pytorch_train_transfer.py'), project_folder)

'/data/home/thor/notebooks/aicamp/project/pytorch_train_transfer.py'

## Create an experiment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this transfer learning PyTorch tutorial.

Think of an experiment like a scenario such as "finding images of people fighting in CCTV feeds".  An experiment usually will have many "runs" which could entail updates to the data, hyperparameters, training code itself, and other optimizations.

In [13]:
# Create an experiment
experiment_name = 'suspicious-behavior-' + my_nickname
experiment = Experiment(ws, name=experiment_name)

## Use an Azure Blob Container as Datastore

In [27]:
# Use an Azure ML Data Store for training data
ds = Datastore.register_azure_blob_container(workspace=ws, 
    datastore_name='aaaaa', 
    container_name=STORAGE_CONTAINER_NAME_TRAINDATA,
    account_name=STORAGE_ACCOUNT_NAME, 
    account_key=STORAGE_ACCOUNT_KEY,
    create_if_not_exists=True)

## Train

To train the PyTorch model we are going to use a Azure ML Estimator specific to PyTorch - see [Train models with Azure Machine Learning using estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-ml-models) for more on Estimators.  We will use the Datastore we specified earlier which mounts the Blob Storage container to the remote compute target for training in this case.

To learn more about where read and write files in a local or remote compute see [Where to save and write files for Azure Machine Learning experiments](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-save-write-experiment-files).

In [28]:
# Set up for training ("trans" flag means - use transfer learning and 
# this should download a model on compute)
# Using /tmp to store model and info due to the fact that
# creating new folders and files on the Azure Function host
# will trigger the function to restart.
script_params = {
    '--data_dir': ds.as_mount(),
    '--num_epochs': 30,
    '--learning_rate': 0.01,
    '--output_dir': './outputs',
    '--trans': 'True'
}

In [29]:
# Instantiate PyTorch estimator with upload of final model to
# a specified blob storage container (this can be anything)
estimator = PyTorch(source_directory=project_folder, 
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='pytorch_train_transfer.py',
                    use_gpu=True,
                    pip_packages=['torch==1.1.0',
                                  'torchvision==0.3.0',
                                  'matplotlib==3.1.1',
                                  'opencv-python==4.1.1.26', 
                                  'Pillow'],
                   framework_version='1.1')

run = experiment.submit(estimator)



Check run status.

In [30]:
print(run.get_details())

{'runId': 'suspicious-behavior-silinskiy_1574278440_b6b7b480', 'target': 'tardis', 'status': 'Preparing', 'properties': {'_azureml.ComputeTargetType': 'batchai', 'ContentSnapshotId': '6f111e79-1594-4a98-a5e1-c605acde6f14'}, 'inputDatasets': [], 'runDefinition': {'script': 'pytorch_train_transfer.py', 'arguments': ['--data_dir', '$AZUREML_DATAREFERENCE_aaaaa', '--num_epochs', '30', '--learning_rate', '0.01', '--output_dir', './outputs', '--trans', 'True'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'tardis', 'dataReferences': {'aaaaa': {'dataStoreName': 'aaaaa', 'mode': 'Mount', 'pathOnDataStore': None, 'pathOnCompute': None, 'overwrite': False}}, 'data': {}, 'jobName': None, 'maxRunDurationSeconds': None, 'nodeCount': 1, 'environment': {'name': 'Experiment suspicious-behavior-silinskiy Environment', 'version': 'Autosave_2019-11-20T19:34:01Z_00c09fa5', 'python': {'interpreterPath': 'python', 'userManagedDependencies': False, 'condaDependen

## Register model to workspace

This will allow accessibility to the model through the SDK in other runs or experiments.

This code is found in the training script where access exists to the run object.

```python
model = run.register_model(model_name='pt-dnn', model_path='outputs/')
```