Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Train GenSen Model with Distributed Pytorch on AML
In this tutorial, you will train a GenSen model with PyTorch on AML using distributed training across a GPU cluster. This could also be a generic guideline to train models using GPU cluster.

Regarding **AzureML**, please refer to:
* [Quickstart notebook](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python)
* [Hyperdrive](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)

## 0. Global Settings

In [1]:
# set the environment path to find NLP
import sys
sys.path.append("../../../")
import time
import os
import papermill as pm
import pandas as pd
import shutil

import azureml as aml
import azureml.train.hyperdrive as hd

from azureml.telemetry import set_diagnostics_collection
from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens
from utils_nlp.dataset import snli
from models.gensen.amlcode.data_utils import gensen_preprocess

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", aml.core.VERSION)
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Azure ML SDK Version: 1.0.33
Pandas version: 0.24.2


In [2]:
BASE_DATA_PATH = '../../../data'


Opt-in diagnostics for better experience, quality, and security of future releases.

In [3]:
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


We assume that an AzureML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) has already been created. For instructions on how to do this, see [here](README.md).

## 1 Initialize workspace

We create a workspace object using the configurations specified in `aml_config/config.json` that you created in the prerequisite step.

In [4]:
# AzureML workspace info. Note, will look up "./aml_config/config.json" first, then fall back to using this
SUBSCRIPTION_ID = '<subscription-id>'
RESOURCE_GROUP  = '<resource-group>'
WORKSPACE_NAME  = '<workspace-name>'

# Connect to a workspace
try:
    ws = aml.core.Workspace.from_config()
except aml.exceptions.UserErrorException:
    try:
        ws = aml.core.Workspace(
            subscription_id=SUBSCRIPTION_ID,
            resource_group=RESOURCE_GROUP,
            workspace_name=WORKSPACE_NAME
        )
        ws.write_config()
    except aml.exceptions.AuthenticationException:
        ws = None

if ws is None:
    raise ValueError(
        """Cannot access the AzureML workspace w/ the config info provided.
        Please check if you entered the correct id, group name and workspace name"""
    )
else:
    print("AzureML workspace name: ", ws.name)

AzureML workspace name:  MAIDAPNLP


Once you've created your workspace and set up your development environment, training a model in Azure Machine Learning involves the following steps:
1. Create a remote compute target (note you can also use local computer as compute target)
2. Prepare your training data and upload it to datastore
3. Create your training script
4. Create an Estimator object
5. Submit the estimator to an experiment object under the workspace

## 2 Create or load an existing compute target

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Specifically, the below code creates an `STANDARD_NC6` GPU cluster that autoscales from `0` to `4` nodes. 

**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace, this code will skip the creation process and loads the cluster directly.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

VM_SIZE = 'STANDARD_NC6'
VM_PRIORITY = 'lowpriority'
MAX_NODES = 4

# choose a name for your cluster
cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size=VM_SIZE,
                                                           max_nodes=MAX_NODES)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute. 
print(compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-05-02T21:01:04.496000+00:00', 'errors': None, 'creationTime': '2019-04-17T17:21:26.968570+00:00', 'modifiedTime': '2019-04-17T17:27:28.740980+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT7200S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


## 3. Prepare dataset.

In this section, we will
1. Download the dataset and load the dataset.
2. Tokenize and reshape the dataset for Gensen.
3. Upload the training set to the default blob storage of the workspace.

We use the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset in this example. For a more detailed walkthrough about data processing jump to [SNLI Data Prep](../01-prep-data/snli.ipynb)

### 3.1 Load the dataset

In [6]:
# defaults to txt
train = snli.load_pandas_df(BASE_DATA_PATH, file_split="train")

#load dataframe from jsonl file format
dev = snli.load_pandas_df(BASE_DATA_PATH, file_split="dev", file_type="txt")

#specify txt format 
test = snli.load_pandas_df(BASE_DATA_PATH, file_split="test", file_type="txt")

train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


### 3.2 Tokenize

In [8]:
def clean(df, file_split):
    src_file_path = os.path.join(BASE_DATA_PATH, "raw/snli_1.0/snli_1.0_{}.txt".format(file_split))
    if not os.path.exists(os.path.join(BASE_DATA_PATH, "clean/snli_1.0")):
        os.makedirs(os.path.join(BASE_DATA_PATH, "clean/snli_1.0"))
    dest_file_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt".format(file_split))
    clean_df = snli.clean_snli(src_file_path).dropna() # drop rows with any NaN vals
    clean_df.to_csv(dest_file_path)
    return clean_df

train = clean(train, 'train')
dev = clean(dev, 'dev')
test = clean(test, 'test')

In [9]:
train_tok = to_nltk_tokens(to_lowercase(train))
dev_tok = to_nltk_tokens(to_lowercase(dev))
test_tok = to_nltk_tokens(to_lowercase(test))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abeswara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abeswara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abeswara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Gensen also needs some model specific preprocessing which we run through below

In [10]:
gensen_preprocess(train_tok, dev_tok, test_tok, os.path.abspath(BASE_DATA_PATH))

We make the data accessible remotely by uploading that data from your local machine into Azure. Then it can be accessed for remote training. The datastore is a convenient construct associated with your workspace for you to upload or download data. You can also interact with it from your remote compute targets. It's backed by an Azure Blob storage account.

In [21]:
data_folder = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/")
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

ds.upload(src_dir=data_folder, target_path='data', overwrite=True, show_progress=True)

AzureBlob maidapnlp0056795534 azureml-blobstore-09b72610-7938-4ed2-86a2-5004896b12d9
Uploading ../../../data\clean/snli_1.0/snli_1.0_dev.txt
Uploading ../../../data\clean/snli_1.0/snli_1.0_dev.txt.clean
Uploading ../../../data\clean/snli_1.0/snli_1.0_dev.txt.lab
Uploading ../../../data\clean/snli_1.0/snli_1.0_dev.txt.s1.tok
Uploading ../../../data\clean/snli_1.0/snli_1.0_dev.txt.s2.tok
Uploading ../../../data\clean/snli_1.0/snli_1.0_test.txt
Uploading ../../../data\clean/snli_1.0/snli_1.0_test.txt.clean
Uploading ../../../data\clean/snli_1.0/snli_1.0_test.txt.lab
Uploading ../../../data\clean/snli_1.0/snli_1.0_test.txt.s1.tok
Uploading ../../../data\clean/snli_1.0/snli_1.0_test.txt.s2.tok
Uploading ../../../data\clean/snli_1.0/snli_1.0_train.txt
Uploading ../../../data\clean/snli_1.0/snli_1.0_train.txt.clean
Uploading ../../../data\clean/snli_1.0/snli_1.0_train.txt.lab
Uploading ../../../data\clean/snli_1.0/snli_1.0_train.txt.s1.tok
Uploading ../../../data\clean/snli_1.0/snli_1.0_train

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 256, in handler
    result = fn()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 576, in <lambda>
    lambda target, source: lambda: self.blob_service.create_blob_from_path(self.container_name, target, source)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\blob\blockblobservice.py", line 463, in create_blob_from_path
    timeout=timeout)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\blob\blockblobservice.py", line 582, in create_blob_from_stream
    timeout=timeout)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_s

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 263, in handler
    logger.error("Task Exception", e)
Message: 'Task Exception'
Arguments: (AzureHttpError('Forbidden\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">\r\n<HTML><HEAD><TITLE>Forbidden</TITLE>\r\n<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>\r\n<BODY><h2>Forbidden URL</h2>\r\n<hr><p>HTTP Error 403. The request URL is forbidden.</p>\r\n</BODY></HTML>\r\n',),)
--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 256, in handler
    result = fn()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 576, in <lambda>
    lambda target, source: lambda: self.b

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_history\utils\task_queue.py", line 57, in _awaiter
    result = task.wait()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_history\utils\async_task.py", line 48, in wait
    return self._handler(self._task, self._logger)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 263, in handler
    logger.error("Task Exception", e)
Message: 'Task Exception'
Arguments: (AzureHttpError('Forbidden\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">\r\n<HTM

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 577, in format
    record.message = record.getMessage()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_history\utils\task_queue.py", line 57, in _awaiter
    result = task.wait()
  File "C:\Users\abeswara\A


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 994, in emit
    msg = self.format(record)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 840, in format
    return fmt.format(record)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 577, in format
    record.message = record.getMessage()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\threading.py", li

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\common\_error.py", line 115, in _http_error_handler
    raise ex
azure.common.AzureHttpError: Forbidden
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Forbidden</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Forbidden URL</h2>
<hr><p>HTTP Error 403. The request URL is forbidden.</p>
</BODY></HTML>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 994, in emit
    msg = self.format(record)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\logging\__init__.py", line 840, in format
    return fmt.format(record)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\loggin

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 576, in <lambda>
    lambda target, source: lambda: self.blob_service.create_blob_from_path(self.container_name, target, source)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\blob\blockblobservice.py", line 463, in create_blob_from_path
    timeout=timeout)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\blob\blockblobservice.py", line 582, in create_blob_from_stream
    timeout=timeout)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\blob\blockblobservice.py", line 971, in _put_blob
    return self._perform_request(request, _parse_base_properties)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor

Message: 'Task Exception'
Arguments: (AzureHttpError('Forbidden\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">\r\n<HTML><HEAD><TITLE>Forbidden</TITLE>\r\n<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>\r\n<BODY><h2>Forbidden URL</h2>\r\n<hr><p>HTTP Error 403. The request URL is forbidden.</p>\r\n</BODY></HTML>\r\n',),)
--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 256, in handler
    result = fn()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 576, in <lambda>
    lambda target, source: lambda: self.blob_service.create_blob_from_path(self.container_name, target, source)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_vendor\azure_storage\

  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_history\utils\task_queue.py", line 57, in _awaiter
    result = task.wait()
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\_history\utils\async_task.py", line 48, in wait
    return self._handler(self._task, self._logger)
  File "C:\Users\abeswara\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\azureml\data\azure_storage_datastore.py", line 263, in handler
    logger.error("Task Exception", e)
Message: 'Task Exception'
Arguments: (AzureHttpError('Forbidden\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">\r\n<HTML><HEAD><TITLE>Forbidden</TITLE>\r\n<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>\r\n<BODY><h2>Forbidden URL</h2>\r\n<hr><p>HTTP Error 403. The request URL is forbidden.</p>\r\n</BODY></HTML>\r\n',),)


$AZUREML_DATAREFERENCE_974ecbc6b7d74bb190868e70fd13026f

### 2.3 Create a training script

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In this notebook we load the files in ``./models/advanced/gensen/amlcode``. The directory contains all the code you want to submit to AmlCompute to run including the training script and other helper utils used by GenSen model.

In [None]:
source_directory = 'models/advanced/gensen/amlcode'

Now you will need to create your training script. In this tutorial, the script for distributed training of GENSEN is already provided for you at `train.py`. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.

The training script also uses Azure ML's [metric logging](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-track-experiments) capabilities, you will have to add a small amount of Azure ML logic inside your training script. In this example, at each logging interval, we will log the loss for that minibatch to our Azure ML run.

In [None]:
entry_script = 'train.py'

## 3. Train model on the remote compute
Now that we have the setup ready, we can start training our model. 

### 3.1 Create an experiment
We first start by creating an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed PyTorch tutorial.

In [None]:
from azureml.core import Experiment, get_run

experiment_name = 'pytorch-gensen'
experiment = Experiment(ws, name=experiment_name)

### 3.2 Create a PyTorch estimator

In this section, we create a PyTorch estimator that enables you to train your models at scale across CPU and GPU clusters of Azure VMs. You can easily run distributed PyTorch training with a few API calls, while Azure Machine Learning will manage behind the scenes all the infrastructure and orchestration needed to carry out these workloads. 

The Azure ML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch).

In this example we run distributed PyTorch using the Horovod framework on AzureML

In [None]:
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator

script_params = {
    '--config': 'example_config.json',
    '--data_folder': ds.as_mount()}

estimator = PyTorch(source_directory=source_directory,
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script=entry_script,
                    node_count=4,
                    process_count_per_node=1,
                    distributed_backend='mpi',
                    use_gpu=True,
                    conda_packages=['scikit-learn=0.20.3']
                   )

The above code specifies that we will run our training script on `4` nodes, with one worker per node. In order to execute a distributed run using GPU, you must provide the argument `use_gpu=True`. To execute a distributed run using MPI/Horovod, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, PyTorch, Horovod and their dependencies will be installed for you. If you are the first time to create a experiment, it may take longer to set up conda environments under `.azureml/conda_dependencies.yml`. After the first run, it will use the existing conda environments and directly run the code. However, if your script also uses other packages, make sure to install them via the `PyTorch` constructor's `pip_packages` or `conda_packages` parameters. The more required packages are stored in `.azureml/conda_dependencies.yml` file.

### 3.3 Submit job
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [None]:
run = experiment.submit(estimator)
print(run)

### 3.4 Monitor your run
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

Alternatively, you can block until the script has completed training before running more code.

In [None]:
run.wait_for_completion(show_output=True) # this provides a verbose log#%%


### 3.5 Cancel the job.
It's better to cancel the job manually to make sure you does not waste resources.

In [None]:
# Cancel the job with id.
# job_id = "pytorch-gensen_1555533596_d9cc75fe"
# run = get_run(experiment, job_id)

# Cancel jobs.
run.cancel()