# PyTorch Distributed GPU Using Gloo


## Introduction

This example demonstrates how to run distributed GPU training for PyTorch using Gloo backend in Batch AI

## Details

- The Gloo backend will be implemented using Batch AI shared job temporary directory which is visible for all GPU nodes in the job
- Will use Batch AI generated AZ_BATCHAI_PYTORCH_INIT_METHOD for shared file-system initialization.
- Will use Batch AI generated AZ_BATCHAI_TASK_INDEX as rank of each worker process
- Standard output of the job will be stored on Azure File Share.
- PyTorch training script [mnist_trainer.py](./mnist_trainer.py) is attached, which trains a CNN for MNIST dataset.

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

import time
from datetime import datetime
import os
import sys

from azure.storage.file import FileService, FilePermissions
import azure.mgmt.batchai.models as models

# The BatchAI/utilities folder contains helper functions used by different notebooks
sys.path.append('../../../')
import utilities as utils

cfg = utils.config.Configuration('..\..\configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resoruce Group and Batch AI workspace if not exists：

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)

### Deploy Sample Script 
For each job we will create a folder containing a copy of script [mnist_trainer.py](./mnist_trainer.py). 

In [None]:
pyTorchSamples = "PyTorchSamples"
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, pyTorchSamples, fail_on_exist=False)
service.create_file_from_path(
    azure_file_share_name, pyTorchSamples, 'mnist_trainer.py', 'mnist_trainer.py')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster
- For this example we will use a gpu cluster of 2 `STANDARD_NC6` nodes. You can increase the number of nodes by changing `nodes_count` variable;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [None]:
nodes_count = 2
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Get the just created cluster. The `utilities` module contains a helper function to print out all kind of nodes count in the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

#### 3. Run Azure Batch AI Training Job

### Configure Job
- The job will use `pytorch/pytorch:0.4_cuda9_cudnn7` container.
- Will use configured previously input and output directories;
- Will mount file share at folder with name `afs`. Full path of this folder on a computer node will be `AZ_BATCHAI_JOB_MOUNT_ROOT/afs`;
- Will run modified `mnist_trainer.py` from SCRIPT input directory;
- Will output standard output and error streams to file share;
- Will use `'Gloo'` as PyTorch distribution backend, and use Batch AI generated `AZ_BATCHAI_PYTORCH_INIT_METHOD` for shared file-system initialization.
- Will use Batch AI generated `AZ_BATCHAI_TASK_INDEX` as rank of each worker process


In [None]:
azure_file_share = 'afs'
parameters = models.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(id=cluster.id),
     node_count=2,
     std_out_err_path_prefix="$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}".format(azure_file_share),
     mount_volumes=models.MountVolumes(
            azure_file_shares=[
                models.AzureFileShareReference(
                    account_name=cfg.storage_account_name,
                    credentials=models.AzureStorageCredentialsInfo(
                        account_key=cfg.storage_account_key),
                    azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                        cfg.storage_account_name, azure_file_share_name),
                    relative_mount_path=azure_file_share)
            ]
        ), 
     container_settings=models.ContainerSettings(
          image_source_registry=models.ImageSourceRegistry(image='pytorch/pytorch:0.4_cuda9_cudnn7')),
     py_torch_settings = models.PyTorchSettings(
         python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/mnist_trainer.py'.format(azure_file_share, pyTorchSamples),
         command_line_args='--epochs 10 --world-size 2 --dist-backend gloo --dist-url $AZ_BATCHAI_PYTORCH_INIT_METHOD --rank $AZ_BATCHAI_TASK_INDEX',
         communication_backend='gloo'))
 

### Create a training Job and wait for Job completion


In [None]:
experiment_name = 'pytorch_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime('pytorch_%m_%d_%Y_%H%M%S')
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job {0} in Experiment {1}'.format(job.name, experiment.name))

### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdout.txt.

**Note** Execution may take several minutes to complete. Due to a known bug in PyTorch Gloo backend, the job may fail with the following error as [reported](https://github.com/pytorch/pytorch/issues/2530):
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /pytorch/torch/lib/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /pytorch/torch/lib/gloo/gloo/cuda.cu:249: driver shutting down
```



In [None]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name,'stdouterr', 'stdout-0.txt')

### List stdout.txt and stderr.txt files for the Job

In [None]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='stdouterr')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

## 4. Clean Up (Optional)

### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, cfg.workspace, experiment_name, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)