# Horovod-Infiniband-Benchmark


## Introduction

This recipe shows how to reproduce [Horovod distributed training benchmarks](https://github.com/uber/horovod/blob/master/docs/benchmarks.md) using Azure Batch AI.

Currently Batch AI has no native support for Horovod framework, but it's easy to run it using customtoolkit and job preparation command line.


## Details

- Official Horovod Benchmark [scripts](https://github.com/alsrgv/benchmarks/tree/master/scripts/tf_cnn_benchmarks) will be used;
- The job will be run on standard tensorflow container ```tensorflow/tensorflow:1.4.0-gpu```;
- Horovod framework and IntelMPI will be installed in the container using job preparation command line. Note, you can build your own docker image containing tensorflow and horovod instead.
- Benchmark scripts will be downloaded to GPU nodes using job preparation command line as well.
- This sample needs to use at least two `STANDARD_NC24r` nodes, please be sure you have enough quota
- Standard output of the job will be stored on Azure File Share.

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

from datetime import datetime
import sys

from azure.storage.file import FileService
import azure.mgmt.batchai.models as models

# The BatchAI/utilities folder contains helper functions used by different notebooks
sys.path.append('../../../')
import utilities as utils

cfg = utilities.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resoruce Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This share will be populated with sample scripts and will contain job's output.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

### Deploy Job Preparation Script to Azure File Share

Create a folder in the file share and upload the sample script to it.

In [None]:
samples_dir = 'horovod_samples'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, samples_dir, fail_on_exist=False)
print('Done')

Upload the job preparation script, that does the following tasks:
- Install essential packages for infiniband support
- Download benchmark sample
- Install IntelMPI binary
- Install honovod framework

In [None]:
service.create_file_from_path(
    azure_file_share_name, samples_dir, 'jobprep_benchmark.sh', 'jobprep_benchmark.sh')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a gpu cluster of `STANDARD_NC24r` nodes, which equip with infiniband device. Number of nodes in the cluster is configured with `nodes_count` variable, and 2 nodes will be used by default.
- Please be sure you have enough core quota to create at lesat two `STANDARD_NC24r` nodes.
- We need to use the latest `UbuntuServer 16.04-LTS` as the host image, which is compatible with infiniband.
- We will call the cluster `nc24r`
- If you like to conduct performance comparasion with TCP network, you can create the cluster with VM size `STANDARD_NC24` that does not support Infiniband 

So, the cluster will have the following parameters:

In [None]:
nodes_count = 2
cluster_name = 'nc24r'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC24r',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

The `utilities` module contains a helper function allowing to wait for the cluster to become available - all nodes are allocated and finished preparation.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Run Azure Batch AI Training Job

### Configure Job

- Will use configured previously input and output directories;
- We will use custom toolkit job to run tensorflow_mnist.py on multiple nodes (use node_count parameter to specify number of nodes). Note, Batch AI will create a host list for the job, it can be found via ```$AZ_BATCH_HOST_LIST``` environment variable;
- Horovod framework, IntelMPI and benchmark sample scripts will be installed by job preparation command line;
- Will output standard output and error streams to file share.

In [None]:
azure_file_share = 'afs'
parameters = models.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(id=cluster.id),
     node_count=2,
     std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share),
     mount_volumes=models.MountVolumes(
            azure_file_shares=[
                models.AzureFileShareReference(
                    account_name=cfg.storage_account_name,
                    credentials=models.AzureStorageCredentialsInfo(
                        account_key=cfg.storage_account_key),
                    azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                        cfg.storage_account_name, azure_file_share_name),
                    relative_mount_path=azure_file_share)
            ]
        ), 
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(image='tensorflow/tensorflow:1.8.0-gpu')),
     job_preparation=models.JobPreparation(
         command_line='bash $AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/jobprep_benchmark.sh'.format(azure_file_share, samples_dir)),
     horovod_settings = models.HorovodSettings(
         python_script_file_path='$AZ_BATCHAI_JOB_TEMP/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py',
         command_line_args='--model resnet101 --batch_size 64 --variable_update horovod',
         process_count=8))


### Create a training Job and wait for Job completion

- Wait for job to complete, and keep streaming the stdout log
- When the job completes, you will see the number of images processed per second by the end of the log

In [None]:
experiment_name = 'horovod_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime('hvdbenchmark_%m_%d_%Y_%H%M%S')
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job {0} in Experiment {1}'.format(job.name, experiment.name))

### Wait for Job to Finish
The job will start running when the cluster will have enough idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stderr.txt.

**Note** Execution may take several minutes to complete.

In [None]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name, 'stdouterr', 'stdout.txt')

### List stdout.txt and stderr.txt files for the Job and job preparation command

In [None]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='stdouterr')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

## 4. Clean Up (Optional)

### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, cfg.workspace, experiment_name, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)