# Chainer GPU Distributed-Infiniband


## Introduction

This example demonstrates how to run standard ChainerMN [train_mnist.py](https://github.com/chainer/chainermn/blob/master/examples/mnist/train_mnist.py) distributed training job using Batch AI with Infiniband enabled.

## Details

- Standard chainer sample script [train_mnist.py](https://github.com/chainer/chainermn/blob/master/examples/mnist/train_mnist.py) is used;
- Chainer downloads the standard MNIST Database on its own and distributed across workers;
- Standard output of the job and the model will be stored on Azure File Share.
- IntelMPI (non-CUDA-aware) will be used to launch ChainerMN jobs cross nodes

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [87]:
from __future__ import print_function

from datetime import datetime
import sys

from azure.storage.file import FileService
import azure.mgmt.batchai.models as models

# The BatchAI/utilities folder contains helper functions used by different notebooks
sys.path.append('../../../')
import utilities as utils

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)
utils.config.create_resource_group(cfg)

Keyring cache token has failed: (1783, 'CredWrite', 'The stub received bad data')
Keyring cache token has failed: (1783, 'CredWrite', 'The stub received bad data')


Create Resoruce Group and Batch AI workspace if not exists:

In [88]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

Keyring cache token has failed: (1783, 'CredWrite', 'The stub received bad data')


## 1. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a gpu cluster of `STANDARD_NC24r` nodes, which equip with infiniband device. Number of nodes in the cluster is configured with `nodes_count` variable, and 2 nodes will be used by default.
- Please be sure you have enough core quota to create at lesat two `STANDARD_NC24r` nodes.
- We will call the cluster `nc24r`
- If you like to conduct performance comparasion with TCP network, you can create the cluster with VM size `STANDARD_NC24` that does not support Infiniband 

So, the cluster will have the following parameters:

In [89]:
nodes_count = 2
cluster_name = 'nc24r'

parameters = models.ClusterCreateParameters(
    vm_size='STANDARD_NC24r',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None
    )
)

### Create Compute Cluster

In [90]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [91]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

Cluster state: steady Target: 2; Allocated: 2; Idle: 2; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


## 2. Prepare Training Script in Azure Storage

### Create File Share

For this example we will create a new File Share with name `batchaisample` under your storage account.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [92]:
azure_file_share_name = 'batchaisample'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

Done


### Deploy Sample Script

Download original sample script

In [93]:
script_to_deploy = 'train_mnist.py'
utils.dataset.download_file('https://raw.githubusercontent.com/chainer/chainermn/v1.3.0/examples/mnist/train_mnist.py', script_to_deploy)

Downloading https://raw.githubusercontent.com/chainer/chainermn/v1.3.0/examples/mnist/train_mnist.py ...Done


We will create a folder on Azure File Share containing a copy of the sample script

In [94]:
samples_dir = 'chainer_samples'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, samples_dir, fail_on_exist=False)
service.create_file_from_path(
    azure_file_share_name, samples_dir, script_to_deploy, script_to_deploy)

Upload the job preparation script [jobprep.sh](./jobprep.sh), that installs IntelMPI binary

In [None]:
service.create_file_from_path(
    azure_file_share_name, samples_dir, 'jobprep.sh', 'jobprep.sh')

### Configure Job
- The job will use `batchaitraining/chainermn:IntelMPI` container. The dockerfile can be found [here](./dockerfile);
- We will mount file share at folder with name `afs`. Full path of this folder on a computer node will be `$AZ_BATCHAI_JOB_MOUNT_ROOT/afs`;
- Will run modified `train_mnist.py` from SCRIPT input directory;
- Will output standard output and error streams to file share;
- IntelMPI binary will be installed via Job Preparation task;
- Will generate output model files in MODEL output directory. 


In [82]:
azure_file_share = 'afs'
parameters = models.JobCreateParameters(
     cluster=models.ResourceId(id=cluster.id),
     node_count=2,
     mount_volumes=models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share)
        ]
     ),  
     std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share),
     output_directories = [
        models.OutputDirectory(
            id='MODEL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share),
            path_suffix='Models')],
     job_preparation=models.JobPreparation(
         command_line='bash $AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/jobprep.sh'.format(azure_file_share, samples_dir)),
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(image='batchaitraining/chainermn:IntelMPI')),
     chainer_settings = models.ChainerSettings(
         python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/{2}'.format(azure_file_share, samples_dir, script_to_deploy),
         command_line_args='-g --communicator non_cuda_aware -o $AZ_BATCHAI_OUTPUT_MODEL',
         process_count=8
     ))

### Create a training Job and wait for Job completion


In [83]:
experiment_name = 'chainermn_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime('chainer_%m_%d_%Y_%H%M%S')
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job {0} in Experiment {1}'.format(job.name, experiment.name))

Created Job chainer_07_25_2018_071026 in Experiment chainermn_experiment


### Wait for Job to Finish
The job will start running when the cluster will have enough idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdout.txt.

**Note** Execution may take several minutes to complete.

In [84]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name, 'stdouterr', 'stdout.txt')

Cluster state: steady Target: 2; Allocated: 2; Idle: 0; Unusable: 0; Running: 2; Preparing: 0; Leaving: 0
Job state: running ExitCode: None
Waiting for job output to become available...
Num process (COMM_WORLD): 8
Using GPUs
Using non_cuda_aware communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
[J1           0.394735    0.153702              0.880933       0.954712                  9.935         
[J     total [###...............................................]  6.67%
this epoch [################..................................] 33.33%
       100 iter, 1 epoch / 20 epochs
       inf iters/sec. Estimated time to finish: 0:00:00.
[4A[J2           0.12691     0.102421              0.963467       0.967019                  11.5765       
[J     total [######............................................] 13.33%
this epoch [#################################................

### List stdout.txt and stderr.txt files for the Job

In [18]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='stdouterr')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

execution-tvm-913932285_1-20180725t045847z.log https://stgacc07062018192335.file.core.windows.net/batchaisample/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchaitestrgnortheuro/workspaces/pgunda/experiments/chainermn_experiment/jobs/chainer_07_25_2018_052122/358c8acf-1348-43ea-97c2-eb3389885a48/stdouterr/execution-tvm-913932285_1-20180725t045847z.log?sv=2016-05-31&sr=f&sig=XXaVGOLgBTr%2FApV7j2tmBq3qpUKhw86T%2F%2BR6aDKXTUg%3D&se=2018-07-25T06%3A28%3A46Z&sp=rl
execution-tvm-913932285_2-20180725t045847z.log https://stgacc07062018192335.file.core.windows.net/batchaisample/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchaitestrgnortheuro/workspaces/pgunda/experiments/chainermn_experiment/jobs/chainer_07_25_2018_052122/358c8acf-1348-43ea-97c2-eb3389885a48/stdouterr/execution-tvm-913932285_2-20180725t045847z.log?sv=2016-05-31&sr=f&sig=XxIrNPcBDk3vJfwNw%2BOGu%2FT0sTHIm4oi6wrlb6jRHwE%3D&se=2018-07-25T06%3A28%3A46Z&sp=rl
stderr-job_prep-tvm-913932285_1-20180725t045847z.txt https://stgacc07062018192335.file

### Enumerate Model Output
Previously we configured the job to use output directory with `ID='MODEL'` for model output. We can enumerate the output using the following code.

In [86]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='MODEL')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

Token expired or is invalid. Attempting to refresh.
Keyring cache token has failed: (1783, 'CredWrite', 'The stub received bad data')


cg.dot https://stgacc07062018192335.file.core.windows.net/batchaisample/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchaitestrgnortheuro/workspaces/pgunda/experiments/chainermn_experiment/jobs/chainer_07_25_2018_071026/888531e9-f8af-47e0-8fe6-fa1714147bc9/outputs/Models/cg.dot?sv=2016-05-31&sr=f&sig=INP00Jl%2Bef%2FnBSeMhL0a7vcjITh4qa7t8%2FtKhFenkAU%3D&se=2018-07-25T21%3A02%3A12Z&sp=rl
log https://stgacc07062018192335.file.core.windows.net/batchaisample/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchaitestrgnortheuro/workspaces/pgunda/experiments/chainermn_experiment/jobs/chainer_07_25_2018_071026/888531e9-f8af-47e0-8fe6-fa1714147bc9/outputs/Models/log?sv=2016-05-31&sr=f&sig=3A3%2FY3ozgorKV62SeUQvIj0ChpblE%2Fcn4GNWoC9xMAk%3D&se=2018-07-25T21%3A02%3A12Z&sp=rl


### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, cfg.workspace, experiment_name, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)