# Distributed Tensorflow GPU


## Introduction

This example demonstrate how to run standard TensorFlow sample (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py) on Azure Batch AI cluster of 2 nodes.

## Details

- For demonstration purposes, MNIST dataset and mnist_replica.py will be deployed at Azure File Share;
- Standard output of the job will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) is archived and uploaded into the blob https://batchaisamples.blob.core.windows.net/samples/mnist_dataset_original.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=b&sig=Qc1RA3zsXIP4oeioXutkL1PXIrHJO0pHJlppS2rID3I%3D.
- The recipe uses official [mnist_replica.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py) script.

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [2]:
from __future__ import print_function

import time
from datetime import datetime
import os
import sys
import zipfile

from azure.storage.file import FileService, FilePermissions
import azure.mgmt.batchai.models as models

# utilities.py contains helper functions used by different notebooks
sys.path.append('..\..')
import utilities

cfg = utilities.Configuration('..\..\configuration.json')
client = utilities.create_batchai_client(cfg)
utilities.create_resource_group(cfg)

### Create File Share

For this example we will create a new File Share with name `batchaisample` under your storage account.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [3]:
azure_file_share_name = 'batchaisample'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)

False

### Configure Compute Cluster

- For this example we will use a gpu cluster of 2 `STANDARD_NC6` nodes. You can increase the number of nodes by changing `nodes_count` variable;
- We will mount file share at folder with name `external`. Full path of this folder on a computer node will be `$AZ_BATCHAI_MOUNT_ROOT/external`;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [4]:
azure_file_share = 'external'
nodes_count = 2
cluster_name = 'nc6'

volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url = 'https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share)
    ]
)

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size="STANDARD_NC6",
    virtual_machine_configuration=models.VirtualMachineConfiguration(
        image_reference=models.ImageReference(
            publisher="microsoft-ads",
            offer="linux-data-science-vm-ubuntu",
            sku="linuxdsvmubuntu",
            version="latest")),
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    node_setup=models.NodeSetup(
        mount_volumes=volumes,
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password,
        admin_user_ssh_public_key=cfg.admin_ssh_key
    )
)

### Create Compute Cluster

In [5]:
_ = client.clusters.create(cfg.resource_group, cluster_name, parameters)

### Monitor Cluster Creation

Monitor the just created cluster. utilities.py contains a helper function to print out detail status of the cluster.

In [6]:
cluster = client.clusters.get(cfg.resource_group, cluster_name)
utilities.print_cluster_status(cluster)

Cluster state: AllocationState.steady Target: 2; Allocated: 2; Idle: 2; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


### Deploy MNIST Dataset

For demonstration purposes, we will download preprocessed MNIST dataset to the current directory and upload it to file share directory named `mnist_dataset`.

#### Download and Extract MNIST Dataset

In [7]:
mnist_dataset_url = 'https://batchaisamples.blob.core.windows.net/samples/mnist_dataset_original.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=b&sig=Qc1RA3zsXIP4oeioXutkL1PXIrHJO0pHJlppS2rID3I%3D'
mnist_files = ['t10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz',
              'train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz']
if any(not os.path.exists(f) for f in mnist_files):
    utilities.download_file(mnist_dataset_url, 'mnist_dataset_original.zip')
    print('Extracting MNIST dataset...', end='')
    with zipfile.ZipFile('mnist_dataset_original.zip', 'r') as z:
        z.extractall('.')
    print('Done')

Downloading https://batchaisamples.blob.core.windows.net/samples/mnist_dataset_original.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=b&sig=Qc1RA3zsXIP4oeioXutkL1PXIrHJO0pHJlppS2rID3I%3D ...Done
Extracting MNIST dataset...Done


#### Create File Share and Upload MNIST Dataset

In [8]:
mnist_dataset_directory = 'mnist_dataset'

There are multiple ways to create folders and upload files into Azure File Share - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into file share.

In [9]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, mnist_dataset_directory,
    fail_on_exist=False)
# Since uploading can take significant time, let's check first if the
# file has been uploaded already.
for f in mnist_files:
    if service.exists(azure_file_share_name, mnist_dataset_directory, f):
        continue
    service.create_file_from_path(
        azure_file_share_name, mnist_dataset_directory, f, f)

### Deploy Sample Script and Configure the Input Directories


In [10]:
mnist_script_directory = 'tensorflow_samples'

- For each job we will create a folder containing a copy of the sample script. This allows to run the same job with different scripts.

In [11]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, mnist_script_directory, fail_on_exist=False)
service.create_file_from_path(
    azure_file_share_name, mnist_script_directory, 'mnist_replica.py', 'mnist_replica.py')

The job needs to know where to find ConvNet_MNIST.py and input MNIST dataset. We will create two input directories for this:

In [12]:
input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, mnist_script_directory)),
    models.InputDirectory(
        id='DATASET',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, mnist_dataset_directory))]

The job will be able to reference those directories using ```$AZ_BATCHAI_INPUT_SCRIPT``` and ```$AZ_BATCHAI_INPUT_DATASET``` environment variables.

### Configure Output Directories
We will store standard and error output of the job in File Share:

In [13]:
std_output_path_prefix = "$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share)

### Configure Job

- The job will use `tensorflow/tensorflow:1.1.0-gpu` container.
- Will use configured previously input and output directories.
- Will use BatchAI reserved environment variable AZ_BATCHAI_TASK_INDEX to identify local task
- By removing container_settings, the job will be ran on the host VMs if you are using DSVM

**Note** You must agree to the following licences before using this container:
- [TensorFlow License](https://github.com/tensorflow/tensorflow/blob/master/LICENSE)

In [15]:
args_fmt = '--job_name={0} --num_gpus=1 --train_steps 1000 --ps_hosts=$AZ_BATCHAI_PS_HOSTS --worker_hosts=$AZ_BATCHAI_WORKER_HOSTS --task_index=$AZ_BATCHAI_TASK_INDEX --data_dir=$AZ_BATCHAI_INPUT_DATASET'
job_name = datetime.utcnow().strftime("tf_%m_%d_%Y_%H%M%S")
parameters = models.job_create_parameters.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(cluster.id),
     node_count=2,
     input_directories=input_directories,
     std_out_err_path_prefix=std_output_path_prefix,
     container_settings=models.ContainerSettings(
         models.ImageSourceRegistry(image='tensorflow/tensorflow:1.1.0-gpu')),
     tensor_flow_settings=models.TensorFlowSettings(
         parameter_server_count=1,
         worker_count=nodes_count,
         python_script_file_path='$AZ_BATCHAI_INPUT_SCRIPT/mnist_replica.py',
         master_command_line_args=args_fmt.format('worker'),
         worker_command_line_args=args_fmt.format('worker'),
         parameter_server_command_line_args=args_fmt.format('ps'),
     )
)

### Create a training Job and wait for Job completion


In [16]:
_ = client.jobs.create(cfg.resource_group, job_name, parameters)   
print('Created Job: {}'.format(job_name))

Created Job: tf_10_08_2017_074801


### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdeout-0.txt (the output of the worker running on the first node).

In [17]:
utilities.wait_for_job_completion(client, cfg.resource_group, job_name, cluster_name, 'stdouterr', 'stdout-wk-0.txt')

Cluster state: AllocationState.steady Target: 2; Allocated: 2; Idle: 0; Unusable: 0; Running: 2; Preparing: 0; Leaving: 0
Job state: running ExitCode: None
Waiting for job output to become available...
Extracting /mnt/batch/tasks/shared/LS_root/mounts/external/mnist_dataset/train-images-idx3-ubyte.gz
Extracting /mnt/batch/tasks/shared/LS_root/mounts/external/mnist_dataset/train-labels-idx1-ubyte.gz
Extracting /mnt/batch/tasks/shared/LS_root/mounts/external/mnist_dataset/t10k-images-idx3-ubyte.gz
Extracting /mnt/batch/tasks/shared/LS_root/mounts/external/mnist_dataset/t10k-labels-idx1-ubyte.gz
job name = worker
task index = 0
Worker 0: Initializing session...
Worker 0: Session initialization complete.
Training begins @ 1507448915.535585
1507448915.730590: Worker 0: training step 1 done (global step: 0)
1507448915.735376: Worker 0: training step 2 done (global step: 1)
1507448915.738424: Worker 0: training step 3 done (global step: 2)
1507448915.741825: Worker 0: training step 4 done (gl

### Download stdout.txt and stderr.txt files for the Job

In [18]:
files = client.jobs.list_output_files(cfg.resource_group, job_name, models.JobsListOutputFilesOptions("stdOuterr")) 
for file in list(files):
    utilities.download_file(file.download_url, file.name)
print("All files Downloaded")

Downloading https://batchaisamples.file.core.windows.net/batchaisample/62254d4a-9a33-42b7-b57b-c3187d4d282e/batchaitests/jobs/tf_10_08_2017_074801/cbf33bd6-594f-4181-b2f9-06a06288f924/stderr-ps-0.txt?sv=2016-05-31&sr=f&sig=GvBgyrvVJ6pyrqLDIdvXBaVhl%2F2%2Fhyug58knTOMtu4U%3D&se=2017-10-08T08%3A49%3A56Z&sp=rl ...Done
Downloading https://batchaisamples.file.core.windows.net/batchaisample/62254d4a-9a33-42b7-b57b-c3187d4d282e/batchaitests/jobs/tf_10_08_2017_074801/cbf33bd6-594f-4181-b2f9-06a06288f924/stderr-wk-0.txt?sv=2016-05-31&sr=f&sig=1pMqHHci2CjHZQpJyfh%2B1wQ2aSN%2BBymW%2F2Sh0rrXDzU%3D&se=2017-10-08T08%3A49%3A56Z&sp=rl ...Done
Downloading https://batchaisamples.file.core.windows.net/batchaisample/62254d4a-9a33-42b7-b57b-c3187d4d282e/batchaitests/jobs/tf_10_08_2017_074801/cbf33bd6-594f-4181-b2f9-06a06288f924/stderr-wk-1.txt?sv=2016-05-31&sr=f&sig=HVmkIq3BQQTCzkV6MjMo9s%2FDXIo57FPKbOUzsiwotlw%3D&se=2017-10-08T08%3A49%3A56Z&sp=rl ...Done
Downloading https://batchaisamples.file.core.windows

In [19]:
for n in range(nodes_count):
    print('stdout-wk-{0}.txt content:'.format(n))
    with open('stderr-wk-{0}.txt'.format(n)) as f:
        print(f.read())

stdout-wk-0.txt content:
2017-10-08 07:48:34.523318: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-08 07:48:34.528879: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-08 07:48:34.536706: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-08 07:48:34.545402: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-08 07:48:34.554013: W tensorflow/core/platform/cpu_feature_guard.cc:45] The 

### Delete the Job

In [None]:
client.jobs.delete(cfg.resource_group, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
client.clusters.delete(cfg.resource_group, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)