# Distributed Batch Scoring in Tensorflow with GPU


## Introduction

This example demonstrate how to run distributed batch scoring job in TensorFlow on Azure Batch AI cluster of 2 nodes. [Inception-V3](https://arxiv.org/abs/1512.00567) model and unlabeled images from [ImageNet](http://image-net.org/) dataset will be used.

## Details

- For demonstration purposes, pretained [Inception-V3](https://arxiv.org/abs/1512.00567) model and approxinately 900 evaluation images from [ImageNet](http://image-net.org/) dataset will be deployed to Azure Blob Container
- Standard output of the job will be stored on Azure File Share;
- Azure Blob Container and Azure File Share will be mounted on Batch AI GPU clusters 
- The recipe uses [batch_image_label.py](./batch_image_label.py) script to perform Distributed Batch Scoring with the given model and image datasets. The input images for evaluation will be partitioned by the MPI rank, so that each MPI worker will evaluate part of the whole image set independently. 

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

from datetime import datetime
import os
import sys
import zipfile, tarfile

from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
import azure.mgmt.batchai.models as models

# The BatchAI/utilities folder contains helper functions used by different notebooks
sys.path.append('../../..')
import utilities as utils

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resoruce Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure Blob Container

We will create a new Blob Container with name `batchaisample` under your storage account. This will be used to store the *input training dataset*

**Note** You don't need to create new blob Container for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_blob_container_name = 'batchaisample'
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

### Upload MNIST Dataset to Azure Blob Container

For demonstration purposes, we will download pretrained `Inception-V3` model and a set of imagenet evaluation images to the current directory and upload it to Azure Blob Container.

The following code downloads the above resource to the current local directory.

In [None]:
model_url = 'http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz'
utils.dataset.download_file(model_url, 'inception_v3.tar.gz')
with  tarfile.open('inception_v3.tar.gz', "r:gz") as tar:
    tar.extractall()

images_url = 'https://batchaisamples.blob.core.windows.net/samples/imagenet_samples.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D'
utils.dataset.download_file(images_url, 'imagenet_samples.zip')
with zipfile.ZipFile('imagenet_samples.zip', 'r') as z:
    z.extractall('.')

There are multiple ways to create folders and upload files into Azure Blob Container - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into Blob.

Upload the pretained model and output labels to Azure Blob Container directory named `pretained_models`.

In [None]:
print('Uploading pretained model and output labels...')
model_directory = 'pretained_models'
blob_service.create_blob_from_path(azure_blob_container_name, 
                                   model_directory + '/' + 'inception_v3.ckpt', 'inception_v3.ckpt')
blob_service.create_blob_from_path(azure_blob_container_name, 
                                   model_directory + '/' + 'imagenet_slim_labels.txt', 'imagenet_slim_labels.txt')


Upload imagenet image samples to Azure Blob Container directory named `unlabeled_images`. This step may take a few minutes to complete.

In [None]:
print('Uploading sample images to evaluate...')
image_directory = 'unlabeled_images'
for f in os.listdir('samples'): 
    if os.path.isfile(os.path.join('samples', f)):
        blob_service.create_blob_from_path(azure_blob_container_name,
                                               image_directory + '/' + f, os.path.join('samples', f))     

### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

### Deploy Sample Script to Azure File Share
For each job we will create a folder containing a copy of the sample script. This allows to run the same job with different scripts.


In [None]:
script_directory = 'classification_samples'
script_to_deploy = 'batch_image_label.py'
file_service.create_directory(
    azure_file_share_name, script_directory, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, script_directory, script_to_deploy, script_to_deploy)

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster
- For this example we will use a gpu cluster of 2 `STANDARD_NC6` nodes. You can increase the number of nodes by changing `nodes_count` variable
- We will call the cluster `nc6`

So, the cluster will have the following parameters:

In [None]:
nodes_count = 2
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
cluster = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Run Azure Batch AI Training Job

### Configure Job

- The job will use `tensorflow/tensorflow:1.7.0-gpu` container.
- We will mount file share at folder with name `afs`. Full path of this folder on a computer node will be `$AZ_BATCHAI_JOB_MOUNT_ROOT/afs`
- We will mount Azure Blob Container at folder with name `bfs`. Full path of this folder on a computer node will be `$AZ_BATCHAI_JOB_MOUNT_ROOT/bfs`
- Will install job preparation task to install OpenMPI binary.
- Will use custom toolkit to launch MPI processes.
- In [`batch_image_label.py`](./batch_image_label.py), the input images for evaluation will be partitioned by the MPI rank, so that each MPI worker will evaluate part of the whole image set independently. 

**Note** You must agree to the following licences before using this container:
- [TensorFlow License](https://github.com/tensorflow/tensorflow/blob/master/LICENSE)

In [None]:
azure_file_share = 'afs'
azure_blob = 'bfs'
parameters = models.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(id=cluster.id),
     node_count=2,
     mount_volumes = models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_container_name,
                relative_mount_path=azure_blob)
        ]
     ),
     input_directories=[
        models.InputDirectory(
            id='SCRIPT',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, script_directory)),
        models.InputDirectory(
            id='IMAGES',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob, image_directory)),
        models.InputDirectory(
            id='MODEL',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob, model_directory))],
     std_out_err_path_prefix="$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}".format(azure_file_share),
     output_directories=[
        models.OutputDirectory(
            id='LABEL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share))],
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(image='tensorflow/tensorflow:1.7.0-gpu')),
     job_preparation=models.JobPreparation(
         command_line="apt update && apt install -y mpi-default-dev mpi-default-bin"),
     custom_mpi_settings = models.CustomMpiSettings(
         command_line='python -u $AZ_BATCHAI_INPUT_SCRIPT/batch_image_label.py --dataset_path $AZ_BATCHAI_INPUT_IMAGES --model_path $AZ_BATCHAI_INPUT_MODEL/inception_v3.ckpt --label_path $AZ_BATCHAI_INPUT_MODEL/imagenet_slim_labels.txt --output_dir $AZ_BATCHAI_OUTPUT_LABEL --batch_size 64'))


### Create a training Job and wait for Job completion


In [None]:
experiment_name = 'batch_scoring_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime("classification_%m_%d_%Y_%H%M%S")
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job: {}'.format(job_name))

### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdeout-0.txt (the output of the worker running on the first node).

In [None]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name, 'stdouterr', 'stdout.txt')

### Download Image Output Label file 

In [None]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='LABEL')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

## 4. Clean Up (Optional)

### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, cfg.workspace, experiment_name, job_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)