# LAB 1 - Batch AI with TensorFlow MultiGPU


## Introduction

This example uses the MNIST dataset to demonstrate how to train a convolutional neural network (CNN) on a GPU cluster. This recipe is running on a single node.

## Details

- For demonstration purposes, MNIST dataset and ConvNet_MNIST.py will be deployed at Azure File Share;
- Standard output of the job and the model will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been preprocessed by usign install_mnist.py available at [here](https://batchaisamples.blob.core.windows.net/samples/mnist_dataset.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D)
- The original CNTK example (https://github.com/Microsoft/CNTK/blob/master/Examples/Image/Classification/ConvNet/Python/ConvNet_MNIST.py) has been modified to accept CNTK dataset and model locations via command line arguments and available here [ConvNet_MNIST.py](/recipes/CNTK/Python/CNTK-GPU-Python/ConvNet_MNIST.py). 

## Instructions

### Create in Azure the Resource Groups and Storage Accounts needed.
```
> ssh sshuser@YOUR.VM.IP.ADDRESS
> az login
> az group create --name batchai_rg  --location eastus
> az storage account create --location eastus --name batchaipablo --resource-group batchai_rg --sku Standard_LRS
> az storage account keys list --account-name batchaipablo --resource-group batchai_rg -o table
> az ad sp create-for-rbac --name MyAppSvcPppl --password Passw0rd
> az storage account keys list --account-name batchaipablo --resource-group batchai_rg
```

### Read Configuration and Create Batch AI client

In [95]:
from __future__ import print_function

from datetime import datetime
import os
import sys
import zipfile

from azure.storage.file import FileService
import azure.mgmt.batchai.models as models

# utilities.py contains helper functions
import utilities

# Resource Group
location = 'eastus'
resource_group = 'batchai_rg'

# credentials used for authentication
client_id = 'ec0640c7-61fa-4662-bce4-8a3e931939ac'
secret = 'Passw0rd'
token_uri = 'https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token'
subscription_id = 'b1395605-1fe9-4af4-b3ff-82a4725a3791'

# credentials used for storage
storage_account_name = 'batchaipablo'
storage_account_key = 'y59heteYEbw5nTLBB/b7rj3jUphvs2Iwslg4AsXFSb4G7ZLgJUep4AuccSmST7I3E8Zw4BaUloebK+VyKmGpog=='

# specify the credentials used to remote login your GPU node
admin_user_name = 'sshuser'
admin_user_password = 'Passw0rd.1!!'

In [96]:
from azure.common.credentials import ServicePrincipalCredentials
import azure.mgmt.batchai as batchai
import azure.mgmt.batchai.models as models

creds = ServicePrincipalCredentials(client_id=client_id, secret=secret, token_uri=token_uri)

client = batchai.BatchAIManagementClient(credentials=creds,subscription_id=subscription_id)

### Create File Share

For this example we will create a new File Share with name `batchaisample` under your storage account.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [97]:
azure_file_share_name = 'batchailab1'
service = FileService(storage_account_name, storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

Done


### Configure Compute Cluster

- For this example we will use a gpu cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will mount file share at folder with name `external`. Full path of this folder on a computer node will be `$AZ_BATCHAI_MOUNT_ROOT/external`;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [98]:
azure_file_share = 'external'
nodes_count = 1
cluster_name = 'nc12'
vmsize = "Standard_NC12" 

volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=storage_account_key),
            azure_file_url = 'https://{0}.file.core.windows.net/{1}'.format(
                storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share)
    ]
)

parameters = models.ClusterCreateParameters(
    location=location,
    vm_size=vmsize,
    virtual_machine_configuration=models.VirtualMachineConfiguration(
        image_reference=models.ImageReference(
            publisher="microsoft-ads",
            offer="linux-data-science-vm-ubuntu",
            sku="linuxdsvmubuntu",
            version="latest")),    
    user_account_settings=models.UserAccountSettings(
        admin_user_name=admin_user_name,
        admin_user_password=admin_user_password),
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    node_setup=models.NodeSetup(
        mount_volumes=volumes,
    )
)

### Create Compute Cluster

In [99]:
cluster = client.clusters.create(resource_group, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. utilities.py contains a helper function to print out detail status of the cluster.

In [100]:
cluster = client.clusters.get(resource_group, cluster_name)
utilities.print_cluster_status(cluster)

Cluster state: AllocationState.resizing Target: 1; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


### Deploy MNIST Dataset

For demonstration purposes, we will download preprocessed MNIST dataset to the current directory and upload it to file share directory named `mnist_dataset`.

#### Download and Extract MNIST Dataset

In [101]:
mnist_dataset_url = 'https://batchaisamples.blob.core.windows.net/samples/mnist_dataset.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D'
if not os.path.exists('Train-28x28_cntk_text.txt') or not os.path.exists('Test-28x28_cntk_text.txt'):
    utilities.download_file(mnist_dataset_url, 'mnist_dataset.zip')
    print('Extracting MNIST dataset...')
    with zipfile.ZipFile('mnist_dataset.zip', 'r') as z:
        z.extractall('.')
print('Done')

Done


#### Create File Share and Upload MNIST Dataset

In [102]:
mnist_dataset_directory = 'mnist_dataset'

There are multiple ways to create folders and upload files into Azure File Share - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into file share.

Normally on large datasets it is better to use the Azure CLI, with this command

> AzCopy /Source:https://myaccount1.blob.core.windows.net/mycontainer/ /Dest:https://myaccount2.file.core.windows.net/myfileshare/ /SourceKey:key1 /DestKey:key2 /S

In [103]:
service = FileService(storage_account_name, storage_account_key)
service.create_directory(
    azure_file_share_name, mnist_dataset_directory,
    fail_on_exist=False)
# Since uploading can take significant time, let's check first if the
# file has been uploaded already.
for f in ['Train-28x28_cntk_text.txt', 'Test-28x28_cntk_text.txt']:
    if service.exists(azure_file_share_name, mnist_dataset_directory, f):
        continue
    service.create_file_from_path(
        azure_file_share_name, mnist_dataset_directory, f, f)
print('Done')

Done


### Deploy Sample Script and Configure the Input Directories


- CNTK: For each job we will create a folder containing a copy of ConvNet_MNIST.py
- Tensorflow: For each job we will create a folder containing a copy of convolutional.py

In [104]:
# Uncomment this if you want to use CNTK
#cntk_script_path = "cntk_samples"
#service = FileService(storage_account_name, storage_account_key)
#service.create_directory(
#    azure_file_share_name, cntk_script_path, fail_on_exist=False)
#service.create_file_from_path(
#    azure_file_share_name, cntk_script_path, 'ConvNet_MNIST.py', 'ConvNet_MNIST.py')

mnist_script_directory = 'tensorflow_samples'
service = FileService(storage_account_name, storage_account_key)
service.create_directory(
    azure_file_share_name, mnist_script_directory, fail_on_exist=False)
service.create_file_from_path(
    azure_file_share_name, mnist_script_directory, 'convolutional.py', 'convolutional.py')
print('Done')

Done


The job needs to know where to find ConvNet_MNIST.py and input MNIST dataset. We will create two input directories for this:

In [105]:
#input_directories = [
#    models.InputDirectory(
#        id='SCRIPT',
#        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, cntk_script_path)),
#    models.InputDirectory(
#        id='DATASET',
#        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, mnist_dataset_directory))]


input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, mnist_script_directory))]

The job will be able to reference those directories using ```$AZ_BATCHAI_INPUT_SCRIPT``` and ```$AZ_BATCHAI_INPUT_DATASET``` environment variables.

### Configure Output Directories
We will store standard and error output of the job in File Share:

In [106]:
std_output_path_prefix = "$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share)

The job will be able to reference this directory as `$AZ_BATCHAI_OUTPUT_MODEL` and we will be able to enumerate files in this directory using `MODEL` id.

In [110]:
output_directories = [
    models.OutputDirectory(
        id='MODEL',
        path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share),
        path_suffix="Models")]

### Configure Job
- The job will use `microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0` container.
- Will use configured previously input and output directories;
- Will run modified ConvNet_MNIST.py providing MNIST Dataset path as the first parameter and desired mode output as the second one.
- By removing container_settings, the job will be ran on the host VMs if you are using DSVM.


In [107]:
job_name = datetime.utcnow().strftime("cntk_%m_%d_%Y_%H%M%S")
parameters = models.job_create_parameters.JobCreateParameters(
     location=location,
     cluster=models.ResourceId(cluster.id),
     node_count=nodes_count,
     input_directories=input_directories,
     std_out_err_path_prefix=std_output_path_prefix,
     output_directories=output_directories,
     #container_settings=models.ContainerSettings(
     #    models.ImageSourceRegistry(image='microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0')),
     container_settings=models.ContainerSettings(
         models.ImageSourceRegistry(image='tensorflow/tensorflow:1.1.0-gpu')),
     #cntk_settings = models.CNTKsettings(
     #    python_script_file_path='$AZ_BATCHAI_INPUT_SCRIPT/ConvNet_MNIST.py',
     #    command_line_args='$AZ_BATCHAI_INPUT_DATASET $AZ_BATCHAI_OUTPUT_MODEL')
     tensor_flow_settings=models.TensorFlowSettings(
     python_script_file_path='$AZ_BATCHAI_INPUT_SCRIPT/convolutional.py',
     master_command_line_args="-p")
)

### Create a training Job and wait for Job completion


In [108]:
job = client.jobs.create(resource_group, job_name, parameters).result()
print('Created Job: {}'.format(job_name))

Created Job: cntk_02_13_2018_231620


### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdout.

**Note** Execution may take several minutes to complete.

In [113]:
#utilities.wait_for_job_completion(client, resource_group, job_name, cluster_name, 'stdOuterr', 'stderr.txt')
utilities.wait_for_job_completion(client, resource_group, job_name, cluster_name, 'stdOuterr', 'stderr-wk-0.txt')

Cluster state: AllocationState.steady Target: 1; Allocated: 1; Idle: 1; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0
Job state: succeeded ExitCode: 0
Waiting for job output to become available...
2018-02-13 23:20:56.437213: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-02-13 23:20:56.441945: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-02-13 23:20:56.447544: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-02-13 23:20:56.453561: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2

In [114]:
utilities.wait_for_job_completion(client, resource_group, job_name, cluster_name, 'stdOuterr', 'stdout-wk-0.txt')

Cluster state: AllocationState.steady Target: 1; Allocated: 1; Idle: 1; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0
Job state: succeeded ExitCode: 0
Waiting for job output to become available...
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 27.4 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 17.0 ms
Minibatch loss: 3.266, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 7.4%
Step 200 (epoch 0.23), 12.5 ms
Minibatch loss: 3.363, learning rate: 0.010000
Minibatch error: 10.9%
Validation e

Job state: succeeded ExitCode: 0


### Download stdout.txt and stderr.txt files for the Job

In [115]:
files = client.jobs.list_output_files(resource_group, job_name, models.JobsListOutputFilesOptions("stdOuterr")) 
for f in list(files):
    utilities.download_file(f.download_url, f.name)
print("All files downloaded")

Downloading https://batchaipablo.file.core.windows.net/batchailab1/b1395605-1fe9-4af4-b3ff-82a4725a3791/batchai_rg/jobs/cntk_02_13_2018_231620/6dc40e3b-61a3-402d-9a16-32667c669a70/stderr-wk-0.txt?sv=2016-05-31&sr=f&sig=VWloNYk7WTt2V3EAAyhTAd9677wzoXLkXCtGgltZYEo%3D&se=2018-02-14T00%3A24%3A31Z&sp=rl ...Done
Downloading https://batchaipablo.file.core.windows.net/batchailab1/b1395605-1fe9-4af4-b3ff-82a4725a3791/batchai_rg/jobs/cntk_02_13_2018_231620/6dc40e3b-61a3-402d-9a16-32667c669a70/stdout-wk-0.txt?sv=2016-05-31&sr=f&sig=rVvrLbHwuHI7smfJ%2FP9hRhMPA0gQFUEg7sfiE4eyyrI%3D&se=2018-02-14T00%3A24%3A31Z&sp=rl ...Done
All files downloaded


### Delete the Job

In [116]:
_ = client.jobs.delete(resource_group, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [117]:
_ = client.clusters.delete(resource_group, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [118]:
service = FileService(storage_account_name, storage_account_key)
service.delete_share(azure_file_share_name)

True