# Tensorflow GPU


# Introduction

This example demonstrate how to run standard TensorFlow sample (https://github.com/tensorflow/models/blob/master/tutorials/image/mnist/convolutional.py) on Azure Batch AI cluster of one node.

## Details

- For demonstration purposes, official convolutional.py will be deployed at Azure File Share;
- Standard output of the job will be stored on Azure File Share;

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

from datetime import datetime
import sys

from azure.storage.file import FileService
import azure.mgmt.batchai.models as models

# The BatchAI/utilities folder contains helper functions used by different notebooks
sys.path.append('../../..')
import utilities as utils

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)
utils.config.create_resource_group(cfg)

Create Resoruce Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Script in Azure Storage

### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

### Deploy Sample Script and Configure the Input Directories


Download original sample script

In [None]:
script_to_deploy = 'convolutional.py'
utils.dataset.download_file('https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py', script_to_deploy)

For each job we will create a folder containing a copy of the sample script. This allows to run the same job with different scripts.

In [None]:
mnist_script_directory = 'tensorflow_samples'
file_service.create_directory(
    azure_file_share_name, mnist_script_directory, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, mnist_script_directory, script_to_deploy, script_to_deploy)
print('Done')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster
- For this example we will use a GPU cluster of 1 `STANDARD_NC6` node. You can increase the number of nodes by changing `nodes_count` variable;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [None]:
nodes_count = 1
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out all kind of nodes count in the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Run Azure Batch AI Training Job

### Configure Job

- The job will use `tensorflow/tensorflow:1.8.0-gpu` container.
- Will use configured previously input and output directories.
- Will mount file share at folder with name `afs`. Full path of this folder on a computer node will be `$AZ_BATCHAI_JOB_MOUNT_ROOT/afs`;
- The job needs to know where to find ConvNet_MNIST.py. We will create an input directory, refered as ```AZ_BATCHAI_INPUT_SCRIPT``` for this.
- We will store standard and error output of the job in File Share
- By removing container_settings, the job will be ran on the host VMs if you are using DSVM.

**Note** You must agree to the following licenses before using this container:
- [TensorFlow License](https://github.com/tensorflow/tensorflow/blob/master/LICENSE)

In [None]:
azure_file_share = 'afs'
parameters = models.JobCreateParameters(
     cluster=models.ResourceId(id=cluster.id),
     node_count=nodes_count,
     std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share),
     mount_volumes=models.MountVolumes(
            azure_file_shares=[
                models.AzureFileShareReference(
                    account_name=cfg.storage_account_name,
                    credentials=models.AzureStorageCredentialsInfo(
                        account_key=cfg.storage_account_key),
                    azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                        cfg.storage_account_name, azure_file_share_name),
                    relative_mount_path=azure_file_share)
            ]
        ),
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(image='tensorflow/tensorflow:1.8.0-gpu')),
     tensor_flow_settings=models.TensorFlowSettings(
         python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/convolutional.py'.format(azure_file_share, mnist_script_directory)
     )
)

### Create a training Job and wait for Job completion


In [None]:
experiment_name = 'tensorflow_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime('tf_%m_%d_%Y_%H%M%S')
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job {0} in Experiment {1}'.format(job.name, experiment.name))

### Wait for Job to Finish
The job will start running when the cluster will have enough idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdeout-0.txt (the output of the worker running on the first node).

In [None]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name, 'stdouterr', 'stdout-wk-0.txt')

### List stdout.txt and stderr.txt files for the Job

In [None]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='stdouterr')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')

## 4. Clean Up (Optional)

### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, cfg.workspace, experiment_name, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = BlobService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_container('batchaisample')