# LAB 4 - Batch AI  - Random Search CNTK GPU (Multi-GPU, Multi-node)


## Introduction

This example shows how to perform random search hyperparameter tuning using CNTK with MNIST dataset to train a convolutional neural network (CNN) on a GPU cluster.

## Details

- We provide a CNTK example ConvMNIST.py to accept command line arguments for CNTK dataset, model locations, model file suffix and two hyperparameters for tuning: 1. hidden layer dimension and 2. feedforward constant
- For demonstration purposes, MNIST dataset and CNTK training script will be deployed at Azure File Share;
- Standard output of the job and the model will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been preprocessed by usign install_mnist.py available here.

## Instructions

### Create in Azure the Resource Groups and Storage Accounts needed.
```
> ssh sshuser@YOUR.VM.IP.ADDRESS
> sudo pip install azure
> sudo pip3 install azure-mgmt-batchai --upgrade
> az login
> az group create --name batchai_rg  --location eastus
> az storage account create --location eastus --name batchaipablo --resource-group batchai_rg --sku Standard_LRS
> az storage account keys list --account-name batchaipablo --resource-group batchai_rg -o table
> az ad sp create-for-rbac --name MyAppSvcPppl --password Passw0rd
> az storage account keys list --account-name batchaipablo --resource-group batchai_rg
```

### Read Configuration and Create Batch AI client

In [8]:
from __future__ import print_function

from datetime import datetime
import os
import sys
import zipfile
import numpy
import queue
import threading
import requests

from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
import azure.mgmt.batchai.models as models

# utilities.py contains helper functions
import utilities
import hyperparam_utilities
from hyperparam_utilities import Hyperparameter, MetricExtractor, run_then_return_metric

# Resource Group
location = 'eastus'
resource_group = 'batchai_rg'

# credentials used for authentication
client_id = 'ec0640c7-61fa-4662-bce4-8a3e931939ac'
secret = 'Passw0rd'
token_uri = 'https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token'
subscription_id = 'b1395605-1fe9-4af4-b3ff-82a4725a3791'

# credentials used for storage
storage_account_name = 'batchaipablo'
storage_account_key = 'y59heteYEbw5nTLBB/b7rj3jUphvs2Iwslg4AsXFSb4G7ZLgJUep4AuccSmST7I3E8Zw4BaUloebK+VyKmGpog=='

# specify the credentials used to remote login your GPU node
admin_user_name = 'sshuser'
admin_user_password = 'Passw0rd.1!!'

In [2]:
from azure.common.credentials import ServicePrincipalCredentials
import azure.mgmt.batchai as batchai
import azure.mgmt.batchai.models as models

creds = ServicePrincipalCredentials(client_id=client_id, secret=secret, token_uri=token_uri)

client = batchai.BatchAIManagementClient(credentials=creds,subscription_id=subscription_id)

Keyring cache token has failed: No recommended backend was available. Install the keyrings.alt package if you want to use the non-recommended backends. See README.rst for details.


### Create Azure Blob Container

We will create a new Blob Container with name 'batchailab4' under your storage account. This will be used to store the input training dataset

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [3]:
azure_blob_container_name = 'batchailab4'
blob_service = BlockBlobService(storage_account_name, storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

False

### Upload MNIST Dataset to Azure Blob Container

For demonstration purposes, we will download preprocessed MNIST dataset to the current directory and upload it to Azure Blob Container directory named mnist_dataset.

There are multiple ways to create folders and upload files into Azure Blob Container - you can use Azure Portal, Storage Explorer, Azure CLI2 or Azure SDK for your preferable programming language. In this example we will use Azure SDK for python to copy files into Blob.

In [4]:
mnist_dataset_directory = 'mnist_dataset'
utilities.download_and_upload_mnist_dataset_to_blob(
    blob_service, azure_blob_container_name, mnist_dataset_directory)

Uploading MNIST dataset...
Done


### Create File Share

For this example we will create a new File Share with name `batchailab4` under your storage account. This will be used to share the training script file and output file.


**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [5]:
azure_file_share_name = 'batchailab4'
file_service = FileService(storage_account_name, storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

False

Upload the training script ConvMNIST.py to file share directory named hyperparam_samples.

In [6]:
cntk_script_path = "hyperparam_samples"
file_service.create_directory(
    azure_file_share_name, cntk_script_path, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, cntk_script_path, 'ConvMNIST.py', 'ConvMNIST.py')

### Configure Compute Cluster

- For this example we will use a GPU cluster of STANDARD_NC6 nodes. Number of nodes in the cluster is configured with nodes_count variable;
- We will mount blob container at folder with name external_ABFS. Full path of this folder on a computer node will be AZ_BATCHAI_MOUNT_ROOT/external_ABFS;
- We will mount file share at folder with name external_AFS. Full path of this folder on a computer node will be AZ_BATCHAI_MOUNT_ROOT/external_AFS;
- We will call the cluster nc6;

So, the cluster will have the following parameters:

In [7]:
azure_file_share = 'external_AFS'
azure_blob = 'external_ABFS'
nodes_count = 4
cluster_name = 'nc6'
vmsize = "Standard_NC6"

volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=storage_account_key),
            azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share)
    ],
    azure_blob_file_systems=[
        models.AzureBlobFileSystemReference(
            account_name=storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=storage_account_key),
            container_name=azure_blob_container_name,
            relative_mount_path=azure_blob)
    ]
)

parameters = models.ClusterCreateParameters(
    location=location,
    vm_size=vmsize,
    virtual_machine_configuration=models.VirtualMachineConfiguration(
        image_reference=models.ImageReference(
            publisher="microsoft-ads",
            offer="linux-data-science-vm-ubuntu",
            sku="linuxdsvmubuntu",
            version="latest")),    
    user_account_settings=models.UserAccountSettings(
        admin_user_name=admin_user_name,
        admin_user_password=admin_user_password),
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    node_setup=models.NodeSetup(
        mount_volumes=volumes,
    )
)

### Create Compute Cluster

In [9]:
cluster = client.clusters.create(resource_group, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. utilities.py contains a helper function to print out detail status of the cluster.

In [17]:
cluster = client.clusters.get(resource_group, cluster_name)
utilities.print_cluster_status(cluster)

Token expired or is invalid. Attempting to refresh.
Keyring cache token has failed: No recommended backend was available. Install the keyrings.alt package if you want to use the non-recommended backends. See README.rst for details.


Cluster state: steady Target: 4; Allocated: 4; Idle: 4; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


## Parametric Sweeping using Random Search
Define the space of hyperparameters

In [11]:
import hyperparam_utilities
from hyperparam_utilities import Hyperparameter, MetricExtractor, run_then_return_metric

space = {Hyperparameter('feedforward constant', 'feedforward_const', 'log', [0.0001, 10]),
         Hyperparameter('hidden layers dimenson', 'hidden_layers_dim', 'choice', [100, 200, 300])}

Define the total number of hyperparameter configurations we want to try

In [12]:
num_configs = 16

Generate num_configs random hyper-parameter configuration and corresponding index

In [13]:
job_configs = {}
for i in range(num_configs):
    job_configs[i] = Hyperparameter.get_random_hyperparameter_configuration(space)
    print(str(i) + ' : ' + str(job_configs[i]))

0 : {'hidden_layers_dim': 300, 'feedforward_const': 0.043951574155499704}
1 : {'hidden_layers_dim': 300, 'feedforward_const': 0.00013025752255328113}
2 : {'hidden_layers_dim': 200, 'feedforward_const': 5.414203512473023}
3 : {'hidden_layers_dim': 300, 'feedforward_const': 5.582105156163857}
4 : {'hidden_layers_dim': 200, 'feedforward_const': 0.003246553415665837}
5 : {'hidden_layers_dim': 200, 'feedforward_const': 0.0019433278974297264}
6 : {'hidden_layers_dim': 300, 'feedforward_const': 0.006299208962397572}
7 : {'hidden_layers_dim': 100, 'feedforward_const': 0.00037441504751268363}
8 : {'hidden_layers_dim': 300, 'feedforward_const': 0.9893207844029632}
9 : {'hidden_layers_dim': 100, 'feedforward_const': 0.7318551250004928}
10 : {'hidden_layers_dim': 100, 'feedforward_const': 0.716883342271889}
11 : {'hidden_layers_dim': 100, 'feedforward_const': 0.0025973455297094625}
12 : {'hidden_layers_dim': 100, 'feedforward_const': 0.05601180437954487}
13 : {'hidden_layers_dim': 200, 'feedforwar

The following helper function is used to construct the job creation parameters with given hyperparameter configuration

In [14]:
def generate_job_create_parameters(configs):
    environment_variables=[]
    for config in configs:
        environment_variables.append(models.EnvironmentVariable(
                name='HYPERPARAM_'+config,
                value=str(configs[config])))

    parameter =models.JobCreateParameters(
        location=location,
        cluster=models.ResourceId(id=cluster.id),
        node_count=1,
        std_out_err_path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share),
        environment_variables=environment_variables,
        output_directories=[
            models.OutputDirectory(
                id='ALL',
                path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share))
        ],
        cntk_settings=models.CNTKsettings(
            python_script_file_path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}/ConvMNIST.py'.format(azure_file_share, cntk_script_path),
            command_line_args='--datadir {0} --outputdir $AZ_BATCHAI_OUTPUT_ALL --logdir $AZ_BATCHAI_OUTPUT_ALL --epochs 16 --feedforward_const $HYPERPARAM_feedforward_const --hidden_layers_dim $HYPERPARAM_hidden_layers_dim'.format(
                '$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_blob, mnist_dataset_directory))
        )
    )
    return parameter

We define the following metric extractor to extract desired metric from learning log file.
- In this example, we extract the number between "metric =" and "%".

In [15]:
metric_extractor = MetricExtractor(
                        list_option='ALL',
                        logfile='progress.log',
                        regex='metric =(.*?)\%')

For each configuration, we generate specific job creation parameters with given configuration and number of epochs.

A new thread is started per new job that submits and monitors the job. Once job completes, the final metric is extracted and returned from log file

In [18]:
print("Submitting {0} jobs with {1} configurations ".format(str(num_configs), str(num_configs)))
val_metric = queue.PriorityQueue()
threads = []
for index in job_configs:
    parameter = generate_job_create_parameters(job_configs[index])
    t = threading.Thread(
        target=run_then_return_metric, 
        args = (index, resource_group, parameter, client, metric_extractor, val_metric))
    threads.append(t)
    t.daemon = True
    t.start()

for t in threads:
    t.join()
print("All {0} job(s) completed".format(str(num_configs)))

while not val_metric.empty():
    metric, index = val_metric.get()
    print("Config {0} produced metric {1} with params: {2}".format(index, metric, job_configs[index]))

Submitting 16 jobs with 16 configurations 
Job 46d4cd95 has completed for config 7
Job 0b075c76 has completed for config 14
Job 3781764d has completed for config 9
Job 53ca24aa has completed for config 10
Job d915518a has completed for config 4
Job e426fcb2 has completed for config 13
Job b2c8f344 has completed for config 8
Job 97988df3 has completed for config 6
Job 40c31d55 has completed for config 11
Job 2d981cc1 has completed for config 15
Job 010d4405 has completed for config 5
Job 2721e218 has completed for config 12
Job 01511755 has completed for config 2
Job f3d8724e has completed for config 1
Job 60964bcb has completed for config 3
Job 63c1fcc8 has completed for config 0
All 16 job(s) completed
Config 0 produced metric 0.22 with params: {'hidden_layers_dim': 300, 'feedforward_const': 0.043951574155499704}
Config 14 produced metric 0.31 with params: {'hidden_layers_dim': 300, 'feedforward_const': 0.05848771920419732}
Config 6 produced metric 0.36 with params: {'hidden_layers_di

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(resource_group, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [25]:
service = FileService(storage_account_name, storage_account_key)
service.delete_share(azure_file_share_name)

True