# HyperBand


## Introduction

This example shows how to perform HyperBand parametric sweeping using CNTK with MNIST dataset to train a convolutional neural network (CNN) on a GPU cluster. 

## Details

- We provide a CNTK example [ConvMNIST.py](../ConvMNIST.py) to accept  command line arguments for CNTK dataset, model locations, model file suffix and two hyperparameters for tuning: 1. hidden layer dimension and 2. feedforward constant 
- The implementation of HyperBand algorithm is adopted from the article [*Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization*](https://people.eecs.berkeley.edu/~kjamieson/hyperband.html)
- For demonstration purposes, MNIST dataset and CNTK training script will be deployed at Azure File Share;
- Standard output of the job and the model will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been preprocessed by usign install_mnist.py available [here](https://batchaisamples.blob.core.windows.net/samples/mnist_dataset.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D).

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

import sys
import logging
import numpy as np

import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService

sys.path.append('../../..')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resoruce Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure Blob Container

We will create a new Blob Container with name `batchaisample` under your storage account. This will be used to store the *input training dataset*

**Note** You don't need to create new blob Container for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_blob_container_name = 'batchaisample'
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

### Upload MNIST Dataset to Azure Blob Container

For demonstration purposes, we will download preprocessed MNIST dataset to the current directory and upload it to Azure Blob Container directory named `mnist_dataset`.

There are multiple ways to create folders and upload files into Azure Blob Container - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into Blob.

In [None]:
mnist_dataset_directory = 'mnist_dataset'
utils.dataset.download_and_upload_mnist_dataset_to_blob(
    blob_service, azure_blob_container_name, mnist_dataset_directory)

### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

Upload the training script [ConvMNIST.py](../ConvMNIST.py) to file share directory named `hyperparam_samples`.

In [None]:
cntk_script_path = "hyperparam_samples"
file_service.create_directory(
    azure_file_share_name, cntk_script_path, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, cntk_script_path, 'ConvMNIST.py', '../ConvMNIST.py')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a GPU cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [None]:
nodes_count = 4
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Hyperparameter tuning using HyperBand

Define specifications for the hyperparameters

In [None]:
param_specs = [
    NumericParameter(
        parameter_name="FEEDFORWARD_CONSTANT",
        data_type="REAL",
        start=0.001,
        end=10,
        scale="LOG"
    ),
    DiscreteParameter(
        parameter_name="HIDDEN_LAYERS_DIMENSION",
        values=[100, 200, 300]
    )
]

Create a parameter substitution object.

In [None]:
parameters = ParameterSweep(param_specs)

Generate *num_trials* random hyper-parameter configuration and corresponding index

We will use the parameter substitution object to specify where we would like to substitute the parameters. We substitute
the values for feedforward constant and hidden layers dimension into `models.JobCreateParameters.cntk_settings.command_line_args`. Note that the `parameters` variable is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called, the `parameters[name]` variables will be replaced with actual values.

In [None]:
azure_file_share_mount_path = 'afs'
azure_blob_mount_path = 'bfs'
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
    input_directories = [
        models.InputDirectory(
            id='SCRIPT',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, mnist_dataset_directory))
    ],
    output_directories = [
        models.OutputDirectory(
            id='ALL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))],
    mount_volumes = models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share_mount_path)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_container_name,
                relative_mount_path=azure_blob_mount_path)
        ]
    ),
    container_settings=models.ContainerSettings(
        image_source_registry=models.ImageSourceRegistry(image='microsoft/cntk:2.5.1-gpu-python2.7-cuda9.0-cudnn7.0')
    ),
    cntk_settings=models.CNTKsettings(
        python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/ConvMNIST.py'.format(azure_file_share_mount_path, cntk_script_path),
        command_line_args='--feedforward_const {0} --hidden_layers_dim {1} --epochs $PARAM_EPOCHS --datadir $AZ_BATCHAI_INPUT_SCRIPT --outputdir $AZ_BATCHAI_OUTPUT_ALL --logdir $AZ_BATCHAI_OUTPUT_ALL'
            .format(parameters['FEEDFORWARD_CONSTANT'],
                    parameters['HIDDEN_LAYERS_DIMENSION'])  # Substitute hyperparameters
    )
)

Create a new experiment.

In [None]:
experiment_name = 'hyperband_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)

We define the following metric extractor to extract desired metric from learning log file. 
- In this example, we extract the number between "metric =" and "%".

In [None]:
metric_extractor = utils.job.MetricExtractor(
                        output_dir_id='ALL',
                        logfile='progress.log',
                        regex='metric =(.*?)\%')

Define the number of configurations and generate these jobs.

In [None]:
num_configs = 16
jobs_to_submit, param_combinations = parameters.generate_jobs_random_search(jcp, num_configs)

# Print the parameter combinations generated
for idx, comb in enumerate(param_combinations):
    print("Parameters {0}: {1}".format(idx + 1, comb))

# Add environment variable for changing number of epochs per iteration
for job in jobs_to_submit:
    job.environment_variables.append(models.EnvironmentVariable(
        name='PARAM_EPOCHS',
        value=None
    ))

Before proceed to the following steps, please be sure you have already read the artile [*Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization*](https://people.eecs.berkeley.edu/~kjamieson/hyperband.html)

We define the following notation of parameters for HyperBand:
- ***max_iter***: maximum iterations/epochs per configuration
- ***eta***: downsampling rate
- ***s_max***: number of unique executions of Successive Halving (minus one)
- ***B***: total number of iterations (without reuse) per execution of Succesive Halving (n,r)
- ***n***: initial number of configurations
- ***r***: initial number of iterations to run configurations for

In [None]:
max_iter = num_configs
eta = 4
logeta = lambda x: np.log(x)/np.log(eta)
s_max = int(logeta(max_iter))  
B = (s_max+1)*max_iter  
n = int(np.ceil(B/max_iter/(s_max+1)*eta**s_max)) 
r = max_iter*eta**(-s_max)

- The following loop describes the early-stopping procedure that considers multiple configurations in parallel and terminates poor performing configurations leaving more resources for more promising configurations. 
- Note that, for illustration purpose, below implemenntation is a simplified version of the HyperBand algorithm where outler-loop used for hedging was omitted. A full implementation of HyperBand will be provided soon.
- ***n_i*** and ***r_i***denote number of remaining configurations and number of epochs to run at given iteration 
- For each configuration, we generate specific job creation parameters with given configuration and number of epochs. A new thread is started per new job that submits and monitors the job. Once job completes, the final *metric* is extracted and returned from log file

In [None]:
for i in range(s_max+1):
    n_i = int(n*eta**(-i))
    r_i = int(r*eta**(i))
    print("******** Round #{0} ******** ".format(str(i+1)))

    # Add number of epochs to JobCreateParameters
    for job in jobs_to_submit:
        for ev in job.environment_variables:
            if ev.name == 'PARAM_EPOCHS':
                ev.value = str(r_i)

    # Submit the jobs to the experiment
    jobs = experiment_utils.submit_jobs(jobs_to_submit, 'mnist_hyperband').result()

    # Wait for the jobs to finish running
    experiment_utils.wait_all_jobs()

    # Get the results and sort by metric value
    results = experiment_utils.get_metrics_for_jobs(jobs, metric_extractor)
    results.sort(key=lambda res: res['metric_value'])
    for result in results:
        print("Job {0} completed with metric value {1}".format(result['job_name'], result['metric_value']))

    # Get the N best jobs and submit them again the next iteration
    num_jobs_to_submit = int(n_i/eta)
    jobs_to_submit = [utils.job.convert_job_to_jcp(result['job'], client) for result in results[0:num_jobs_to_submit]]
    #### End Finite Horizon Successive Halving with (n,r)

## 4. Clean Up (Optional)

### Delete the Experiment
Delete the experiment and jobs inside it

In [None]:
_ = client.experiments.delete(cfg.resource_group, cfg.workspace, experiment_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name).result()

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)