# Random Search


## Introduction

This example shows how to perform random search hyperparameter tuning using CNTK with MNIST dataset to train a convolutional neural network (CNN) on a GPU cluster. We make use of the Batch AI Extensions, including the JobFactory module to generate values for hyperparameters, and the ExperimentUtils module for bulk job submission.

## Details

- We provide a CNTK example [ConvMNIST.py](../ConvMNIST.py) to accept  command line arguments for CNTK dataset, model locations, model file suffix and two hyperparameters for tuning: 1. hidden layer dimension and 2. feedforward constant 
- For demonstration purposes, MNIST dataset and CNTK training script will be deployed at Azure File Share;
- Standard output of the job and the model will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been preprocessed by usign install_mnist.py available [here](https://batchaisamples.blob.core.windows.net/samples/mnist_dataset.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D).

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [15]:
from __future__ import print_function

import sys

import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService

sys.path.append('../../..')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resoruce Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure Blob Container

We will create a new Blob Container with name `batchaisample` under your storage account. This will be used to store the *input training dataset*

**Note** You don't need to create new blob Container for every cluster. We are doing this in this sample to simplify resource management for you.

In [16]:
azure_blob_container_name = 'batchaisample'
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

False

### Upload MNIST Dataset to Azure Blob Container

For demonstration purposes, we will download preprocessed MNIST dataset to the current directory and upload it to Azure Blob Container directory named `mnist_dataset`.

There are multiple ways to create folders and upload files into Azure Blob Container - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into Blob.

In [17]:
mnist_dataset_directory = 'mnist_dataset'
utils.dataset.download_and_upload_mnist_dataset_to_blob(
    blob_service, azure_blob_container_name, mnist_dataset_directory)

Uploading MNIST dataset...
Done


### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [18]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

False

Upload the training script [ConvMNIST.py](../ConvMNIST.py) to file share directory named `hyperparam_samples`.

In [19]:
cntk_script_path = "hyperparam_samples"
file_service.create_directory(
    azure_file_share_name, cntk_script_path, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, cntk_script_path, 'ConvMNIST.py', '../ConvMNIST.py')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a GPU cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [20]:
nodes_count = 4
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [21]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [22]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

Cluster state: steady Target: 4; Allocated: 4; Idle: 4; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


## 3. Parametric Sweeping using Random Search


Create a new experiment called ```random_search_experiment```

In [23]:
experiment_name = 'random_search_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()

Define specifications for the hyperparameters

In [24]:
param_specs = [
    NumericParameter(
        parameter_name="FEEDFORWARD_CONSTANT",
        data_type="REAL",
        start=0.001,
        end=10,
        scale="LOG"
    ),
    DiscreteParameter(
        parameter_name="HIDDEN_LAYERS_DIMENSION",
        values=[100, 200, 300]
    )
]

Create a parameter substitution object.

In [25]:
parameters = ParameterSweep(param_specs)

We will use the parameter substitution object to specify where we would like to substitute the parameters. We substitute
the values for feedforward constant and hidden layers dimension into `models.JobCreateParameters.cntk_settings.command_line_args`. Note that the `parameters` variable is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called, the `parameters[name]` variables will be replaced with actual values.

In [26]:
azure_file_share_mount_path = 'afs'
azure_blob_mount_path = 'bfs'
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
    input_directories = [
        models.InputDirectory(
            id='SCRIPT',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, mnist_dataset_directory))
    ],
    output_directories = [
        models.OutputDirectory(
            id='ALL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))],
    mount_volumes = models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share_mount_path)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_container_name,
                relative_mount_path=azure_blob_mount_path)
        ]
    ),
    container_settings=models.ContainerSettings(
        image_source_registry=models.ImageSourceRegistry(image='microsoft/cntk:2.5.1-gpu-python2.7-cuda9.0-cudnn7.0')
    ),
    cntk_settings=models.CNTKsettings(
        python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/ConvMNIST.py'.format(azure_file_share_mount_path, cntk_script_path),
        command_line_args='--epochs 16 --feedforward_const {0} --hidden_layers_dim {1} --datadir $AZ_BATCHAI_INPUT_SCRIPT --outputdir $AZ_BATCHAI_OUTPUT_ALL --logdir $AZ_BATCHAI_OUTPUT_ALL'
            .format(parameters['FEEDFORWARD_CONSTANT'], parameters['HIDDEN_LAYERS_DIMENSION'])  # Substitute hyperparameters
    )
)

Next, we generate a list of jobs to submit and then submit the jobs to an experiment.

In [27]:
# Generate Jobs
num_configs = 16
jobs_to_submit, param_combinations = parameters.generate_jobs_random_search(jcp, num_configs)

# Print the parameter combinations generated
for idx, comb in enumerate(param_combinations):
    print("Parameters {0}: {1}".format(idx + 1, comb))

# Submit Jobs
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)
jobs = experiment_utils.submit_jobs(jobs_to_submit, 'mnist_hyperparam_job').result()

Parameters 1: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTANT': 1.7858414194061631}
Parameters 2: {'PARAM_HIDDEN_LAYERS_DIMENSION': 100, 'PARAM_FEEDFORWARD_CONSTANT': 3.529620296029193}
Parameters 3: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTANT': 0.076836488580194}
Parameters 4: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTANT': 2.3612612183004096}
Parameters 5: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTANT': 3.866318922076553}
Parameters 6: {'PARAM_HIDDEN_LAYERS_DIMENSION': 100, 'PARAM_FEEDFORWARD_CONSTANT': 4.766523112243888}
Parameters 7: {'PARAM_HIDDEN_LAYERS_DIMENSION': 300, 'PARAM_FEEDFORWARD_CONSTANT': 3.48412799040511}
Parameters 8: {'PARAM_HIDDEN_LAYERS_DIMENSION': 300, 'PARAM_FEEDFORWARD_CONSTANT': 2.7225566670292016}
Parameters 9: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTANT': 1.3910036577234637}
Parameters 10: {'PARAM_HIDDEN_LAYERS_DIMENSION': 200, 'PARAM_FEEDFORWARD_CONSTA

We define the following metric extractor to extract desired metric from learning log file. 
- In this example, we extract the number between "metric =" and "%".

In [28]:
metric_extractor = utils.job.MetricExtractor(
                        output_dir_id='ALL',
                        logfile='progress.log',
                        regex='metric =(.*?)\%')

We wait on the jobs the finish, then get the metric value from the log files of the finished jobs.

In [29]:
# Wait for all jobs to complete
experiment_utils.wait_all_jobs()

# Get the metrics from the jobs
results = experiment_utils.get_metrics_for_jobs(jobs, metric_extractor)
results.sort(key=lambda r: r['metric_value'])

# Print results
for result in results:
    print("Job {0} completed with metric value {1}".format(result['job_name'], result['metric_value']))
print("Best job: {0} with parameters {1}".format(
    results[0]['job_name'], 
    {ev.name:ev.value for ev in results[0]['job'].environment_variables}
))

0/16 jobs completed (0 succeeded, 0 failed)...............
0/16 jobs completed (0 succeeded, 0 failed)...............
3/16 jobs completed (3 succeeded, 0 failed)...............
3/16 jobs completed (3 succeeded, 0 failed)...............
6/16 jobs completed (6 succeeded, 0 failed)...............
6/16 jobs completed (6 succeeded, 0 failed)...............
9/16 jobs completed (9 succeeded, 0 failed)...............
9/16 jobs completed (9 succeeded, 0 failed)...............
12/16 jobs completed (12 succeeded, 0 failed)...............
12/16 jobs completed (12 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 failed)...............
15/16 jobs completed (15 succeeded, 0 fa

## 4. Clean Up (Optional)

### Delete the Experiment
Delete the experiment and jobs inside it

In [None]:
_ = client.experiments.delete(cfg.resource_group, cfg.workspace, experiment_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name).result()

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)