# Random Search

## Introduction

In This example of random search hyperparameter tuning we make use of the Batch AI extensions, including the JobFactory module to generate values for hyperparameters, and the ExperimentUtils module for bulk job submission.

## Imports

In [1]:
from __future__ import print_function
import sys
import glob
import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService
sys.path.append('.')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter

### Read Configuration and Create Batch AI client

In [2]:
configuration_path = 'configuration.json'
cfg = utils.config.Configuration(configuration_path)
client = utils.config.create_batchai_client(cfg)

Create the resource group and Batch AI workspace if they do not exist.

In [3]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure Blob Container

We will create a new Blob Container with name `batchaisample` under your storage account. This will be used to store the  training datasets.

**Note** You don't need to create new blob Container for every cluster. We are doing this in this sample to simplify resource management for you.

In [4]:
azure_blob_container_name = 'batchaisample'
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

False

### Upload the datasets to Azure Blob Container

We will upload the TSVs created by the [data prep notebook](00_Data_Prep.ipynb) to an Azure blob container directory named `dataset` using the Azure SDK for Python.

In [5]:
dataset_directory = 'dataset'
dataset_files = glob.glob('*.tsv')
for file in dataset_files:
    print(file)
    blob_service.create_blob_from_path(azure_blob_container_name, 
                                       dataset_directory + '/' + file,
                                       file)

questions.tsv
balanced_pairs_train.tsv
dupes_test.tsv
dupes_train.tsv
balanced_pairs_test.tsv


### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the training script file created in the [create model notebook](01_Create_Model.ipynb), and also the output files created by the script.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [6]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

False

Upload the training script to file share directory named `hyperparam_samples`.

In [7]:
script_path = 'hyperparam_samples'
script_name = 'TrainTestClassifier.py'
file_service.create_directory(
    azure_file_share_name, script_path, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, script_path, script_name, script_name)

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

For this example we will use a cluster of `Standard_D4_v2` nodes. Number of nodes in the cluster is configured with `nodes_count` variable. We will call the cluster `d4`.

In [8]:
cluster_name = 'd4'
nodes_count = 4
vm_size = 'Standard_D4_v2'

parameters = models.ClusterCreateParameters(
    vm_size=vm_size,
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [9]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [10]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

Cluster state: steady Target: 4; Allocated: 3; Idle: 3; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0
Cluster error: ClusterCoreQuotaReached: Operation results in exceeding quota limits of Standard Dv2 Family Cluster Dedicated vCPUs. Maximum allowed: 24, Current in use: 0, Additional requested: 32. Please contact support to increase the quota for resource type Standard Dv2 Family Cluster Dedicated vCPUs


## 3. Parameter Sweeping using Random Search
Create a new experiment called ```random_search_experiment```.

In [21]:
experiment_name = 'random_search_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()

Define specifications for the hyperparameters

In [22]:
param_specs = [
    DiscreteParameter(
        parameter_name="ESTIMATORS",
        values=[1, 2]
    ),
    DiscreteParameter(
        parameter_name="NGRAMS",
        values=[1, 2]
    ),
]

Create a parameter substitution object.

In [23]:
parameters = ParameterSweep(param_specs)

We will use the parameter substitution object to specify where we would like to substitute the parameters. We substitute
the values for those parameters into `models.JobCreateParameters.cntk_settings.command_line_args`. Note that the `parameters` variable is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called, the `parameters[name]` variables will be replaced with actual values.

In [24]:
azure_file_share_mount_path = 'afs'
azure_blob_mount_path = 'bfs'
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
    input_directories = [
        models.InputDirectory(
            id='SCRIPT',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, dataset_directory))
    ],
    output_directories = [
        models.OutputDirectory(
            id='ALL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))],
    mount_volumes = models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share_mount_path)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_container_name,
                relative_mount_path=azure_blob_mount_path)
        ]
    ),
    container_settings=models.ContainerSettings(
        image_source_registry=models.ImageSourceRegistry(image='microsoft/cntk:2.5.1-gpu-python2.7-cuda9.0-cudnn7.0')
    ),
    cntk_settings=models.CNTKsettings(
        python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/{2}'.format(
            azure_file_share_mount_path, script_path, script_name),
        command_line_args='--estimators {0} --ngrams {1} --inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL'
            .format(parameters['ESTIMATORS'], parameters['NGRAMS'])  # Substitute hyperparameters
    )
)

Next, we generate a list of jobs to submit.

In [25]:
# Generate Jobs
num_configs = 2
jobs_to_submit, param_combinations = parameters.generate_jobs_random_search(jcp, num_configs)

# Print the parameter combinations generated
for idx, comb in enumerate(param_combinations):
    print("Parameters {0}: {1}".format(idx + 1, comb))

Parameters 1: {'PARAM_ESTIMATORS': 2, 'PARAM_NGRAMS': 2}
Parameters 2: {'PARAM_ESTIMATORS': 1, 'PARAM_NGRAMS': 2}


And we submit the jobs to the experiment.

In [None]:
# Submit Jobs
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)
jobs = experiment_utils.submit_jobs(jobs_to_submit, 'mnist_hyperparam_job').result()

We define the following metric extractor to extract desired metric from learning log file. 
- In this example, we extract the number between "metric =" and "%".

In [None]:
metric_extractor = utils.job.MetricExtractor(
                        output_dir_id='ALL',
                        logfile='progress.log',
                        regex='metric =(.*?)\%')

We wait on the jobs the finish, then get the metric value from the log files of the finished jobs.

In [None]:
# Wait for all jobs to complete
experiment_utils.wait_all_jobs()

# Get the metrics from the jobs
results = experiment_utils.get_metrics_for_jobs(jobs, metric_extractor)
results.sort(key=lambda r: r['metric_value'])

# Print results
for result in results:
    print("Job {0} completed with metric value {1}".format(result['job_name'], result['metric_value']))
print("Best job: {0} with parameters {1}".format(
    results[0]['job_name'], 
    {ev.name:ev.value for ev in results[0]['job'].environment_variables}
))

## 4. Clean Up (Optional)

### Delete the Experiment
Delete the experiment and jobs inside it

In [None]:
_ = client.experiments.delete(cfg.resource_group, cfg.workspace, experiment_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [12]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name).result()

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [13]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)

True