# Create Cluster

In this notebook we make use of the Batch AI extensions to generate values for hyperparameters, and create the Batch AI cluster.

## Imports

In [None]:
from __future__ import print_function
import os
import sys
import glob
import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService
sys.path.append('.')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter
%load_ext dotenv

Import the contents of the `.env` file into the environment

In [None]:
%dotenv -o

In the next cell are the names of various files and services used or created in this notebook.

In [None]:
azure_blob_container_name = 'batchaisample'   # The Azure blob container created for the datasets
dataset_directory = 'dataset'                 # The Azure blob container directory containing the datasets
azure_file_share_name = 'batchaisample'       # The Azure file share created for the scripts and outputs
script_path = 'hyperparam_samples'            # The Azure file share directory containing the Python scripts
script_name = 'TrainTestClassifier.py'        # The script to be run
cluster_name = 'd4'                           # The Batch AI cluster
experiment_name = 'random_search_experiment'  # The Batch AI experiment
azure_file_share_mount_path = 'afs'           # The mount point of the Azure file share in the Docker container
azure_blob_mount_path = 'bfs'                 # The mount point of the Azure blob container in the the Docker container
image_name = ':'.join([os.getenv('docker_login') + os.getenv('image_repo'), 'latest']) # The image used to create the Docker container

In [None]:
image_name

## Create a Batch AI client
Read the configuration, and use it to create a Batch AI client.

In [None]:
configuration_path = os.getenv('configuration_path')
cfg = utils.config.Configuration(configuration_path)
client = utils.config.create_batchai_client(cfg)

Create the resource group and Batch AI workspace if they do not exist.

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## Copy training datasets and script to Azure storage

### Azure blob container

We create a blob container named `batchaisample` in your storage account for storing the training and testing datasets.

**Note** You don't need to create new blob container for every cluster. We are doing this here to simplify resource management.

In [None]:
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

We upload the TSVs created by the [data prep notebook](00_Data_Prep.ipynb) to an Azure blob container directory named `dataset` using the Azure SDK for Python.

In [None]:
dataset_files = glob.glob('*.tsv')
for file in dataset_files:
    print(file)
    blob_service.create_blob_from_path(azure_blob_container_name, 
                                       dataset_directory + '/' + file,
                                       file)

### Azure file share

We create a file share named `batchaisample` in your storage account to hold the training script file created in the [create model notebook](01_Create_Model.ipynb). This will also contain the output files created by the running script.

**Note** You don't need to create new file share for every cluster. We are doing here to simplify resource management.

In [None]:
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

Upload the training script to file share directory named `hyperparam_samples`.

In [None]:
file_service.create_directory(
    azure_file_share_name, script_path, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, script_path, script_name, script_name)

## Create the Azure Batch AI compute cluster

We will be creating a compute cluster named `d4` with `nodes_count` nodes of type `Standard_D4_v2`.

In [None]:
nodes_count = 4
vm_size = 'STANDARD_NC6'

Create the cluster configuration parameters.

In [None]:
cluster_parameters = models.ClusterCreateParameters(
    vm_size=vm_size,
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

Create the cluster.

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, cluster_parameters).result()

Monitor the just created cluster. The `utilities` module contains a helper function to print out a detailed status of the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Parameter Sweeping using Random Search
We specify the Docker image that will be used to create the containers run in the experiment.

In [None]:
container_settings = models.ContainerSettings(
    image_source_registry=models.ImageSourceRegistry(image=image_name)
)

We define the mount points to be created in each container. These will give the container access to the datasets and scripts.

In [None]:
mount_volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share_mount_path)
    ],
    azure_blob_file_systems=[
        models.AzureBlobFileSystemReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            container_name=azure_blob_container_name,
            relative_mount_path=azure_blob_mount_path)
    ]
)

Define the locations in a container's file system for
- storing the job's standard output and error,
- obtaining the datasets, and
- storing the job's outputs.

In [None]:
std_out_err_path_prefix = '$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, dataset_directory))
]
output_directories = [
    models.OutputDirectory(
        id='ALL',
        path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))
]

We define the path to the training script.

In [None]:
python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/{2}'.format(
    azure_file_share_mount_path, script_path, script_name)

We define specifications for the hyperparameters, and use them to create a parameter substitution object.

In [None]:
param_specs = [
    DiscreteParameter(
        parameter_name="ESTIMATORS",
        values=[2]
    ),
]

parameters = ParameterSweep(param_specs)

We define the command line arguments that will be passed to the training script. We will use the parameter substitution object to specify where we would like to substitute the values of the parameters in the command line. Note that `parameters` is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called below, the `parameters[name]` variables will be replaced with actual values.

In [None]:
command_line_args = '--inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL --estimators {0}'.format(
    parameters['ESTIMATORS'])  # Substitute hyperparameters

We put the script path and command line arguments together in a module settings structure.

In [None]:
custom_toolkit_settings = models.CustomToolkitSettings(
        command_line=' '.join(['bash', '-c', '"', 'python', python_script_file_path, command_line_args, '"']),
    )

In [None]:
custom_toolkit_settings.command_line

We put together the information we just created into a set of job control parameters that will be used by `parameters.generate_jobs` to create the definitions of the jobs to execute on the cluster.

In [None]:
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix=std_out_err_path_prefix,
    input_directories=input_directories,
    output_directories=output_directories,
    mount_volumes=mount_volumes,
    container_settings=container_settings,
    custom_toolkit_settings=custom_toolkit_settings
)

Next, we generate a list of jobs to submit using randomly selected combinations of parameters.

In [None]:
num_configs = 1
jobs_to_submit, param_combinations = parameters.generate_jobs_random_search(jcp, num_configs)
for idx, comb in enumerate(param_combinations, 1):
    print("Parameters {0}: {1}".format(idx, comb))

In [None]:
jobs_to_submit[0].custom_toolkit_settings.command_line

Create a new experiment called ```random_search_experiment```.

In [None]:
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()

And submit the jobs to the experiment.

In [None]:
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)
jobs = experiment_utils.submit_jobs(jobs_to_submit, 'hyperparam_job').result()

We wait for all the jobs to complete.

In [None]:
experiment_utils.wait_all_jobs()

We define an extractor that pulls desired metric from learning log file. 
- In this example, we extract the number between "`INFO:root:Accuracy @3 =`" and "`%`".

In [None]:
metric_extractor = utils.job.MetricExtractor(
                        output_dir_id='ALL',
                        logfile='TrainTestClassifier.log',
                        regex='INFO:root:Accuracy @3 = (.*?)\%')

We wait on the jobs the finish, then get the metric value from the log files of the finished jobs.

In [None]:
# Wait for all jobs to complete
experiment_utils.wait_all_jobs()

# Get the metrics from the jobs
results = experiment_utils.get_metrics_for_jobs(jobs, metric_extractor)
results.sort(key=lambda r: r['metric_value'])

# Print results
for result in results:
    print("Job {0} completed with metric value {1}".format(result['job_name'], result['metric_value']))
print("Best job: {0} with parameters {1}".format(
    results[0]['job_name'], 
    {ev.name:ev.value for ev in results[0]['job'].environment_variables}
))

## 4. Clean Up (Optional)

### Delete the Experiment
Delete the experiment and jobs inside it

In [None]:
_ = client.experiments.delete(cfg.resource_group, cfg.workspace, experiment_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name).result()

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)