# Hyperparameter Search

In this notebook we:
- create the Batch AI job configuration parameters,
- generate combinations of hyperparameter values,
- generate a job for each combination,
- submit the jobs to Batch AI,
- extract the performance of each combination,
- identify the combination that had the best performance, and
- use this combination to generate a model that we save.

## Imports

In [1]:
%load_ext dotenv
from __future__ import print_function
import os
import sys
import glob
import pandas as pd
import dotenv
import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService
sys.path.append('.')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter

In the next cell are the names of various files and services used or created in this notebook.

In [2]:
# The location of the dotenv file
dotenv_path = dotenv.find_dotenv()
# The mount point of the Azure file share in the Docker container
dotenv.set_key(dotenv_path, 'azure_file_share_mount_path', 'afs')
# The mount point of the Azure blob container in the the Docker container
dotenv.set_key(dotenv_path, 'azure_blob_mount_path', 'bfs')
# The Batch AI experiment
dotenv.set_key(dotenv_path, 'experiment_name', 'hyperparameter_search_experiment')

(True, 'experiment_name', 'hyperparameter_search_experiment')

Import the contents of the `.env` file into the environment

In [3]:
%dotenv -o

Define Python variables used in this notebook.

In [4]:
configuration_path = os.getenv('configuration_path')
image_name = os.getenv('docker_login') + os.getenv('image_repo') + ':latest'
azure_blob_container_name = os.getenv('azure_blob_container_name')
dataset_path = os.getenv('dataset_path')
azure_file_share_name = os.getenv('azure_file_share_name')
script_path = os.getenv('script_path')
script_name = os.getenv('script_name')
cluster_name = os.getenv('cluster_name')
azure_blob_mount_path = os.getenv('azure_blob_mount_path')
azure_file_share_mount_path = os.getenv('azure_file_share_mount_path')
experiment_name = os.getenv('experiment_name')

## Create a Batch AI client
Read the configuration, and use it to create a Batch AI client.

In [5]:
cfg = utils.config.Configuration(configuration_path)
client = utils.config.create_batchai_client(cfg)

## Define the parameters common to all jobs
Specify the Docker image used to create the Docker containers that run the experiment's jobs.

In [6]:
container_settings = models.ContainerSettings(
    image_source_registry=models.ImageSourceRegistry(image=image_name)
)

Define the volumes to be mounted and their mount points in each Docker container's file system. These will give the containers access to the datasets and script, and a location to store results.

In [7]:
mount_volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share_mount_path)
    ],
    azure_blob_file_systems=[
        models.AzureBlobFileSystemReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            container_name=azure_blob_container_name,
            relative_mount_path=azure_blob_mount_path)
    ]
)

Define the locations in a container's file system for
- storing the job's standard output and error,
- obtaining the datasets, and
- storing the job's outputs.

In [8]:
std_out_err_path_prefix = '$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path)

input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, dataset_path))
]

output_directories = [
    models.OutputDirectory(
        id='ALL',
        path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))
]

Define the path to the training script.

In [9]:
python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/{2}'.format(
    azure_file_share_mount_path, script_path, script_name)

## Generate the combinations of hyperparameters
Define specifications for the hyperparameters, and use them to create a parameter substitution object. We choose a single small value for the number of estimators that is enough to let us reliably identify the best configuration of the other parameters. Once we have the best combination, we will build a model using a large number of estimators to boost the performance.

In [76]:
param_specs = [
    DiscreteParameter(
        parameter_name="ESTIMATORS",
        values=[100]
    ),
    DiscreteParameter(
        parameter_name="NGRAMS",
        values=[1, 2, 3, 4]
    ),
    DiscreteParameter(
        parameter_name="MATCH",
        values=[10, 20, 30, 40]
    ),
    DiscreteParameter(
        parameter_name="MIN_CHILD_SAMPLES",
        values=[5, 10, 20]
    ),
    DiscreteParameter(
        parameter_name="WEIGHT",
        values=["", "--unweighted"]
    ),
]

parameters = ParameterSweep(param_specs)

We define the command line arguments that will be passed to the training script. We will use the parameter substitution object to specify where we would like to substitute the values of the hyperparameters in the command line. Note that `parameters` is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called below, the `parameters[name]` variables will be replaced with actual values.

In [11]:
command_line_args = '--inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL'\
    ' --estimators {estimators}'\
    ' --ngrams {ngrams}'\
    ' --match {match}'\
    ' --min_child_samples {min_child_samples}'\
    ' {weight}'.format(
    estimators=parameters['ESTIMATORS'],
    ngrams=parameters['NGRAMS'],
    match=parameters['MATCH'],
    min_child_samples=parameters['MIN_CHILD_SAMPLES'],
    weight=parameters['WEIGHT'])

We put the script path and command line arguments together in a module settings structure.

In [12]:
custom_toolkit_settings = models.CustomToolkitSettings(
        command_line=' '.join(['python', python_script_file_path, command_line_args]),
    )
print(custom_toolkit_settings.command_line)

python $AZ_BATCHAI_JOB_MOUNT_ROOT/afs/scripts/TrainTestClassifier.py --inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL --estimators PARAM_ESTIMATORS --ngrams PARAM_NGRAMS --match PARAM_MATCH --min_child_samples PARAM_MIN_CHILD_SAMPLES PARAM_WEIGHT


## Generate the jobs for each combination
Retrieve the cluster information.

In [13]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)

Use the information from above to create a job control parameter structure.

In [14]:
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix=std_out_err_path_prefix,
    input_directories=input_directories,
    output_directories=output_directories,
    mount_volumes=mount_volumes,
    container_settings=container_settings,
    custom_toolkit_settings=custom_toolkit_settings
    )

And generate a list of jobs to submit, each with a combinations of the parameters.

In [79]:
jobs_to_submit, param_combinations = parameters.generate_jobs(jcp)
print('{:,} configurations.'.format(len(param_combinations)))
print('The command line of the first job is\n{}'.format(jobs_to_submit[0].custom_toolkit_settings.command_line))

96 configurations.
The command line of the first job is
python $AZ_BATCHAI_JOB_MOUNT_ROOT/afs/scripts/TrainTestClassifier.py --inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL --estimators 100 --ngrams 1 --match 10 --min_child_samples 5 


## Run the jobs in an experiment
Create a new experiment called ```hyperparameter_search_experiment```.

In [17]:
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()

Submit the jobs to the experiment.

In [18]:
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)
jobs = experiment_utils.submit_jobs(jobs_to_submit, 'hyperparam_job2').result()

Initialized JobSubmitter in resource group: maboumlbaiht | workspace: maboumlbaiht | experiment: hyperparameter_search_experiment
Created job "hyperparam_job2_5a6c4d52fead4a66" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "40", "PARAM_MIN_CHILD_SAMPLES": "20", "PARAM_NGRAMS": "1", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_7a76e9f504abf486" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "2", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_bb7192ec74eb9352" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "30", "PARAM_MIN_CHILD_SAMPLES": "20", "PARAM_NGRAMS": "1", "PARAM_WEIGHT": "--unweighted"}
Created job "hyperparam_job2_6f2b89da3093ba5f" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "10", "PARAM_NGRAMS": "2", "PARAM_WEIGHT": "--unweighted"}
Created job "hyperparam_job2_61aec25a28248cd0" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH

Created job "hyperparam_job2_9c90bc1854ddd3c4" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "40", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "2", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_d762764c5d870388" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "3", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_94970e52354dcc63" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "10", "PARAM_NGRAMS": "3", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_43c2207cff050623" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "10", "PARAM_NGRAMS": "3", "PARAM_WEIGHT": "--unweighted"}
Created job "hyperparam_job2_9cfc3bead1e3dab3" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "40", "PARAM_MIN_CHILD_SAMPLES": "20", "PARAM_NGRAMS": "2", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_236741c3f6f2eb92" with paramete

Created job "hyperparam_job2_371866c6122d8b9a" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "20", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "4", "PARAM_WEIGHT": "--unweighted"}
Created job "hyperparam_job2_6215082a18ac72ed" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "40", "PARAM_MIN_CHILD_SAMPLES": "20", "PARAM_NGRAMS": "4", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_ba17662995c3be80" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "10", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "4", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_3f2193c7b3d07087" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "30", "PARAM_MIN_CHILD_SAMPLES": "5", "PARAM_NGRAMS": "4", "PARAM_WEIGHT": ""}
Created job "hyperparam_job2_bb2ed9533357f067" with parameters {"PARAM_ESTIMATORS": "100", "PARAM_MATCH": "40", "PARAM_MIN_CHILD_SAMPLES": "10", "PARAM_NGRAMS": "4", "PARAM_WEIGHT": "--unweighted"}
Created job "hyperparam_job2_9617fbd2071dc607" wi

Wait for the jobs to complete. This should take about two hours. You can interrupt and restart this cell as needed.

In [28]:
%%time
_ = experiment_utils.wait_all_jobs()

All jobs completed.
CPU times: user 143 ms, sys: 7.66 ms, total: 151 ms
Wall time: 1.21 s


If you need to interrupt the experiment before it is complete, you can delete all the queued and running jobs on the cluster.

In [None]:
experiment_utils.delete_jobs_in_experiment(execution_state=models.ExecutionState.queued)
experiment_utils.delete_jobs_in_experiment(execution_state=models.ExecutionState.running)

Define an extractor that pulls desired metric from each job's log file. 
- In this example, we extract the number between "`INFO:root:Accuracy @3 =`" and "`%`".

In [20]:
metric_extractor = utils.job.MetricExtractor(
                        output_dir_id='ALL',
                        logfile='TrainTestClassifier.log',
                        regex='INFO:root:Accuracy @3 = (.*?)\%')

Get the metric values from the log files of the finished jobs.

In [22]:
%%time
results = experiment_utils.get_metrics_for_jobs(jobs, metric_extractor)

All jobs completed.


Sort the metrics in decreasing order, and print a summary description of the values.

In [60]:
results.sort(key=lambda r: r['metric_value'], reverse=True)

metric_values = pd.Series({result['job_name']: result['metric_value']
                           for result in results},
                          name='Accuracy @3')
metric_values.describe().to_frame().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
metric_value,96.0,54.861354,6.04839,47.81,48.935,53.08,61.08,64.49


Print the best configuration.

In [70]:
best_params = {ev.name[len('PARAM_'):]:ev.value for ev in results[0]['job'].environment_variables}
print("Best job: {0} with parameters:".format(results[0]['job_name']))
pd.Series(best_params, name='Value').to_frame()

Best job: hyperparam_job2_c0eeed662e953b2e with parameters:


Unnamed: 0,Value
ESTIMATORS,100
MATCH,40
MIN_CHILD_SAMPLES,20
NGRAMS,4
WEIGHT,--unweighted


## Build the best model
Define variables that hold the best combination of parameters, and the number of estimators to use. Typically, increasing the number of estimators will increase the performance. 

In [80]:
best_estimators = 10 * int(best_params['ESTIMATORS'])
best_min_child_samples = best_params['MIN_CHILD_SAMPLES']
best_match = best_params['MATCH']
best_ngrams = best_params['NGRAMS']
best_weight = best_params['WEIGHT']

Run the training script with the best parameters, and save the model. This will likely take more than a half hour.

In [None]:
%run -t TrainTestClassifier.py\
    --save\
    --estimators $best_estimators\
    --match $best_match\
    --ngrams $best_ngrams\
    --min_child_samples $best_min_child_samples\
    $best_weight

Prepare the training data.
Reading ./balanced_pairs_train.tsv
train: 267,320 rows with 2.50% matches
No sample weights.
Define the model pipeline.
Estimators=8,000
Ngram range=(1, 4)
Min child samples=20
Fitting the model.


To tear down the experiment and all related resources go to [the last notebook](06_Tear_Down.ipynb).