# Hyperparameter Search
In this notebook, we create a Batch AI cluster, and use it to search for the best set of hyperparameters for the model.
## Imports and definitions

In [None]:
import os
import shutil
import json
import time
import pandas as pd
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.estimator import Estimator
from azureml.train.widgets import RunDetails
from azureml.train.hyperdrive import RandomParameterSampling, choice, PrimaryMetricGoal, HyperDriveRunConfig
import azureml.core
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

## Read in the Azure ML workspace
Read in the the workspace created in a previous notebook.

In [None]:
ws = Workspace.from_config()
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
     .format(ws_details['name'],
            ws_details['location']))

## Create a Batch AI cluster
Define the properties of the cluster needed.

In [None]:
batchai_cluster_name = 'mabouhype'
provisioning_config = BatchAiCompute.provisioning_configuration(
        vm_size='Standard_D4_v2',
        # vm_priority = 'lowpriority', # optional
        cluster_min_nodes=0,
        cluster_max_nodes=16,
        autoscale_enabled=True)

Create a configured Batch AI cluster, if it doesn't already exist.

In [None]:
if batchai_cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[batchai_cluster_name]
    if type(compute_target) is not BatchAiCompute:
        raise Exception('Compute target {} is not a Batch AI cluster.'
                        .format(batchai_cluster_name))
    print('Using pre-existing Batch AI cluster {}'
         .format(batchai_cluster_name))
else:
    # Create the cluster
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)

    # You can poll for a minimum number of nodes and set a specific timeout. 
    # If min node count is provided, priovisioning will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Print a detailed view of BatchAI cluster status.    

In [None]:
pd.Series(compute_target.get_status().serialize(), name='Value').to_frame()

## Upload the data to the cloud
We put the data in a particular directory on the workspace's default data store. This will show up in the same location for every job running on the Batch AI cluster. We use `overwrite=False` to avoid taking the time to re-upload the data should files with the same names are already present. If you change the data and want to refresh what's uploaded, use `overwrite=True`.

In [None]:
ds = ws.get_default_datastore()
ds.upload(src_dir=os.path.join('.', 'data'), target_path='data', overwrite=False, show_progress=True)

## Create a hyperparameter search configuration
Define the hyperparameter space for a random search.  We choose a single value for the number of estimators that is enough to let us reliably identify the best of the parameter configurations. Once we have the best combination, we will build a model using a larger number of estimators to boost the performance.

In [None]:
hyperparameter_sampling = RandomParameterSampling({
    'estimators': choice(1000),
    'ngrams': choice(range(1, 5)),
    'match': choice(range(2, 41)),
    'min_child_samples': choice(range(1, 31)),
    'unweighted': choice('Yes', 'No')
})

Specify the primary metric to be optimized as accuracy, and that it should be maximized. This is the metric that is logged by the training script.

In [None]:
primary_metric_name = 'accuracy'
primary_metric_goal = PrimaryMetricGoal.MAXIMIZE

The training script only logs a single accuracy at the end of training, so we specify no early termination policy. If no policy is specified, the hyperparameter tuning service will let all training runs run to completion.

In [None]:
policy = None

Control the resources used by the search through specifying a maximum number of runs. It is also possible to specify a maximum duration for the tuning experiment by setting `max_duration_minutes`. If both parameters are specified, all remaining runs are terminated once first resource limit is reached/

In [None]:
max_total_runs = 96

Create an estimator that specifies the location of the script, sets up its fixed parameters, including the location of the data, the compute target, and specifies the packages needed to run the script. It may take a while to prepare the run environment the first time an estimator is used, but that environment will be used until the list of packages is changed.

In [None]:
estimator = Estimator(source_directory=os.path.join('.', 'scripts'),
                      entry_script='TrainTestClassifier.py',
                      script_params={'--data-folder': ds.as_mount()},
                      compute_target=compute_target,
                      conda_packages=['pandas==0.23.4',
                                      'scikit-learn==0.20.0'],
                      pip_packages=['lightgbm==2.1.2'])

Put the information together into an experiment run configuration.

In [None]:
hyperdrive_run_config = HyperDriveRunConfig(
    estimator=estimator,
    hyperparameter_sampling=hyperparameter_sampling,
    policy=policy,
    primary_metric_name=primary_metric_name,
    primary_metric_goal=primary_metric_goal,
    max_total_runs=max_total_runs)

## Run the search
Get an experiment to run the search; create it if it doesn't already exist.

In [None]:
exp = Experiment(workspace=ws, name='mabouhypelocal')

Submit the configuration to be run. This should return almost immediately, and the value will be a run object.

In [None]:
run = exp.submit(hyperdrive_run_config)
run

The experiment returns a run that when printed shows a table with a link to the `Details Page` in the Azure Portal. That page will let you monitor the status of this run and that of its children runs. By clicking on a particular child run, you can see its details, files output by the script for that configuration, and the logs of the run, including the `driver.log` with the script's print outs.

If you want to cancel this trial, run the code in the cell below.

In [None]:
# run.cancel()

You can use the cell below to get the list of children runs submitted to the cluster. And as shown in the following cell, you can poll the list to monitor how your experiment is progressing. Here, we can see the number of children runs that are `Queued`, `Running`, `Failed`, or `Completed`.

In [None]:
run_children = list(run.get_children())
while len(run_children) == 0:
    print('Waiting for the children runs to be submitted.')
    time.sleep(60)
    run_children = list(run.get_children())
print('{} children runs'.format(len(run_children)))

In [None]:
pd.Series(map(lambda x: x.get_status(), run_children)).value_counts().rename('Count').to_frame().transpose()

The RunDetails widget is a great way to monitor your hyperparameter tuning run.

In [None]:
run_details = RunDetails(run)
run_details.show()

Until all children runs have either failed or completed, the parent run's status will not be `Completed`.

In [None]:
run.get_status()

Wait for the runs to complete. This returns a `dict` with detailed information about the run. Here, we see that the run has `Completed`.

In [None]:
run_status = run.wait_for_completion()
run_status['status']

## Select the best model
We can automatically select the best run, and show its parameters.

In [None]:
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']
best_parameters = dict(zip(parameter_values[::2], parameter_values[1::2]))
pd.Series(best_parameters, name='Value').to_frame()

We can use these parameters to train and save the best model.

In [None]:
best_parameters['--data-folder'] =  ds.as_mount()
best_parameters['--save'] = 'FAQ-ranker'
best_parameters['--estimators'] = 8 * int(best_parameters['--estimators'])
pd.Series(best_parameters, name='Value').to_frame()

In [None]:
est = Estimator(source_directory=os.path.join('.', 'scripts'), 
                entry_script='TrainTestClassifier.py',
                script_params=best_parameters,
                compute_target=compute_target,
                conda_packages=['pandas==0.23.4',
                                'scikit-learn==0.20.0'],
                pip_packages=['lightgbm==2.1.2'])
run = exp.submit(est)
run

Wait for the model to be created and saved.

In [None]:
run.wait_for_completion()
print(run_status['status'])
if run_status['status'] != 'Completed':
    raise Exception('The run did not successfully complete.')

Register the best model.

In [None]:
model = run.register_model(model_name='FAQ_ranker', model_path=os.path.join('outputs', 'FAQ-ranker.pkl'))
print(model.name, model.id, model.version, sep = '\t')