# Hyperparameter Search
In this notebook, we create an AML cluster, and use it to search for the best set of hyperparameters for the model.

The steps in this notebook are
- [import libraries](#import),
- [read in the Azure ML workspace](#workspace),
- [create an AML cluster](#cluster),
- [upload the data to the cloud](#upload),
- [define a hyperparameter search configuration](#configuration),
- [create an estimator](#estimator),
- [submit the estimator](#submit), and
- [get the results](#results).

## Imports  <a id='import'></a>

In [None]:
import os
import pandas as pd
import time
from azureml.core import Workspace, Experiment
from azureml.core.authentication import AzureCliAuthentication
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import RandomParameterSampling, choice, PrimaryMetricGoal, HyperDriveRunConfig
from azureml.widgets import RunDetails
import azureml.core
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

## Read in the Azure ML workspace <a id='workspace'></a>
Read in the the workspace created in a previous notebook.

In [None]:
ws = Workspace.from_config(auth=AzureCliAuthentication())
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
      .format(ws_details['name'],
              ws_details['location']))

## Create an AML cluster <a id='cluster'></a>
Define the properties of the cluster needed.

In [None]:
cluster_name = 'hypetuning'
provisioning_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_D4_v2',
        # vm_priority = 'lowpriority', # optional
        max_nodes=16)

Create the configured cluster if it doesn't already exist, or retrieve if if it does exist. Creation can take about a minute.

In [None]:
if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if type(compute_target) is not AmlCompute:
        raise Exception('Compute target {} is not an AML cluster.'
                        .format(cluster_name))
    print('Using pre-existing AML cluster {}'.format(cluster_name))
else:
    # Create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)

    # You can poll for a minimum number of nodes and set a specific timeout. 
    # If min node count is provided, provisioning will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Print a detailed view of the cluster.    

In [None]:
pd.Series(compute_target.get_status().serialize(), name='Value').to_frame()

## Upload the data to the cloud <a id='upload'></a>
We put the data in a particular directory on the workspace's default data store. This will show up in the same location in the file system of every job running on the Batch AI cluster.

Get a handle to the workspace's default data store.

In [None]:
ds = ws.get_default_datastore()

Upload the data. We use `overwrite=False` to avoid taking the time to re-upload the data should files with the same names be already present. If you change the data and want to refresh what's uploaded, use `overwrite=True`.

In [None]:
ds.upload(src_dir=os.path.join('.', 'data'), target_path='data', overwrite=False, show_progress=True)

## Define a hyperparameter search configuration <a id='configuration'></a>
Define the hyperparameter space for a random search.  We will use a constant value for the number of estimators that is enough to let us reliably identify the best of the parameter configurations. Once we have the best combination, we will build a model using a larger number of estimators to boost the performance. The table below should give you an idea of the trade-off between the number of estimators and the modeling run time, model size, and model accuracy.

| Estimators | Run time (s) | Size (MB) | Accuracy@1 | Accuracy@2 | Accuracy@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |


In [None]:
hyperparameter_sampling = RandomParameterSampling({
    'ngrams': choice(range(1, 5)),
    'match': choice(range(2, 41)),
    'min_child_samples': choice(range(1, 31)),
    'unweighted': choice('Yes', 'No')
})

Specify the primary metric to be optimized as accuracy, and that it should be maximized. This is the metric that is logged by the training script.

In [None]:
primary_metric_name = 'accuracy'
primary_metric_goal = PrimaryMetricGoal.MAXIMIZE

The training script only logs a single accuracy at the end of training, so we specify no early termination policy. If no policy is specified, the hyperparameter tuning service will let all training runs run to completion.

In [None]:
policy = None

Control the resources used by the search through specifying a maximum number of runs. It is also possible to specify a maximum duration for the tuning experiment by setting `max_duration_minutes`. If both parameters are specified, all remaining runs are terminated once first resource limit is reached/

In [None]:
max_total_runs = 96

## Create an estimator <a id='estimator'></a>
Create an estimator that specifies the location of the script, sets up its fixed parameters, including the location of the data, the compute target, and specifies the packages needed to run the script. It may take a while to prepare the run environment the first time an estimator is used, but that environment will be used until the list of packages is changed.

In [None]:
estimator = Estimator(source_directory=os.path.join('.', 'scripts'),
                      entry_script='TrainTestClassifier.py',
                      script_params={'--data-folder': ds.as_mount(),
                                     '--estimators': 1000},
                      compute_target=compute_target,
                      conda_packages=['pandas==0.23.4',
                                      'scikit-learn==0.20.0'],
                      pip_packages=['lightgbm==2.1.2'])

Put the estimator and the configuration information together into an HyperDrive run configuration object.

In [None]:
hyperdrive_run_config = HyperDriveRunConfig(
    estimator=estimator,
    hyperparameter_sampling=hyperparameter_sampling,
    policy=policy,
    primary_metric_name=primary_metric_name,
    primary_metric_goal=primary_metric_goal,
    max_total_runs=max_total_runs)

## Run the search <a id='submit'></a>
Get an experiment to run the search; create it if it doesn't already exist.

In [None]:
exp = Experiment(workspace=ws, name='hypetuning')

Submit the configuration to be run. This should return almost immediately, and the value will be a run object.

In [None]:
run = exp.submit(hyperdrive_run_config)
run

The experiment returns a run that when printed shows a table with a link to the `Details Page` in the Azure Portal. That page will let you monitor the status of this run and that of its children runs. By clicking on a particular child run, you can see its details, files output by the script for that configuration, and the logs of the run, including the `driver.log` with the script's print outs.

If you want to cancel this trial, run the code in the cell below.

In [None]:
# run.cancel()

Until all children runs have either failed or completed, the parent run's status will not be `Completed`. Other possible run statuses include `Preparing`, `Running`, `Finalizing`, and `Failed`.

In [None]:
run.get_status()

You can programmatically monitor the progress of the run. You need to first obtain the list of its child runs. Poll every 60 seconds until all of the child runs are available.

In [None]:
run_children = list(run.get_children())
while len(run_children) < max_total_runs:
    time.sleep(60)
    run_children = list(run.get_children())
print('{:,} child runs'.format(len(run_children)))

You can now report of the status of the child runs.

In [None]:
run_children_status = pd.Series({child.id : child.get_status() for child in run_children}, name="Count")
run_children_status.value_counts().to_frame().transpose()

Wait for the runs to complete. This returns a `dict` with detailed information about the run. Here, we see that the run has `Completed`.

In [None]:
%%time
run_status = run.wait_for_completion()
print(run_status['status'])
if run_status['status'] != 'Completed':
    raise Exception('The run did not successfully complete.')

## Select the best model <a id='results'></a>
We can automatically select the best run.

In [None]:
best_run = run.get_best_run_by_primary_metric()

Here are the best run's hyperparameter set.

In [None]:
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']
best_parameters = dict(zip(parameter_values[::2], parameter_values[1::2]))
pd.Series(best_parameters, name='Value').to_frame()

We can use these parameters to train and save the best model. We will train with a boosted number of estimators.

In [None]:
model_parameters = best_parameters.copy()
model_parameters['--estimators'] = 8 * int(best_parameters['--estimators'])
model_parameters['--save'] = 'FAQ-ranker'
pd.Series(model_parameters, name='Value').to_frame()

Train and save the best model.

In [None]:
model_est = Estimator(source_directory=os.path.join('.', 'scripts'),
                      entry_script='TrainTestClassifier.py',
                      script_params=model_parameters,
                      compute_target=compute_target,
                      conda_packages=['pandas==0.23.4',
                                      'scikit-learn==0.20.0'],
                      pip_packages=['lightgbm==2.1.2'])
model_run = exp.submit(model_est)
model_run

Wait for the model to be created and saved.

In [None]:
%%time
model_run_status = model_run.wait_for_completion()
print(model_run_status['status'])
if model_run_status['status'] not in ['Completed', 'Finalizing']:
    raise Exception('The run did not successfully complete.')

Register the best model.

In [None]:
model = model_run.register_model(model_name='FAQ_ranker', model_path=os.path.join('outputs', 'FAQ-ranker.pkl'))
print(model.name, model.id, model.version, sep = '\t')

The [next notebook](04_HyperDrive_Run_Recovery.ipynb) shows how to recover in Python information about the HyperDrive run.