# Hyperparameter Tuning using HyperDrive

Import dependencies.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.core import Environment
from azureml.core import ScriptRunConfig
import os

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import joblib

## Dataset

Dataset used for this project is "Titanic - Machine Learning from Disaster" dataset from Kaggle. Dataset is first cleaned: removing redundant columns, and transforming categorical data.


In [2]:
ws = Workspace.from_config()
experiment_name = 'titanic-survival'
project_folder = './capstone-project'

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')
experiment=Experiment(ws, experiment_name)

key = "titanic-survival-data"
description_text = "Kaggle dataset for Titanic disaster."

if key in ws.datasets.keys(): 
        dataset = ws.datasets[key] 

else:
        # Create AutoML Dataset and register it into Workspace
        my_dataset = 'https://raw.githubusercontent.com/j0h4nnesk/Capstone_project_Titanic_Survival/main/train.csv'
        dataset = Dataset.Tabular.from_delimited_files(my_dataset)        
        # Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


run = experiment.start_logging()

capstone-project
capstone-project
francecentral
2d4b3a3e-de2a-45bb-9ac0-29caf8f98da4


In [3]:
# Name for the cluster
cpu_cluster_name = "compute-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    print('Creating a new compute cluster...')
    # Poll for a minimum number of nodes (min_nodes = 1). 
    # If no min node count is provided it uses the scale settings for the cluster.
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_DS3_v2', min_nodes=1, max_nodes=6)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 1, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-03-03T18:40:03.431000+00:00', 'errors': None, 'creationTime': '2021-03-03T17:57:01.678759+00:00', 'modifiedTime': '2021-03-03T17:57:17.222758+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 6, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


## Hyperdrive Configuration

In this experiment I am using a custom Scikit-Learn Logistic regression model, whose hyperparameters were tuned using Hyperdrive.

#### Early stopping policy
Early stopping policy is used here to automatically terminate poorly performing and to improve computational efficiency. Here **Bandit policy** was used which defines an early stopping policy based on slack criteria, and a frenquency and delay interval for evaluation. Bandit policy was specified as shown below:

```
policy = BanditPolicy(evaluation_interval=2,slack_factor=0.1,delay_evaluation=5)
```
*evaluation interval:* The frequency for applying the policy.

*slack_factor:* The ratio used to calculate the allowed distance from the best performing experiment run.

*delay_evaluation:* The number of intervals for which to delay the first policy evaluation. If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.

#### Parameter sampler

**RandomParameterSampling**, where hyperparameter values are randomly selected from the defined search space, was used as a sampler. It is a good choice as it is [more efficient, though less exhaustive compared](https://www.sciencedirect.com/science/article/pii/S1674862X19300047) to Grid grid search to search over the search space.

Parameter sampler was specified using the parameters:
*C:* Inverse of regularization strength; must be a positive float, where smaller values specify stronger regularization.

*max_iter:* Maximum number of iterations taken for the solvers to converge.

#### Hyperdrive config
Configuration was specified using the following parameters:

*hyperparameter_sampling:* The hyperparameter sampling space, specified using RandomParameterSampling

*primary_metric_name:* The name of the primary metric reported by the experiment runs. In our case, it is Accuracy.

*primary_metric_goal:* I chose PrimaryMetricGoal.MAXIMIZE. This parameter determines that the primary metric is to be maximized when evaluating runs.


*policy:* Refers to the early termination policy that is specified above.


*run_config:* An object for setting up configuration for script/notebook runs. Run_config will be used with train.py file which does data preparation.

*max_total_runs=20:* The maximum total number of runs to create. This is the upper bound; there may be fewer runs when the sample space is smaller than this value. If both max_total_runs and max_duration_minutes are specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached.

*max_concurrent_runs=4:* The maximum number of runs to execute concurrently. If None, all runs are launched in parallel. The number of concurrent runs is gated on the resources available in the specified compute target. Hence, you need to ensure that the compute target has the available resources for the desired concurrency.

In [4]:
sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='./conda_dependencies.yml')
compute_target = ws.compute_targets['compute-cluster']

# Specify a Policy 
policy = BanditPolicy(evaluation_interval=2,slack_factor=0.1,delay_evaluation=5)

if "training" not in os.listdir():
    os.mkdir("./training")
script_folder = './training'
os.makedirs(script_folder, exist_ok=True)
import shutil
shutil.copy('./train.py', script_folder)

src = ScriptRunConfig(source_directory=script_folder,
                        script='train.py',
                        #arguments=['--input_data', dataset.as_named_input('titanic')],
                        compute_target=compute_target,
                        environment=sklearn_env)


# Specify parameter sampler
ps = RandomParameterSampling({
    '--max_iter' : choice(20,40,80,100,150,200),
    '--C' : choice(0.001,0.01,0.1, 0.5, 1,1.5,10,20,50,100)
}) 


# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(hyperparameter_sampling = ps, 
                                    primary_metric_name = 'Accuracy',
                                    primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                                    policy=policy,
                                    run_config=src,
                                    max_total_runs=20,
                                    max_concurrent_runs=4)

## Run Details

Here `RunDetails` widget is used to show the different experiments.

In [None]:
# Submit hyperdrive run to the experiment and show run details with the widget.

hyperdrive_run = experiment.submit(config=hyperdrive_config)
RunDetails(hyperdrive_run).show()

hyperdrive_run.wait_for_completion(show_output=True)
assert(hyperdrive_run.get_status()=='Completed')

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_b99566bd-5020-4d94-84e6-7e0a5fe8a729
Web View: https://ml.azure.com/experiments/titanic-survival/runs/HD_b99566bd-5020-4d94-84e6-7e0a5fe8a729?wsid=/subscriptions/2d4b3a3e-de2a-45bb-9ac0-29caf8f98da4/resourcegroups/capstone-project/workspaces/capstone-project

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-03-03T18:49:55.159138][API][INFO]Experiment created<END>\n"<START>[2021-03-03T18:49:55.7384629Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>"<START>[2021-03-03T18:49:58.236415][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-03-03T18:49:58.413302][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"


## Best Model

Here the best model is fetched from the HyperDrive experiment and all the properties of the model are displayed.

In [8]:
# Get your best run and save the model from that run.
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

# Identify the best performing configuration and save the model from that run. 

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']

print('Best Run Id: ', best_run.id)
print(parameter_values[0],":", parameter_values[1],"and",parameter_values[2],":",parameter_values[3])
print('Accuracy:', best_run_metrics['Accuracy'])

Best Run Id:  HD_b99566bd-5020-4d94-84e6-7e0a5fe8a729_1
--C : 50 and --max_iter : 150
Accuracy: 0.7636363636363637


In [9]:
# get_metrics(): returns the metrics from all the runs that were launched by this HyperDriveRun.
print("Best run metrics :",best_run.get_metrics())
print('***************************************************')

# get_details(): returns a dictionary with the details for the run
print("Best run details :",best_run.get_details())
print('***************************************************')

# get_file_names(): returns a list of the files that are stored in association with the run.
print("Best run file names :",best_run.get_file_names())
print('***************************************************')

Best run metrics : {'Regularization Strength:': 50.0, 'Max iterations:': 150, 'Accuracy': 0.7636363636363637}
***************************************************
Best run details : {'runId': 'HD_b99566bd-5020-4d94-84e6-7e0a5fe8a729_1', 'target': 'compute-cluster', 'status': 'Completed', 'startTimeUtc': '2021-03-03T19:09:31.995987Z', 'endTimeUtc': '2021-03-03T19:23:31.930848Z', 'properties': {'_azureml.ComputeTargetType': 'amlcompute', 'ContentSnapshotId': '4716d362-fe7e-4d44-8579-03cf347aea58', 'ProcessInfoFile': 'azureml-logs/process_info.json', 'ProcessStatusFile': 'azureml-logs/process_status.json'}, 'inputDatasets': [], 'outputDatasets': [], 'runDefinition': {'script': 'train.py', 'command': '', 'useAbsolutePath': False, 'arguments': ['--C', '50', '--max_iter', '150'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'compute-cluster', 'dataReferences': {}, 'data': {}, 'outputData': {}, 'jobName': None, 'maxRunDurationSeconds': 2592000, 'no

In [10]:
# Save the best model

best_run.register_model(model_name = "best_run_hyperdrive.pkl", model_path = './outputs/')
print(best_run)

Run(Experiment: titanic-survival,
Id: HD_b99566bd-5020-4d94-84e6-7e0a5fe8a729_1,
Type: azureml.scriptrun,
Status: Completed)


In [11]:
# Download the model file
best_run.download_file('outputs/model.pkl', 'hyperdrive_model.pkl')

In [12]:
# Clean up and shut down compute cluster
compute_target.delete()

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

