# Hyperparameter Tuning using HyperDrive

## Create an experiment in workspace

In [1]:
from azureml.core import Workspace, Experiment

#Define a workspace
ws = Workspace.from_config()

#Create an experiment
exp = Experiment(workspace=ws, name="Hyperparameter_Tuning")
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')
run = exp.start_logging()

Workspace name: quick-starts-ws-135681
Azure region: southcentralus
Subscription id: 81cefad3-d2c9-4f77-a466-99a7f541c7bb
Resource group: aml-quickstarts-135681


## Create a compute cluster

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

#Create a new cluster to run the experiment using vm_size of STANDARD_D2_V2 and max_nodes of 4. 
cpu_cluster_name = "cpucluster"
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found an existing cluster.\n")
except ComputeTargetException:
    print("Creating a new cluster.\n")
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion()

#If the cluster exist already for the current cluster, print it's detailed status using get_status(). If not, create and print.
print("Cluster details: ", cpu_cluster.get_status().serialize())

Found an existing cluster.

Cluster details:  {'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-22T08:48:05.034000+00:00', 'errors': None, 'creationTime': '2021-01-22T08:47:59.411726+00:00', 'modifiedTime': '2021-01-22T08:48:14.900351+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Dataset overview

This project demonstrates the usage of different Machine Learning Algorithms on the Kaggle's Titanic dataset. We are performing classification in this case. The titanic dataset consists of features related to a passenger and the response is if a passenger survived the titanic disaster or not (Survived-1/0).

In [3]:
import pandas as pd

#Loading the titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/parvatijay2901/Machine-Learning-with-the-Titanic-dataset-on-Azure/main/Training_data.csv')

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Further in train.py we are analysing and transforming the dataset. We are also splitting it into train and test sets. 

## Hyperdrive Configuration

Hyperparameters are adjustable parameters that controls the model training process. Hyperparameters of logistic regression are max-iterations "choice(100, 150, 200, 250, 300)" and regularization strength "uniform(0.01,1)". HyperDrive package(HyperDriveConfig) helps helps us to choose the parameters automatically. Hyperparameter tuning is the process of finding the configuration of hyperparameters that results in the best performance. BayesianParameterSampling is used in this case. With the help of hyperdrive, hyperparameters of Logistic Regression are optimized and accuracy is calculated. Logistic Regression is a binary classification algorithm(0 or 1). It uses logistic function called the sigmoid function in order to predict outcomes.

In [7]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.sampling import BayesianParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform
from azureml.train.hyperdrive.parameter_expressions import choice
import os

#Specify a parameter sampler (Bayesian sampling)
ps = BayesianParameterSampling({'--C': uniform(0.01,1),'--max_iter': choice(100, 150, 200, 250, 300)})

#Create a directory 'training'
if "training" not in os.listdir():
    os.mkdir("./training")

#Create a SKLearn estimator for use with train.py
est = SKLearn(source_directory='./',
                compute_target=cpu_cluster,
                entry_script='train.py')

#Create a HyperDriveConfig using the estimator and hyperparameter sampler.
hyperdrive_config = HyperDriveConfig(
                                   hyperparameter_sampling = ps,
                                   primary_metric_name = 'accuracy',
                                   primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                                   max_total_runs = 20,
                                   max_concurrent_runs = 4,
                                   policy = None,
                                   estimator = est)

For best results with Bayesian Sampling we recommend using a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned. Recommendend value:40.


## Run Details

We are training the model on Logistic regression with different hyperparameters. Also, we are using Bayesian parameter sampling. The best run/model can be easily predicted after this step. 

In [8]:
#Submit the hyperdrive run
hd_run = exp.submit(hyperdrive_config)

#Launch the widget to view the progress and results
RunDetails(hd_run).show()



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Retrieve and save the best model. 

In [9]:
import joblib
from azureml.core.model import Model

#Get the best run.
best_run_hd = hd_run.get_best_run_by_primary_metric()
best_run_metrics_hd = best_run_hd.get_metrics()
print("Best Run Id: ", best_run_hd.id)
print("Accuracy: ", best_run_metrics_hd['accuracy'])

Best Run Id:  HD_2147dc72-0f35-4918-ba95-ba7c924b1f6c_6
Accuracy:  0.8715083798882681


In [10]:
#Save the best model.
model_hd = best_run_hd.register_model(model_name='hyperdrive_best_model', 
                                model_path='./outputs/model.pkl',
                                model_framework=Model.Framework.SCIKITLEARN, 
                                model_framework_version='0.19.1')
print("Model successfully saved.")

Model successfully saved.
