# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Environment, Experiment, ScriptRunConfig
from azureml.core.dataset import Dataset
from azureml.core.model import Model
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.sklearn import SKLearn
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

## Dataset

### Overview
The dataset chosen for this project is the one from [Kaggle Titanic Challenge](https://www.kaggle.com/c/titanic). 
In the famous Titanic shipwreck, some passengers were more likely to survive than others. The dataset presents information about 871 passengers and a column that states if they have survived or not. The model that we will create using HyperDrive predicts which passengers survived the Titanic shipwreck.

In [3]:
ws = Workspace.from_config()
experiment_name = 'titanic-hyperdrive-experiment'

experiment=Experiment(ws, experiment_name)


# Get the data of Kaggle Titanic Dataset
key = "titanic-modified"
description_text = "Kaggle Titanic Challenge dataset with some changes made by myself"
found = False

if key in ws.datasets.keys(): 
    found = True
    dataset = ws.datasets[key] 

if not found:
    # Create AML Dataset and register it into Workspace
    example_data = 'https://raw.githubusercontent.com/clasimoes/nd00333-capstone/master/titanic_data/full_capstone.csv'
    dataset = Dataset.Tabular.from_delimited_files(example_data)
    #Register Dataset in Workspace
    dataset = dataset.register(workspace=ws,
                               name=key,
                               description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Q,S,male
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.066409,0.523008,0.381594,32.204208,0.08642,0.725028,0.647587
std,257.353842,0.486592,0.836071,13.244532,1.102743,0.806057,49.693429,0.281141,0.446751,0.47799
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0
50%,446.0,0.0,3.0,26.0,0.0,0.0,14.4542,0.0,1.0,1.0
75%,668.5,1.0,3.0,37.0,1.0,0.0,31.0,0.0,1.0,1.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


### Create a compute cluster

We create here a compute cluster to run the experiment. In this cluster, we provise 2-10 machines with the "STANDARD_DS12_V2" configuration.

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "cluster-1"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=10)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 2, timeout_in_minutes = 10)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Hyperdrive Configuration

Here we are using a Logistic Regression model coming from the SKLearn framework to classify if a passenger would survive or not in the Titanic shipwreck. 

Hyperdrive is used to sample different values for two algorithm hyperparameters:
* "C": Inverse of regularization strength
* "max_iter": Maximum number of iterations taken for the solvers to converge

Different values and combinations of these hyperparameters lead to different models, and impact directly the performance of the model. We are looking here to build models with the best possible **Accuracy**.

My choice here was to sample the values using Random Sampling, in which hyperparameter values are randomly selected from the defined search space. "C" is chosen randomly in uniformly distributed between **0.001** and **1.0**, while "max_iter" is sampled from one of the three values: **1000, 10000 and 100000**.

Here we also specify an early termination of low-performance runs. The Bandit Policy was chosen with a factor of 0.1, which states what any run that doesn't fall within the slack factor of the evaluation metric (in our case, "Accuracy") with respect to the best performing run will be terminated. This saves both time and resources.

Last but not least, the Hyperdrive configuration takes some settings, some of which are worth mentioning: a training script used to import the data and train the Logistic Regression model using SKLearn; a compute target (cluster defined above); an enviroment specification (see conda_dependencies yaml file); a parameter sampling object with the Random Sampling configuration; an early termination policy; the metric that we aim to improve and how; and the total number of runs to be performed by the Hyperdrive.

In [5]:
# Create dependencies file for the train script
import os
import shutil

project_folder = './sklearn-titanic'
os.makedirs(project_folder, exist_ok=True)
shutil.copy('train_sklearn.py', project_folder)
shutil.copy('config.json', project_folder)
            
sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

In [6]:
# Create an early termination policy. 
early_termination_policy = BanditPolicy(slack_factor = 0.1)

# Create the different params that you will be using during training
param_sampling = RandomParameterSampling( {
    '--C': uniform(0.001, 1.0),
    '--max_iter': choice(1000, 10000, 100000)
} )

# Create your estimator and hyperdrive config
src = ScriptRunConfig(source_directory=project_folder,
                      script='train_sklearn.py',
                      compute_target=compute_target,
                      environment=sklearn_env)

hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling,
                                     policy=early_termination_policy,
                                     primary_metric_name="Accuracy",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=100)

In [7]:
# Submit the experiment
hyperdrive_run = experiment.submit(hyperdrive_config)

## Run Details

Here we have `RunDetails` widget to show the different experiments.

In [8]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

Here get the best model from the automl experiments and display all the arguments of the model and its id.

In [11]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()

In [12]:
print(best_run.get_details()['runId'])

HD_0a8fb55d-f1ee-4fc6-9b70-e3c6f393df42_62


In [13]:
print(best_run.get_details()['runDefinition']['arguments'])

['--C', '0.8893892118773127', '--max_iter', '1000']


In [14]:
# Model Metrics
print(best_run.get_metrics())

{'Regularization Strength': 0.8893892118773127, 'Max iterations': 1000, 'Accuracy': 0.852017937219731}


In [15]:
best_run.get_details()

{'runId': 'HD_0a8fb55d-f1ee-4fc6-9b70-e3c6f393df42_62',
 'target': 'cluster-1',
 'status': 'Completed',
 'startTimeUtc': '2021-01-13T15:37:52.39963Z',
 'endTimeUtc': '2021-01-13T15:38:31.855764Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'd5193886-778d-43c2-a305-29f0094fbac7',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train_sklearn.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--C', '0.8893892118773127', '--max_iter', '1000'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'cluster-1',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'credentialPassthrough': False,
  'environment': {'name': 'sklearn-env',
   'version': 'Autos

In [16]:
print(best_run.get_file_names())

['azureml-logs/55_azureml-execution-tvmps_ebb5e258fe1f8a4d3edeeb9fa9d4246b46b1dc6d411ebb90ca15f82bc47d2ed3_d.txt', 'azureml-logs/65_job_prep-tvmps_ebb5e258fe1f8a4d3edeeb9fa9d4246b46b1dc6d411ebb90ca15f82bc47d2ed3_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_ebb5e258fe1f8a4d3edeeb9fa9d4246b46b1dc6d411ebb90ca15f82bc47d2ed3_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/102_azureml.log', 'logs/azureml/dataprep/backgroundProcess.log', 'logs/azureml/dataprep/backgroundProcess_Telemetry.log', 'logs/azureml/dataprep/engine_spans_l_d3b31b8d-1bbc-4ff3-a029-414feebd991a.jsonl', 'logs/azureml/dataprep/python_span_l_d3b31b8d-1bbc-4ff3-a029-414feebd991a.jsonl', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/model.joblib']


In [17]:
# Save the best model 
best_run.download_file('outputs/model.joblib', 'sklearn-titanic/model.joblib')

In [18]:
# Register the best model
model = best_run.register_model(model_name='sklearn-titanic',
                                model_path='outputs/model.joblib',
                                model_framework=Model.Framework.SCIKITLEARN)

## Model Deployment

Create an inference config and deploy the model as a web service.

In [20]:
service_name = 'hyperdrive-service'
service = Model.deploy(ws, service_name, [model])
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [21]:
print(service.state)

Healthy


Send a request to the web service you deployed to test it.

In [22]:
import json

data = {"data":
        [
          {
            "PassengerId": 812,
            "Pclass": 2,
            "Age": 23.0,
            "SibSp": 0,
            "Parch": 0, 
            "Fare": 13.0,
            "Q": 0,
            "S": 1,
            "male": 1
          },
          {
            "PassengerId": 813,
            "Pclass": 1,
            "Age": 35.0,
            "SibSp": 0,
            "Parch": 0, 
            "Fare": 512.3292,
            "Q": 0,
            "S": 0,
            "male": 1
          }
      ]
    }

# Convert to JSON string
input_data = json.dumps(data)

In [25]:
output = service.run(input_data)

Print the logs of the web service and delete the service

In [26]:
print(output)

{'predict_proba': [[0.6554829708839002, 0.3445170291160998], [0.07152712155339236, 0.9284728784466076]]}


In the cell above, logistic regression returns an array of observations with the prediction probabilities.

The first item refers to passenger 812. The probability of this passenger belonging to the non-survived (class 0) is 65,55%, while the probability of this passenger belonging to the non-survived (class 1) is 34,45%. Thus we conclude that passenger 812 was classified as non-survivant.

Similarly, passenger 813 has been classified as a "Survived" (class 1) with a chance of 92,85%.

In [27]:
service.delete()

In [28]:
print(service.state)

Deleting
