# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import ames # The module for loading external data - Ames Housing dataset
import os
import pandas as pd
import numpy as np
import json
import ast
import pickle

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core import Workspace, Dataset, Experiment, Model, Environment, ScriptRunConfig
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.widgets import RunDetails

from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, loguniform, choice

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EL9VGKYY3 to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
quick-starts-ws-154226
aml-quickstarts-154226
southcentralus
510b94ba-e453-4417-988b-fbdc37b55ca7


In [3]:
# Create compute cluster
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [27]:
# # Try to load the dataset from the workspace. Otherwise, load if from Kaggle
# found = False
# ds_key = 'Ames-housing-dataset'
# ds_desc = 'Ames Housing training data.'

# if ds_key in ws.datasets.keys():
#     found = True
#     dataset = ws.datasets[ds_key]
#     print(f'Found registered {ds_key}, use it.')
    
# if not found:
#     train, test = ames.load_data_clean()
#     print(f"train.shape = {train.shape}, test.shape = {test.shape}")
#     # Register the train dataset
#     blob = ws.get_default_datastore()
#     dataset = TabularDatasetFactory.register_pandas_dataframe(train, blob, name=ds_key, description=ds_desc)

In [31]:
! python train_xgb.py

X_train.shape = (1095, 79), X_test.shape = (365, 79)
Attempted to log scalar metric Learning rate:
0.1
Attempted to log scalar metric Gamma:
2.0
Attempted to log scalar metric Maximum depth:
3.0
  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \
Attempted to log scalar metric r2_score:
0.8504777748523604
Writting r2 score = 0.8504777748523604 into a log.


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

In [8]:
# Choose a name for an experiment
experiment_name = 'Ames-housing-hdr'

experiment=Experiment(ws, experiment_name)

In [9]:
%%writefile conda_env.yml

dependencies:
- python=3.6.2
- pip:
  - azureml-defaults==1.32.0
- scikit-learn
- xgboost

Overwriting conda_env.yml


In [34]:
# Define an Azure ML environment
# Dependencies are the same as for AutoML experiment
env = Environment.from_conda_specification(name='env', file_path='conda_env.yml')

# Configure the training job
src = ScriptRunConfig(source_directory=".",
                     script='train_xgb.py',
                     #arguments=['--learning_rate', 0.01, '--gamma', 5, '--max_depth', 5], # Just for testing
                     compute_target=cpu_cluster,
                     environment=env)

In [33]:
# Test the script
# run = experiment.submit(src)

In [35]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
# Specify a Policy
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
# Specify parameter sampler
ps = RandomParameterSampling(
    {
        '--learning_rate': loguniform(-4.6, -1.6), # results in [0.01, 0.2]
        '--gamma': uniform(0, 9), 
        '--max_depth': choice(3, 5, 7)
    }
)

#TODO: Create your estimator and hyperdrive config
# src - see above

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(run_config=src,
                                    hyperparameter_sampling=ps,
                                    policy=policy,
                                    primary_metric_name='r2_score',
                                    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                    max_total_runs=20,
                                    max_concurrent_runs=4,
                                    max_duration_minutes=30)

In [36]:
#TODO: Submit your experiment
hdr = experiment.submit(config=hyperdrive_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [37]:
# Show run details with the widget.
RunDetails(hdr).show()
hdr.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c
Web View: https://ml.azure.com/runs/HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c?wsid=/subscriptions/510b94ba-e453-4417-988b-fbdc37b55ca7/resourcegroups/aml-quickstarts-154226/workspaces/quick-starts-ws-154226&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-08-11T08:04:21.164730][API][INFO]Experiment created<END>\n""<START>[2021-08-11T08:04:21.701704][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-08-11T08:04:21.911203][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"

Execution Summary
RunId: HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c
Web View: https://ml.azure.com/runs/HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c?wsid=/subscriptions/510b94ba-e453-4417-988b-fbdc37b55ca7/resourcegroups/aml-quickstarts-154226/workspaces/quick-starts-ws-154226&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254



{'runId': 'HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-08-11T08:04:20.920048Z',
 'endTimeUtc': '2021-08-11T08:13:29.839392Z',
 'properties': {'primary_metric_config': '{"name": "r2_score", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': 'c27b5318-bc5f-4c9f-8a2c-20d6de110122',
  'user_agent': 'python/3.6.9 (Linux-5.4.0-1055-azure-x86_64-with-debian-buster-sid) msrest/0.6.21 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.32.0',
  'score': '0.9124076643556375',
  'best_child_run_id': 'HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c_15',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg154226.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c/azureml-logs/hyperdrive.txt?sv=2019-07-07&sr=b&

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [38]:
# Get your best run and save the model from that run.
best_run = hdr.get_best_run_by_primary_metric()
print(best_run)

Run(Experiment: Ames-housing-hdr,
Id: HD_de3caa4d-e04f-4365-95c9-52373d8c1a7c_15,
Type: azureml.scriptrun,
Status: Completed)


In [39]:
best_run_metrics = best_run.get_metrics()
best_run_metrics

{'Learning rate': 0.07287096533079379,
 'Gamma': 8.484787185178305,
 'Maximum depth': 5.0,
 'r2_score': 0.9124076643556375}

In [40]:
details = best_run.get_details()

# Save metrics and details for ex-post examination
with open('best_hdr_metrics.json', 'w') as file:
    json.dump(best_run_metrics, file)
with open('best_hdr_details.txt', 'w') as file:
    file.write(str(details))

In [41]:
best_run.get_file_names()[-1]

'outputs/model.joblib'

In [42]:
#TODO: Save the best model
os.makedirs('./outputs', exist_ok=True)
best_run.download_file(best_run.get_file_names()[-1], output_file_path='./outputs/')

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [43]:
# Register the best model
model = Model.register(ws, model_path='outputs/model.joblib', model_name='Ames-Housing-XGB-Model', tags=best_run_metrics)
print(model.name, model.id, model.version, sep='\t')

Registering model Ames-Housing-XGB-Model
Ames-Housing-XGB-Model	Ames-Housing-XGB-Model:2	2


In [48]:
from azureml.core.webservice import AciWebservice
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
                                              memory_gb=1,
                                              tags={"data" : "Kaggle", "method" : "XGB"},
                                              description="Predict Ames Housing Prices")

In [49]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment("project-env")
cd = CondaDependencies('conda_env.yml')
env.python.conda_dependencies = cd
# Register environment to re-use later
env.register(workspace=ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210615.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "project-env",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "dependencies": [
                "python=3.6.2",
                {
      

In [50]:
%%time
import uuid
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model

ws = Workspace.from_config()
model = Model(ws, 'Ames-Housing-XGB-Model')

myenv = Environment.get(workspace=ws, name="project-env", version="1")
inference_config = InferenceConfig(entry_script="train_xgb.py", environment=myenv)

service_name = 'ames-housing-xgb-' + str(uuid.uuid4())[:4]
service = Model.deploy(workspace=ws,
                      name=service_name,
                      models=[model],
                      inference_config=inference_config,
                      deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-08-11 08:33:55+00:00 Creating Container Registry if not exists.
2021-08-11 08:33:55+00:00 Registering the environment.
2021-08-11 08:33:58+00:00 Use the existing image.
2021-08-11 08:33:58+00:00 Generating deployment configuration.
2021-08-11 08:34:00+00:00 Submitting deployment to compute.
2021-08-11 08:34:02+00:00 Checking the status of deployment ames-housing-xgb-0fd5..
2021-08-11 08:36:23+00:00 Checking the status of inference endpoint ames-housing-xgb-0fd5.
Failed


Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 69fbb6f3-f9ff-4780-b923-117ff94f9176
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config.",
  "details": [
    {
      "code": "ScoreInitRestart",
      "message": "Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config."
    }
  ]
}



WebserviceException: WebserviceException:
	Message: Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 69fbb6f3-f9ff-4780-b923-117ff94f9176
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config.",
  "details": [
    {
      "code": "ScoreInitRestart",
      "message": "Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config."
    }
  ]
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service deployment polling reached non-successful terminal state, current service state: Unhealthy\nOperation ID: 69fbb6f3-f9ff-4780-b923-117ff94f9176\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"AciDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Aci Deployment failed with exception: Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config.\",\n  \"details\": [\n    {\n      \"code\": \"ScoreInitRestart\",\n      \"message\": \"Your scoring file's init() function restarts frequently. You can address the error by increasing the value of memory_gb in deployment_config.\"\n    }\n  ]\n}"
    }
}

In [47]:
print(service.get_logs())

2021-08-11T08:18:28,040751700+00:00 - gunicorn/run 
File not found: /var/azureml-app/.
Starting HTTP server
2021-08-11T08:18:28,048302900+00:00 - rsyslog/run 
2021-08-11T08:18:28,043016800+00:00 - iot-server/run 
2021-08-11T08:18:28,091305300+00:00 - nginx/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-08-11T08:18:28,558826700+00:00 - iot-server/finish 1 0
2021-08-11T08:18:28,561409800+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (66)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 91
SPARK_HOME not set. Skipping PySpark Initialization.
X_train.shape = (1095, 79), X_test.shape = (365, 79)
Could not load the run context. Logging offline
Worker exiting (pid: 91)
usage: gunicorn [-h] [--learning_rate LEARNING_RATE] [--gamma GAMMA]
                [--max_depth MAX_DEPTH]
gunicorn: error: unrecognized arguments: -c /var/azureml-server/synchronous/gunicorn

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
service.scoring_uri

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

TODO: In the cell below, print the logs of the web service and delete the service

In [40]:
# Delete() is used to deprovision and delete the AmlCompute target. 
cpu_cluster.delete()