Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.png)

# Training, hyperparameter tune, and deploy with Keras

## Introduction
This tutorial shows how to train a simple deep neural network using the MNIST dataset and Keras on Azure Machine Learning. MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of `28x28` pixels, representing number from 0 to 9. The goal is to create a multi-class classifier to identify the digit each image represents, and deploy it as a web service in Azure.

For more information about the MNIST dataset, please visit [Yan LeCun's website](http://yann.lecun.com/exdb/mnist/).

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)
* For local scoring test, you will also need to have `tensorflow` and `keras` installed in the current Jupyter kernel.

Let's get started. First let's import some Python libraries.

In [1]:
%matplotlib inline
import numpy as np
import os
import matplotlib.pyplot as plt

In [36]:
from sklearn.preprocessing import OneHotEncoder

In [2]:
import azureml
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.43


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [3]:
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Workspace name: azure-ml-experiments
Azure region: northeurope
Subscription id: 79ec9c01-599f-4707-82f9-31b2d938f2e5
Resource group: pedro-test


## Create an Azure ML experiment
Let's create an experiment named "keras-mnist" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [4]:
from azureml.core import Experiment

script_folder = './keras-mnist'
os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='keras-mnist')

### Check that the datasource in in there

In [5]:
ds = ws.get_default_datastore()

## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [16]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing compute target
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-06-24T15:54:07.616000+00:00', 'errors': None, 'creationTime': '2019-06-24T15:53:20.147297+00:00', 'modifiedTime': '2019-06-24T15:54:12.544682+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 2, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named "gpu-cluster" of type `AmlCompute`.

In [17]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

cpu-cluster AmlCompute Succeeded
gpu-cluster AmlCompute Succeeded
azure-ml-cpu AmlCompute Succeeded


## Copy the training files into the script folder
The Keras training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array.

In [18]:
import shutil

# the training logic is in the keras_mnist.py file.
shutil.copy('./keras_mnist.py', script_folder)

# the utils.py just helps loading data from the downloaded MNIST dataset into numpy arrays.
shutil.copy('./utils.py', script_folder)

'./keras-mnist/utils.py'

## Construct neural network in Keras
In the training script `keras_mnist.py`, it creates a very simple DNN (deep neural network), with just 2 hidden layers. The input layer has 28 * 28 = 784 neurons, each representing a pixel in an image. The first hidden layer has 300 neurons, and the second hidden layer has 100 neurons. The output layer has 10 neurons, each representing a targeted label from 0 to 9.

![DNN](nn.png)

### Azure ML concepts  
Please note the following three things in the code below:
1. The script accepts arguments using the argparse package. In this case there is one argument `--data_folder` which specifies the file system folder in which the script can find the MNIST data
```
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder')
```
2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the loss and accuracy at the end of each epoch via callback.
```
    run.log('Loss', log['loss'])
    run.log('Accuracy', log['acc'])
```
3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record.

The next cell will print out the training code for you to inspect.

In [19]:
with open(os.path.join(script_folder, './keras_mnist.py'), 'r') as f:
    print(f.read())

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import numpy as np
import argparse
import os

import matplotlib.pyplot as plt

import keras
from keras.models import Sequential, model_from_json
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.callbacks import Callback

import tensorflow as tf

from azureml.core import Run
from utils import load_data, one_hot_encode

print("Keras version:", keras.__version__)
print("Tensorflow version:", tf.__version__)

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--batch-size', type=int, dest='batch_size', default=50, help='mini batch size for training')
parser.add_argument('--first-layer-neurons', type=int, dest='n_hidden_1', default=100,
                    help='# of neurons in the first layer')
parser.add_argument('--second-layer-neurons', type=int, dest='n_hidd

## Create TensorFlow estimator & add Keras
Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the `gpu-cluster` as compute target, and pass the mount-point of the datastore to the training code as a parameter.
The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed. In this case, we add `keras` package (for the Keras framework obviously), and `matplotlib` package for plotting a "Loss vs. Accuracy" chart and record it in run history.

In [20]:
from azureml.train.dnn import TensorFlow

script_params = {
    '--data-folder': ds.path('mnist').as_mount(),
    '--batch-size': 50,
    '--first-layer-neurons': 300,
    '--second-layer-neurons': 100,
    '--learning-rate': 0.001
}

est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target, 
                 pip_packages=['keras', 'matplotlib'],
                 entry_script='keras_mnist.py', 
                 use_gpu=True)



And if you are curious, this is what the mounting point looks like:

In [21]:
print(ds.path('mnist').as_mount())

$AZUREML_DATAREFERENCE_f8f3e69ca84544d39b2d30263523ce37


## Submit job to run
Submit the estimator to the Azure ML experiment to kick off the execution.

In [22]:
run = exp.submit(est)

### Monitor the Run
As the Run is executed, it will go through the following stages:
1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.

2. Scaling: If the compute needs to be scaled up (i.e. the AmlCompute cluster requires more nodes to execute the run than currently available), the cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.

3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.

4. Post-Processing: The `./outputs` folder of the run is copied over to the run history

There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. 

**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**

**The widget isn't working**

We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run.

In [24]:
run

Experiment,Id,Type,Status,Details Page,Docs Page
keras-mnist,keras-mnist_1561473949_451ed77e,azureml.scriptrun,Preparing,Link to Azure Portal,Link to Documentation


In [25]:
run.wait_for_completion(show_output=True)

RunId: keras-mnist_1561473949_451ed77e
Web View: https://mlworkspace.azure.ai/portal/subscriptions/79ec9c01-599f-4707-82f9-31b2d938f2e5/resourceGroups/pedro-test/providers/Microsoft.MachineLearningServices/workspaces/azure-ml-experiments/experiments/keras-mnist/runs/keras-mnist_1561473949_451ed77e

Streaming azureml-logs/20_image_build_log.txt

2019/06/25 14:45:58 Downloading source code...
2019/06/25 14:45:59 Finished downloading source code
2019/06/25 14:46:00 Using acb_vol_083f6e5b-0b61-4566-b482-c805f4ef33f1 as the home volume
2019/06/25 14:46:00 Creating Docker network: acb_default_network, driver: 'bridge'
2019/06/25 14:46:00 Successfully set up Docker network: acb_default_network
2019/06/25 14:46:00 Setting up Docker configuration...
2019/06/25 14:46:01 Successfully set up Docker configuration
2019/06/25 14:46:01 Logging in to registry: azuremlexperd4996c19.azurecr.io
2019/06/25 14:46:02 Successfully logged into azuremlexperd4996c19.azurecr.io
2019/06/25 14:46:02 Executing step 

  Downloading https://files.pythonhosted.org/packages/2c/f4/c34cb32d3c789e5205d9bc1a9b7f191f3dfe45a9fab5655d7441c198f63b/azureml_defaults-1.0.43-py2.py3-none-any.whl
Collecting tensorflow-gpu==1.13.1 (from -r /azureml-setup/condaenv.79tk76ed.requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/7b/b1/0ad4ae02e17ddd62109cd54c291e311c4b5fd09b4d0678d3d6ce4159b0f0/tensorflow_gpu-1.13.1-cp36-cp36m-manylinux1_x86_64.whl (345.2MB)
Collecting horovod==0.16.1 (from -r /azureml-setup/condaenv.79tk76ed.requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/89/70/327e1ce9bee0fb8a879b98f8265fb7a41ae6d04a3ee019b2bafba8b66333/horovod-0.16.1.tar.gz (2.6MB)
Collecting pyyaml (from keras->-r /azureml-setup/condaenv.79tk76ed.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/a3/65/837fefac7475963d1eccf4aa684c23b95aa6c1d033a2c5965ccb11e22623/PyYAML-5.1.1.tar.gz (274kB)
Collecting keras-preprocessing>=1.0.5 (from keras->-r /

7fd0e7b966d8: Pushed
950abdfbc65e: Pushed
11210dd81681: Pushed
91d341ced524: Pushed
cacd9b90c818: Pushed
221e6befd0c5: Pushed
478462fe5e15: Pushed
297fd071ca2f: Pushed
2f0d1e8214b2: Pushed
7dd604ffa87f: Pushed
5d6cca69c100: Pushed
aa54c2bc1229: Pushed
ef3304f85894: Pushed
31b946ea17bb: Pushed
beac226414c1: Pushed
latest: digest: sha256:5578a588a32e1fa3656937bdc469e292e2be64c8ab5b8af568cd7df7e7ff74b3 size: 4728
2019/06/25 14:54:47 Successfully pushed image: azuremlexperd4996c19.azurecr.io/azureml/azureml_e639d7869712879dfdb1d61005d0d6e0:latest
2019/06/25 14:54:47 Step ID: acb_step_0 marked as successful (elapsed time in seconds: 286.238073)
2019/06/25 14:54:47 Populating digests for step ID: acb_step_0...
2019/06/25 14:54:49 Successfully populated digests for step ID: acb_step_0
2019/06/25 14:54:49 Step ID: acb_step_1 marked as successful (elapsed time in seconds: 238.939073)
2019/06/25 14:54:49 The following dependencies were found:
2019/06/25 14:54:49 
- image:
    registry: azuremlex

Exception in thread Thread-22:
Traceback (most recent call last):
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/multiprocessing/pool.py", line 479, in _handle_results
    cache[job]._set(i, obj)
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/multiprocessing/pool.py", line 649, in _set
    self._callback(self._value)
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/azureml/widgets/_userrun/_run_details.py", line 503, in _update_metrics
    self.widget_instance.run_metrics = result
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/traitlets/traitlets.py", line 585, in __set__
    self.set(obj, value)
  File "/Users/psfriso/anaconda3/envs/azure_ml/lib/pyt


Execution Summary
RunId: keras-mnist_1561473949_451ed77e
Web View: https://mlworkspace.azure.ai/portal/subscriptions/79ec9c01-599f-4707-82f9-31b2d938f2e5/resourceGroups/pedro-test/providers/Microsoft.MachineLearningServices/workspaces/azure-ml-experiments/experiments/keras-mnist/runs/keras-mnist_1561473949_451ed77e



{'runId': 'keras-mnist_1561473949_451ed77e',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2019-06-25T14:59:01.717853Z',
 'endTimeUtc': '2019-06-25T15:02:23.08799Z',
 'properties': {'azureml.runsource': 'experiment',
  'AzureML.DerivedImageName': 'azureml/azureml_e639d7869712879dfdb1d61005d0d6e0',
  'ContentSnapshotId': 'bcde896a-e0a8-4257-8aa0-1f3f89f4555e'},
 'runDefinition': {'script': 'keras_mnist.py',
  'arguments': ['--data-folder',
   '$AZUREML_DATAREFERENCE_c236cf1297ba41378d4ce4d49deba0cc',
   '--batch-size',
   '50',
   '--first-layer-neurons',
   '300',
   '--second-layer-neurons',
   '100',
   '--learning-rate',
   '0.001'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'gpu-cluster',
  'dataReferences': {'c236cf1297ba41378d4ce4d49deba0cc': {'dataStoreName': 'workspaceblobstore',
    'mode': 'Mount',
    'pathOnDataStore': 'mnist',
    'pathOnCompute': None,
    'overwrite': False}},
  'jobName': Non

In the outputs of the training script, it prints out the Keras version number. Please make a note of it.

### The Run object
The Run object provides the interface to the run history -- both to the job and to the control plane (this notebook), and both while the job is running and after it has completed. It provides a number of interesting features for instance:
* `run.get_details()`: Provides a rich set of properties of the run
* `run.get_metrics()`: Provides a dictionary with all the metrics that were reported for the Run
* `run.get_file_names()`: List all the files that were uploaded to the run history for this Run. This will include the `outputs` and `logs` folder, azureml-logs and other logs, as well as files that were explicitly uploaded to the run using `run.upload_file()`

Below are some examples -- please run through them and inspect their output. 

In [27]:
run.get_details()

{'runId': 'keras-mnist_1561473949_451ed77e',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2019-06-25T14:59:01.717853Z',
 'endTimeUtc': '2019-06-25T15:02:23.08799Z',
 'properties': {'azureml.runsource': 'experiment',
  'AzureML.DerivedImageName': 'azureml/azureml_e639d7869712879dfdb1d61005d0d6e0',
  'ContentSnapshotId': 'bcde896a-e0a8-4257-8aa0-1f3f89f4555e'},
 'runDefinition': {'script': 'keras_mnist.py',
  'arguments': ['--data-folder',
   '$AZUREML_DATAREFERENCE_c236cf1297ba41378d4ce4d49deba0cc',
   '--batch-size',
   '50',
   '--first-layer-neurons',
   '300',
   '--second-layer-neurons',
   '100',
   '--learning-rate',
   '0.001'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'gpu-cluster',
  'dataReferences': {'c236cf1297ba41378d4ce4d49deba0cc': {'dataStoreName': 'workspaceblobstore',
    'mode': 'Mount',
    'pathOnDataStore': 'mnist',
    'pathOnCompute': None,
    'overwrite': False}},
  'jobName': Non

In [28]:
run.get_metrics()

{'Loss': [0.21193540368539593,
  0.08979386562573685,
  0.0640791903809683,
  0.04969040511659841,
  0.03851924479506427,
  0.034629579259246084,
  0.026691853923315422,
  0.022164999651270893,
  0.019180537446129336,
  0.016137909693583575,
  0.014995283454052431,
  0.01181835241087434,
  0.01221864976328655,
  0.01097686702258443,
  0.009526684323577537,
  0.008651913791873628,
  0.008176027018272787,
  0.006864741751473722,
  0.006263888735279886,
  0.005459982245078159],
 'Accuracy': [0.9361833322482804,
  0.9733833349247774,
  0.9812166696290175,
  0.9855166707932949,
  0.9889000045756499,
  0.9902500042319298,
  0.9924000041683515,
  0.9935333373149237,
  0.9943833367029826,
  0.995666669656833,
  0.9959333363175392,
  0.9969333358108997,
  0.9967500025530657,
  0.9969833355148633,
  0.9974500020345052,
  0.9974333353837331,
  0.9980000015099844,
  0.9982166680693626,
  0.9985166678329309,
  0.9985666677852472],
 'Final test loss': 0.1386978552624188,
 'Final test accuracy': 0.98

In [29]:
run.get_file_names()

['Accuracy vs Loss_1561474928.png',
 'azureml-logs/20_image_build_log.txt',
 'azureml-logs/55_batchai_execution.txt',
 'azureml-logs/55_batchai_stdout-job_post.txt',
 'azureml-logs/55_batchai_stdout-job_prep.txt',
 'azureml-logs/55_batchai_stdout.txt',
 'azureml-logs/56_batchai_stderr.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/driver_log.txt',
 'logs/azureml/137_azureml.log',
 'logs/azureml/azureml.log',
 'outputs/model/model.h5',
 'outputs/model/model.json']

## Download the saved model

In the training script, the Keras model is saved into two files, `model.json` and `model.h5`, in the `outputs/models` folder on the gpu-cluster AmlCompute node. Azure ML automatically uploaded anything written in the `./outputs` folder into run history file store. Subsequently, we can use the `run` object to download the model files. They are under the the `outputs/model` folder in the run history file store, and are downloaded into a local folder named `model`.

In [30]:
# create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

Downloading from outputs/model/model.h5 to ./model/model.h5 ...
Downloading from outputs/model/model.json to ./model/model.json ...


## Predict on the test set
Let's check the version of the local Keras. Make sure it matches with the version number printed out in the training script. Otherwise you might not be able to load the model properly.

In [33]:
import keras
import tensorflow as tf

print("Keras version:", keras.__version__)
print("Tensorflow version:", tf.__version__)

Using TensorFlow backend.


Keras version: 2.2.4
Tensorflow version: 1.14.0


Now let's load the downloaded model.

In [34]:
from keras.models import model_from_json

# load json and create model
json_file = open('model/model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model/model.h5")
print("Model loaded from disk.")

W0625 16:27:02.328063 4565059008 deprecation_wrapper.py:119] From /Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0625 16:27:02.357424 4565059008 deprecation_wrapper.py:119] From /Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0625 16:27:02.425621 4565059008 deprecation_wrapper.py:119] From /Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0625 16:27:02.426526 4565059008 deprecation_wrapper.py:119] From /Users/psfriso/anaconda3/envs/azure_ml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecate

Model loaded from disk.


Feed test dataset to the persisted model to get predictions.

In [39]:
from utils import load_data, one_hot_encode

In [48]:
X_test = load_data('../data/test-images.gz', False) / 255.0
y_test = load_data('../data/test-labels.gz', True).reshape(-1)

In [49]:
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
y_test_ohe = one_hot_encode(y_test, 10)
y_hat = np.argmax(loaded_model.predict(X_test), axis=1)

# print the first 30 labels and predictions
print('labels:  \t', y_test[:30])
print('predictions:\t', y_hat[:30])

labels:  	 [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1]
predictions:	 [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1]


Calculate the overall accuracy by comparing the predicted value against the test set.

In [50]:
print("Accuracy on the test set:", np.average(y_hat == y_test))

Accuracy on the test set: 0.9824


## Intelligent hyperparameter tuning
We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.

In [51]:
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform

ps = RandomParameterSampling(
    {
        '--batch-size': choice(25, 50, 100),
        '--first-layer-neurons': choice(10, 50, 200, 300, 500),
        '--second-layer-neurons': choice(10, 50, 200, 500),
        '--learning-rate': loguniform(-6, -1)
    }
)

Next, we will create a new estimator without the above parameters since they will be passed in later by Hyperdrive configuration. Note we still need to keep the `data-folder` parameter since that's not a hyperparamter we will sweep.

In [52]:
est = TensorFlow(source_directory=script_folder,
                 script_params={'--data-folder': ds.path('mnist').as_mount()},
                 compute_target=compute_target,
                 conda_packages=['keras', 'matplotlib'],
                 entry_script='keras_mnist.py', 
                 use_gpu=True)

W0625 16:40:16.795565 4565059008 estimator.py:1048] framework_version is not specified, defaulting to version 1.13.


Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [53]:
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

Now we are ready to configure a run configuration object, and specify the primary metric `Accuracy` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

In [59]:
print(PrimaryMetricGoal.MAXIMIZE)

PrimaryMetricGoal.MAXIMIZE


In [60]:
hdc = HyperDriveConfig(estimator=est, 
                       hyperparameter_sampling=ps, 
                       policy=policy, 
                       primary_metric_name='Accuracy', 
                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                       max_total_runs=20,
                       max_concurrent_runs=4)

Finally, let's launch the hyperparameter tuning job.

In [61]:
hdr = exp.submit(config=hdc)

We can use a run history widget to show the progress. Be patient as this might take a while to complete.

In [62]:
RunDetails(hdr).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [63]:
hdr.wait_for_completion(show_output=True)

RunId: keras-mnist_1561477492652
Web View: https://mlworkspace.azure.ai/portal/subscriptions/79ec9c01-599f-4707-82f9-31b2d938f2e5/resourceGroups/pedro-test/providers/Microsoft.MachineLearningServices/workspaces/azure-ml-experiments/experiments/keras-mnist/runs/keras-mnist_1561477492652

Streaming azureml-logs/hyperdrive.txt

"<START>[2019-06-25T15:44:53.111169][API][INFO]Experiment created<END>\n""<START>[2019-06-25T15:44:53.371738][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2019-06-25T15:44:53.466013][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2019-06-25T15:44:56.0296609Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>


W0625 17:08:11.550705 4565059008 connectionpool.py:665] Retrying (Retry(total=2, connect=3, read=2, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', OSError("(60, 'ETIMEDOUT')",))': /azureml/ExperimentRun/dcid.keras-mnist_1561477492652/azureml-logs/hyperdrive.txt?sv=2018-03-28&sr=b&sig=tGfF8YF7%2FaRCAMRudLqfr3v3zh3Rp6BxJyAwgzjxAgY%3D&st=2019-06-25T15%3A57%3A52Z&se=2019-06-26T00%3A07%3A52Z&sp=r



Execution Summary
RunId: keras-mnist_1561477492652
Web View: https://mlworkspace.azure.ai/portal/subscriptions/79ec9c01-599f-4707-82f9-31b2d938f2e5/resourceGroups/pedro-test/providers/Microsoft.MachineLearningServices/workspaces/azure-ml-experiments/experiments/keras-mnist/runs/keras-mnist_1561477492652



{'runId': 'keras-mnist_1561477492652',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2019-06-25T15:44:52.983649Z',
 'endTimeUtc': '2019-06-25T16:16:23.000Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'ContentSnapshotId': 'bcde896a-e0a8-4257-8aa0-1f3f89f4555e'},
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://azuremlestorage9451e43f9.blob.core.windows.net/azureml/ExperimentRun/dcid.keras-mnist_1561477492652/azureml-logs/hyperdrive.txt?sv=2018-03-28&sr=b&sig=siqFFY8gA4rz%2F4USI9sW%2FgoFEE1E0PG019%2FHUv8RZ70%3D&st=2019-06-25T16%3A06%3A24Z&se=2019-06-26T00%3A16%3A24Z&sp=r'}}

## Find and register best model
When all the jobs finish, we can find out the one that has the highest accuracy.

In [64]:
best_run = hdr.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

['--data-folder', '$AZUREML_DATAREFERENCE_bee0b4637f1f47cda6b8c9f7be91c1b8', '--batch-size', '50', '--first-layer-neurons', '300', '--learning-rate', '0.00268786311136608', '--second-layer-neurons', '10']


Now let's list the model files uploaded during the run.

In [65]:
print(best_run.get_file_names())

['Accuracy vs Loss_1561479261.png', 'azureml-logs/55_batchai_execution.txt', 'azureml-logs/55_batchai_stdout-job_post.txt', 'azureml-logs/55_batchai_stdout-job_prep.txt', 'azureml-logs/55_batchai_stdout.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/driver_log.txt', 'logs/azureml/138_azureml.log', 'logs/azureml/azureml.log', 'outputs/model/model.h5', 'outputs/model/model.json']


We can then register the folder (and all files in it) as a model named `keras-dnn-mnist` under the workspace for deployment.

In [66]:
model = best_run.register_model(model_name='keras-mlp-mnist', model_path='outputs/model')

## Deploy the model in ACI
Now we are ready to deploy the model as a web service running in Azure Container Instance [ACI](https://azure.microsoft.com/en-us/services/container-instances/). Azure Machine Learning accomplishes this by constructing a Docker image with the scoring logic and model baked in.
### Create score.py
First, we will create a scoring script that will be invoked by the web service call. 

* Note that the scoring script must have two required functions, `init()` and `run(input_data)`. 
  * In `init()` function, you typically load the model into a global object. This function is executed only once when the Docker container is started. 
  * In `run(input_data)` function, the model is used to predict a value based on the input data. The input and output to `run` typically use JSON as serialization and de-serialization format but you are not limited to that.

In [67]:
%%writefile score.py
import json
import numpy as np
import os
from keras.models import model_from_json

from azureml.core.model import Model

def init():
    global model
    
    model_root = Model.get_model_path('keras-mlp-mnist')
    # load json and create model
    json_file = open(os.path.join(model_root, 'model.json'), 'r')
    model_json = json_file.read()
    json_file.close()
    model = model_from_json(model_json)
    # load weights into new model
    model.load_weights(os.path.join(model_root, "model.h5"))   
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    
def run(raw_data):
    data = np.array(json.loads(raw_data)['data'])
    # make prediction
    y_hat = np.argmax(model.predict(data), axis=1)
    return y_hat.tolist()

Writing score.py


### Create myenv.yml
We also need to create an environment file so that Azure Machine Learning can install the necessary packages in the Docker image which are required by your scoring script. In this case, we need to specify conda packages `tensorflow` and `keras`.

In [68]:
from azureml.core.runconfig import CondaDependencies

cd = CondaDependencies.create()
cd.add_conda_package('tensorflow')
cd.add_conda_package('keras')
cd.add_conda_package('scikit-learn')
cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')

print(cd.serialize_to_string())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - azureml-defaults
- tensorflow
- keras
- scikit-learn
channels:
- conda-forge



### Deploy to ACI
We are almost ready to deploy. Create a deployment configuration and specify the number of CPUs and gigbyte of RAM needed for your ACI container. 

In [69]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               auth_enabled=True, # this flag generates API keys to secure access
                                               memory_gb=1, 
                                               tags={'name':'mnist', 'framework': 'Keras'},
                                               description='Keras MLP on MNIST')

#### Deployment Process
Now we can deploy. **This cell will run for about 7-8 minutes**. Behind the scene, it will do the following:
1. **Build Docker image**  
Build a Docker image using the scoring file (`score.py`), the environment file (`myenv.yml`), and the `model` object. 
2. **Register image**    
Register that image under the workspace. 
3. **Ship to ACI**    
And finally ship the image to the ACI infrastructure, start up a container in ACI using that image, and expose an HTTP endpoint to accept REST client calls.

In [70]:
from azureml.core.image import ContainerImage

imgconfig = ContainerImage.image_configuration(execution_script="score.py", 
                                               runtime="python", 
                                               conda_file="myenv.yml")

In [71]:
%%time
from azureml.core.webservice import Webservice

service = Webservice.deploy_from_model(workspace=ws,
                                       name='keras-mnist-svc',
                                       deployment_config=aciconfig,
                                       models=[model],
                                       image_config=imgconfig)

service.wait_for_deployment(show_output=True)

Creating image
Running......................................................
Succeeded
Image creation operation finished for image keras-mnist-svc:1, operation "Succeeded"
Creating service
Running.
Failed

E0625 17:48:24.968159 4565059008 _azureml_exception.py:175] Service deployment polling reached non-successful terminal state, current service state: Transitioning
Error:
{
  "code": "RequestDisallowedByPolicy",
  "statusCode": 403,
  "message": "ACI Service request failed. Reason: Resource 'keras-mnist-svc' was disallowed by policy. Policy identifiers: '[{\"policyAssignment\":{\"name\":\"Mandatory\",\"id\":\"/providers/Microsoft.Management/managementGroups/f55b1f7d-7a7f-49e4-9b90-55218aad89f8/providers/Microsoft.Authorization/policyAssignments/9978978c349a4e128d18fe76\"},\"policyDefinition\":{\"name\":\"Allowed locations\",\"id\":\"/providers/Microsoft.Authorization/policyDefinitions/e56962a6-4747-49cd-b67b-bf8b01975c4c\"},\"policySetDefinition\":{\"name\":\"Mandatory\",\"id\":\"/providers/Microsoft.Management/managementgroups/f55b1f7d-7a7f-49e4-9b90-55218aad89f8/providers/Microsoft.Authorization/policySetDefinitions/Important\"}}]'.."
}

E0625 17:48:24.969106 4565059008 _azureml_excep

WebserviceException: Service deployment polling reached non-successful terminal state, current service state: Transitioning
Error:
{
  "code": "RequestDisallowedByPolicy",
  "statusCode": 403,
  "message": "ACI Service request failed. Reason: Resource 'keras-mnist-svc' was disallowed by policy. Policy identifiers: '[{\"policyAssignment\":{\"name\":\"Mandatory\",\"id\":\"/providers/Microsoft.Management/managementGroups/f55b1f7d-7a7f-49e4-9b90-55218aad89f8/providers/Microsoft.Authorization/policyAssignments/9978978c349a4e128d18fe76\"},\"policyDefinition\":{\"name\":\"Allowed locations\",\"id\":\"/providers/Microsoft.Authorization/policyDefinitions/e56962a6-4747-49cd-b67b-bf8b01975c4c\"},\"policySetDefinition\":{\"name\":\"Mandatory\",\"id\":\"/providers/Microsoft.Management/managementgroups/f55b1f7d-7a7f-49e4-9b90-55218aad89f8/providers/Microsoft.Authorization/policySetDefinitions/Important\"}}]'.."
}

**Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:**

In [72]:
print(service.get_logs())

E0625 17:48:25.284075 4565059008 _azureml_exception.py:175] Received bad response from Model Management Service:
Response Code: 404
Headers: {'Date': 'Tue, 25 Jun 2019 16:48:25 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Request-Context': 'appId=cid-v1:6a27ce65-5555-41a3-85f7-b7a1ce31fd6b', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'x-ms-client-request-id': '5c9e90e8896c4fe5b8e167e4954fff7a', 'x-ms-client-session-id': '', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload', 'Content-Encoding': 'gzip'}
Content: b'{"code":"NotFound","statusCode":404,"message":"The specified resource was not found","details":[{"code":"ResourceNotFound","message":"The Resource \'Microsoft.ContainerInstance/containerGroups/keras-mnist-svc\' under resource group \'pedro-test\' was not found."}]}'



WebserviceException: Received bad response from Model Management Service:
Response Code: 404
Headers: {'Date': 'Tue, 25 Jun 2019 16:48:25 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Request-Context': 'appId=cid-v1:6a27ce65-5555-41a3-85f7-b7a1ce31fd6b', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'x-ms-client-request-id': '5c9e90e8896c4fe5b8e167e4954fff7a', 'x-ms-client-session-id': '', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload', 'Content-Encoding': 'gzip'}
Content: b'{"code":"NotFound","statusCode":404,"message":"The specified resource was not found","details":[{"code":"ResourceNotFound","message":"The Resource \'Microsoft.ContainerInstance/containerGroups/keras-mnist-svc\' under resource group \'pedro-test\' was not found."}]}'

This is the scoring web service endpoint:

In [None]:
print(service.scoring_uri)

### Test the deployed model
Let's test the deployed model. Pick 30 random samples from the test set, and send it to the web service hosted in ACI. Note here we are using the `run` API in the SDK to invoke the service. You can also make raw HTTP calls using any HTTP tool such as curl.

After the invocation, we print the returned predictions and plot them along with the input images. Use red font color and inversed image (white on black) to highlight the misclassified samples. Note since the model accuracy is pretty high, you might have to run the below cell a few times before you can see a misclassified sample.

In [None]:
import json

# find 30 random samples from test set
n = 30
sample_indices = np.random.permutation(X_test.shape[0])[0:n]

test_samples = json.dumps({"data": X_test[sample_indices].tolist()})
test_samples = bytes(test_samples, encoding='utf8')

# predict using the deployed model
result = service.run(input_data=test_samples)

# compare actual value vs. the predicted values:
i = 0
plt.figure(figsize = (20, 1))

for s in sample_indices:
    plt.subplot(1, n, i + 1)
    plt.axhline('')
    plt.axvline('')
    
    # use different color for misclassified sample
    font_color = 'red' if y_test[s] != result[i] else 'black'
    clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys
    
    plt.text(x=10, y=-10, s=y_hat[s], fontsize=18, color=font_color)
    plt.imshow(X_test[s].reshape(28, 28), cmap=clr_map)
    
    i = i + 1
plt.show()

We can retreive the API keys used for accessing the HTTP endpoint.

In [None]:
# retreive the API keys. two keys were generated.
key1, Key2 = service.get_keys()
print(key1)

We can now send construct raw HTTP request and send to the service. Don't forget to add key to the HTTP header.

In [None]:
import requests

# send a random row from the test set to score
random_index = np.random.randint(0, len(X_test)-1)
input_data = "{\"data\": [" + str(list(X_test[random_index])) + "]}"

headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}

resp = requests.post(service.scoring_uri, input_data, headers=headers)

print("POST to url", service.scoring_uri)
#print("input data:", input_data)
print("label:", y_test[random_index])
print("prediction:", resp.text)

Let's look at the workspace after the web service was deployed. You should see 
* a registered model named 'keras-mlp-mnist' and with the id 'model:1'
* an image called 'keras-mnist-svc' and with a docker image location pointing to your workspace's Azure Container Registry (ACR)  
* a webservice called 'keras-mnist-svc' with some scoring URL

In [None]:
models = ws.models
for name, model in models.items():
    print("Model: {}, ID: {}".format(name, model.id))
    
images = ws.images
for name, image in images.items():
    print("Image: {}, location: {}".format(name, image.image_location))
    
webservices = ws.webservices
for name, webservice in webservices.items():
    print("Webservice: {}, scoring URI: {}".format(name, webservice.scoring_uri))

## Clean up
You can delete the ACI deployment with a simple delete API call.

In [None]:
service.delete()