# 2 - Training a Model on Remote Compute

In this notebook, you will train a model on remote compute powered by Azure Machine Learning. While we will be working primarily in this Jupyter notebook, the actual training will be done on a separate compute cluster.

## Imports

We first need to import several modules from the Azure Machine Learning Python SDK:

In [None]:
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.compute import AmlCompute
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.dnn import TensorFlow
from azureml.widgets import RunDetails
from azureml.opendatasets import MNIST
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Create an Experiment

The first step toward the goal of training our model is to create an experiment.  The documentation defines an experiment in the following manner:

> Azure Machine Learning experiment represent the collection of trials used to validate a user's hypothesis. [Experiment Class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py)

We will first need to get the reference to our workspace, and then we will create an experiment with the name `keras-mnist`.

In [None]:
# Get reference to the workspace
ws = Workspace.from_config()
print("Azure ML Workspace")
print(f'Name: {ws.name}')
print(f'Location: {ws.location}')
print(f'Resource Group: {ws.resource_group}')

# Create an experiment, or get a reference to the experiment if it already exists
experiment_name = 'keras-mnist'
exp = Experiment(workspace=ws, name=experiment_name)
print("Azure ML Experiment")
print(f'ID: {exp.id}')
print(f'Name: {exp.name}')

Next, we will create a directory that will hold the script we will use for training our model.  This directory will eventually be sent to our remote compute cluster:

In [None]:
import os
script_folder = os.path.join(os.getcwd(), "keras-mnist")
os.makedirs(script_folder, exist_ok=True)

Next, we need to output the script that we will use to train our model.  It will have the name `train.py` in our `keras-mnist` directory.

Within an experiment in Azure Machine Learning, each time you execute a test on your experiment, it is considered a `Run`.  The documentation defines this as:

> A run represents a single trial of an experiment. Runs are used to monitor the asynchronous execution of a trial, log metrics and store output of the trial, and to analyze results and access artifacts generated by the trial. [Run Class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py)

We will need to get the context of the current `Run` from within our training script. We can use this to register metrics from our test and make them associated with that specific `Run`.

In [None]:
%%writefile $script_folder/train.py

import argparse
import os
import numpy as np
from azureml.core import Run
from utils import load_data
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten
from tensorflow.keras.utils import to_categorical


print(f'TensorFlow Version: {tf.__version__}')
print(f'Keras Version: {keras.__version__}')

# ARGUMENT HANDLING
# Read in the two arguments that are passed into the run
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--epochs', type=int, dest='epochs', default=10, help='number of epochs')
args = parser.parse_args()

epochs = args.epochs
data_folder = args.data_folder
print(f'Data folder: {data_folder}')
print(f'Epochs: {epochs}')

# LOAD DATA
# We will leverage a slightly modified version of the code that we used to load the data in the
# previous notebook
num_classes = 10

training_images = load_data(data_folder, "train-images-idx3-ubyte.gz", False) / 255.0
training_images = np.reshape(training_images, (-1, 28,28)).astype('float32')
test_images = load_data(data_folder, "t10k-images-idx3-ubyte.gz", False) / 255.0
test_images = np.reshape(test_images, (-1, 28,28)).astype('float32')

training_labels = load_data(data_folder, "train-labels-idx1-ubyte.gz", True).reshape(-1)
training_labels = to_categorical(training_labels, num_classes)
test_labels = load_data(data_folder, "t10k-labels-idx1-ubyte.gz", True).reshape(-1)
test_labels = to_categorical(test_labels, num_classes)

print(f'Training Image: {training_images.shape}')
print(f'Training Labels: {training_labels.shape}')
print(f'Test Images: {test_images.shape}')
print(f'Test Labels: {test_labels.shape}')

# RUN CONTEXT
# We need to get a reference to the current run context
run = Run.get_context()

# MODEL CREATION & EVALUATION
# We next need to create, compile, fit, and evaluate our model
model_data = training_images, training_labels, test_images, test_labels

model = keras.Sequential([
    Flatten(input_shape=training_images[0].shape),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

input_shape = training_images.shape

model.compile(loss=keras.losses.categorical_crossentropy,
          optimizer=tf.keras.optimizers.Adam(0.001),
          metrics=['accuracy'])

model.fit(training_images, training_labels,
      batch_size=128,
      epochs=epochs,
      verbose=1,
      validation_data=(test_images, test_labels))

score = model.evaluate(test_images, test_labels, verbose=0)

# METRICS
# We need to associate our accuracy metric with the Run context
print('Test accuracy:', score[1])
run.log('accuracy', score[1])

# MODEL EXPORT
# We will save out our model export so that we can register this within our workspace
os.makedirs('outputs/model', exist_ok=True)
model.save('outputs/model/mnist.h5')

Since our training script will also need to access our `load_data` function, we need to be sure to copy the `utils.py` file into this directory:

In [None]:
import shutil
shutil.copy('utils.py', script_folder)

## Working with Estimators

To submit our training script to our compute cluster, we need to create an `Estimator` instance.  The documentation defines an `Estimator` as:

> To facilitate deep learning model training, the Azure Machine Learning Python SDK provides an alternative higher-level abstraction, the estimator class, which allows users to easily construct run configurations. You can create and use a generic Estimator to submit training script using any learning framework you choose (such as scikit-learn) on any compute target you choose, whether it's your local machine, a single VM in Azure, or a GPU cluster in Azure. For PyTorch, TensorFlow and Chainer tasks, Azure Machine Learning also provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks. [Train models with Azure Machine Learning using estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-ml-models)

We will specifically be leveraging the `TensorFlow` Estimator class [see documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) since we are leveraging Keras on top of TensorFlow.

To create our estimator, we will need the following:

1. Environment Configuration
1. Script Parameters
1. Script directory and entry point
1. Compute cluster and configuration

### Script Parameters

We will need to include two different parameters in our script parameters: the data folder, and the number of epochs for model compilation.  We can include any parameters here, and we can then leverage these to gauge model effectiveness with different parameter customizations:

In [None]:
mnist_file_dataset = MNIST.get_file_dataset()

script_params = {
    # to mount files referenced by mnist dataset
    '--data-folder': mnist_file_dataset.as_named_input('mnist_opendataset').as_mount(),
    '--epochs': 10
}

### Remaining Estimator Configuration

We will include our compute cluster plus its configuration as well as the script directory and entry point as we create our estimator instance:

In [None]:
cpu_cluster = AmlCompute(workspace=ws, name='tdsp-cluster')

est = TensorFlow(source_directory=script_folder,
                 entry_script='train.py',
                 script_params=script_params,
                 compute_target=cpu_cluster,
                 framework_version='2.0',
                 pip_packages=['azureml-dataprep[fuse,pandas]','keras==2.2.4'],
                 use_gpu=True)

### Submitting our First Run

We can then submit an Estimator experiment which will create a instance of the `Run` class:

In [None]:
run = exp.submit(config=est)
run

Since this will take some time to complete, we can use an included utility to see the progress within our Jupyter notebook:

In [None]:
RunDetails(run).show()

Once it completes, we can extract the metrics from this run:

In [None]:
print(run.get_metrics())

## Registering Model

Finally, we can register our model in our workspace, so that it can be leveraged elsewhere.  We will use this model in the next clip to deploy it as a web service.

In [None]:
# register model 
from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='keras-mnist', 
                           model_path='outputs/model/mnist.h5',
                           model_framework=Model.Framework.TENSORFLOW,
                           model_framework_version='2.0',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))