Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Part 1: Training Tensorflow 2.0 Model on Azure Machine Learning Service

## Overview of the part 1
This notebook is Part 1 (Preparing Data and Model Training) of a four part workshop that demonstrates an end-to-end workflow using Tensorflow 2.0 on Azure Machine Learning service. The different components of the workshop are as follows:

- Part 1: [Preparing Data and Model Training](https://github.com/johnwu0604/azure-service-classifier)
- Part 2: [Inferencing and Deploying a Model](https://github.com/johnwu0604/azure-service-classifier)
- Part 3: [Setting Up a Pipeline Using MLOps](https://github.com/johnwu0604/azure-service-classifier)
- Part 4: [Explaining Your Model Interpretability](https://github.com/johnwu0604/azure-service-classifier)

**This notebook will cover the following topics:**

- StackOverflow question tagging problem
- Introduction to Azure Machine Learning service
- Preparing training data and uploading it to a central Blob storage
- Registering datastore and datasets to a workspace
- Creating a remote compute target and training a model on it
- Registering the trained model for future deployment

## Prerequisites
This notebook is designed to be run in Azure ML Notebook VM. See [readme](https://github.com/microsoft/bert-stack-overflow/blob/master/README.md) file for instructions on how to create Notebook VM and open this notebook in it.

### Check Azure Machine Learning Python SDK version

This tutorial requires version 1.0.69 or higher. Let's check the version of the SDK:

In [None]:
import azureml.core

print("Azure Machine Learning Python SDK version:", azureml.core.VERSION)

## Stackoverflow Question Tagging Problem 
In this workshop we will use powerful language understanding model to automatically route Stackoverflow questions to the appropriate support team on the example of Azure services.

One of the key tasks to ensuring long term success of any Azure service is actively responding to related posts in online forums such as Stackoverflow. In order to keep track of these posts, Microsoft relies on the associated tags to direct questions to the appropriate support team. While Stackoverflow has different tags for each Azure service (azure-web-app-service, azure-virtual-machine-service, etc), people often use the generic **azure** tag. This makes it hard for specific teams to track down issues related to their product and as a result, many questions get left unanswered. 

**In order to solve this problem, we will build a model to classify posts on Stackoverflow with the appropriate Azure service tag.**

We will be using a BERT (Bidirectional Encoder Representations from Transformers) model which was published by researchers at Google AI Reasearch. Unlike prior language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of natural language processing (NLP) tasks without substantial architecture modifications.

For more information about the BERT, please read this [paper](https://arxiv.org/pdf/1810.04805.pdf)

## What is Azure Machine Learning Service?
Azure Machine Learning service is a cloud service that you can use to develop and deploy machine learning models. Using Azure Machine Learning service, you can track your models as you build, train, deploy, and manage them, all at the broad scale that the cloud provides.
![](./images/aml-overview.png)


#### How can we use it for training machine learning models?
Training machine learning models, particularly deep neural networks, is often a time- and compute-intensive task. Once you've finished writing your training script and running on a small subset of data on your local machine, you will likely want to scale up your workload.

To facilitate training, the Azure Machine Learning Python SDK provides a high-level abstraction, the estimator class, which allows users to easily train their models in the Azure ecosystem. You can create and use an Estimator object to submit any training code you want to run on remote compute, whether it's a single-node run or distributed training across a GPU cluster.

## Process Data Using Databricks

TODO:
- Explain how we went from raw stackoverflow data to a processed form
- Include spark/pandas source code that is reproducible

## Connect To Workspace

The [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class)?view=azure-ml-py) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace holds all your experiments, compute targets, models, datastores, etc.

You can [click here](https://ml.azure.com) to access your workspace resources through a graphical user interface.

![](./images/aml-workspace.png)

In [None]:
from azureml.core import Workspace

workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name, 
      'Azure region: ' + workspace.location, 
      'Subscription id: ' + workspace.subscription_id, 
      'Resource group: ' + workspace.resource_group, sep = '\n')

## Create Compute Target

A [compute target](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py) is a designated compute resource/environment where you run your training script or host your service deployment. This location may be your local machine or a cloud-based compute resource. Compute targets can be reused across the workspace for different runs and experiments. 

For this tutorial, we will create an auto-scaling [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute?view=azure-ml-py) cluster, which is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. To create the cluster, we need to specify the following parameters:

- `vm_size`: The is the type of GPUs that we want to use in our cluster. For this tutorial, we will use **Standard_NC12s_v3 (NVIDIA V100) GPU Machines** .
- `idle_seconds_before_scaledown`: This is the number of seconds before a node will scale down in our auto-scaling cluster. We will set this to **6000** seconds. 
- `min_nodes`: This is the minimum numbers of nodes that the cluster will have. To avoid paying for compute while they are not being used, we will set this to **0** nodes.
- `max_modes`: This is the maximum number of nodes that the cluster will scale up to. Will will set this to **2** nodes.

**Creation of the cluster takes approximately 5 minutes** 

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = 'v100cluster'
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC12s_v3', 
                                                       idle_seconds_before_scaledown=6000,
                                                       min_nodes=0, 
                                                       max_nodes=2)

compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)

To ensure our compute target was created successfully, we can check it's status.

In [None]:
compute_target.get_status().serialize()

#### If the compute target has already been created, then you (and other users in your workspace) can directly run this cell.

In [None]:
compute_target = workspace.compute_targets['v100cluster']

## Register Datastore

A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) is used to store connection information to a central data storage. This allows you to access your storage without having to hard code this (potentially confidential) information into your scripts. 

In this tutorial, the data was been previously prepped and uploaded into a central [Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) container. We will register this container into our workspace as a datastore using a [shared access signature (SAS) token](https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview). 

In [None]:
from azureml.core import Datastore, Dataset

datastore_name = 'tfworld'
container_name = 'azureml-blobstore-7c6bdd88-21fa-453a-9c80-16998f02935f'
account_name = 'tfworld6818510241'
sas_token = '?sv=2019-02-02&ss=bfqt&srt=sco&sp=rl&se=2019-11-08T05:12:15Z&st=2019-10-23T20:12:15Z&spr=https&sig=eDqnc51TkqiIklpQfloT5vcU70pgzDuKb5PAGTvCdx4%3D'

datastore = Datastore.register_azure_blob_container(workspace=workspace, 
                                                    datastore_name=datastore_name, 
                                                    container_name=container_name,
                                                    account_name=account_name, 
                                                    sas_token=sas_token)

#### If the datastore has already been registered, then you (and other users in your workspace) can directly run this cell.

In [None]:
datastore = workspace.datastores['tfworld']

#### What if my data wasn't already hosted remotely?
All workspaces also come with a blob container which is registered as a default datastore. This allows you to easily upload your own data to a remote storage location. You can access this datastore and upload files as follows:
```
datastore = workspace.get_default_datastore()
ds.upload(src_dir='<LOCAL-PATH>', target_path='<REMOTE-PATH>')
```


## Register Dataset

A [Dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a reference to data in a datastore. We can register specific folders in our datastore that contains our data files, as a Dataset. This allows you to have direct access to your data directory within a datastore.

There is a folder within our datastore called **azure-service-data** that contains all our training and testing data. We will register this as a dataset.

In [None]:
azure_dataset = Dataset.File.from_files(path=(datastore, 'azure-service-classifier/data'))

azure_dataset = azure_dataset.register(workspace=workspace,
                                       name='Azure Services Dataset',
                                       description='Dataset containing azure related posts on Stackoverflow')

#### If the dataset has already been registered, then you (and other users in your workspace) can directly run this cell.

In [None]:
azure_dataset = workspace.datasets['Azure Services Dataset']

## Prepare Source Code

It is good practice to keep your trainings scripts separated from your notebook. We have prepared a script (stored in the same folder) in advance that trains a BERT model on the dataset using Tensorflow 2.0 and the open source [huggingface/transformers](https://github.com/huggingface/transformers) libary. Let's start by taking a look at the *train.py* script.

Let's take a look at the *train.py* script

In [None]:
%pycat train.py

## Test Locally

Let's try running the script locally to make sure it works before scaling up to use our compute cluster. To do so, you will need to install the transformers libary.

In [None]:
%pip install transformers==2.0.0

We have taken a small partition of our dataset and included it in this repository. Let's take a quick look at the format of the data.

In [None]:
data_dir = './data'

In [None]:
import os 
import pandas as pd
data = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)
data.head(5)

Now we know what the data looks like, let's test out our script!

In [None]:
%run train.py --data_dir $data_dir --max_seq_length 128 --batch_size 16 --learning_rate 3e-5 --steps_per_epoch 5 --num_epochs 1 --export_dir ../outputs/model

## Perform Experiment

Now that we have our compute target, dataset, and training script working locally, it is time to scale up so that the script can run faster. We will start by creating an [experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py). An experiment is a grouping of many runs from a specified script. All runs in this tutorial will be performed under the same experiment. 

In [None]:
from azureml.core import Experiment

experiment_name = 'azure-service-classifier' 
experiment = Experiment(workspace, name=experiment_name)

#### Create TensorFlow Estimator

The Azure Machine Learning Python SDK Estimator classes allow you to easily construct run configurations for your experiments. They allow you too define parameters such as the training script to run, the compute target to run it on, framework versions, additional package requirements, etc. 

You can also use a generic [Estimator](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) to submit training scripts that use any learning framework you choose.

For popular libaries like PyTorch and Tensorflow you can use their framework specific estimators. We will use the [TensorFlow Estimator](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) for our experiment.

In [None]:
from azureml.train.dnn import TensorFlow

estimator1 = TensorFlow(source_directory='.',
                        entry_script='train.py',
                        compute_target=compute_target,
                        script_params = {
                              '--data_dir': azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length': 128,
                              '--batch_size': 32,
                              '--learning_rate': 3e-5,
                              '--steps_per_epoch': 150,
                              '--num_epochs': 3,
                              '--export_dir':'./outputs/model'
                        },
                        framework_version='2.0',
                        use_gpu=True,
                        pip_packages=['transformers==2.0.0', 'azureml-dataprep[fuse,pandas]==1.1.22'])

A quick description for each of the parameters we have just defined:

- `source_directory`: This specifies the root directory of our source code. 
- `entry_script`: This specifies the training script to run. It should be relative to the source_directory.
- `compute_target`: This specifies to compute target to run the job on. We will use the one created earlier.
- `script_params`: This specifies the input parameters to the training script. Please note:

    1) *azure_dataset.as_named_input('azureservicedata').as_mount()* mounts the dataset to the remote compute and provides the path to the dataset on our datastore. 
    
    2) All outputs from the training script must be outputted to an './outputs' directory as this is the only directory that will be saved to the run. 
    
    
- `framework_version`: This specifies the version of TensorFlow to use. Use Tensorflow.get_supported_verions() to see all supported versions.
- `use_gpu`: This will use the GPU on the compute target for training if set to True.
- `pip_packages`: This allows you to define any additional libraries to install before training.

#### 1) Submit First Run 

We can now train our model by submitting the estimator object as a [run](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py).

In [None]:
run1 = experiment.submit(estimator1)

We can view the current status of the run and stream the logs from within the notebook.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run1).show()

You cancel a run at anytime which will stop the run and scale down the nodes in the compute target.

In [None]:
run1.cancel()

#### 2) Add Metrics Logging

So we were able to clone a Tensorflow 2.0 project and run it without any changes. However, with larger scale projects we would want to log some metrics in order to make it easier to monitor the performance of our model. 

We can do this by adding a few lines of code into our training script:

```python
# 1) Import SDK Run object
from azureml.core.run import Run

# 2) Get current service context
run = Run.get_context()

# 3) Log the metrics that we want
run.log('val_accuracy', float(logs.get('val_accuracy')))
run.log('accuracy', float(logs.get('accuracy')))
```
We've created a *train_logging.py* script that includes logging metrics as shown above. Let's see what the updated script looks like.

In [None]:
%pycat train_logging.py

We can submit this run in the same way that we did before. 

*Since our cluster can scale automatically to two nodes, we can run this job simultaneously with the previous one.*

In [None]:
estimator2 = TensorFlow(source_directory='.',
                        entry_script='train_logging.py',
                        compute_target=compute_target, 
                        script_params = {
                              '--data_dir': azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length': 128,
                              '--batch_size': 32,
                              '--learning_rate': 3e-5,
                              '--steps_per_epoch': 150,
                              '--num_epochs': 3,
                              '--export_dir':'./outputs/model'
                        },
                        framework_version='2.0',
                        use_gpu=True,
                        pip_packages=['transformers==2.0.0', 'azureml-dataprep[fuse,pandas]==1.1.22'])

run2 = experiment.submit(estimator2)

Now if we view the current details of the run, you will notice that the metrics will be logged into graphs.

In [None]:
RunDetails(run2).show()

While we wait for our two runs to complete, let's go over how a Run is executed in Azure Machine Learning.

![](./images/aml-run.png)

#### 3) Distributed Training Across Multiple GPUs

Distributed training allows us to train across multiple nodes if your cluster allows it. Azure Machine Learning service helps manage the infrastructure for training distributed jobs. All we have to do is add the following parameters to our estimator object in order to enable this:

- `node_count`: The number of nodes to run this job across. Our cluster has a maximum node limit of 2, so we will set this to 2.
- `process_count_per_node`: The number of processes to enable per node. We will set this to 1.
- `distributed_training`: The backend to use for our distributed job. We will be using an MPI (Message Passing Interface) backend which is a standardized design for message passing.

We use [horovod](https://github.com/horovod/horovod), which is a library that allows us to easily modifying our existing training script to be run across multiple nodes. The distributed training script is saved as *train_horovod.py*. Let's see what the updated script looks like:

In [None]:
%pycat train_horovod.py

We can submit this run in the same way that we did with the others, but with the additional parameters.

In [None]:
from azureml.train.dnn import Mpi

estimator3 = TensorFlow(source_directory='./',
                        entry_script='train_horovod.py',compute_target=compute_target,
                        script_params = {
                              '--data_dir': azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length': 128,
                              '--batch_size': 32,
                              '--learning_rate': 3e-5,
                              '--steps_per_epoch': 150,
                              '--num_epochs': 3,
                              '--export_dir':'./outputs/model'
                        },
                        framework_version='2.0',
                        node_count=1,
                        distributed_training=Mpi(process_count_per_node=2),
                        use_gpu=True,
                        pip_packages=['transformers==2.0.0', 'azureml-dataprep[fuse,pandas]==1.1.22'])

run3 = experiment.submit(estimator3)

Once again, we can view the current details of the run. 

In [None]:
RunDetails(run3).show()

In [None]:
from azureml.tensorboard import Tensorboard

# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here
tb = Tensorboard([run3])

# If successful, start() returns a string with the URI of the instance.
tb.start()

In [None]:
tb.stop()

#### 4) Tune Hyperparameters Using Hyperdrive

So far we have been putting in default hyperparameter values, but in practice we would need tune these values to optimize the performance. Azure Machine Learning service provides many methods for tuning hyperparameters using different strategies.

The first step is to choose the parameter space that we want to search. We have a few choices to make here :

- **Parameter Sampling Method**: This is how we select the combinations of parameters to sample. Azure Machine Learning service offers [RandomParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.randomparametersampling?view=azure-ml-py), [GridParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.gridparametersampling?view=azure-ml-py), and [BayesianParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.bayesianparametersampling?view=azure-ml-py). We will use the `GridParameterSampling` method.
- **Parameters To Search**: We will be searching for optimal combinations of `learning_rate` and `num_epochs`.
- **Parameter Expressions**: This defines the [functions that can be used to describe a hyperparameter search space](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.parameter_expressions?view=azure-ml-py), which can be discrete or continuous. We will be using a `discrete set of choices`.

The following code allows us to define these options.

In [None]:
from azureml.train.hyperdrive import GridParameterSampling
from azureml.train.hyperdrive.parameter_expressions import choice


param_sampling = GridParameterSampling( {
        '--learning_rate': choice(3e-5, 3e-4),
        '--num_epochs': choice(3, 4)
    }
)

The next step is to a define how we want to measure our performance. We do so by specifying two classes:

- **[PrimaryMetricGoal](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.primarymetricgoal?view=azure-ml-py)**: We want to `MAXIMIZE` the `val_accuracy` that is logged in our training script.
- **[BanditPolicy](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?view=azure-ml-py)**: A policy for early termination so that jobs which don't show promising results will stop automatically.

In [None]:
from azureml.train.hyperdrive import BanditPolicy
from azureml.train.hyperdrive import PrimaryMetricGoal

primary_metric_name='val_accuracy'
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE

early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=2)

We define an estimator as usual, but this time without the script parameters that we are planning to search.

In [None]:
estimator4 = TensorFlow(source_directory='./',
                        entry_script='train_logging.py',
                        compute_target=compute_target,
                        script_params = {
                              '--data_dir': azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length': 128,
                              '--batch_size': 32,
                              '--steps_per_epoch': 150,
                              '--export_dir':'./outputs/model',
                        },
                        framework_version='2.0',
                        use_gpu=True,
                        pip_packages=['transformers==2.0.0', 'azureml-dataprep[fuse,pandas]==1.1.22'])

Finally, we add all our parameters in a [HyperDriveConfig](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py) class and submit it as a run. 

In [None]:
from azureml.train.hyperdrive import HyperDriveConfig

hyperdrive_run_config = HyperDriveConfig(estimator=estimator4,
                                         hyperparameter_sampling=param_sampling, 
                                         policy=early_termination_policy,
                                         primary_metric_name=primary_metric_name, 
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=10,
                                         max_concurrent_runs=2)

run4 = experiment.submit(hyperdrive_run_config)

When we view the details of our run this time, we will see information and metrics for every run in our hyperparameter tuning.

In [None]:
RunDetails(run4).show()

We can retrieve the best run based on our defined metric.

In [None]:
best_run = run4.get_best_run_by_primary_metric()

## Register Model

A registered [model](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py) is a reference to the directory or file that make up your model. After registering a model, you and other people in your workspace can easily gain access to and deploy your model without having to run the training script again. 

We need to define the following parameters to register a model:

- `model_name`: The name for your model. If the model name already exists in the workspace, it will create a new version for the model.
- `model_path`: The path to where the model is stored. In our case, this was the *export_dir* defined in our estimators.
- `description`: A description for the model.

Let's register the best run from our hyperparameter tuning.

In [None]:
model = best_run.register_model(model_name='azure-service-classifier', model_path='./outputs/model', description='BERT model for classifying azure services on stackoverflow posts.')

In [None]:
model_dir = model.download(target_dir='.', exist_ok=True, exists_ok=None)

In [None]:
from model import TFBertForMultiClassification
from transformers import BertTokenizer
import tensorflow as tf
def encode_example(text, max_seq_length):
    # Encode inputs using tokenizer
    inputs = tokenizer.encode_plus(
        question,
        add_special_tokens=True,
        max_length=max_seq_length
        )
    input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
    # The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
    attention_mask = [1] * len(input_ids)
    # Zero-pad up to the sequence length.
    padding_length = max_seq_length - len(input_ids)
    input_ids = input_ids + ([0] * padding_length)
    attention_mask = attention_mask + ([0] * padding_length)
    token_type_ids = token_type_ids + ([0] * padding_length)
    
    return input_ids, attention_mask, token_type_ids
    
labels = ['azure-web-app-service', 'azure-storage', 'azure-devops', 'azure-virtual-machine', 'azure-functions']
# Load model and tokenizer
loaded_model = TFBertForMultiClassification.from_pretrained(model_dir, num_labels=len(labels))
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
print("Model loaded from disk.")

In [None]:
# Encode example
question = "I'm having trouble with my blob container"
question = "How to trigger git pipeline automatically"
input_ids, attention_mask, token_type_ids = encode_example(question, 128)

In [None]:
# Make prediction
predictions = loaded_model.predict({
    'input_ids': tf.convert_to_tensor([input_ids], dtype=tf.int32),
    'attention_mask': tf.convert_to_tensor([attention_mask], dtype=tf.int32),
    'token_type_ids': tf.convert_to_tensor([token_type_ids], dtype=tf.int32)
})
prediction = labels[predictions[0].argmax().item()]
probability = predictions[0].max()
result = {
    'prediction': str(labels[predictions[0].argmax().item()]),
    'probability': str(predictions[0].max())
}
print('Prediction: {}'.format(prediction))
print('Probability: {}'.format(probability))

In the [next tutorial](), we will perform inferencing on this model and deploy it to a web service.