Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Part 1: Training Tensorflow 2.0 Model on Azure Machine Learning Service

## Overview of the part 1
This notebook is Part 1 (Preparing Data and Model Training) of a four part workshop that demonstrates an end-to-end workflow using Tensorflow 2.0 on Azure Machine Learning service. The different components of the workshop are as follows:

- Part 1: [Preparing Data and Model Training](https://github.com/microsoft/bert-stack-overflow/blob/master/1-Training/AzureServiceClassifier_Training.ipynb)
- Part 2: [Inferencing and Deploying a Model](https://github.com/microsoft/bert-stack-overflow/blob/master/2-Inferencing/AzureServiceClassifier_Inferencing.ipynb)
- Part 3: [Setting Up a Pipeline Using MLOps](https://github.com/microsoft/bert-stack-overflow/tree/master/3-ML-Ops)
- Part 4: [Explaining Your Model Interpretability](https://github.com/microsoft/bert-stack-overflow/blob/master/4-Interpretibility/IBMEmployeeAttritionClassifier_Interpretability.ipynb)

**This notebook will cover the following topics:**

- Stackoverflow question tagging problem
- Introduction to Transformer and BERT deep learning models
- Introduction to Azure Machine Learning service
- Preparing raw data for training using Apache Spark
- Registering cleaned up training data as a Dataset
- Debugging the model in Tensorflow 2.0 Eager Mode
- Training the model on GPU cluster
- Monitoring training progress with built-in Tensorboard dashboard 
- Automated search of best hyper-parameters of the model
- Registering the trained model for future deployment

## Prerequisites
This notebook is designed to be run in an Azure ML Notebook VM. See the [readme](https://github.com/microsoft/bert-stack-overflow/blob/master/README.md) file for instructions on how to create Notebook VM and open this notebook in it.

### Check Azure Machine Learning Python SDK version

This tutorial requires version 1.27.0 or higher of the Azure ML Python SDK. Let's check the version of the SDK:

In [None]:
import azureml.core

print("Azure Machine Learning Python SDK version:", azureml.core.VERSION)

## Stackoverflow Question Tagging Problem 
In this workshop we will use a powerful language understanding model to automatically route Stackoverflow questions about Azure services to the appropriate Azure support team.

One of the key tasks to ensuring long term success of any Azure service is actively responding to related posts in online forums such as Stackoverflow. In order to keep track of these posts, Microsoft relies on the associated tags to direct questions to the appropriate support team. While Stackoverflow has different tags for each Azure service (azure-web-app-service, azure-virtual-machine-service, etc), people often use the generic **azure** tag. Questions without appropriate tags are less likely to be found by the teams that can help, and so many untagged and generically tagged questions are unanswered. 

**In order to solve this problem, we will build a model to classify posts on Stackoverflow with the appropriate Azure service tag.**

We will be using a BERT (Bidirectional Encoder Representations from Transformers) model which was published by researchers at Google AI Reasearch. Unlike prior language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of natural language processing (NLP) tasks without substantial architecture modifications.

## Why use BERT model?
[Introduction of BERT model](https://arxiv.org/pdf/1810.04805.pdf) changed the world of NLP. Many NLP problems that before relied on specialized models to achive state of the art performance are now solved with BERT better and with more generic approach.

If we look at the leaderboards on such popular NLP problems as GLUE and SQUAD, most of the top models are based on BERT:
* [GLUE Benchmark Leaderboard](https://gluebenchmark.com/leaderboard/)
* [SQuAD Benchmark Leaderboard](https://rajpurkar.github.io/SQuAD-explorer/)

Recently, the Allen Institue for AI announced a new language understanding system called Aristo [https://allenai.org/aristo/](https://allenai.org/aristo/). The system has been developed for 20 years, but its performance on 8th grade science test was stuck at 60%. The result jumped to 90% once researchers adopted BERT as the core language understanding component. With BERT Aristo now solves the test with A grade.  

## Quick Overview of How BERT model works

The foundation of BERT model is the Transformer model architecture, which was introduced in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Before the introduction of transformers, the dominant deep learning language model architecture used Recurrent Neural Networks (RNNs). For context, let's briefly review RNNs.

## RNNs

In many languages, what you say next can be partially predicted based on what you've just said. RNNs were designed to capture what's already been said or written, and perform sophisticated inference on what text will follow based on that.

<img src="https://miro.medium.com/max/400/1*L38xfe59H5tAgvuIjKoWPg.png" alt="Drawing" style="width: 100px;"/>

_Taken from [1](https://towardsdatascience.com/transformers-141e32e69591)_

Applied to language translation tasks, the processing dynamics looked like this.

![](https://miro.medium.com/max/1200/1*8GcdjBU5TAP36itWBcZ6iA.gif)
_Taken from [2](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)_
    
But RNNs suffered from two disadvantages:
1. Sequential computation put a limit on parallelization, which limited the effectiveness of larger models.
2. Long term relationships between words were harder to detect.

## Transformers

Transformers were designed to address these two limitations of RNNs.

<img src="https://miro.medium.com/max/2436/1*V2435M1u0tiSOz4nRBfl4g.png" alt="Drawing" style="width: 500px;"/>

_Taken from [3](http://jalammar.github.io/illustrated-transformer/)_

In each Encoder layer, Transformers perform a Self-Attention operation which detects relationships between all word embeddings in one matrix multiplication operation. Notably, the relationships are not constrained to just those words that occurred before a particular word. 

<img src="https://miro.medium.com/max/2176/1*fL8arkEFVKA3_A7VBgapKA.gif" alt="Drawing" style="width: 500px;"/>

_Taken from [4](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1)_


## BERT Model

BERT is a very large network with multiple layers of Transformers (12 for BERT-base, and 24 for BERT-large). The model is first pre-trained on a large collection, or corpus, of text data (WikiPedia + books) using self-supervised training, predicting masked words in a sentence. During pre-training the model absorbs significant level of language understanding and produces a more generally useful representation of the language.

<img src="http://jalammar.github.io/images/bert-output-vector.png" alt="Drawing" style="width: 700px;"/>

_Taken from [5](http://jalammar.github.io/illustrated-bert/)_

The pre-trained BERT model can then be fine-tuned to solve specific language tasks, like answering questions, or categorizing spam emails.

<img src="http://jalammar.github.io/images/bert-classifier.png" alt="Drawing" style="width: 700px;"/>

_Taken from [5](http://jalammar.github.io/illustrated-bert/)_

The end-to-end training process for a model that tags stackoverflow questions looks like this:

![](images/model-training-e2e.png)


## What is Azure Machine Learning Service?
Azure Machine Learning service is a cloud service that you can use to develop and deploy machine learning models. Using Azure Machine Learning service, you can track your models as you build, train, deploy, and manage them, all at the broad scale that the cloud provides.
![](./images/aml-overview.png)


#### How can we use it for training machine learning models?
Training machine learning models, particularly deep neural networks, is often a time- and compute-intensive task. Once you've finished writing your training script and running on a small subset of data on your local machine, you will likely want to scale up your workload.

To facilitate large-scale training, the Azure Machine Learning Python SDK provides high-level abstractions like the ScriptRunConfig that enable users to easily train their models in the Azure ecosystem. You will learn to submit any training code you want to run on remote compute, whether it's a single-node run or distributed training across a GPU cluster.

## Connect To Workspace

The [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class)?view=azure-ml-py) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace holds all your experiments, compute targets, models, datastores, etc.

You can [open ml.azure.com](https://ml.azure.com) to access your workspace resources through a graphical user interface of **Azure Machine Learning studio**.

![](./images/aml-workspace.png)

**You will be asked to login in the next step. Use your Microsoft AAD credentials.**

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

## Register Datastore

A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) is used to store connection information to a central data storage. This allows you to access your storage without having to hard code this (potentially confidential) information into your scripts. 

In this tutorial, the data was been previously prepped and uploaded into a central [Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) container. We will register this container into our workspace as a datastore using a [shared access signature (SAS) token](https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview). 


Here is files.
![](https://github.com/xlegend1024/bert-stack-overflow/raw/master/images/datastore_folder_files.png)

In [None]:
from azureml.core import Datastore, Dataset

datastore_name = 'mtcseattle'
container_name = 'azure-service-classifier'
account_name = 'mtcseattle'
sas_token = '?sv=2020-04-08&st=2021-05-26T04%3A39%3A46Z&se=2022-05-27T04%3A39%3A00Z&sr=c&sp=rl&sig=CTFMEu24bo2X06G%2B%2F2aKiiPZBzvlWHELe15rNFqULUk%3D'

datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                    datastore_name=datastore_name, 
                                                    container_name=container_name,
                                                    account_name=account_name, 
                                                    sas_token=sas_token,
                                                    overwrite=True)

#### If the datastore has already been registered, then you (and other users in your workspace) can directly run this cell.

In [None]:
from azureml.core import Datastore

datastore = Datastore.get(ws, 'mtcseattle')
datastore

## Dataset

An Azure Machine Learning [Dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a resource for exploring, transforming, and managing data in Azure Machine Learning. The following Dataset types are supported:

* [TabularDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format created by parsing the provided file or list of files.

* [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) references a single file or multiple files in Datastores or from public URLs.

### Register Dataset using SDK

In this workshop, we will register Datasets using the SDK. In particular, we will register a File Dataset. File Datasets allow for the data files in a specific folder in our Datastore to be registered as a Dataset.

There is a folder within our datastore called **data** that contains all our training and testing data. We will register this as a dataset.

In [None]:
azure_dataset = Dataset.File.from_files(path=(datastore, 'data'))

azure_dataset = azure_dataset.register(workspace=ws,
                                       name='Azure Services Dataset',
                                       description='Dataset containing azure related posts on Stackoverflow',
                                       create_new_version=True)

#### If the dataset has already been registered, then you (and other users in your workspace) can directly run this cell.

In [None]:
azure_dataset = Dataset.get_by_name(ws, 'Azure Services Dataset')
azure_dataset 

## Perform Experiment using Azure Machine Learning for SDK

Now that we have a Compute Instance, Dataset, and training script working locally, it is time to train a model running the script. We will start by creating an [Experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py). An Experiment is useful for grouping many runs of a specified script, or many runs of competing approaches to a single task. All runs in this tutorial will be performed under the same Experiment. 

In [None]:
from azureml.core import Experiment

experiment_name = 'azure-service-classifier' 
experiment = Experiment(ws, name=experiment_name)

So that we can replicate the exact set of packages our script relies on, we establish an Environment and specify packages, some with specific versions. We will pass the Environment specification along with our script when we submit the run.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies 

env = Environment.get(ws, name='AzureML-TensorFlow-2.0-GPU')
env.python.conda_dependencies.add_conda_package("pip")
env.python.conda_dependencies.add_pip_package("transformers==2.0.0")
env.python.conda_dependencies.add_pip_package("absl-py")
env.python.conda_dependencies.add_pip_package("azureml-dataprep")
env.python.conda_dependencies.add_pip_package("h5py<3.0.0")
env.python.conda_dependencies.add_pip_package("pandas")

env.name = "Bert_training"
env

### Create ScriptRunConfig

The RunConfiguration object encapsulates the information necessary to submit a training run in an experiment. Typically, you will not create a RunConfiguration object directly but get one from a method that returns it, such as the submit method of the Experiment class.

RunConfiguration is a base environment configuration that is also used in other types of configuration steps that depend on what kind of run you are triggering. For example, when setting up a PythonScriptStep, you can access the step's RunConfiguration object and configure Conda dependencies or access the environment properties for the run.

Let's go over how a Run is executed in Azure Machine Learning.

![](./images/aml-run.png)

A quick description for each of the parameters we have just defined:

- `source_directory`: This specifies the root directory of our source code. 
- `entry_script`: This specifies the training script to run. It should be relative to the source_directory.
- `compute_target`: This specifies to compute target to run the job on. We will use the one created earlier.
- `script_params`: This specifies the input parameters to the training script. Please note:

    1) *azure_dataset.as_named_input('azureservicedata').as_mount()* mounts the dataset to the remote compute and provides the path to the dataset on our datastore. 
    
    2) All outputs from the training script must be outputted to an './outputs' directory as this is the only directory that will be saved to the run. 

#### Add Metrics Logging

So we were able to clone a Tensorflow 2.0 project and run it without any changes. However, with larger scale projects we would want to log some metrics in order to make it easier to monitor the performance of our model and compare its performance to that of other models. 

We can do this by adding a few lines of code into our training script:

```python
# 1) Import SDK Run object
from azureml.core.run import Run

# 2) Get current service context
run = Run.get_context()

# 3) Log the metrics that we want
run.log('val_accuracy', float(logs.get('val_accuracy')))
run.log('accuracy', float(logs.get('accuracy')))
```
We've created a *train_logging.py* script that includes logging metrics as shown above. 

In [None]:
%pycat train_logging.py

In [None]:
from azureml.core import ScriptRun, ScriptRunConfig

scriptrun = ScriptRunConfig(source_directory='./',
                           script='train_logging.py',
                           arguments=['--data_dir', azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length', 128,
                              '--batch_size', 16,
                              '--learning_rate', 3e-5,
                              '--steps_per_epoch', 5, # to reduce time for workshop
                              '--num_epochs', 1, # to reduce time for workshop
                              '--export_dir','./outputs/model'],
                           compute_target=locals,
                           environment=env)

run = experiment.submit(scriptrun)

Now if we view the current details of the run, you will notice that the metrics will be logged into graphs.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

## Check the model performance

Last training run produced model of decent accuracy. Let's test it out and see what it does. First, let's check what files our latest training run produced and download the model files.

#### Download model files

In [None]:
run.get_file_names()

In [None]:
modelpath = 'outputs/model'
run.download_files(prefix=modelpath)

# # If you haven't finished training the model then just download pre-made model from datastore
# datastore.download('./outputs/model',prefix='model')

## Register Model

A registered [model](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py) is a reference to the directory or file or files that make up your model. After registering a model, you and other people in your workspace can easily gain access to and deploy your model without having to run the training script again. 

We need to define the following parameters to register a model:

- `model_name`: The name for your model. If the model name already exists in the workspace, it will create a new version for the model.
- `model_path`: The path to where the model is stored. In our case, this was the *export_dir* defined in our estimators.
- `description`: A description for the model.

Let's register the best run from our hyperparameter tuning.

In [None]:
model = run.register_model(model_name='azure-service-classifier', 
                                model_path=modelpath,
                                datasets=[('train, test, validation data', azure_dataset)],
                                description='BERT model for classifying azure services on stackoverflow posts.')

---

## Appendix

__The following is for reference, but not to be used for this workshop.__

## Distributed Training Across Multiple GPUs

Distributed training allows us to train across multiple nodes of a cluster. Azure Machine Learning service helps manage the infrastructure for training distributed jobs. All we have to do is add the following parameters to our ScriptRunConfig object in order to enable this:

- `node_count`: The number of nodes to run this job across. Our cluster has a maximum node limit of 2, so we can set this number up to 2.
- `process_count_per_node`: The number of processes to enable per node. The nodes in our cluster have 2 GPUs each. We will set this value to 2 which will allow us to distribute the load on both GPUs. Using multi-GPU nodes is beneficial as communication channel bandwidth on local machines is higher than between nodes. 
- `distributed_training`: The backend to use for our distributed job. We will be using an MPI (Message Passing Interface) backend which is used by the Horovod framework.

We use [Horovod](https://github.com/horovod/horovod), which is a framework that allows us to easily modifying our existing training script to be run across multiple nodes/GPUs. The distributed training script is saved as *train_horovod.py*.

*  **ACTION**: Explore _train_horovod.py_ using [Azure ML studio > Notebooks tab](images/azuremlstudio-notebooks-explore.png)

In [None]:
%pycat train_horovod.py

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = 'train-gpu-nv6'
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NV6', 
                                                       idle_seconds_before_scaledown=6000,
                                                       min_nodes=0, 
                                                       max_nodes=10)

compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)

We can submit this run in the same way that we did with the others, but with the additional parameters.

In [None]:
from azureml.core import ScriptRun, ScriptRunConfig
from azureml.core.runconfig import MpiConfiguration

scriptrun3 = ScriptRunConfig(source_directory='./',
                           script='train_horovod.py',
                           arguments=['--data_dir', azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length', 128,
                              '--batch_size', 32,
                              '--learning_rate', 3e-5,
                              '--steps_per_epoch', 150,
                              '--num_epochs', 3,
                              '--export_dir','./outputs/model'],
                           compute_target=compute_target,
                           distributed_job_config=MpiConfiguration(process_count_per_node=1, node_count=1),
                           environment=env)

run3 = experiment.submit(scriptrun3)

Once again, we can view the current details of the run. 

In [None]:
from azureml.widgets import RunDetails
RunDetails(run3).show()

Once the run completes, note the time it took. It should be around 5 minutes. As you can see, by moving to cloud GPUs and using distibuted training we reduced model training time from more than an hour to 5 minutes, which greatly improves our speed of experimentation and potential for innovation.

## Tune Hyperparameters Using Hyperdrive

So far we have been putting in default hyperparameter values, but in practice we would need tune these values to optimize the performance. Azure Machine Learning service provides many methods for tuning hyperparameters using different strategies.

The first step is to choose the parameter space that we want to search. We have a few choices to make here :

- **Parameter Sampling Method**: This is how we select the combinations of parameters to sample. Azure Machine Learning service offers [RandomParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.randomparametersampling?view=azure-ml-py), [GridParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.gridparametersampling?view=azure-ml-py), and [BayesianParameterSampling](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.bayesianparametersampling?view=azure-ml-py). We will use the `GridParameterSampling` method.
- **Parameters To Search**: We will be searching for optimal combinations of `learning_rate` and `num_epochs`.
- **Parameter Expressions**: This defines the [functions that can be used to describe a hyperparameter search space](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.parameter_expressions?view=azure-ml-py), which can be discrete or continuous. We will be using a `discrete set of choices`.

The following code allows us to define these options.

In [None]:
from azureml.train.hyperdrive import GridParameterSampling
from azureml.train.hyperdrive.parameter_expressions import choice


param_sampling = GridParameterSampling( {
        '--learning_rate': choice(3e-5, 3e-4),
        '--num_epochs': choice(3, 4)
    }
)

The next step is to a define how we want to measure our performance. We do so by specifying two classes:

- **[PrimaryMetricGoal](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.primarymetricgoal?view=azure-ml-py)**: We want to `MAXIMIZE` the `val_accuracy` that is logged in our training script.
- **[BanditPolicy](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?view=azure-ml-py)**: A policy for early termination so that jobs which don't show promising results will stop automatically.

In [None]:
from azureml.train.hyperdrive import BanditPolicy
from azureml.train.hyperdrive import PrimaryMetricGoal

primary_metric_name='val_accuracy'
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE

early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=2)

We define an estimator as usual, but this time without the script parameters that we are planning to search.

In [None]:
from azureml.core import ScriptRun, ScriptRunConfig

scriptrun4 = ScriptRunConfig(source_directory='./',
                           script='train_logging.py',
                           arguments=['--data_dir', azure_dataset.as_named_input('azureservicedata').as_mount(),
                              '--max_seq_length', 128,
                              '--batch_size', 32,
                              '--learning_rate', 3e-5,
                              '--steps_per_epoch', 150,
                              '--num_epochs', 3,
                              '--export_dir','./outputs/model'],
                           compute_target=compute_target,
                           environment=env)


Finally, we add all our parameters in a [HyperDriveConfig](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py) class and submit it as a run. 

In [None]:
from azureml.train.hyperdrive import HyperDriveConfig

hyperdrive_run_config = HyperDriveConfig(run_config=scriptrun4, 
                       hyperparameter_sampling=param_sampling, 
                       policy=early_termination_policy, 
                       primary_metric_name=primary_metric_name, 
                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                       max_total_runs=10,
                       max_concurrent_runs=2)

run4 = experiment.submit(hyperdrive_run_config)

When we view the details of our run this time, we will see information and metrics for every run in our hyperparameter tuning.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run4).show()

We can retrieve the best run based on our defined metric.

In [None]:
best_run = run4.get_best_run_by_primary_metric()

We have registered the model with Dataset reference. 
* **ACTION**: Check dataset to model link in **Azure ML studio > Datasets tab > Azure Service Dataset**.

In the [next tutorial](), we will perform inferencing on this model and deploy it to a web service.