Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# PyTorch Pretrained BERT on AzureML with SQuAD Dataset
This notebook contains an end-to-end walkthrough of using Azure Machine Learning Service to run [PyTorch reimplementation](https://github.com/huggingface/pytorch-pretrained-BERT) of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) developed by Hugging Face.

You will find the following contents:
- Download SQuAD dataset on the remote compute and store them in Azure storage
- Speep-up fine-tuning BERT for SQuAD dataset on AzureML GPU clusters
- Further fine-tune BERT wtih AzureML hyperparameter optimizer 

## Prerequisites
- Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning (AML)

- Install the Python SDK:  make sure to install notebook, and contrib
```
conda create -n azureml -y Python=3.6
source activate azureml
pip install --upgrade azureml-sdk[notebooks,contrib] 
conda install ipywidgets
jupyter nbextension install --py --user azureml.widgets
jupyter nbextension enable azureml.widgets --user --py
```

You will need to restart jupyter after this
Detailed instructions are here: https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python 

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize workspace

To create or access an Azure ML Workspace, you will need to import the AML library and the following information:
* A name for your workspace
* Your subscription id
* The resource group name

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step or create a new one. 

In [2]:
from azureml.core.workspace import Workspace

workspace_name = ''
subscription_id = ''
resource_group_name = ''
location = ''

ws = Workspace._get_or_create(workspace_name,
                             subscription_id=subscription_id,
                             resource_group=resource_group_name,
                             location=location)

Wrote the config file BERTtrainconfig.json to: F:\AzureML-BERT\PyTorch\aml_config\BERTtrainconfig.json
Workspace configuration succeeded. Skip the workspace creation steps below


### Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [None]:
project_folder = './pytorch-pretrained-BERT'

Make a local clone of the original [PyTorch reimplementation](https://github.com/huggingface/pytorch-pretrained-BERT) repository

In [None]:
!git clone -b v0.4.0 https://github.com/huggingface/pytorch-pretrained-BERT.git

### Define AzureML datastore to collect training dataset

To make data accessible for remote training, AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data to Azure Storage, and interact with it from your remote compute targets.

Each workspace is associated with a default Azure Blob datastore named `'workspaceblobstore'`. In this work, we use this default datastore to collect the SQuAD training data .

In [None]:
from azureml.core import Datastore
ds = Datastore(ws, 'workspaceblobstore')

The data for SQuAD can be downloaded with the following links and should be saved in a blob storage.
- [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
- [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
- [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

In [None]:
import urllib, os
os.makedirs('./squad', exist_ok=True)
urllib.request.urlretrieve('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json', filename='./squad\\train-v1.1.json')
urllib.request.urlretrieve('https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json', filename='./squad\\dev-v1.1.json')
urllib.request.urlretrieve('https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py', filename= project_folder + '\\evaluate_squad.py')

The following code will upload the training data to the path ./squad on the default datastore.

In [None]:
ds.upload(src_dir='./squad', target_path='./squad')

### Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed PyTorch tutorial. 

In [None]:
from azureml.core import Experiment

experiment_name = 'BERT-SQuAD'
experiment = Experiment(ws, name=experiment_name)

## Fine-tuning BERT with Distributed Training
As our `SQuAD` dataset are ready in Azure storage, we can start the fine-tune the model by exploting the power of distributed training. 

### Create a GPU remote compute target

We need to create a GPU [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to perform the fine-tuning. In this example, we create an AmlCompute cluster as our training compute resource. Please find the information of Azure VM size in below table.

    
|    VM Size    	| CPU 	|   GPU   	| Storage (SSD) 	| GPU memory 	| InfiniBand  	|
|:-------------:	|:---:	|:-------:	|:-------------:	|:----------:	|:----------:	|
|  Standard_NC6 	|  6  	| 1 x K80 	|    340 GiB    	|    8 GiB   	|      No   	|
| Standard_NC12 	|  12 	| 2 x K80 	|    680 GiB    	|   16 GiB   	|      No   	|
| Standard_NC24 	|  24 	| 4 x K80 	|    1440 GiB   	|   32 GiB   	|      No   	|
| Standard_NC24r 	|  24 	| 4 x K80 	|    1440 GiB   	|   32 GiB   	|      Yes   	|
| Standard_NC6s_v3 	|  6  	| 1 x V100 	|    736 GiB    	|   16 GiB   	|      No   	|
| Standard_NC12s_v3 |  12 	| 2 x V100 	|    1474 GiB   	|   32 GiB   	|      No   	|
| Standard_NC24s_v3 |  24 	| 4 x V100 	|    2948 GiB   	|   64 GiB   	|      No   	|
| Standard_NC24rs_v3|  24 	| 4 x V100 	|    2948 GiB   	|   64 GiB   	|      Yes   	|

This code creates a cluster for you if it does not already exist in your workspace.

***We strongly recommend to use NCv3-series (NVIDIA Tesla V100) to fine-tune with SQuAD dataset. You will need to request quota of NCv3-series for your AzureML subscription.***

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
gpu_cluster_name = "nc24rsv3Cluster"

try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC24rs_v3', max_nodes=4)

    # create the cluster
    gpu_compute_target = AmlCompute.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

# Use the 'status' property to get a detailed status for the current cluster. 
print(gpu_compute_target.status.serialize())

### Create a PyTorch estimator for fine-tuning
Let us create a new PyTorch estimator to run the fine-tuning script `run_squad.py`, that is already provided at [the original repository](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_squad.py). Please refer [here](https://github.com/huggingface/pytorch-pretrained-BERT#fine-tuning-with-bert-running-the-examples) for more detail about the script. 

The original `run_squad.py` script uses PyTorch distributed launch untility to launch multiple processes across nodes and GPUs. We prepared a modified version [run_squad_azureml.py](./run_squad_azureml.py) so that we can launch it based on AzureML build-in MPI backend.

To use AML's tracking and metrics capabilities, we need to add a small amount of AzureML code inside the training script.

In `run_squad_azureml.py`, we will log some metrics to our AML run. To do so, we will access the AML run object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `run_squad_azureml.py`, we log learning rate, training loss and prediction scores the model achieves as:
```Python
run.log('lr', np.float(args.learning_rate))
...

for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")): 
    ...
    run.log('train_loss', np.float(loss))

..
```
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Let's first copy the training script `run_squad_azureml.py` and the helper utility script `azureml_bert_util.py` into our project directory.

In [None]:
import shutil
shutil.copy('run_squad_azureml.py', project_folder)
shutil.copy('azureml_bert_util.py', project_folder)

Then, AzureML PyTorch estimator can be defined as below. We use `azuremlsamples/bert:torch-1.0.0-apex-cuda9` as the base docker image with [dockerfile](./dockerfile)

In [None]:
from azureml.train.dnn import PyTorch

estimator = PyTorch(source_directory=project_folder,
                    compute_target=gpu_compute_target,
                    script_params = {
                          '--bert_model':'bert-large-uncased',
                          '--do_train' : '',
                          '--do_predict': '',
                          '--train_file': ds.path('squad/train-v1.1.json').as_mount(),
                          '--predict_file': ds.path('squad/dev-v1.1.json').as_mount(),
                          '--max_seq_length': 384,
                          '--train_batch_size': 8,
                          '--learning_rate': 6.8e-5,
                          '--num_train_epochs': 2.0,
                          '--doc_stride': 128,
                          '--seed': 32,
                          '--init_gradient_accumulation_steps':2,
                          '--target_gradient_accumulation_steps':16,
                          '--accumulation_warmup_proportion':0.25,
                          '--output_dir': './outputs',
                          '--fp16':'',
                          '--loss_scale':256,
                    },
                    custom_docker_base_image='azuremlsamples/bert:torch-1.0.0-apex-cuda9',
                    entry_script='run_squad_azureml.py',
                    node_count=2,
                    process_count_per_node=4,
                    distributed_backend='mpi',
                    use_gpu=True)

estimator._estimator_config.environment.python.user_managed_dependencies=True

### Submit and Monitor your run

In [None]:
run = experiment.submit(estimator)
from azureml.widgets import RunDetails
RunDetails(run).show()

To achieve over **90.5 F1 score** and **83.5 Exact-Match** with `SQuAD v1.1` dataset, it requires **2** epochs when fine-tune with `BERT large` model. Below please find the elapsed time using deferent Azure GPU VMs and configures. 

The default configuration in this notebook uses 2 `STANDARD_NC24rs_v3` (8 x V100) with `fp16` enabled. The training phase should take **22 mins** to complete 2 epochs. 

|  GPU counts 	|    1 GPU    	|         2 GPU 	| 4 GPU      	| 8 GPU      	|
|------------:	|:-----------:	|--------------:	|------------	|------------	|
| NCv3-series 	|     340 mins  |    180 mins 	    |    80 mins 	|   48 mins 	|
| NCv3 with fp16|     140 mins  |    79 mins 	    |    38 mins 	|   22 mins 	|

## Fine-Tuning BERT with Hyperparameter Tuning

We would also like to optimize our hyperparameter, `learning rate`, using Azure Machine Learning's hyperparameter tuning capabilities.

### Start a hyperparameter sweep
First, we will define the hyperparameter space to sweep over. In this example we will use random sampling to try different configuration sets of hyperparameter to minimize our primary metric, the f1 score (`f1`).

In [None]:
from azureml.train.hyperdrive import *
import math

param_sampling = GridParameterSampling( {
         'learning_rate': uniform(5e-5, 9e-5),
    }
)

hyperdrive_run_config = HyperDriveRunConfig(estimator=estimator,
                                            hyperparameter_sampling=param_sampling, 
                                            primary_metric_name='f1',
                                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                            max_total_runs=5,
                                            max_concurrent_runs=5)

Finally, lauch the hyperparameter tuning job.

In [None]:
hyperdrive_run = experiment.submit(hyperdrive_run_config)

### Monitor HyperDrive runs
We can monitor the progress of the runs with the following Jupyter widget. 

In [None]:
from azureml.widgets import RunDetails

RunDetails(hyperdrive_run).show()

### Find and register the best model
Once all the runs complete, we can find the run that produced the model with the highest F1 score.

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)
print('Best Run is:\n  F1 score: {0:.5f} \n  Learning rate: {1:.8f}'.format(
        best_run_metrics['f1'],
        best_run_metrics['lr']
     ))