# Accelerate pretraining of BERT model using ONNX Runtime
This notebook contains a walkthrough of using ONNX Runtime in Azure Machine Learning service to pretrain [BERT: Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805) models. This example shows how ONNX Runtime training can accelerate BERT pretraining implementation in PyTorch maintained at https://github.com/NVIDIA/DeepLearningExamples.

Steps:
- Intialize an AzureML workspace
- Register a datastore to use preprocessed data for training
- Create an AzureML experiment
- Provision a compute target
- Create an Estimator
- Configure and Run

Prerequisites
If you are using an Azure Machine Learning [Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance) you are all set. Otherwise, you need to setup your environment by installing AzureML Python SDK to run this notebook. Refer to [How to use Estimator in Azure ML](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/how-to-use-estimator/how-to-use-estimator.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. 

Refer to instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/nvidia-bert/README.md before running the steps below.

### Check SDK installation

In [None]:
import os
import requests
import sys

# AzureML libraries
import azureml.core
from azureml.core import Experiment, Workspace, Datastore, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.container_registry import ContainerRegistry
from azureml.core.runconfig import MpiConfiguration, RunConfiguration, DEFAULT_GPU_IMAGE
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

### AzureML Workspace setup

In [None]:
# Create or retrieve Azure machine learning workspace
# see https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py
ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-id>', resource_group='myresourcegroup')

# Print workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

### Register Datastore
Before running the step below, data prepared using the instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/nvidia-bert/README.md should be transferred to an Azure Blob container referenced in the `Datastore` registration step. Refer to the documentation at https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data for details on using data in Azure ML experiments.

In [None]:
# Create a datastore from blob storage containing training data.
# Consult README.md for instructions downloading and uploading training data.
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='<datastore-name>',
                                             account_name='<storage-account-name>', 
                                             account_key='<storage-account-key>',
                                             container_name='<storage-container-name>')

In [None]:
# Print datastore attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

### Create AzureML Compute Cluster
This recipe is supported on Azure Machine Learning Service using 16 x Standard_NC24rs_v3 or 8 x Standard_ND40rs_v2 VMs. In the next step, you will create an AzureML Compute cluster of Standard_NC40s_v2 GPU VMs with the specified name, if it doesn't already exist in your workspace. 

In [None]:
# Create GPU cluster
gpu_cluster_name = "ortbertpretrain" 
try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_ND40rs_v2', min_nodes=0, max_nodes=8)
    gpu_compute_target = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

In [None]:
# Create experiment for phase 1
experiment_name = 'nvbert-ort-pretraining-phase1'
experiment = Experiment(ws, name=experiment_name)

### Create Estimator
Notes before running the following step:
* Update the following step to replace two occurences of `<blob-path-to-phase1-training-data>` with the actual path in the datastore that contains the training files.
* If you followed instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/nvidia-bert/README.md to prepare data, make sure that the data and others files that are not code or config are moved out `workspace` directory. Data files should have been moved to a `Datastore` to use in training. 

To fully utilize capacity, we suggest parameters from below table for phase 1. 

| VM SKU             | node_count         | gpu_memory_limit_gb         | train_batch_size | gradient_accumulation_steps |
| ------------------ |:------------------:|-----------------:|-----------------:| ---------------------------:|
| Standard_ND40rs_v2 | 1 (8 GPUs total)   | 32  | 8192  | 64  |
| Standard_ND40rs_v2 | 2 (16 GPUs total)  | 32  | 4096  | 32  |
| Standard_ND40rs_v2 | 4 (32 GPUs total)  | 32  | 2048  | 16  |
| Standard_ND40rs_v2 | 8 (64 GPUs total)  | 32  | 1024  | 8   |
| Standard_NC24rs_v3 | 1 (4 GPUs total)   | 16  | 16320 | 340 |
| Standard_NC24rs_v3 | 2 (8 GPUs total)   | 16  | 8160  | 170 |
| Standard_NC24rs_v3 | 4 (16 GPUs total)  | 16  | 4080  | 85  |
| Standard_NC24rs_v3 | 8 (32 GPUs total)  | 16  | 2016  | 42  |
| Standard_NC24rs_v3 | 16 (64 GPUs total) | 16  | 1008  | 21  |

Refer to [README.md](../README.md) for an in-depth explanation of batch sizes and gradient accumulation steps.

In [None]:
# this directory should contain run_pretraining_ort.py, ort_supplement directory and other files copied over based on the instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/nvidia-bert/README.md 
project_folder = '../../workspace/BERT'

image_name = 'mcr.microsoft.com/azureml/onnxruntime-training:0.1-rc2-openmpi4.0-cuda10.2-cudnn7.6-nccl2.7.6-for-bert'

# set MPI configuration
# set processes per node to be equal to GPU count on SKU.
mpi = MpiConfiguration()
mpi.process_count_per_node = 8

import uuid
output_id = uuid.uuid1().hex

# Define training estimator for phase 1
# Consult https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-ml-models
# Fill in blob path to phase 1 training data in argument below
estimator_ph1 = Estimator(source_directory=project_folder,

                    # Compute configuration
                    compute_target = gpu_compute_target,
                    node_count=4,
                    process_count_per_node=1,  # separate MPI jobs
                    distributed_training = mpi,
                    use_gpu = True,
                    
                    # supply Docker image
                    use_docker = True,
                    custom_docker_image = image_name,
                    user_managed = True,
                    
                    # Training script parameters
                    script_params = {
                        "--config_file": "bert_config.json",
                        '--input_dir' : ds.path('<blob-path-to-phase1-training-data>').as_mount(), 
                        '--output_dir': ds.path(f'output/{experiment_name}/{output_id}/').as_mount(),
                        '--bert_model' : 'bert-large-uncased',
                        '--train_batch_size' : 2048,
                        '--max_seq_length': 128,
                        '--max_predictions_per_seq': 20,
                        '--max_steps' : 7038,
                        '--warmup_proportion' : '0.2843',
                        '--num_steps_per_checkpoint' : 200,
                        '--learning_rate' : '6e-3',
                        '--seed': 42,
                        '--fp16' : '',
                        '--gradient_accumulation_steps' : 16,
                        '--allreduce_post_accumulation' : '',
                        '--allreduce_post_accumulation_fp16' : '',
                        '--do_train' : '',
                        '--use_ib' : '', # pass if infiniband available on SKU
                        '--gpu_memory_limit_gb' : 32 # set to per GPU memory in GB (check SKU)
                    },
                    
                    entry_script = 'run_pretraining_ort.py',
                    inputs = [ds.path('<blob-path-to-phase1-training-data>').as_mount()]
                   )

### Run AzureML experiment - Phase 1 of pretraining

In [None]:
# Submit phase 1 (check logs from Outputs + logs tab of corresponding link)
run = experiment.submit(estimator_ph1)
RunDetails(run).show()
print(run.get_portal_url())

In [None]:
# Create experiment for phase 2
experiment_name = 'nvbert-ort-pretraining-phase2'
experiment = Experiment(ws, name=experiment_name)

### Create Estimator - Phase 2
Notes before running the following step:
* Update the following step to replace two occurences of `<blob-path-to-phase1-training-data>` with the actual path in the datastore that contains the training files.
* If you followed instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/nvidia-bert/README.md to prepare data, make sure that the data and others files that are not code or config are moved out `workspace` directory. Data files should have been moved to a `Datastore` to use in training. 

To fully utilize capacity, we suggest parameters from below table for phase 2. 

| VM SKU             | node_count         | gpu_memory_limit_gb         | train_batch_size | gradient_accumulation_steps |
| ------------------ |:------------------:|-----------------:|-----------------:| ---------------------------:|
| Standard_ND40rs_v2 | 1 (8 GPUs total)   | 32  | 4096 | 256  |
| Standard_ND40rs_v2 | 2 (16 GPUs total)  | 32  | 2048 | 128  |
| Standard_ND40rs_v2 | 4 (32 GPUs total)  | 32  | 1024 | 64   |
| Standard_ND40rs_v2 | 8 (64 GPUs total)  | 32  | 512  | 32   |
| Standard_NC24rs_v3 | 1 (4 GPUs total)   | 16  | 8192 | 1024 |
| Standard_NC24rs_v3 | 2 (8 GPUs total)   | 16  | 4096 | 512  |
| Standard_NC24rs_v3 | 4 (16 GPUs total)  | 16  | 2048 | 256  |
| Standard_NC24rs_v3 | 8 (32 GPUs total)  | 16  | 1024 | 128  |
| Standard_NC24rs_v3 | 16 (64 GPUs total) | 16  | 512  | 64   |

In [None]:

# Define training estimator for phase 2
# Fill in blob path to phase 1 training data as well as phase 1 checkpoint in arguments below
estimator_ph2 = Estimator(source_directory=project_folder,

                    # Compute configuration
                    compute_target = gpu_compute_target,
                    node_count=4, 
                    process_count_per_node=1, # separate MPI jobs
                    distributed_training = mpi,
                    use_gpu = True,
                    
                    #Docker image
                    use_docker = True,
                    custom_docker_image = image_name,
                    user_managed = True,
                    
                    # Training script parameters
                    script_params = {
                        # Required Params
                        "--config_file": "bert_config.json",
                        '--input_dir' : ds.path('<blob-path-to-phase2-training-data>').as_mount(), 
                        '--output_dir': ds.path(f'output/{experiment_name}/{output_id}/').as_mount(),
                        '--bert_model' : 'bert-large-uncased',
                        '--train_batch_size' : 1024,
                        '--max_seq_length': 512,
                        '--max_predictions_per_seq': 80,
                        '--max_steps' : 1563,
                        '--warmup_proportion' : '0.128',
                        '--num_steps_per_checkpoint' : 200,
                        '--learning_rate' : '4e-3',
                        '--seed': 42,
                        '--fp16' : '',
                        '--gradient_accumulation_steps' : 64,
                        '--allreduce_post_accumulation' : '',
                        '--allreduce_post_accumulation_fp16' : '',
                        '--do_train' : '',
                        '--phase2' : '',
                        '--resume_from_checkpoint' : '',
                        '--phase1_end_step' : '7038',
                        '--init_checkpoint' : ds.path('<path-to-checkpoint-from-phase-1>'),
                        '--use_ib' : '', # pass if infiniband available on SKU
                        '--gpu_memory_limit_gb' : 32 # set to per GPU memory in GB (check SKU)
                    },
                    
                    entry_script='run_pretraining_ort.py',
                    inputs=[ds.path('<blob-path-to-phase2-training-data>').as_mount()])

Run AzureML experiment - Phase 2 of pretraining

In [None]:
# Submit phase 2 run (check logs from Outputs + logs tab of corresponding link)
run = experiment.submit(estimator_ph2)
RunDetails(run).show()
print(run.get_portal_url())