# Pretraining of the BERT model

This notebook contains an end-to-end walkthrough of using Azure Machine Learning service and pretraining [BERT: Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805) models.

Methodology:
- Intialize an AzureML workspace
- Register a datastore
- Create an experiment
- Provision a compute target
- Create an Estimator
- Configure and Run

## Prerequisites
If you are using an [Azure Machine Learning Notebook VM](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-run-cloud-notebook), you are all set. Otherwise, refer to the [configuration Notebook](https://github.com/Azure/MachineLearningNotebooks/blob/56e0ebc5acb9614fac51d8b98ede5acee8003820/configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace. Prerequisites are:
* Azure subscription
* Azure Machine Learning Workspace
* Azure Machine Learning SDK

## Library import

In [None]:
# Regular python libraries
import os
import requests
import sys

# AzureML libraries
import azureml.core
from azureml.core import Experiment, Workspace, Datastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.container_registry import ContainerRegistry
from azureml.core.runconfig import MpiConfiguration, RunConfiguration, DEFAULT_GPU_IMAGE
from azureml.tensorboard import Tensorboard
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Workspace setup

Initialize a Workspace object from the existing workspace you created in the Prerequisites step or create a new one.

In [None]:
# Retrieve the workspace
ws = Workspace.setup()

# Print the workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

## Datastore registration

[BERT paper](https://arxiv.org/pdf/1810.04805) references `Wikipedia` and `BookCorpus` datasets for pretraining. This notebook is configured to use Wikipedia dataset only, but can be used with other datasets as well, including custom datasets. The preprocessed data should be available in a `Datastore` in AzureML `Workspace`. 

The Wikipedia corpus used for BERT pretraining is preprocessed following the [data prep instructions](https://github.com/microsoft/AzureML-BERT/blob/master/docs/dataprep.md) and uploaded to  https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz (70 GB). You need to extract the files and copy them to another Azure blob container and register it as a workspace to use it in the pretraining job. Additional details on the tar.gz file is available at [artifacts.md](https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md). Instructions will be added soon for this data transfer, stay tuned.

Alternatively, you can preprocess the raw data from scratch (instructions available at the [data prep notes](https://github.com/microsoft/AzureML-BERT/blob/master/pretrain/pytorch/dataprep/README.md)), upload that to an Azure blob container and use it as the datastore for the job. 

Note: it is also possible to use datasets other than Wikipedia corpus with this implementation. 

The following code assumes that the data is already copied to an Azure blob container with the following directory structure. It is recommended to retain this directory structure to run this notebook without code updates. In case the directory structure is different, the constructor of PyTorch estimator where the datastore is mounted should be modified.

       
```
bert_data
│   bert-base.json
│   bert-large.json
│   bert-base-single-node.json
│   bert-large-single-node.json
│
└───512
│   │
│   └───wiki_pretrain
│       │   wikipedia_segmented_part_0.bin
│       │   wikipedia_segmented_part_1.bin
│       │   ...
│       │   wikipedia_segmented_part_98.bin
```

In [None]:
# Register the datastore with the workspace
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='BERT_Preprocessed_Data',
                                             container_name='data',
                                             account_name='<name goes here>', 
                                             account_key='<key goes here>'
                                            )

# Help from: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data

In [None]:
# Print the workspace attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

## Create an Experiment

Experiment is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [None]:
# Create an experiment
experiment_name = 'BERT-pretraining'
experiment = Experiment(ws, name=experiment_name)

## Provision a cluster

### Introduction to AmlCompute

Azure Machine Learning Compute is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created within your workspace region and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user.

Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service.

For more information on Azure Machine Learning Compute, please read [this](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute).

Note: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [this](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

### Create a compute target
BERT pretraining on Azure Machine Learning Service is supported on 16 x `Standard_NC24s_v3` or 8 x `Standard_ND40_v2` VMs. In the next step, you will create a 16 node (i.e. 64 GPUs) AMLCompute cluster of `Standard_NC24s_v3` GPU VMs, if it doesn't already exist in your workspace. The code to create a cluster with 8 `Standard_ND40_v2` VMs is commented out in the cell below.

* vm_size: VM family of the nodes provisioned by AmlCompute. Simply choose from the supported_vmsizes() above
* max_nodes: Maximum nodes to autoscale to while running a job on AmlCompute
* min_nodes: Minimum number of nodes while running a job on AmlCompute

In [None]:
# Create the compute cluster
gpu_cluster_name = "pretraincluster" 

# Verify that the cluster doesn't exist already
try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC24s_v3', min_nodes=0, max_nodes=16)
    # compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC40_v2', min_nodes=0, max_nodes=8)
    
    # create the cluster
    gpu_compute_target = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

# Use the 'status' property to get a detailed status for the current cluster. 
#print(gpu_compute_target.status.serialize())

## Estimator definition and run submission

The estimator uses a custom docker image and train.py as the entry script for execution.

For more information on Estimator, refer [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch)

In [None]:
# Define the project folder
project_folder = '..' # This is to allow the libraries stored under pytorch/ to be loaded

## Using a public image published on Azure.
image_name = 'mcr.microsoft.com/azureml/bert:pretrain-openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04'

# Using MPI to execute a distributed run
mpi = MpiConfiguration()
# Standard_NC24s_v3 VM has 4 GPUs. !!!! update this appropriately if you use a different VM size !!!!
mpi.process_count_per_node = 4 
# !!!! use the following for Standard_NC40_v2 VM !!!!
# mpi.process_count_per_node = 8

# Define the Pytorch estimator
estimator = PyTorch(source_directory=project_folder,
                    # Compute configuration
                    compute_target=gpu_compute_target,
                    node_count=16, 
                    distributed_training=mpi,
                    use_gpu=True,
                    
                    #Docker image
                    use_docker=True,
                    custom_docker_image=image_name,
                    user_managed=True,
                    
                    # Training script parameters
                    script_params = {
                        # Required Params
                        "--config_file": "bert-large.json",
                        # bert_data is where pre-processed training data are
                        "--path": ds.path('bert_data/').as_mount(),
                        # Optional Params
                        "--max_seq_length": 512,
                        "--max_predictions_per_seq": 80,
                        "--masked_lm_prob": 0.15,
                        "--train_batch_size": 64,
                        '--seed': 42,
                        '--accumulate_gradients': "True",
                        '--gradient_accumulation_steps': 16,
                        '--fp16': "True",
                        '--loss_scale': 0
                    },
                    entry_script='train.py',
                    inputs=[ds.path('bert_data/').as_mount()]
                   )
# path to the Python environment in the custom Docker image
estimator._estimator_config.environment.python.interpreter_path = '/opt/miniconda/envs/amlbert/bin/python'

For single node (1 NC24s_v3 VM), multi-GPU runs for debugging purposes, use the following configuration:
- '--config_file':`bert-base-single-node.json`, '--gradient_accumulation_steps': `16`,"--train_batch_size": `1024` (for bert-base)
- '--config_file':`bert-large-single-node.json`,'--gradient_accumulation_steps': `256`,"--train_batch_size": `1024` (for bert-large)

To resume from the latest checkpoint, use `load_training_checkpoint` parameter. It will load the latest checkpoint from current experiment.

In [None]:
# Submit the run
run = experiment.submit(estimator)
RunDetails(run).show()

In [None]:
# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here
tb = Tensorboard([run])
tb.start()

In [None]:
#tb.stop()

In [None]:
#run.cancel()