Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference Bert Model for High Performance with ONNX Runtime on AzureML #

This tutorial includes how to pretrain and finetune Bert models using AzureML, convert it to ONNX, and then deploy the ONNX model with ONNX Runtime through Azure ML. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

## Roadmap

0. **Prerequisites** to set up your Azure ML work environments.
1. **Pre-train, finetune and export Bert model** from other framework using Azure ML.
2. **Deploy Bert model using ONNX Runtime and AzureML**

## Step 0 - Prerequisites
If you are using an [Azure Machine Learning Notebook VM](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-run-cloud-notebook), you are all set. Otherwise, refer to the [configuration Notebook](https://github.com/Azure/MachineLearningNotebooks/blob/56e0ebc5acb9614fac51d8b98ede5acee8003820/configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace. Prerequisites are:
* Azure subscription
* Azure Machine Learning Workspace
* Azure Machine Learning SDK

Also to make the best use of your time, make sure you have done the following:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* [Azure Portal](https://portal.azure.com) allows you to track the status of your deployments.

## Step 1 - Pretrain, Finetune and Export Bert Model (PyTorch)

If you'd like to pre-train and finetune a Bert model from scratch, follow the instructions in [
Pretraining of the BERT model](https://github.com/microsoft/AzureML-BERT/blob/master/pretrain/PyTorch/notebooks/BERT_Pretrain.ipynb) to pretrain a Bert model in PyTorch using AzureML. Once you have the pretrained model, refer to [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb) or [AzureML Bert Eval GLUE](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_GLUE.ipynb) to finetune your model with your desired dataset. Follow the tutorials all the way through **Create a PyTorch estimator for fine-tuning**. Before creating a Pytorch estimator, we need to prepare an entry file that trains and exports the PyTorch model together. Make sure the entry file has the following code to create an ONNX file:

In [None]:
output_model_path = "bert_azureml_large_uncased.onnx"

# set the model to inference mode
# It is important to call torch_model.eval() or torch_model.train(False) before exporting the model, 
# to turn the model to inference mode. This is required since operators like dropout or batchnorm 
# behave differently in inference and training mode.
model.eval()

# Generate dummy inputs to the model. Adjust if neccessary
inputs = {
        'input_ids':   torch.randint(32, [2, 32], dtype=torch.long).to(device), # list of numerical ids for the tokenised text
        'attention_mask': torch.ones([2, 32], dtype=torch.long).to(device),        # dummy list of ones
        'token_type_ids':  torch.ones([2, 32], dtype=torch.long).to(device),        # dummy list of ones
    }

symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model,                                        # model being run
                  (inputs['input_ids'], 
                   inputs['attention_mask'], 
                   inputs['token_type_ids']),                   # model input (or a tuple for multiple inputs)
                  output_model_path,                            # where to save the model (can be a file or file-like object)
                  opset_version=11,                             # the ONNX version to export the model to
                  do_constant_folding=True,                     # whether to execute constant folding for optimization
                  input_names=['input_ids', 
                               'input_mask', 
                               'segment_ids'],                   # the model's input names
                  output_names=['start', "end"],                 # the model's output names
                  dynamic_axes={'input_ids': symbolic_names,              
                                'input_mask' : symbolic_names,
                                'segment_ids' : symbolic_names,
                                'start' : symbolic_names, 
                                'end': symbolic_names})     # variable length axes

In this directory, a `run_squad_azureml.py` containing the above code is available for use. Copy the training script `run_squad_azureml.py` to your `project_root` (defined at an earlier step in [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb))

In [None]:
shutil.copy('run_squad_azureml.py', project_root)

Now you may continue to follow the **Create a PyTorch estimator for fine-tuning** section in [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb). In creating the estimator, change `entry_script` parameter to point to the `run_squad_azureml.py` we just copied as noted in the following code. 

In [None]:
estimator = PyTorch(source_directory=project_roots, 
                    script_params={'--output-dir': './outputs'},
                    compute_target=gpu_compute_target,
                    use_docker=True,
                    custom_docker_image=image_name,
                    script_params = {...},
                    entry_script='run_squad_azureml.py', # change here
                    node_count=1,
                    process_count_per_node=4,
                    distributed_backend='mpi',
                    use_gpu=True)

Follow the rest of the [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb) to run and export your model. 

## Step 2 - Deploy Bert model with ONNX Runtime through AzureML

In Step 1 and 2, we have prepared an optimized ONNX Bert model and now we can deploy this model as a web service using Azure Machine Learning services and the ONNX Runtime.

We're now going to deploy our ONNX model on Azure ML using the following steps.

1. **Register our model** in our Azure Machine Learning workspace
2. **Write a scoring file** to evaluate our model with ONNX Runtime
3. **Write environment file** for our Docker container image.
4. **Deploy to the cloud** using an Azure Container Instances VM and use it to make predictions using ONNX Runtime Python APIs
5. **Classify sample text input** so we can explore inference with our deployed service.

![End-to-end pipeline with ONNX Runtime](https://raw.githubusercontent.com/vinitra/models/gtc-demo/gtc-demo/E2EPicture.png)

## Step 2.0 - Check your AzureML environment

In [None]:
# Check core SDK version number
import azureml.core
from PIL import Image, ImageDraw, ImageFont
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

print("SDK version:", azureml.core.VERSION)

### Load your Azure ML workspace

We begin by instantiating a workspace object from the existing workspace created earlier in the configuration notebook.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep = '\n')

## Step 2.1 - Register your model with Azure ML

Now we upload the model and register it in the workspace. In the following tutorial. we use the bert SQuAD model outputted from Step 1 as an example. 

You can also register the model from your run to your workspace. The model_path parameter takes in the relative path on the remote VM to the model file in your outputs directory. You can then deploy this registered model as a web service through the AML SDK.

In [None]:
model = run.register_model(model_path = "./bert_azureml_large_uncased.onnx", # Name of the registered model in your workspace.
                           model_name = "bert-squad-large-uncased", # Local ONNX model to upload and register as a model
                           model_framework=Model.Framework.ONNX , # Framework used to create the model.
                           model_framework_version='1.6', # Version of ONNX used to create the model.
                           tags = {"onnx": "demo"},
                           description = "Bert-large-uncased squad model exported from PyTorch",
                           workspace = ws)

Alternatively, if you're working on a local model and want to deploy it to AzureML, upload your model to the same directory as this notebook and register it with `Model.register()`

In [None]:
from azureml.core.model import Model

model = Model.register(model_path = "./bert_azureml_large_uncased.onnx", # Name of the registered model in your workspace.
                       model_name = "bert-squad-large-uncased", # Local ONNX model to upload and register as a model
                       model_framework=Model.Framework.ONNX , # Framework used to create the model.
                       model_framework_version='1.6', # Version of ONNX used to create the model.
                       tags = {"onnx": "demo"},
                       description = "Bert-large-uncased squad model exported from PyTorch",
                       workspace = ws)

#### Displaying your registered models

You can optionally list out all the models that you have registered in this workspace.

In [None]:
models = ws.models
for name, m in models.items():
    print("Name:", name,"\tVersion:", m.version, "\tDescription:", m.description, m.tags)
    
#     # If you'd like to delete the models from workspace
#     model_to_delete = Model(ws, name)
#     model_to_delete.delete()

## Step 2.2 - Write scoring file

We are now going to deploy our ONNX model on Azure ML using the ONNX Runtime. We begin by writing a score.py file that will be invoked by the web service call. The `init()` function is called once when the container is started so we load the model using the ONNX Runtime into a global session object. Then the `run()` function is called when we run the model using the Azure ML web service. Add neccessary `preprocess()` and `postprocess()` steps. The following score.py file uses `bert-squad` as an example and assumes the inputs will be in the following format. 

In [None]:
inputs_json = {
  "version": "1.4",
  "data": [
    {
      "paragraphs": [
        {
          "context": "In its early years, the new convention center failed to meet attendance and revenue expectations.[12] By 2002, many Silicon Valley businesses were choosing the much larger Moscone Center in San Francisco over the San Jose Convention Center due to the latter's limited space. A ballot measure to finance an expansion via a hotel tax failed to reach the required two-thirds majority to pass. In June 2005, Team San Jose built the South Hall, a $6.77 million, blue and white tent, adding 80,000 square feet (7,400 m2) of exhibit space",
          "qas": [
            {
              "question": "where is the businesses choosing to go?",
              "id": "1"
            },
            {
              "question": "how may votes did the ballot measure need?",
              "id": "2"
            },
            {
              "question": "When did businesses choose Moscone Center?",
              "id": "3"
            }
          ]
        }
      ],
      "title": "Conference Center"
    }
  ]
}

In [None]:
%%writefile score.py
import os
import collections
import json
import time
from azureml.core.model import Model
import numpy as np    # we're going to use numpy to process input and output data
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
import wget
from pytorch_pretrained_bert.tokenization import whitespace_tokenize, BasicTokenizer, BertTokenizer

def init():
    global session, tokenizer
    # use AZUREML_MODEL_DIR to get your deployed model(s). If multiple models are deployed, 
    # model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), '$MODEL_NAME/$VERSION/$MODEL_FILE_NAME')
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'bert_azureml_large_uncased.onnx')
    sess_options = onnxruntime.SessionOptions()
    
    # You need set environment variables like OMP_NUM_THREADS for OpenMP to get best performance.
    # See https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/bert/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb
    sess_options.intra_op_num_threads = 1
    
    session = onnxruntime.InferenceSession(model_path, sess_options)
    
    tokenizer = BertTokenizer.from_pretrained("bert-large-uncased", do_lower_case=True)
    
    # download run_squad.py and tokenization.py from 
    # https://github.com/onnx/models/tree/master/text/machine_comprehension/bert-squad to 
    # help with preprocessing and post-processing. 
    if not os.path.exists('./run_onnx_squad.py'):
        url = "https://raw.githubusercontent.com/onnx/models/master/text/machine_comprehension/bert-squad/dependencies/run_onnx_squad.py"
        wget.download(url, './run_onnx_squad.py')

    if not os.path.exists('./tokenization.py'):
        url = "https://raw.githubusercontent.com/onnx/models/master/text/machine_comprehension/bert-squad/dependencies/tokenization.py"
        wget.download(url, './tokenization.py')

def preprocess(input_data_json):
    
    global all_examples, extra_data
    
    # Model configs. Adjust as needed.
    max_seq_length = 128
    doc_stride = 128
    max_query_length = 64

    # Write the input json to file to be used by read_squad_examples()
    input_data_file = "input.json"
    with open(input_data_file, 'w') as outfile:
        json.dump(json.loads(input_data_json), outfile)
    
    from run_onnx_squad import read_squad_examples, convert_examples_to_features
    # Use read_squad_examples method from run_onnx_squad to read the input file
    all_examples = read_squad_examples(input_file=input_data_file)
    
    

    # Use convert_examples_to_features method from run_onnx_squad to get parameters from the input 
    input_ids, input_mask, segment_ids, extra_data = convert_examples_to_features(all_examples, tokenizer,
                                                                              max_seq_length, doc_stride, max_query_length)
    return input_ids, input_mask, segment_ids

def postprocess(all_results):
    # postprocess results
    from run_onnx_squad import write_predictions

    n_best_size = 20
    max_answer_length = 30
    output_dir = 'predictions'
    os.makedirs(output_dir, exist_ok=True)
    output_prediction_file = os.path.join(output_dir, "predictions.json")
    output_nbest_file = os.path.join(output_dir, "nbest_predictions.json")
    # Write the predictions (answers to the questions) in a file.
    write_predictions(all_examples, extra_data, all_results,
                    n_best_size, max_answer_length,
                    True, output_prediction_file, output_nbest_file)
    # Retrieve best results from file.
    result = {}
    with open(output_prediction_file, "r") as f:
        result = json.load(f)
    return result

def run(input_data_json):
    try:
        # load in our data
        input_ids, input_mask, segment_ids = preprocess(input_data_json)
        RawResult = collections.namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
        
        n = len(input_ids)
        bs = 1
        all_results = []
        start = time.time()
        for idx in range(0, n):
            item = all_examples[idx]
            # this is using batch_size=1
            # feed the input data as int64
            data = {
                    "segment_ids": segment_ids[idx:idx+bs],
                    "input_ids": input_ids[idx:idx+bs],
                    "input_mask": input_mask[idx:idx+bs]
                    }
            result = session.run(["start", "end"], data)
            in_batch = result[0].shape[0]
            start_logits = [float(x) for x in result[1][0].flat]
            end_logits = [float(x) for x in result[0][0].flat]
            for i in range(0, in_batch):
                unique_id = len(all_results)
                all_results.append(RawResult(unique_id=unique_id, start_logits=start_logits, end_logits=end_logits))
                
        end = time.time()
        print("total time: {}sec, {}sec per item".format(end - start, (end - start) / len(all_results)))
        return {"result": postprocess(all_results),
                "total_time": end - start, 
               "time_per_item": (end - start) / len(all_results)}
    except Exception as e:
        result = str(e)
        return {"error": result}

## Step 2.3 - Write Environment File

We create a YAML file that specifies which dependencies we would like to see in our container.

In [None]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies.create(pip_packages=["numpy","onnxruntime","azureml-core", "azureml-defaults", "tensorflow", "wget", "pytorch_pretrained_bert"])

with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

We're all set! Let's get our model chugging.

## Step 2.4 - Deploy Model as Webservice on Azure Container Instance

In [None]:
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment

myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 4, 
                                               tags = {'demo': 'onnx'}, 
                                               description = 'web service for Bert-squad-large-uncased ONNX model')

The following cell will likely take a few minutes to run as well.

In [None]:
from azureml.core.webservice import Webservice
from random import randint

aci_service_name = 'onnx-bert-squad-large-uncased-'+str(randint(0,100))
print("Service", aci_service_name)

aci_service = Model.deploy(ws, 
                           aci_service_name, 
                           [model], 
                           inference_config, 
                           aciconfig)

aci_service.wait_for_deployment(True)
print(aci_service.state)

In case the deployment fails, you can check the logs. Make sure to delete your aci_service before trying again.

In [None]:
if aci_service.state != 'Healthy':
    # run this command for debugging.
    print(aci_service.get_logs())
    aci_service.delete()

## Success!

If you've made it this far, you've deployed a working web service that does image classification using an ONNX model. You can get the URL for the webservice with the code below.

In [None]:
print(aci_service.scoring_uri)

## Step 2.5 - Inference Bert Model using our WebService

**Input**: Context paragraph and questions as formatted in `inputs.json`

**Task**: For each question about the context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the questions.

**Output**: The best answer for each question.

In [None]:
# Use the inputs from step 2.2
print("========= INPUT DATA =========")
print(json.dumps(inputs_json, indent=2))
azure_result = aci_service.run(json.dumps(inputs_json))
print("\n")
print("========= RESULT =========")
print(json.dumps(azure_result, indent=2))

In [None]:
res = azure_result['result']
inference_time = np.round(azure_result['total_time'] * 1000, 2)
time_per_item = np.round(azure_result['time_per_item'] * 1000, 2)

print('========================================')
print('Final predictions are: ')
for key in res:
    print("Question: ", inputs_json['data'][0]['paragraphs'][0]['qas'][int(key) - 1]['question'])
    print("Best Answer: ", res[key])
    print()

print('========================================')
print('Inference time: ' + str(inference_time) + " ms")
print('Average inference time for each question: ' + str(time_per_item) + " ms")
print('========================================')

When you are eventually done using the web service, remember to delete it.

In [None]:
aci_service.delete()