Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference Bert Model for High Performance with ONNX Runtime on AzureML #

This tutorial takes a pre-trained BERT model, converts it to ONNX, and deploys the ONNX model with ONNX Runtime through AzureML.
In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

## Contents

**Prerequisites** to set up your Azure ML work environments

**Obtain model and convert to ONNX**

**Deploy Bert model using ONNX Runtime and AzureML**

## Prerequisites

To run on AzureML, you need:
* Azure subscription
* Azure Machine Learning Workspace
* the Azure Machine Learning SDK

You might also find the following resources useful:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* The [Azure Portal](https://portal.azure.com) allows you to track the status of your deployments.

In [1]:
# To install dependencies directly run the following
!pip install torch
!pip install transformers
!pip install azureml azureml.core
!pip install onnxruntime
!pip install matplotlib

# To create a a Jupter kernel from your conda environment, run the following. replacing <kernel name> with your own name
#   conda install -c anaconda ipykernel
#   python -m ipykernel install --user --name=<kernel name>

Collecting azureml
  Using cached azureml-0.2.7-py2.py3-none-any.whl (23 kB)
Collecting azureml.core
  Using cached azureml_core-1.40.0-py3-none-any.whl (2.7 MB)
Collecting azure-mgmt-containerregistry<9.0.0,>=8.2.0
  Using cached azure_mgmt_containerregistry-8.2.0-py2.py3-none-any.whl (928 kB)
Collecting msal-extensions<0.4,>=0.3.0
  Using cached msal_extensions-0.3.1-py2.py3-none-any.whl (18 kB)
Collecting knack~=0.9.0
  Using cached knack-0.9.0-py3-none-any.whl (59 kB)
Collecting msal<2.0.0,>=1.15.0
  Using cached msal-1.17.0-py2.py3-none-any.whl (79 kB)
Collecting azure-mgmt-resource<21.0.0,>=15.0.0
  Using cached azure_mgmt_resource-20.1.0-py3-none-any.whl (2.3 MB)
Collecting azure-mgmt-storage<20.0.0,>=16.0.0
  Using cached azure_mgmt_storage-19.1.0-py3-none-any.whl (1.8 MB)
[31mERROR: azureml-widgets 1.34.0 has requirement azureml-core~=1.34.0, but you'll have azureml-core 1.40.0 which is incompatible.[0m
[31mERROR: azureml-train-core 1.34.0 has requirement azureml-core~=1.34

## Obtain and convert PyTorch model to ONNX format

In the code below, we obtain a BERT model fine-tuned for question answering with the SQUAD dataset from HuggingFace.

If you'd like to pre-train a BERT model from scratch, follow the instructions in
[Pretraining of the BERT model](https://github.com/microsoft/AzureML-BERT/blob/master/pretrain/PyTorch/notebooks/BERT_Pretrain.ipynb). 
And if you would like to fine-tune the model with your own dataset, refer to  [AzureML Bert Eval Squad](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb)
or [AzureML Bert Eval GLUE](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/BERT_Eval_GLUE.ipynb).


### Define the tokenizer and model

In [2]:
from transformers import BertTokenizer, BertForQuestionAnswering

model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


### Sample input and question

In [18]:
question = "What is a major importance of Southern California in relation to California and the United States?"
context = "Southern California, often abbreviated SoCal, is a geographic and cultural region that generally comprises California's southernmost 10 counties. The region is traditionally described as \"eight counties\", based on demographics and economic ties: Imperial, Los Angeles, Orange, Riverside, San Bernardino, San Diego, Santa Barbara, and Ventura. The more extensive 10-county definition, including Kern and San Luis Obispo counties, is also used based on historical political divisions. Southern California is a major economic center for the state of California and the United States."

### Run the PyTorch model

In [3]:
def preprocess(question, context):
    input_ids = tokenizer.encode(question, context)
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # First occurance of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    len_question = sep_idx+1
    len_context = len(input_ids) - len_question
    segment_ids = [0]*len_question + [1]*len_context
    return (input_ids, segment_ids, tokens)
    

In [12]:
def postprocess(tokens, output):
    results = {}
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
        results['question'] = question.capitalize()
        results['answer'] = answer.capitalize()
    else:
        results['error'] = "I am unable to find the answer to this question. Can you please ask another question?"
    return results

In [19]:
import sys 
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

input_ids, segment_ids, tokens = preprocess(question, context)

output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids])) 

results = postprocess(tokens, output)
 

In [20]:
results

{'question': 'What is a major importance of southern california in relation to california and the united states?',
 'answer': 'Economic center'}

### Export the model

In [21]:
import torch

output_model_path = "./" + model_name + ".onnx"

device = 'cpu'

# set the model to inference mode
# It is important to call torch_model.eval() or torch_model.train(False) before exporting the model, 
# to turn the model to inference mode. This is required since operators like dropout or batchnorm 
# behave differently in inference and training mode.
model.eval()

# Generate dummy inputs to the model. Adjust if neccessary
inputs = {
        'input_ids':   torch.randint(32, [1, 32], dtype=torch.long).to(device), # list of numerical ids for the tokenised text
        'token_type_ids':  torch.ones([1, 32], dtype=torch.long).to(device),    # dummy list of ones
    }

symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model,                                        # model being run
                  (inputs['input_ids'], 
                   inputs['token_type_ids']),                   # model input (or a tuple for multiple inputs)
                  output_model_path,                            # where to save the model (can be a file or file-like object)
                  opset_version=11,                             # the ONNX version to export the model to
                  do_constant_folding=True,                     # whether to execute constant folding for optimization
                  input_names=['input_ids', 
                               'segment_ids'],                   # the model's input names
                  output_names=['start', "end"],                 # the model's output names
                  dynamic_axes={'input_ids': symbolic_names,              
                                'segment_ids' : symbolic_names,
                                'start' : symbolic_names, 
                                'end': symbolic_names})          # variable length axes

## Run the ONNX model with ONNX Runtime



In [None]:
import onnxruntime
import torch
from transformers import BertTokenizer, BertForQuestionAnswering




# Run the torch model as a test
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids])) 

# Create an ONNX Runtime session to run the ONNX model
session = onnxruntime.InferenceSession(output_model_path)  

inputs = {
        'input_ids':   [input_ids], 
        'segment_ids': [segment_ids]
        }
                    
result = session.run(["start", "end"], inputs)

#tokens with highest start and end scores
answer_start = torch.argmax(torch.from_numpy(result[0]))
answer_end = torch.argmax(torch.from_numpy(result[1]))
if answer_end >= answer_start:
    answer = tokens[answer_start]
    for i in range(answer_start+1, answer_end+1):
        if tokens[i][0:2] == "##":
            answer += tokens[i][2:]
        else:
            answer += " " + tokens[i]
    print("\nQuestion:\n{}".format(question.capitalize()))
    print("\nAnswer:\n{}.".format(answer.capitalize()))
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

## Deploy model with ONNX Runtime through AzureML

Now that we have prepared ONNX Bert model, we can deploy it using Azure ML and the ONNX Runtime.

1. **Register our model** in our Azure Machine Learning workspace
2. **Write a scoring file** to evaluate our model with ONNX Runtime
3. **Write environment file** for our Docker container image.
4. **Deploy to the web** using an AzureML endpoint 
5. **Classify sample text input** so we can explore inference with our endpoint.


## Check your environment

In [39]:
# Check core SDK version number
import azureml.core
import onnxruntime
import torch
import transformers

print("Transformers version: ", transformers.__version__)
torch_version = torch.__version__
print("Torch (ONNX exporter) version: ", torch_version)
print("Azure SDK version:", azureml.core.VERSION)
print("ONNX Runtime version: ", onnxruntime.__version__)


Transformers version:  4.17.0
Torch (ONNX exporter) version:  1.10.0
Azure SDK version: 1.40.0
ONNX Runtime version:  1.11.0


### Load your Azure ML workspace

We begin by instantiating a workspace object from the existing workspace created earlier in the configuration notebook.

In [45]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep = '\n')

ort_training_dev
australiaeast
onnx_training


## Register your model with Azure ML

Now we upload the model and register it in the workspace.


In [15]:
from azureml.core.model import Model

model = Model.register(model_path = output_model_path,          # Name of the registered model in your workspace.
                       model_name = model_name,                 # Local ONNX model to upload and register as a model
                       model_framework=Model.Framework.ONNX ,   # Framework used to create the model.
                       model_framework_version=torch_version,   # Version of ONNX used to create the model.
                       tags = {"onnx": "demo"},
                       description = "HuggingFace Bert model fine-tuned with SQuAd and exported from PyTorch",
                       workspace = ws)

Registering model bert-large-uncased-whole-word-masking-finetuned-squad


#### Displaying your registered models

You can list out all the models that you have registered in this workspace.

In [16]:
models = ws.models
for name, m in models.items():
    print("Name:", name,"\tVersion:", m.version, "\tDescription:", m.description, m.tags)
    
#     # If you'd like to delete the models from workspace
#     model_to_delete = Model(ws, name)
#     model_to_delete.delete()

Name: hf-gpt2.onnx 	Version: 1 	Description: ONNX version of base HuggingFace GPT-2 {}
Name: hf-gpt2.pt 	Version: 1 	Description: GPT-2 model saved from pre-trained HuggingFace {}
Name: pytorch-hf-gpt-onnx-int8 	Version: 1 	Description: None {}
Name: pytorch-hf-gpt2-wikitext103 	Version: 1 	Description: None {}
Name: pt-ort-hf-gpt2-wt103-full 	Version: 1 	Description: HuggingFace GPT-2 fine-tuned with PyTorch ORT using Wikitext103 {}
Name: sample-densenet-onnx-model 	Version: 1 	Description: None {}
Name: bert-large-uncased-whole-word-masking-finetuned-squad 	Version: 1 	Description: HuggingFace Bert model fine-tuned with SQuAd and exported from PyTorch {'onnx': 'demo'}


## Deploy the model 

We are now going to deploy our ONNX model on Azure ML using ONNX Runtime.

Firstly we will test the deployment using an Azure Container Instance, then deploy the model for production using an Azure ML endpoint.



### Scoring (prediction) code

We begin by writing a `score.py` file that performs the prediction.

The `init()` function is called at startup, performing the one-off operations such as creating the tokenizer and the ONNX Runtime session.

The `run()` function is called when we run the model using the Azure ML web service.
Add neccessary `preprocess()` and `postprocess()` steps.

The following score.py file assumes the inputs will be in the format of the example above. 

In [84]:
%%writefile score.py
import os
import collections
import json
import time
import numpy as np    # we're going to use numpy to process input and output data
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
from transformers import BasicTokenizer, BertTokenizer
from azureml.core.model import Model

def init():
    global session, tokenizer

    tokenizer = BertTokenizer.from_pretrained("bert-large-uncased", do_lower_case=True)

    # use AZUREML_MODEL_DIR to get your deployed model(s). If multiple models are deployed, 
    # model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), '$MODEL_NAME/$VERSION/$MODEL_FILE_NAME')
    # Use the local directory if the environment is not set (for local testing)
    model_dir = os.getenv('AZUREML_MODEL_DIR')
    if model_dir is None:
        model_dir = "."
    model_path = os.path.join(model_dir, output_model_path)
    sess_options = onnxruntime.SessionOptions()
    
    # Set environment variables like OMP_NUM_THREADS for OpenMP to get best performance.
    # See https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/bert/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb
    sess_options.intra_op_num_threads = 1
    
    session = onnxruntime.InferenceSession(model_path, sess_options, providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
    
    
def preprocess(input_data_json):
    
    global all_examples, extra_data
    
    # Model configs. Adjust as needed.
    max_seq_length = 128
    doc_stride = 128
    max_query_length = 64
    is_training = False
        
    #input_ids, input_mask, segment_ids, extra_data = transformers.squad_convert_examples_to_features(input_data_json, tokenizer,
    #                                                                            max_seq_length, doc_stride, max_query_length, is_training)

    question = input_data_json["question"]
    context = input_data_json["context"]
    encoding = tokenizer.encode([question, context])
    return encoding.input_ids, encoding.attention_mask, encoding.token_type_ids

def postprocess(all_results):
    # postprocess results
    # from run_onnx_squad import write_predictions

    n_best_size = 20
    max_answer_length = 30
    output_dir = 'predictions'
    os.makedirs(output_dir, exist_ok=True)
    output_prediction_file = os.path.join(output_dir, "predictions.json")
    #output_nbest_file = os.path.join(output_dir, "nbest_predictions.json")
    # Write the predictions (answers to the questions) in a file.
    #write_predictions(all_examples, extra_data, all_results,
    #                n_best_size, max_answer_length,
    #                True, output_prediction_file, output_nbest_file)
    # Retrieve best results from file.
    result = {}
    with open(output_prediction_file, "r") as f:
        result = json.load(f)
    return result

def run(input_data_json):
    try:
        # load in our data
        input_ids, input_mask, segment_ids = preprocess(input_data_json)
        print(input_ids, input_mask, segment_ids)
        RawResult = collections.namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
        print("RawResults: ", RawResult)
        
        n = len(input_ids)
        bs = 1
        all_results = []
        start = time.time()
        for idx in range(0, n):
            # this is using batch_size=1
            # feed the input data as int64
            data = {
                    "segment_ids": segment_ids[idx:idx+bs],
                    "input_ids": input_ids[idx:idx+bs],
                    "input_mask": input_mask[idx:idx+bs]
                    }
            result = session.run(["start", "end"], data)
            print("result: ", result)
            in_batch = result[0].shape[0]
            start_logits = [float(x) for x in result[1][0].flat]
            end_logits = [float(x) for x in result[0][0].flat]
            for i in range(0, in_batch):
                unique_id = len(all_results)
                all_results.append(RawResult(unique_id=unique_id, start_logits=start_logits, end_logits=end_logits))
                
        end = time.time()
        print("total time: {}sec, {}sec per item".format(end - start, (end - start) / len(all_results)))
        return {"result": postprocess(all_results),
                "total_time": end - start, 
               "time_per_item": (end - start) / len(all_results)}
    except Exception as e:
        result = str(e)
        return {"error": result}

def main():
    print("Hello World!")

if __name__ == "__main__":
    init()
    outputs = run()
    print(outputs)


Overwriting score.py


In [85]:
# Test the score.py locally
%run -i score.py 


loading file https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt from cache at /home/azureuser/.cache/huggingface/transformers/e12f02d630da91a0982ce6db1ad595231d155a2b725ab106971898276d842ecc.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bert-large-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-large-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-large-uncased/resolve/main/tokenizer_config.json from cache at /home/azureuser/.cache/huggingface/transformers/300ecd79785b4602752c0085f8a89c3f0232ef367eda291c79a5600f3778b677.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at /home/azureuser/.cache/huggingface/transformers/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2

### Dependencies

We create a YAML file that specifies the dependencies of the inference application

In [30]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies.create(pip_packages=["numpy","onnxruntime","transformers", "torch", "azureml-core", "azureml-defaults"])

with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

We're all set! Let's get our model chugging.

## Deploy Model as Webservice on Azure Container Instance

The following cell will likely take a few minutes to run as well.

In [32]:
from random import randint

from azureml.core.webservice import Webservice
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment

myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 4, 
                                               tags = {'demo': 'onnx'}, 
                                               description = 'Web service for Bert-squad-large-uncased ONNX model')

# ACI deployment names must be 32 characters or less
aci_service_name = model_name[:28] + '-' + str(randint(0,100))
print("ACI service name: ", aci_service_name)

aci_service = Model.deploy(ws, 
                           aci_service_name, 
                           [model], 
                           inference_config, 
                           aciconfig)

aci_service.wait_for_deployment(True)
print("ACI service state: ", aci_service.state)

ACI service name:  bert-large-uncased-whole-wor-51
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-04-04 21:15:56+00:00 Creating Container Registry if not exists.
2022-04-04 21:15:56+00:00 Registering the environment.
2022-04-04 21:15:58+00:00 Building image..
2022-04-04 21:21:57+00:00 Generating deployment configuration.
2022-04-04 21:22:00+00:00 Submitting deployment to compute..
2022-04-04 21:22:15+00:00 Checking the status of deployment bert-large-uncased-whole-wor-51..
2022-04-04 21:25:06+00:00 Checking the status of inference endpoint bert-large-uncased-whole-wor-51.

Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: bda42140-b82e-4509-bc35-e83c04abd226
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.
	1. Please check the logs for your container instance: bert-large-uncased-whole-wor-51. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.
	2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.
	3. You can also try to run image orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838 locally. Please refer to https://aka.ms/debugimage#service-launc

WebserviceException: WebserviceException:
	Message: Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: bda42140-b82e-4509-bc35-e83c04abd226
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.
	1. Please check the logs for your container instance: bert-large-uncased-whole-wor-51. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.
	2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.
	3. You can also try to run image orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.
	1. Please check the logs for your container instance: bert-large-uncased-whole-wor-51. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.
	2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.
	3. You can also try to run image orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information."
    },
    {
      "code": "AciDeploymentFailed",
      "message": "Your container application crashed. Please follow the steps to debug:
	1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.
	2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.
	3. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.
	4. View the diagnostic events to check status of container, it may help you to debug the issue.
"RestartCount": 3
"CurrentState": {"state":"Waiting","startTime":null,"exitCode":null,"finishTime":null,"detailStatus":"CrashLoopBackOff: Back-off restarting failed"}
"PreviousState": {"state":"Terminated","startTime":"2022-04-04T21:26:36.546Z","exitCode":111,"finishTime":"2022-04-04T21:26:46.957Z","detailStatus":"Error"}
"Events":
{"count":1,"firstTimestamp":"2022-04-04T21:22:19Z","lastTimestamp":"2022-04-04T21:22:19Z","name":"Pulling","message":"pulling image "orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838@sha256:626ea2a178599e0dbcdddc5626dc2a532b6666adcda17a5b10e95f35bf5971ab"","type":"Normal"}
{"count":1,"firstTimestamp":"2022-04-04T21:24:00Z","lastTimestamp":"2022-04-04T21:24:00Z","name":"Pulled","message":"Successfully pulled image "orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838@sha256:626ea2a178599e0dbcdddc5626dc2a532b6666adcda17a5b10e95f35bf5971ab"","type":"Normal"}
{"count":4,"firstTimestamp":"2022-04-04T21:25:00Z","lastTimestamp":"2022-04-04T21:26:36Z","name":"Started","message":"Started container","type":"Normal"}
{"count":3,"firstTimestamp":"2022-04-04T21:25:10Z","lastTimestamp":"2022-04-04T21:26:04Z","name":"Killing","message":"Killing container with id 7eb73b62601f1ce560c3785b23ca7b1e5cbfd6ed6f24d2c75370a994a388c0d4.","type":"Normal"}
"
    }
  ]
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service deployment polling reached non-successful terminal state, current service state: Failed\nOperation ID: bda42140-b82e-4509-bc35-e83c04abd226\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"AciDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: bert-large-uncased-whole-wor-51. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\",\n  \"details\": [\n    {\n      \"code\": \"CrashLoopBackOff\",\n      \"message\": \"Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: bert-large-uncased-whole-wor-51. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\"\n    },\n    {\n      \"code\": \"AciDeploymentFailed\",\n      \"message\": \"Your container application crashed. Please follow the steps to debug:\n\t1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\n\t2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.\n\t3. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t4. View the diagnostic events to check status of container, it may help you to debug the issue.\n\"RestartCount\": 3\n\"CurrentState\": {\"state\":\"Waiting\",\"startTime\":null,\"exitCode\":null,\"finishTime\":null,\"detailStatus\":\"CrashLoopBackOff: Back-off restarting failed\"}\n\"PreviousState\": {\"state\":\"Terminated\",\"startTime\":\"2022-04-04T21:26:36.546Z\",\"exitCode\":111,\"finishTime\":\"2022-04-04T21:26:46.957Z\",\"detailStatus\":\"Error\"}\n\"Events\":\n{\"count\":1,\"firstTimestamp\":\"2022-04-04T21:22:19Z\",\"lastTimestamp\":\"2022-04-04T21:22:19Z\",\"name\":\"Pulling\",\"message\":\"pulling image \"orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838@sha256:626ea2a178599e0dbcdddc5626dc2a532b6666adcda17a5b10e95f35bf5971ab\"\",\"type\":\"Normal\"}\n{\"count\":1,\"firstTimestamp\":\"2022-04-04T21:24:00Z\",\"lastTimestamp\":\"2022-04-04T21:24:00Z\",\"name\":\"Pulled\",\"message\":\"Successfully pulled image \"orttrainingdf7604408.azurecr.io/azureml/azureml_c24ea65edf5165e43fa82547f30f3838@sha256:626ea2a178599e0dbcdddc5626dc2a532b6666adcda17a5b10e95f35bf5971ab\"\",\"type\":\"Normal\"}\n{\"count\":4,\"firstTimestamp\":\"2022-04-04T21:25:00Z\",\"lastTimestamp\":\"2022-04-04T21:26:36Z\",\"name\":\"Started\",\"message\":\"Started container\",\"type\":\"Normal\"}\n{\"count\":3,\"firstTimestamp\":\"2022-04-04T21:25:10Z\",\"lastTimestamp\":\"2022-04-04T21:26:04Z\",\"name\":\"Killing\",\"message\":\"Killing container with id 7eb73b62601f1ce560c3785b23ca7b1e5cbfd6ed6f24d2c75370a994a388c0d4.\",\"type\":\"Normal\"}\n\"\n    }\n  ]\n}"
    }
}


Failed


In case the deployment fails, you can check the logs. Make sure to delete your aci_service before trying again.

In [28]:
if aci_service.state != 'Healthy':
    # run this command for debugging.
    print(aci_service.get_logs())
    aci_service.delete()

None


## Success!

If you've made it this far, you've deployed a working web service that does image classification using an ONNX model. You can get the URL for the webservice with the code below.

In [26]:
print(aci_service.scoring_uri)

None


## Step 2.5 - Inference Bert Model using our WebService

**Input**: Context paragraph and questions as formatted in `inputs.json`

**Task**: For each question about the context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the questions.

**Output**: The best answer for each question.

In [None]:
# Use the inputs from step 2.2
print("========= INPUT DATA =========")
print(json.dumps(inputs_json, indent=2))
azure_result = aci_service.run(json.dumps(inputs_json))
print("\n")
print("========= RESULT =========")
print(json.dumps(azure_result, indent=2))

In [None]:
res = azure_result['result']
inference_time = np.round(azure_result['total_time'] * 1000, 2)
time_per_item = np.round(azure_result['time_per_item'] * 1000, 2)

print('========================================')
print('Final predictions are: ')
for key in res:
    print("Question: ", inputs_json['data'][0]['paragraphs'][0]['qas'][int(key) - 1]['question'])
    print("Best Answer: ", res[key])
    print()

print('========================================')
print('Inference time: ' + str(inference_time) + " ms")
print('Average inference time for each question: ' + str(time_per_item) + " ms")
print('========================================')

When you are eventually done using the web service, remember to delete it.

In [None]:
aci_service.delete()