### Introduction

The goal of this notebook is to demonstrate how to perform inference using the DistilBERT model on OpenVINO. DistilBERT is a popular pre-trained transformer-based model for natural language processing. The `distilbert-base-cased-distilled-squad` variant is trained to answer questions. We will first convert the PyTorch model to ONNX, and then convert the ONNX model to an intermediate representation for optimization and deployment on a CPU using OpenVINO.

## Prerequisites
To follow this tutorial, you need to have Python 3.6 or later installed, along with the following libraries:

- `numpy`
- `openvino`
- `torch`
- `transformers`

You can install these libraries using pip

`pip install numpy onnxruntime openvino torch transformers`
    
However, once you have installed the `requirements.txt`, file you dont need to run the above command.

## Step 1: Load the Model and Tokenizer
First, we need to load the DistilBERT model and tokenizer from the Hugging Face transformers library. The DistilBERT model is pre-trained on a large corpus of text and can be used for various NLP tasks, including question answering.

In [1]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch
from pathlib import Path
import openvino.runtime as ov

# Load the pre-trained tokenizer and model
model_ckpt = "distilbert-base-cased-distilled-squad"
tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)
model = DistilBertForQuestionAnswering.from_pretrained(model_ckpt)

ImportError: numpy>=1.17 is required for a normal functioning of this module, but found numpy==1.15.4.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

In [2]:
model_dir = Path("model")
model_dir.mkdir(exist_ok=True)
MODEL_DIR = "model/"
MODEL_DIR = f"{MODEL_DIR}"

## Step 2: Define the Question and Context
Next, we need to define the question and context for which we want to find the answer. The question should be a string that represents the question, and the context should be a string that represents the text where the answer can be found. Since the task is for extractive question answering..

In [3]:
# Define the question and context
question = "What is ONNX?"
context = "ONNX (Open Neural Network Exchange) is an open standard format for representing machine learning models."

## Step 3: Tokenize the Input
Before we can pass the input to the model, we need to tokenize it using the DistilBERT tokenizer. The tokenizer converts the input into a sequence of tokens that can be fed into the model.

In [4]:
# Tokenize the input
inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")

## Step 4: Perform Question Answering
Now, we can pass the tokenized input to the DistilBERT model to perform question answering. The model returns two sets of logits: one for the start index of the answer span and one for the end index of the answer span.

In [5]:
# Perform the question answering task
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

## Step 5: Find the Answer Span
We need to find the start and end indices of the answer span in the context. We do this by finding the indices with the highest scores in the start and end logits.

In [6]:
# Find the start and end indices of the answer span
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1

## Step 6: Get the Answer Tokens and Decode Them
Finally, we can retrieve the answer tokens from the input and decode them using the DistilBERT tokenizer.

In [7]:
# Get the answer tokens and decode them
answer_tokens = inputs["input_ids"][0][start_index:end_index]
answer = tokenizer.decode(answer_tokens)

In [8]:
print(answer)

Open Neural Network Exchange


## Step 7: Convert the Model to ONNX Format

We can convert the PyTorch model to the ONNX format using the `torch.onnx.export` function. This function takes as inputs the PyTorch model, the input tensor, the output ONNX model name, and the names of the input and output tensors. We also need to specify the dynamic axes of the input and output tensors. Dynamic axes allow the input and output tensors to have a variable batch size and sequence length. Meaning we can enter length of various sizes as long as it is less than maximum input sizes for DistillBERT, and also process multiple inputs at once

In [9]:
# Define the input and output names for the ONNX model
input_names = ["input_ids", "attention_mask"]
output_names = ["start_logits", "end_logits"]
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
onnx_path = Path("model/distillbert_qa.onnx")

# Export the model to the ONNX format
torch.onnx.export(
    model,  # model to export
    (input_ids, attention_mask),  # input as tuple
    onnx_path,  # output ONNX model name
    input_names=input_names,  # names for input tensor
    output_names=output_names,  # names for output tensor
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "start_logits": {0: "batch_size"},
        "end_logits": {0: "batch_size"},
    },
)
print("Model saved in ONNX format.")

  mask, torch.tensor(torch.finfo(scores.dtype).min)


Model saved in ONNX format.


## Step 8: Convert the Model to OpenVINO Format

Now that we have our model in the ONNX format, we can use the OpenVINO toolkit to optimize the model for deployment on Intel hardware. We first need to use the OpenVINO Model Optimizer to convert the ONNX model to the OpenVINO Intermediate Representation (IR) format

In [10]:
# Construct the command for Model Optimizer.
optimizer_command = f'mo \
                     --input_model {onnx_path} \
                     --output_dir {MODEL_DIR} \
                     --model_name {model_ckpt} \
                     --input input_ids,attention_mask \
                     --input_shape "[1,512],[1,512]"'

! $optimizer_command

core = ov.Core()
ir_model_xml = str((Path(MODEL_DIR) / model_ckpt).with_suffix(".xml"))
compiled_model = core.compile_model(ir_model_xml)

[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /workspaces/codespaces-blank/openvino_notebooks/notebooks/236-distillbert_question_answering/model/distilbert-base-cased-distilled-squad.xml
[ SUCCESS ] BIN file: /workspaces/codespaces-blank/openvino_notebooks/notebooks/236-distillbert_question_answering/model/distilbert-base-cased-distilled-squad.bin


## Step 9: Create a function to answer questions
Now that we have our ONNX model and our tokenizer, we can create a function that takes a question and a context as inputs and returns an answer.

First, we need to tokenize the input using the tokenizer. We will use the padding and truncation options to ensure that the input is of a fixed length (512 in this case) and that any extra text is truncated or padded with zeros as necessary.

Once we have the input tokenized, we can use the infer method of the compiled OpenVINO model to get the start and end indices of the answer span. We will then decode the answer tokens using the tokenizer and return the resulting answer.

Now we can use this function to answer questions based on the input context:

In [11]:
def answer_question(compiled_model, tokenizer, question, context):
    input_attention_ids = tokenizer(
        question,
        context,
        padding="max_length",
        max_length=512,
        truncation=True,
        return_tensors="np",
    )

    inputs = dict(input_attention_ids)

    result = model.infer(inputs=inputs)

    start_index, end_index = result.values()

    # Find the start and end indices of the answer span
    start_index = start_index.argmax()
    end_index = end_index.argmax() + 1

    # Get the answer tokens and decode them
    answer_tokens = inputs["input_ids"][0][start_index:end_index]
    answer = tokenizer.decode(answer_tokens)

    return answer

In [12]:
question = "What is ONNX? "
context = "ONNX (Open Neural Network Exchange) is an open standard format for representing machine learning models."
answer = answer_question(compiled_model, tokenizer, question, context)
print(answer)

Open Neural Network Exchange
