Before running the notebook, you should prepare the OpenVINO runtime first. You can 

(1) Download the OpenVINO toolkit, uncompress it, and install it by following the [webpage](https://docs.openvinotoolkit.org/latest/openvino_docs_install_guides_installing_openvino_linux.html).

(2) Run the prebuilt OpenVINO container.

```sh
docker pull openvino/ubuntu20_runtime:latest
docker run -it --rm -p 8888:8888 -v $PWD:/tmp openvino/ubuntu20_runtime:latest bash
```

In [1]:
import logging
import time
import os
import numpy as np
import cv2
import codecs
from openvino.inference_engine import IECore

# Device information

Using the following commands reveals the available resources.

In [2]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           6
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Stepping:            10
CPU MHz:             2600.000
BogoMIPS:            5199.98
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht pbe syscall nx pdpe1gb lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq dtes64 ds_cpl ssse3 sdbg fma cx16 xtpr pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 avx2 bmi2 erms xsaveopt arat


In [3]:
!lsusb

/bin/sh: 1: lsusb: not found


# Bert-small

In the text example, we are going to demo a question-answering inference. 
The model is a BERT-like model. The pretrained model can be downloaded from the link below.

https://download.01.org/opencv/2021/openvinotoolkit/2021.2/open_model_zoo/models_bin/3/bert-small-uncased-whole-word-masking-squad-int8-0002/FP16-INT8

The origin source code can refer to the following link.

https://github.com/openvinotoolkit/open_model_zoo/blob/develop/demos/bert_question_answering_demo/python/bert_question_answering_demo.py



## Helper function

In [4]:
import unicodedata
import string

# load vocabulary file for encoding
def load_vocab_file(vocab_file_name):
    with open(vocab_file_name, "r", encoding="utf-8") as r:
        return {t.rstrip("\n"): i for i, t in enumerate(r.readlines())}


# split word by vocab items and get tok codes
# iteratively return codes
def encode_by_voc(w, vocab):
    # remove mark and control chars
    def clean_word(w):
        wo = ""  # accumulator for output word
        for c in unicodedata.normalize("NFD", w):
            c_cat = unicodedata.category(c)
            # remove mark nonspacing code and controls
            if c_cat != "Mn" and c_cat[0] != "C":
                wo += c
        return wo

    w = clean_word(w)

    res = []
    for s0, e0 in split_to_words(w):
        s, e = s0, e0
        tokens = []
        while e > s:
            subword = w[s:e] if s == s0 else "##" + w[s:e]
            if subword in vocab:
                tokens.append(vocab[subword])
                s, e = e, e0
            else:
                e -= 1
        if s < e0:
            tokens = [vocab['[UNK]']]
        res.extend(tokens)
    return res

#split big text into words by spaces
#iteratively return words
def split_to_words(text):
    prev_is_sep = True # mark initial prev as space to start word from 0 char
    for i, c in enumerate(text + " "):
        is_punc = (c in string.punctuation or unicodedata.category(c)[0] == "P")
        cur_is_sep = (c.isspace() or is_punc)
        if prev_is_sep != cur_is_sep:
            if prev_is_sep:
                start = i
            else:
                yield start, i
                del start
        if is_punc:
            yield i, i+1
        prev_is_sep = cur_is_sep

# get big text and return list of token id and start-end positions for each id in original texts
def text_to_tokens(text, vocab):
    tokens_id = []
    tokens_se = []
    for s, e in split_to_words(text):
        for tok in encode_by_voc(text[s:e], vocab):
            tokens_id.append(tok)
            tokens_se.append((s, e))

    return tokens_id, tokens_se

## Dataset Preparation

In [5]:
import json

COLOR_RED = "\033[91m"
COLOR_RESET = "\033[0m"

In [6]:
rawText = ""
with open("./bert-small-uncased-whole-word-masking-squad-int8-0002/SquAD1.1/train-v2.0.json", "r") as fin:
  for line in fin:
    rawText += line.strip()

dataset = json.loads(rawText)
print("Total: {}".format(len(dataset['data'])))

Total: 442


Let's take a look the first several pieces of data.

In [7]:
dataset['data'][0]['paragraphs'][0]

{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

In [8]:
text2id = {}
id2text = {}
count = 0
with codecs.open("./bert-small-uncased-whole-word-masking-squad-int8-0002/2021.2_fp16-int8/vocab.txt", "r", "utf-8") as fin:
  for line in fin:
    word = line.strip()
    text2id[word] = count
    id2text[count] = word
    count += 1
print("There are {} words.".format(len(text2id)))

There are 30522 words.


In [9]:
firstData = dataset['data'][0]
context = firstData['paragraphs'][0]['context']
print("Context origin:", context)

c_tokens_id, c_tokens_se = text_to_tokens(context, text2id)
print("Context: ", c_tokens_id)

qus = firstData['paragraphs'][0]['qas'][0]['question']
q_tokens_id, _ = text_to_tokens(qus, text2id)
print("Question: ", q_tokens_id)

Context origin: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Context:  [100, 100, 100, 1011, 100, 1006, 1013, 12170, 23432, 29715, 3501, 29678, 12325, 29685, 1013, 10506, 1011, 100, 1011, 2360, 1007, 1006, 2141, 100, 1018, 1010, 3261, 1007, 2003, 2019, 100, 3220, 1010, 6009, 1010, 2501, 3135, 1998, 3883, 1012, 100, 1998, 2992, 1999, 100, 1010, 100,

## OpenVINO Runtime

In [10]:
# plugin initialization
ie = IECore()

# read the intermediate representations
net = ie.read_network("./bert-small-uncased-whole-word-masking-squad-int8-0002/2021.2_fp16-int8/bert-small-uncased-whole-word-masking-squad-int8-0002.xml", 
                      "./bert-small-uncased-whole-word-masking-squad-int8-0002/2021.2_fp16-int8/631814_Intel-DL-Boost-Enabling-Kit-UG_Rev0p5.bin")

net.batch_size = 1
net_inputs = net.inputs
net_outputs = net.outputs
inputNames = list(net_inputs.keys())
outputNames = list(net_outputs.keys())
print("Network input:  {}".format(inputNames))
print("Network output: {}".format(outputNames))

print("Inputs:")
for key in inputNames:
  print("  {}: {}".format(key, net_inputs[key].shape))
print("Outputs:")
for key in outputNames:
  print("  {}: {}".format(key, net_outputs[key].shape))

# loading the model to the plugin
model = ie.load_network(network=net, device_name="CPU")

Network input:  ['attention_mask', 'input_ids', 'position_ids', 'token_type_ids']
Network output: ['output_e', 'output_s']
Inputs:
  attention_mask: [1, 384]
  input_ids: [1, 384]
  position_ids: [1, 384]
  token_type_ids: [1, 384]
Outputs:
  output_e: [1, 384]
  output_s: [1, 384]


  if __name__ == '__main__':


In [11]:
# maximum number of tokens that can be processed by network at once
max_length = net_inputs['input_ids'].shape[1]

# calculate number of tokens for context in each inference request.
# reserve 3 positions for special tokens
# [CLS] q_tokens [SEP] c_tokens [SEP]
c_wnd_len = max_length - (len(q_tokens_id) + 3)

# token num between two neighbour context windows
# 1/2 means that context windows are overlapped by half
c_stride = c_wnd_len // 2

# init a window to iterate over context
c_s, c_e = 0, min(c_wnd_len, len(c_tokens_id))
print("Start:", c_s, ", End:", c_e)

Start: 0 , End: 162


In [12]:
def get_score(res, name, max_length=max_length):
  out = np.exp(res[name].reshape((max_length,)))
  return out / out.sum(axis=-1)

# return entire sentence as start-end positions for a given answer (within the sentence).
def find_sentence_range(context, s, e):
  # find start of sentence
  for c_s in range(s, max(-1, s - 200), -1):
    if context[c_s] in "\n.":
      c_s += 1
      break

  # find end of sentence
  for c_e in range(max(0, e - 1), min(len(context), e + 200), +1):
    if context[c_e] in "\n.":
      break

  return c_s, c_e

In [13]:
ttl_time_cost = 0.0
number_requests = 0
answers = []

while c_e > c_s:
  # prepare the inputs
  token_cls = text2id['[CLS]']
  token_sep = text2id['[SEP]']
  input_ids = [token_cls] + q_tokens_id + [token_sep] + c_tokens_id + [token_sep]
  token_type_ids = [0] + [0] * len(q_tokens_id) + [0] + [1] * (c_e - c_s) + [0]
  attention_mask = [1] * len(input_ids)
  
  # pad the rest of the request
  pad_len = max_length - len(input_ids)
  input_ids += [0] * pad_len
  token_type_ids += [0] * pad_len
  attention_mask += [0] * pad_len
  
  # create the inputs
  inputs = {
    inputNames[0]: np.array([attention_mask], dtype=np.int32),
    inputNames[1]: np.array([input_ids], dtype=np.int32),
    inputNames[2]: np.arange(len(input_ids), dtype=np.int32)[None, :],
    inputNames[3]: np.array([token_type_ids], dtype=np.int32)
  }
  
  _startTime = time.perf_counter()
  res = model.infer(inputs=inputs)
  timeCost = time.perf_counter() - _startTime  
  ttl_time_cost += timeCost
  number_requests += 1
  
  score_e = get_score(res, outputNames[0], max_length)
  score_s = get_score(res, outputNames[1], max_length)
  
  # get 'no-answer' score (not valid if model has been fine-tuned on squad1.x)
  score_na = 0
  # for squad2.0
  # score_na = score_s[0] * score_e[0]

  # find product of all start-end combinations to find the best one
  c_s_idx = len(q_tokens_id) + 2
  c_e_idx = max_length - (1 + pad_len)
  score_matrix = np.matmul(
    score_s[c_s_idx:c_e_idx].reshape((c_e - c_s, 1)), 
    score_e[c_s_idx:c_e_idx].reshape((1, c_e - c_s))
  )

  # reset candidates with end before start
  score_matrix = np.triu(score_matrix)
  # reset long candidates (> max_answer_token_num)
  max_answer_token_num = 3
  score_matrix = np.tril(score_matrix, max_answer_token_num - 1)
  # find the best pair
  max_s, max_e = divmod(score_matrix.flatten().argmax(), score_matrix.shape[1])
  max_score = score_matrix[max_s, max_e] * (1 - score_na)
  print("Max score: {}".format(max_score))

  # convert to context text start-end index
  max_s = c_tokens_se[c_s + max_s][0]
  max_e = c_tokens_se[c_s + max_e][1]

  # check whether the answer exists or not
  same = [i for i, a in enumerate(answers) if a[1] == max_s and a[2] == max_e]
  if same:
    assert len(same) == 1
    a = answers[same[0]]
    answers[same[0]] = (max(max_score, a[0]), max_s, max_e)
  else:
    # add new record
    answers.append((max_score, max_s, max_e))

  # check that context window reached the end
  if c_e == len(c_tokens_id):
    break  

  # move to next window position
  c_s = min(c_s + c_stride, len(c_tokens_id))
  c_e = min(c_s + c_wnd_len, len(c_tokens_id))
  
answers = sorted(answers, key=lambda x: -x[0])
for score, s, e in answers[:3]:
  print("--answer: {:0.2f} {}".format(score, context[s:e]))
  c_s, c_e = find_sentence_range(context, s, e)
  print("  " + context[c_s:s] + COLOR_RED + context[s:e] + COLOR_RESET + context[e:c_e])

print("Total time cost: {:.4f} sec".format(ttl_time_cost))
print("Number of requests: {}".format(number_requests))
print("Average inference time per request: {:.4f} sec".format(ttl_time_cost / number_requests))
print("Requests per second: {:.2f}".format(number_requests / ttl_time_cost))

Max score: 0.028991861268877983
--answer: 0.03 Child
   Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's [91mChild[0m
Total time cost: 0.1447 sec
Number of requests: 1
Average inference time per request: 0.1447 sec
Requests per second: 6.91


## Bundle all operations

In [14]:
ttl_time_cost = 0.0
number_requests = 0

for data_idx in range(0, int(len(dataset['data']) * 0.1)):
  firstData = dataset['data'][data_idx]
  context = firstData['paragraphs'][0]['context']
  c_tokens_id, c_tokens_se = text_to_tokens(context, text2id)
  qus = firstData['paragraphs'][0]['qas'][0]['question']
  q_tokens_id, _ = text_to_tokens(qus, text2id)

  # maximum number of tokens that can be processed by network at once
  max_length = net_inputs['input_ids'].shape[1]

  # calculate number of tokens for context in each inference request.
  # reserve 3 positions for special tokens
  # [CLS] q_tokens [SEP] c_tokens [SEP]
  c_wnd_len = max_length - (len(q_tokens_id) + 3)

  # token num between two neighbour context windows
  # 1/2 means that context windows are overlapped by half
  c_stride = c_wnd_len // 2

  # init a window to iterate over context
  c_s, c_e = 0, min(c_wnd_len, len(c_tokens_id))
  print("Start:", c_s, ", End:", c_e)

  answers = []

  while c_e > c_s:
    # prepare the inputs
    token_cls = text2id['[CLS]']
    token_sep = text2id['[SEP]']
    input_ids = [token_cls] + q_tokens_id + [token_sep] + c_tokens_id + [token_sep]
    token_type_ids = [0] + [0] * len(q_tokens_id) + [0] + [1] * (c_e - c_s) + [0]
    attention_mask = [1] * len(input_ids)

    # pad the rest of the request
    pad_len = max_length - len(input_ids)
    input_ids += [0] * pad_len
    token_type_ids += [0] * pad_len
    attention_mask += [0] * pad_len

    # create the inputs
    inputs = {
      inputNames[0]: np.array([attention_mask], dtype=np.int32),
      inputNames[1]: np.array([input_ids], dtype=np.int32),
      inputNames[2]: np.arange(len(input_ids), dtype=np.int32)[None, :],
      inputNames[3]: np.array([token_type_ids], dtype=np.int32)
    }

    _startTime = time.perf_counter()
    res = model.infer(inputs=inputs)
    timeCost = time.perf_counter() - _startTime  
    ttl_time_cost += timeCost
    number_requests += 1

    score_e = get_score(res, outputNames[0], max_length)
    score_s = get_score(res, outputNames[1], max_length)

    # get 'no-answer' score (not valid if model has been fine-tuned on squad1.x)
    score_na = 0
    # for squad2.0
    # score_na = score_s[0] * score_e[0]

    # find product of all start-end combinations to find the best one
    c_s_idx = len(q_tokens_id) + 2
    c_e_idx = max_length - (1 + pad_len)
    score_matrix = np.matmul(
      score_s[c_s_idx:c_e_idx].reshape((c_e - c_s, 1)), 
      score_e[c_s_idx:c_e_idx].reshape((1, c_e - c_s))
    )

    # reset candidates with end before start
    score_matrix = np.triu(score_matrix)
    # reset long candidates (> max_answer_token_num)
    max_answer_token_num = 3
    score_matrix = np.tril(score_matrix, max_answer_token_num - 1)
    # find the best pair
    max_s, max_e = divmod(score_matrix.flatten().argmax(), score_matrix.shape[1])
    max_score = score_matrix[max_s, max_e] * (1 - score_na)
    print("Max score: {}".format(max_score))

    # convert to context text start-end index
    max_s = c_tokens_se[c_s + max_s][0]
    max_e = c_tokens_se[c_s + max_e][1]

    # check whether the answer exists or not
    same = [i for i, a in enumerate(answers) if a[1] == max_s and a[2] == max_e]
    if same:
      assert len(same) == 1
      a = answers[same[0]]
      answers[same[0]] = (max(max_score, a[0]), max_s, max_e)
    else:
      # add new record
      answers.append((max_score, max_s, max_e))

    # check that context window reached the end
    if c_e == len(c_tokens_id):
      break  

    # move to next window position
    c_s = min(c_s + c_stride, len(c_tokens_id))
    c_e = min(c_s + c_wnd_len, len(c_tokens_id))

  answers = sorted(answers, key=lambda x: -x[0])
  for score, s, e in answers[:3]:
    print("--answer: {:0.2f} {}".format(score, context[s:e]))
    c_s, c_e = find_sentence_range(context, s, e)
    print("  " + context[c_s:s] + COLOR_RED + context[s:e] + COLOR_RESET + context[e:c_e])

print("Total time cost: {:.4f} sec".format(ttl_time_cost))
print("Number of requests: {}".format(number_requests))
print("Average inference time per request: {:.4f} sec".format(ttl_time_cost / number_requests))
print("Requests per second: {:.2f}".format(number_requests / ttl_time_cost))  

Start: 0 , End: 162
Max score: 0.028991861268877983
--answer: 0.03 Child
   Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's [91mChild[0m
Start: 0 , End: 196
Max score: 0.26153603196144104
--answer: 0.26 Polish and French
  pɛ̃]; 22 February or 1 March 1810 – 17 October 1849), born Fryderyk Franciszek Chopin,[n 1] was a [91mPolish and French[0m (by citizenship and birth of father) composer and a virtuoso pianist of the Romantic era, who wrote primarily for the solo piano
Start: 0 , End: 220
Max score: 0.21010905504226685
--answer: 0.21 Some Mainland Chinese
   [91mSome Mainland Chinese[0m scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Ti
Start: 0 , End: 87
Max score: 0.12847229838371277
--answer: 0.13 