# BERT Question Answering Embedding Demo

This notebook demonstrates a Question Answering demo application that uses a SQuAD-tuned BERT model from [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/) to calculate embedding vectors to find the right context for questions. See `bert_embedding_demo.py` in the same folder as this notebook for the source code to load the model and perform inference.

The primary difference from the [bert_question_answering_demo](../../../bert_question_answering_demo/README.md) is that this demo demonstrates how the inference can be accelerated via pre-computing the embeddings for the contexts. 

## How It Works

The model is loaded to OpenVINO Inference Engine. Data is fetched from the user-provided url to populate the list of "contexts" with the text. Prior to the actual inference to answer user's questions, the embedding vectors are pre-calculated (via inference) for each context from the list. This is done using the first ("emdbeddings-only") BERT model. After that, when user type the question and the "embeddings" network is used to calculate an embedding vector for the specified question. Using the L2 distance between the embedding vector of the question and the embedding vectors for the contexts the best (closest) contexts are selected as candidates to further seek for the final answer to the question. 

The question is usually much shorter than the contexts, so calculating the embedding for that is really fast. Also calculating the L2 distance between a context and question is almost free, compared to the actual inference. Together, during question answering, this substantially saves on the actual inference, which is needed ONLY for the question (while contexts are pre-calculated), compared to the conventional approach that has to concatenate each context with the question and do an inference on this large input (per context).

A second (conventional SQuAD-tuned) Bert model is used to further search for the exact answer in the best contexts found in the first step.

## Settings and Imports

Change the `input_url` to change the context of the model.

In [None]:
import os
import subprocess
import time

from bert_embedding_demo import BERT
from bert_notebook_utils import BERT as ORIGINAL_BERT

In [None]:
input_url = "https://en.wikipedia.org/wiki/Bert_(Sesame_Street)"

# Other settings (only change this if you know what you are doing!)
model_name_emb = "bert-small-uncased-whole-word-masking-squad-emb-int8-0001"
model_name_qa = "bert-small-uncased-whole-word-masking-squad-0001"
device = "CPU"
reshape = False  # Try to reshape the sequence length to the input context + max question len (to improve the speed)
model_squad_ver = "1.2"  # SQuAD version used for model fine tuning

In [None]:
# Set file and directory paths for model. By default, this demo notebook downloads all BERT models from the Open Model Zoo and stores them in the current directory.
# Adjust these settings if you want to change this.
open_model_zoo_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(os.curdir)))))
vocab_file = os.path.join(open_model_zoo_path, "models", "intel", model_name_emb, "vocab.txt")
base_model_dir = os.curdir
omz_cache_dir = os.path.expanduser("~/open_model_zoo_cache")

## Download BERT models

Download BERT models from [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/) with the Model Downloader. Models are downloaded to `base_model_dir`, which is set to the current directory by default. Open Model Zoo caches the downloaded models in `omz_cache_dir`. By default this is set to `open_model_zoo_cache` in the home directory (`/home/username` on Linux, `c:\Users\username` on Windows). If you want to modify these defaults, change the settings in the previous cell. See the [Model Downloader documentation](https://github.com/openvinotoolkit/open_model_zoo/blob/master/tools/downloader/README.md) for more information.

In [None]:
downloader_model_name = "bert*"  # Name (with optional wildcards) of the model or models to download. `bert*` downloads all BERT models. To only download the smaller models, set `downloader_model_name` to `bert-small*`
precision = "P16,FP16-INT8"  # If Model Downloader is run with the precision argument, only models with the specified precision are downloaded. On CPU, FP16 and FP32 give the same result.

downloader_command = os.path.join(open_model_zoo_path, "tools", "downloader", "downloader.py")
subprocess.run(
    [
        "python",
        downloader_command,
        "--output_dir",
        base_model_dir,
        "--jobs",
        "4",
        "--cache_dir",
        omz_cache_dir,
        "--precision",
        precision,
        "--name",
        downloader_model_name,
    ],
    shell=False,
    check=False,
    capture_output=False,
)

## Setup BERT

In [None]:
bert = BERT(
    input_url=input_url,
    vocab_file=vocab_file,
    model_name_emb=model_name_emb,
    model_name_qa=model_name_qa,
    base_model_dir=base_model_dir,
    reshape=reshape,
    device=device,
    model_squad_ver=model_squad_ver,
)

## Ask questions!

In [None]:
bert.ask("What is BERT?")

# Compare Embedding model with original model

Check the speed and result of the `ask` function on different models. Call `bert.ask` with `show_embeddings=False` and `show_context=False` for more concise output. Optionally set `show_answers` to `False` to disable all output. Note that the reported speeds are an indication. See the [OpenVINO documentation](https://docs.openvinotoolkit.org/latest/_docs_IE_DG_Intro_to_Performance.html) for tips on how to improve performance and the [Benchmark C++ Sample](https://docs.openvinotoolkit.org/latest/_inference_engine_samples_benchmark_app_README.html) for actual performance measurements.

Supported embedding models (`model_name_emb`):
* bert-large-uncased-whole-word-masking-squad-emb-0001
* bert-small-uncased-whole-word-masking-squad-emb-int8-0001

Supported QA models (`model_name_qa`):
* bert-large-uncased-whole-word-masking-squad-0001
* bert-large-uncased-whole-word-masking-squad-int8-0001
* bert-small-uncased-whole-word-masking-squad-0001
* bert-small-uncased-whole-word-masking-squad-0002
* bert-small-uncased-whole-word-masking-squad-int8-000

In [None]:
model_name_qa = "bert-large-uncased-whole-word-masking-squad-0001"
model_name_emb = "bert-large-uncased-whole-word-masking-squad-emb-0001"
model_name_qa = "bert-small-uncased-whole-word-masking-squad-0001"
model_name_emb = "bert-small-uncased-whole-word-masking-squad-emb-int8-0001"

input_url = "https://en.wikipedia.org/wiki/Sesame_Street"
questions = [
    "Who created Sesame Street?",
    "What characters are in Sesame Street?",
    "Where is Sesame Street?",
    "When did Sesame Street start?",
    "What is the goal of Sesame Street?",
]

In [None]:
original_bert = ORIGINAL_BERT(
    input_url,
    vocab_file,
    model_name_qa,
    base_model_dir,
    reshape,
    device,
    model_squad_ver,
)
start_time_orig = time.time()
original_bert.ask(questions, show_answers=True, show_context=False)
end_time_orig = time.time()
del original_bert
print(f"Original model time: {end_time_orig - start_time_orig:.2f} seconds")

In [None]:
embedding_bert = BERT(
    input_url,
    vocab_file,
    model_name_emb,
    model_name_qa,
    base_model_dir,
    reshape=reshape,
    device=device,
    model_squad_ver=model_squad_ver,
)
start_time_emb = time.time()
embedding_bert.ask(questions, show_answers=True, show_embeddings=False, show_context=False)
end_time_emb = time.time()
del embedding_bert
print(f"Embedding model time: {end_time_emb - start_time_emb:.2f} seconds")