# Quantize a Hugging Face Question-Answering Model with OpenVINO

This notebook shows how to quantize a question answering model with OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

With quantization, we reduce the precision of the model's weights and activations from floating point (FP32) to integer (INT8). This results in a smaller model with faster inference times with OpenVINO Runtime. 

The notebook demonstrates post-training quantization, which does not require specific hardware to execute. A laptop or desktop with a recent Intel Core processor is recommended for best results. To install the requirements for this notebook, please do `pip install -r requirements.txt` or uncomment the cell below to install the requirements in your current Python environment.

In [None]:
# %pip install "optimum-intel[openvino, nncf]" datasets evaluate[evaluator] ipywidgets

In [1]:
import time
import warnings
from pathlib import Path

import datasets
import evaluate
import numpy as np
import pandas as pd
import transformers
from evaluate import evaluator
from openvino.runtime import Core
from optimum.intel.openvino import OVModelForQuestionAnswering, OVQuantizer
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

## Settings

We define MODEL_ID and DATASET_NAME, and the paths for the quantized model files. VERSION_2_WITH_NEGATIVE should be set to TRUE if a version of the SQuAD v2 dataset is used, which includes questions that do not have an answer. 

For this tutorial, we use the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad), a reading comprehension dataset consisting of questions on a set of Wikipedia articles, where the answer to every question is a segment of text from a given context. The notebook was tested with the [csarron/bert-base-uncased-squad-v1](https://huggingface.co/csarron/bert-base-uncased-squad-v1) model. Other [question-answering models](https://huggingface.co/models?dataset=dataset:squad&pipeline_tag=question-answering&sort=downloads) should also work.

In [2]:
MODEL_ID = "csarron/bert-base-uncased-squad-v1"
DATASET_NAME = "squad"
VERSION_2_WITH_NEGATIVE = False

base_model_path = Path(f"models/{MODEL_ID}")
fp32_model_path = base_model_path.with_name(base_model_path.name + "_FP32")
int8_ptq_model_path = base_model_path.with_name(base_model_path.name + "_INT8_PTQ")

## Load Model and Tokenizer

We load the model from the Hugging Face Hub. The model will be automatically downloaded if it has not been downloaded before, or loaded from the cache otherwise.

We also load the tokenizer, which converts the questions and contexts from the dataset to tokens, converting the inputs in a format the model expects.

In [3]:
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# See how the tokenizer for the given model converts input text to model input values
print(tokenizer("hello world!"))

{'input_ids': [101, 7592, 2088, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}


## Preview the Dataset

The `datasets` library makes it easy to load datasets. Common datasets can be loaded from the Hugging Face Hub by providing the name of the dataset. See https://github.com/huggingface/datasets. We can load the SQuAD dataset with `load_dataset` and show a random dataset item. Every dataset item in the SQuAD dataset has a unique id, a title which denotes the category, a context and a question, and answers. The answer is a subset of the context, and both the text of the answer, and the start position of the answer in the context (`answer_start`) are returned.

In [4]:
dataset = datasets.load_dataset(DATASET_NAME)
dataset["train"][31415]

  0%|          | 0/2 [00:00<?, ?it/s]

{'id': '570e53690b85d914000d7e3c',
 'title': 'Melbourne',
 'context': "Melbourne is experiencing high population growth, generating high demand for housing. This housing boom has increased house prices and rents, as well as the availability of all types of housing. Subdivision regularly occurs in the outer areas of Melbourne, with numerous developers offering house and land packages. However, after 10 years[when?] of planning policies to encourage medium-density and high-density development in existing areas with greater access to public transport and other services, Melbourne's middle and outer-ring suburbs have seen significant brownfields redevelopment.",
 'question': 'What effect has the housing boom had on house prices and rents?',
 'answers': {'text': ['increased'], 'answer_start': [108]}}

## Post Training Quantization

For post-training quantization (PTQ), we first start by loading the model using the `AutoModelForQuestionAnswering` class. After instantiating an `OVQuantizer`, we need to provide a dataset for the calibration step. You can apply quantization on your model by calling the `quantize` method. That's all!

### Prepare the Dataset

We need a representative calibration dataset to quantize the model. The SQuAD dataset is pretrained on a large dataset with a wide variety of questions and answers, and it generalizes pretty well to questions and contexts it has never seen before. For production use, you would finetune this dataset with questions and context specific to your domain. In this notebook, we use a subset of the SQuAD dataset, for demonstration purposes. We chose the _Super Bowl 50_ category from the validation subset of SQuAD because it has a large number of questions.

Post-training quantization does not need a training and validation dataset, because we will not train the model, but we define these splits here to use a training split for calibration, and a validation split for validation.

In [5]:
def preprocess_fn(examples, tokenizer):
    """convert the text from the dataset into tokens in the format that the model expects"""
    return tokenizer(
        examples["question"],
        examples["context"],
        padding=True,
        truncation=True,
        max_length=384,
    )

In [6]:
NUM_TRAIN_ITEMS = 600
filtered_examples = dataset["validation"].filter(lambda x: x["title"].startswith("Super_Bowl_50"))
train_examples = filtered_examples.select(range(0, NUM_TRAIN_ITEMS))
train_dataset = train_examples.map(lambda x: preprocess_fn(x, tokenizer), batched=True)

validation_examples = filtered_examples.select(range(NUM_TRAIN_ITEMS, len(filtered_examples)))
validation_dataset = validation_examples.map(lambda x: preprocess_fn(x, tokenizer), batched=True)

### Quantize the Model with Post Training Quantization

In [7]:
# Hide PyTorch warnings about missing shape inference
warnings.simplefilter("ignore")

# Quantize the model
quantizer = OVQuantizer.from_pretrained(model)
quantizer.quantize(calibration_dataset=train_dataset, save_directory=int8_ptq_model_path)

## Compare INT8 and FP32 models

We compare the accuracy, model size and inference results and latency of the FP32 and INT8 models.
### Inference Pipeline

Transformers [Pipelines](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial) simplify model inference. A `Pipeline` is created by adding a task, model and tokenizer to the `pipeline` function. Inference is then as simple as `qa_pipeline({"question": question, "context": context})`.

We create two pipelines: `hf_qa_pipeline` and `ov_qa_pipeline_ptq` to compare the FP32 PyTorch model with the OpenVINO INT8 model. These pipelines will also be used for showing the accuracy difference and for benchmarking later in this notebook.

In [8]:
quantized_model_ptq = OVModelForQuestionAnswering.from_pretrained(int8_ptq_model_path)
original_model = AutoModelForQuestionAnswering.from_pretrained(MODEL_ID)
ov_qa_pipeline_ptq = pipeline("question-answering", model=quantized_model_ptq, tokenizer=tokenizer)
hf_qa_pipeline = pipeline("question-answering", model=original_model, tokenizer=tokenizer)

In [9]:
context = validation_examples[200]["context"]
question = "Who won the game?"
print(context)

Super Bowl 50 featured numerous records from individuals and teams. Denver won despite being massively outgained in total yards (315 to 194) and first downs (21 to 11). Their 194 yards and 11 first downs were both the lowest totals ever by a Super Bowl winning team. The previous record was 244 yards by the Baltimore Ravens in Super Bowl XXXV. Only seven other teams had ever gained less than 200 yards in a Super Bowl, and all of them had lost. The Broncos' seven sacks tied a Super Bowl record set by the Chicago Bears in Super Bowl XX. Kony Ealy tied a Super Bowl record with three sacks. Jordan Norwood's 61-yard punt return set a new record, surpassing the old record of 45 yards set by John Taylor in Super Bowl XXIII. Denver was just 1-of-14 on third down, while Carolina was barely better at 3-of-15. The two teams' combined third down conversion percentage of 13.8 was a Super Bowl low. Manning and Newton had quarterback passer ratings of 56.6 and 55.4, respectively, and their added total

In [10]:
hf_qa_pipeline({"question": question, "context": context})["answer"]

'Denver'

In [11]:
ov_qa_pipeline_ptq({"question": question, "context": context})["answer"]

'Denver'

### Accuracy

We load the quantized model and the original FP32 model, and compare the metrics on both models. The [evaluate](https://github.com/huggingface/evaluate) library makes it very easy to evaluate models on a given dataset, with a given metric. For the SQuAD dataset, the F1 score and Exact Match metrics are returned.

To load the quantized model with OpenVINO, we use the `OVModelForQuestionAnswering` class. It can be used in the same way as [`AutoModelForQuestionAnswering`](https://huggingface.co/docs/transformers/main/model_doc/auto).

The pipelines we created in the previous section are used to perform evaluation.

In [12]:
squad_eval = evaluator("question-answering")

ov_eval_results = squad_eval.compute(
    model_or_pipeline=ov_qa_pipeline_ptq,
    data=validation_examples,
    metric="squad",
    squad_v2_format=VERSION_2_WITH_NEGATIVE,
)

hf_eval_results = squad_eval.compute(
    model_or_pipeline=hf_qa_pipeline,
    data=validation_examples,
    metric="squad",
    squad_v2_format=VERSION_2_WITH_NEGATIVE,
)
pd.DataFrame.from_records(
    [hf_eval_results, ov_eval_results],
    columns=["exact_match", "f1"],
    index=["FP32", "INT8 PTQ"],
).round(2)

Unnamed: 0,exact_match,f1
FP32,82.86,86.33
INT8 PTQ,82.86,87.42


### Inference Results

To fully understand the quality of a model, it is useful to look beyond metrics like Exact Match and F1 score and examine model predictions directly. This can give a more complete impression of the model's performance and help identify areas for improvement.

In the next cell, we go over the items in the validation set, and display the items where the FP32 prediction score is different from the INT8 prediction score

The results show that for some predictions, the FP32 model is better, but for others, the INT8 model is.

In [13]:
results = []
metric = evaluate.load("squad_v2" if VERSION_2_WITH_NEGATIVE else "squad")

for item in validation_examples:
    id, title, context, question, answers = item.values()
    fp32_answer = hf_qa_pipeline(question, context)["answer"]
    int8_answer = ov_qa_pipeline_ptq(question, context)["answer"]

    references = [{"id": id, "answers": answers}]
    fp32_predictions = [{"id": id, "prediction_text": fp32_answer}]
    int8_predictions = [{"id": id, "prediction_text": int8_answer}]

    fp32_score = round(metric.compute(references=references, predictions=fp32_predictions)["f1"], 2)
    int8_score = round(metric.compute(references=references, predictions=int8_predictions)["f1"], 2)

    if int8_score != fp32_score:
        results.append((question, answers["text"], fp32_answer, fp32_score, int8_answer, int8_score))

pd.set_option("display.max_colwidth", None)
pd.DataFrame(
    results,
    columns=["Question", "Answer", "FP32 answer", "FP32 F1", "INT8 answer", "INT8 F1"],
)

Unnamed: 0,Question,Answer,FP32 answer,FP32 F1,INT8 answer,INT8 F1
0,What company paid for a Super Bowl 50 ad to show a trailer of X-Men: Apocalypse?,"[Fox, Fox, Disney]","20th Century Fox, Lionsgate",40.0,"20th Century Fox, Lionsgate, Paramount Pictures, Universal Studios",22.22
1,What BBC radio station will carry the game in the United Kingdom?,"[BBC Radio 5, Radio 5 Live, BBC Radio 5 Live]",BBC Radio 5 Live,100.0,BBC Radio 5 Live and 5 Live Sports Extra,61.54
2,"Aside from BBC Radio 5, what radio station will broadcast the game?","[5 Live Sports Extra, 5 Live Sports Extra, 5 Live Sports Extra]",BBC Radio 5 Live,50.0,5 Live Sports Extra,100.0
3,How many players have been awarded the Most Valuable Player distinction for the Super Bowl?,"[43, 43, 43]",43,100.0,39 of the 43,50.0
4,How many former MVP honorees were present for a pregame ceremony?,"[39, 39, 39]",43,0.0,39 of the 43,50.0
5,How many yards was the missed field goal?,"[44, 44, 44]",33,0.0,44,100.0
6,Who picked off Cam Newton and subsequently fumbled the ball?,"[T. J. Ward, T. J. Ward, Ward]",Trevathan,0.0,T. J. Ward,100.0
7,What yard line was the Broncos on when Manning lost the ball in the fourth quarter?,"[50-yard line., 41, 50]",41,100.0,41-yard line,50.0
8,Which player was criticized for not jumping into the pile to recover the ball?,"[Newton, Newton, Newton]",Ward,0.0,Newton,100.0
9,How many plays was Denver kept out of the end zone after getting the ball from Newton?,"[three, three, three]",three,100.0,three plays,66.67


### Model Size

We save the FP32 PyTorch model and define a function to show the model size for the PyTorch and OpenVINO models.

In [14]:
def get_model_size(model_folder, framework):
    """
    Return OpenVINO or PyTorch model size in Mb.
    Arguments:
        model_folder:
            Directory containing a pytorch_model.bin for a PyTorch model, and an openvino_model.xml/.bin for an OpenVINO model.
        framework:
            Define whether the model is a PyTorch or an OpenVINO model.
    """
    if framework.lower() == "openvino":
        model_path = Path(model_folder) / "openvino_model.xml"
        model_size = model_path.stat().st_size + model_path.with_suffix(".bin").stat().st_size
    elif framework.lower() == "pytorch":
        model_path = Path(model_folder) / "pytorch_model.bin"
        model_size = model_path.stat().st_size
    model_size /= 1000 * 1000
    return model_size


model.save_pretrained(fp32_model_path)

fp32_model_size = get_model_size(fp32_model_path, "pytorch")
int8_model_size = get_model_size(int8_ptq_model_path, "openvino")
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

FP32 model size: 436.07 MB
INT8 model size: 182.41 MB
INT8 size decrease: 2.39x


### Benchmarks

Compare the inference speed of the quantized OpenVINO model with that of the original PyTorch model.

This benchmark provides an estimate of performance, but keep in mind that other programs running on the computer, as well as power management settings, can affect performance.

In [15]:
def benchmark(qa_pipeline, dataset, num_items=100):
    """
    Benchmark PyTorch or OpenVINO model. This function does inference on `num_items`
    dataset items and returns the median latency in milliseconds
    """
    latencies = []
    for i, item in enumerate(dataset.select(range(num_items))):
        start_time = time.perf_counter()
        results = qa_pipeline({"question": item["question"], "context": item["context"]})
        end_time = time.perf_counter()
        latencies.append(end_time - start_time)

    return np.median(latencies) * 1000


original_latency = benchmark(hf_qa_pipeline, validation_dataset)
quantized_latency = benchmark(ov_qa_pipeline_ptq, validation_dataset)
cpu_device_name = Core().get_property("CPU", "FULL_DEVICE_NAME")

print(cpu_device_name)
print(f"Latency of original FP32 model: {original_latency:.2f} ms")
print(f"Latency of quantized model: {quantized_latency:.2f} ms")
print(f"Speedup: {(original_latency/quantized_latency):.2f}x")

11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Latency of original FP32 model: 113.28 ms
Latency of quantized model: 41.29 ms
Speedup: 2.74x
