# Joint Pruning Quantization and Distillation with OpenVINO and NNCF

With quantization, we reduce the precision of the model's weights and activations from floating point (FP32) to integer (INT8). This results in a smaller model with faster inference times with OpenVINO Runtime. 

Please see the [Optimum OpenVINO model compression documentation](https://huggingface.co/docs/optimum/intel/optimization_ov#optimizationhttps://huggingface.co/docs/optimum/intel/optimization_ov#optimization) for more information about compressing models with NNCF and JPQD.

JPQD is applied during training/finetuning of the model. It's not ideal to train models for a long time in a notebook and we recommend to run the [question-answering example](https://github.com/huggingface/optimum-intel/tree/main/examples/openvino/question-answering) in a terminal to quantize the model yourself. 

To follow this notebook, you do not need to compress the model yourself, you can use the already compressed model that we uploaded to the Hugging Face hub.

A laptop or desktop with a recent Intel Core processor is recommended for best results. To install the requirements for this notebook, please do `pip install "optimum[openvino]" "evaluate[evaluator]" ipywidgets datasets` or uncomment the cell below to install the requirements in your current Python environment.

In [1]:
# %pip install "optimum-intel[openvino]" "evaluate[evaluator]" ipywidgets datasets

In [2]:
import random
import tempfile
from pathlib import Path

import datasets
import evaluate
import pandas as pd
import transformers
from evaluate import evaluator
from optimum.intel.openvino import OVModelForQuestionAnswering
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

from openvino.runtime import Core

transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino




## Settings

We will compare the accuracy and performance of the quantized and pruned model with that of an FP32 bert-base-uncased model which was also finetuned on the SQuAD dataset, following the [Transformers question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering#fine-tuning-bert-on-squad10). 

We give the model_ids for the FP32 model and the INT8 model and define the dataset name. If you trained the models yourself, set FP32_MODEL_ID and INT8_MODEL_ID to the directory containing the model and tokenizer files.

The models were finetuned on the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad), a reading comprehension dataset consisting of questions on a set of Wikipedia articles, where the answer to every question is a segment of text from a given context. The models were finetuned on version 1 of the SQuAD dataset, so VERSION_2_WITH_NEGATIVE should be set to False. 

In [3]:
FP32_MODEL_ID = "helenai/bert-base-uncased-squad-v1"
INT8_MODEL_ID = "helenai/bert-base-uncased-squad-v1-jpqd-ov-int8"
DATASET_NAME = "squad"
VERSION_2_WITH_NEGATIVE = False

#### Intel GPU support

At the moment, quantized embeddings are not supported for inference on GPU. To show inference on iGPU, we compressed the model without quantizing embeddings, by adding ` "{re}.*Embeddings.*"` to the `ignored_scopes` in the quantization sections of the [NNCF config](https://github.com/huggingface/optimum-intel/blob/main/examples/openvino/question-answering/configs/bert-base-jpqd.json) and compressed the model again with that config. This does not affect performance, but it does affect file size of the quantized model, from 75 to 146 MB.

The code in the cell below checks if a GPU is available for OpenVINO inference, and if so it sets INT8_MODEL_ID to the GPU-enabled version of the model.

In [4]:
gpu_available = "GPU" in Core().available_devices
if gpu_available:
    INT8_MODEL_ID = "helenai/bert-base-uncased-squad-v1-jpqd-ov-int8@gpu"

## Load the Dataset

The `datasets` library makes it easy to load datasets. Common datasets can be loaded from the Hugging Face Hub by providing the name of the dataset. See https://github.com/huggingface/datasets. We load the SQuAD dataset with `load_dataset`, show a random dataset item, and the list of categories in the dataset.

Every dataset item in the SQuAD dataset has a unique id, a title which denotes the category, a context and a question, and answers. The answer is a subset of the context, and both the text of the answer, and the start position of the answer in the context (`answer_start`) are returned.



In [5]:
examples = datasets.load_dataset(DATASET_NAME, split="validation")
random.choice(examples)

{'id': '56e77a8700c9c71400d7718b',
 'title': 'Teacher',
 'context': "In the past, teachers have been paid relatively low salaries. However, average teacher salaries have improved rapidly in recent years. US teachers are generally paid on graduated scales, with income depending on experience. Teachers with more experience and higher education earn more than those with a standard bachelor's degree and certificate. Salaries vary greatly depending on state, relative cost of living, and grade taught. Salaries also vary within states where wealthy suburban school districts generally have higher salary schedules than other districts. The median salary for all primary and secondary teachers was $46,000 in 2004, with the average entry salary for a teacher with a bachelor's degree being an estimated $32,000. Median salaries for preschool teachers, however, were less than half the national median for secondary teachers, clock in at an estimated $21,000 in 2004. For high school teachers, median sa

In [6]:
print(set([item["title"] for item in examples]))

{'Scottish_Parliament', 'Oxygen', 'United_Methodist_Church', 'European_Union_law', 'Construction', 'French_and_Indian_War', 'Martin_Luther', 'Super_Bowl_50', 'Genghis_Khan', 'Prime_number', 'Rhine', 'Steam_engine', 'Economic_inequality', 'Yuan_dynasty', '1973_oil_crisis', 'American_Broadcasting_Company', 'Computational_complexity_theory', 'Packet_switching', 'Civil_disobedience', 'Warsaw', 'Teacher', 'Southern_California', 'Normans', 'Newcastle_upon_Tyne', 'Black_Death', 'Chloroplast', 'Jacksonville,_Florida', 'Imperialism', 'Apollo_program', 'Huguenot', 'Pharmacy', 'Ctenophora', 'Victoria_and_Albert_Museum', 'Kenya', 'Immune_system', 'Intergovernmental_Panel_on_Climate_Change', 'Doctor_Who', 'Force', 'University_of_Chicago', 'Amazon_rainforest', 'Fresno,_California', 'Geology', 'Islamism', 'Victoria_(Australia)', 'Private_school', 'Nikola_Tesla', 'Sky_(United_Kingdom)', 'Harvard_University'}


## Load Model and Tokenizer

We load the PyTorch FP32 model and the OpenVINO INT8 model from the Hugging Face Hub. The models will be automatically downloaded if it has not been downloaded before, or loaded from the cache otherwise. To load the quantized model with OpenVINO, we use the `OVModelForQuestionAnswering` class. It can be used in the same way as [`AutoModelForQuestionAnswering`](https://huggingface.co/docs/transformers/main/model_doc/auto).


We also load the tokenizer, which converts the questions and contexts from the dataset to tokens, converting the inputs in a format the model expects.

In [7]:
fp32_model = AutoModelForQuestionAnswering.from_pretrained(FP32_MODEL_ID)
int8_model = OVModelForQuestionAnswering.from_pretrained(INT8_MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(FP32_MODEL_ID)

# See how the tokenizer for the given model converts input text to model input values
tokenizer("hello world!")

{'input_ids': [101, 7592, 2088, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

## Compare INT8 and FP32 models

We compare the accuracy, model size and inference results and latency of the FP32 and INT8 models.
### Inference Pipeline

Transformers [Pipelines](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial) simplify model inference. A `Pipeline` is created by adding a task, model and tokenizer to the `pipeline` function. Inference is then as simple as `qa_pipeline({"question": question, "context": context})`.

We create two pipelines: `hf_qa_pipeline` and `ov_qa_pipeline` to compare the FP32 PyTorch model with the OpenVINO INT8 model. These pipelines will also be used for showing the accuracy difference and for benchmarking later in this notebook.

For some Intel processors, it can be beneficial to reshape the OpenVINO model to a static shape of (1,384) for faster inference. This requires padding or truncating inputs to the specified sequence length. This can be done by adding `padding`, `max_seq_len` and `truncation` arguments to the `pipeline` function. See Hugging Face's [padding and truncation documentation](https://huggingface.co/docs/transformers/pad_truncation) for more information on the possible values.

Setting a shorter sequence length in the cell below will speed up inference further, with the possibility of a drop in accuracy, since larger model inputs will be truncated.

In [8]:
USE_DYNAMIC_SHAPES = False

if USE_DYNAMIC_SHAPES:
    ov_qa_pipeline = pipeline("question-answering", model=int8_model, tokenizer=tokenizer)
else:
    seq_length = 384
    int8_model.reshape(1, seq_length)
    int8_model.compile()
    ov_qa_pipeline = pipeline(
        "question-answering", model=int8_model, tokenizer=tokenizer, max_seq_len=seq_length, padding="max_length", truncation=True
    )

hf_qa_pipeline = pipeline("question-answering", model=fp32_model, tokenizer=tokenizer)

Show a dataset item and inference results on both pipelines.

In [9]:
context = examples[0]["context"]
question = "Who won the game?"
print(context)

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


In [10]:
hf_qa_pipeline({"question": question, "context": context})["answer"]

'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'

In [11]:
ov_qa_pipeline({"question": question, "context": context})["answer"]

'Denver Broncos'

### Accuracy

We load the quantized model and the original FP32 model, and compare the metrics on both models. The [evaluate](https://github.com/huggingface/evaluate) library makes it very easy to evaluate models on a given dataset, with a given metric. For the SQuAD dataset, the F1 score and Exact Match metrics are returned.

The SQuAD dataset is pretty large and it can take some time to run the evaluation on the full dataset. For demonstration purposes, we evaluate the metrics on a subset of 500 items of the dataset. The metrics on the full validation dataset are:

```
FP32 exact match 81.5, F1 88.7
INT8 exact match 82.5, F1 89.5
```

The evaluate function also keeps track of the time it takes to run. This provides an estimate of performance, but keep in mind that other programs running on the computer (including Jupyter), as well as power management settings, can affect performance.

If you have a processor with an Intel integrated GPU, or a dedicated Intel GPU, you can run inference on the GPU for even faster performance. An 11th generation Intel Core processor or later with Xe graphics, is recommended for iGPU inference. See [OpenVINO documentation](https://docs.openvino.ai/latest/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing GPU drivers if you are on Linux or macOS (on Windows iGPU inference should work out of the box).

Currently, dynamic shapes are supported with limitations on GPU. In the code below we enable GPU inference if a GPU is available to OpenVINO and if the model is compiled with static shapes, in the previous section. Note that minor variations in accuracy between CPU and GPU are expected.

In [12]:
random.seed(2023)
num_items = 500
# Set num_items to len(examples) to validate on the entire dataset. That may take a long time!
# num_items = len(examples)
indices = sorted(random.sample(range(len(examples)), k=num_items))
filtered_examples = examples.select(indices)

In [13]:
squad_eval = evaluator("question-answering")

hf_eval_results = squad_eval.compute(
    model_or_pipeline=hf_qa_pipeline,
    data=filtered_examples,
    metric="squad",
    squad_v2_format=VERSION_2_WITH_NEGATIVE,
)

devices = ("CPU", "GPU") if ("GPU" in Core().available_devices and not int8_model.is_dynamic) else ("CPU",)
ov_eval_results = {}
for device in devices:
    int8_model.to(device)
    int8_model.compile()

    # run a few warmup inferences
    for item in examples.select(range(10)):
        ov_qa_pipeline(item["question"], item["context"])

    ov_eval_results[device] = squad_eval.compute(
        model_or_pipeline=ov_qa_pipeline,
        data=filtered_examples,
        metric="squad",
        squad_v2_format=VERSION_2_WITH_NEGATIVE,
    )

In [14]:
summary = (
    pd.DataFrame.from_records(
        [hf_eval_results, *ov_eval_results.values()],
        columns=["exact_match", "f1", "latency_in_seconds"],
        index=["FP32", *(f"INT8 {device}" for device in devices)],
    )
    .round(4)
    .dropna()
)
summary["latency_in_seconds"] *= 1000
summary.columns = ["exact_match", "f1", "latency"]
summary

Unnamed: 0,exact_match,f1,latency
FP32,80.8,88.4116,143.7
INT8 CPU,82.0,88.7953,64.1
INT8 GPU,82.8,89.3397,34.1


In [15]:
for device in devices:
    int8_speedup = summary.loc["FP32"]["latency"] / summary.loc[f"INT8 {device}"]["latency"]
    print(f"INT8 speedup on {device}: {int8_speedup:.2f}X")
print(Core().get_property("CPU", "FULL_DEVICE_NAME"))

INT8 speedup on CPU: 2.24X
INT8 speedup on GPU: 4.21X
11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz


### Inference Results

To fully understand the quality of a model, it is useful to look beyond metrics like Exact Match and F1 score and examine model predictions directly. This can give a more complete impression of the model's performance and help identify areas for improvement.

In the next cell, we go over a selection of items in the filtered validation set, and display the items where the FP32 prediction score is different from the INT8 prediction score

The table displays the question and the set of correct answers from the dataset, the FP32 prediction and F1 score and the INT8 prediction and F1 score. The results show that for some predictions, the FP32 model is better, and for others, the INT8 model is, and that for the large majority of dataset items both models are equally accurate.

In [16]:
results = []
int8_better = 0
num_items = 100
metric = evaluate.load("squad_v2" if VERSION_2_WITH_NEGATIVE else "squad")

for item in filtered_examples.select(range(num_items)):
    id, title, context, question, answers = item.values()
    fp32_answer = hf_qa_pipeline(question, context)["answer"]
    int8_answer = ov_qa_pipeline(question, context)["answer"]

    references = [{"id": id, "answers": answers}]
    fp32_predictions = [{"id": id, "prediction_text": fp32_answer}]
    int8_predictions = [{"id": id, "prediction_text": int8_answer}]

    fp32_score = round(metric.compute(references=references, predictions=fp32_predictions)["f1"], 2)
    int8_score = round(metric.compute(references=references, predictions=int8_predictions)["f1"], 2)

    if int8_score != fp32_score:
        results.append((question, answers["text"], fp32_answer, fp32_score, int8_answer, int8_score))
        if int8_score > fp32_score:
            int8_better += 1

In [17]:
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(
    results,
    columns=["Question", "Answer", "FP32 prediction", "FP32 F1", "INT8 prediction", "INT8 F1"],
)
df

Unnamed: 0,Question,Answer,FP32 prediction,FP32 F1,INT8 prediction,INT8 F1
0,Who was the male singer who performed as a special guest during Super Bowl 50?,"[Bruno Mars, Bruno Mars, Bruno Mars,]",Beyoncé and Bruno Mars,66.67,Bruno Mars,100.0
1,What position does Demaryius Thomas play?,"[receiver, receiver, Thomas]",Veteran receiver,66.67,receiver,100.0
2,Which smartphone customers were the only people who could stream the game on their phones?,"[Verizon Wireless customers, Verizon, Verizon]",Verizon Wireless,80.0,Verizon,100.0
3,Who stripped the ball from Cam Newton while sacking him on this drive?,"[Von Miller, Von Miller, Miller]",Von Miller,100.0,linebacker Von Miller,80.0
4,What were Tesla's mother's special abilities?,"[making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems, making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems, making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems]",memorize Serbian epic poems,47.06,"craft tools, mechanical appliances, and the ability to memorize Serbian epic poems",91.67
5,What was Tesla's AC system used for in Pittsburgh?,"[to power the city's streetcars., the city's streetcars, street cars]",create an alternating current system to power the city's streetcars,66.67,helping to create an alternating current system to power the city's streetcars,57.14
6,Where can Tesla's theories as to what caused the skin damage be found?,"[In his many notes, In his many notes]",Roentgen rays,0.0,ozone generated in contact with the skin,20.0
7,How far did he claim the mechanical energy could be transmitted?,"[over any terrestrial distance, any terrestrial distance, any terrestrial distance]",over any terrestrial distance,100.0,terrestrial distance,80.0
8,What was the occasion when he claimed he'd made the death ray?,"[at a luncheon in his honor, a luncheon in his honor, a luncheon in his honor]",luncheon,40.0,"1937, at a luncheon",50.0
9,A non-deterministic Turing machine has the ability to capture what facet of useful analysis?,"[mathematical models, mathematical models, branching]",mathematical models we want to analyze,50.0,mathematical models,100.0


### Model Size

We save the FP32 and INT8 models to a temporary directory and define a function to show the model size for the PyTorch and OpenVINO models.

In [18]:
def get_model_size(model_folder, framework):
    """
    Return OpenVINO or PyTorch model size in Mb.
    Arguments:
        model_folder:
            Directory containing a pytorch_model.bin for a PyTorch model, and an openvino_model.xml/.bin for an OpenVINO model.
        framework:
            Define whether the model is a PyTorch or an OpenVINO model.
    """
    if framework.lower() == "openvino":
        model_path = Path(model_folder) / "openvino_model.xml"
        model_size = model_path.stat().st_size + model_path.with_suffix(".bin").stat().st_size
    elif framework.lower() == "pytorch":
        model_path = Path(model_folder) / "pytorch_model.bin"
        model_size = model_path.stat().st_size
    model_size /= 1000 * 1000
    return model_size


with tempfile.TemporaryDirectory() as fp32_model_dir:
    fp32_model.save_pretrained(fp32_model_dir)
    fp32_model_size = get_model_size(fp32_model_dir, "pytorch")

with tempfile.TemporaryDirectory() as int8_model_dir:
    int8_model.save_pretrained(int8_model_dir)
    int8_model_size = get_model_size(int8_model_dir, "openvino")

print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

FP32 model size: 435.64 MB
INT8 model size: 147.57 MB
INT8 size decrease: 2.95x
