# Post-Training Quantization of OpenAI Whisper model with NNCF

The goal of this tutorial is to demonstrate how to speed up the model by applying 8-bit post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) (Neural Network Compression Framework) and infer quantized model via OpenVINO™ Toolkit. The optimization process contains the following steps:

1. Quantize the converted OpenVINO model from [notebook](227-whisper-convert.ipynb) with NNCF.
2. Check the model result using the same input data from the [notebook](227-whisper-convert.ipynb).
3. Compare model size of converted and quantized models.
4. Compare performance and accuracy of converted and quantized models.

> **NOTE**: you should run [227-whisper-convert](227-whisper-convert.ipynb) notebook first to generate OpenVINO IR model that is used for quantization.

<a id="0"></a>
Table of content:
- [Prerequisites](#1)
- [Create and initialize quantization](#2)
    - [Prepare calibration datasets](#3)
- [Run quantized OpenVINO model](#4)
    - [Compare File Size](#5)
    - [Compare inference time and accuracy of the FP32 IR and quantized models](#6)


<a id="1"></a>
## Prerequisites [&#8657;](#0)


Install dependencies.

In [37]:
#!pip install -q "git+https://github.com/openvinotoolkit/nncf.git"
#!pip install -q datasets
#!pip install -q evaluate

Select device from dropdown list for running inference using OpenVINO.

In [39]:
import ipywidgets as widgets

from openvino import Core
core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')

Select the task for the model:

* **transcribe** - generate audio transcription in the source language (automatically detected).
* **translate** - generate audio transcription with translation to English language.

In [40]:
task = widgets.Select(
    options=["transcribe", "translate"],
    value="translate",
    description="Select task:",
    disabled=False
)
task

Select(description='Select task:', index=1, options=('transcribe', 'translate'), value='translate')

<a id="2"></a>
## Create and initialize quantization [&#8657;](#0)

[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor. Quantization is the simplest scenario and requires a few modifications.

The optimization process contains the following steps:

1. Create a calibration dataset for quantization.
2. Run `nncf.quantize` for getting a quantized model.
3. Serialize the `INT8` model using `openvino.runtime.serialize` function.

Set paths to the model converted in [227-whisper-convert](227-whisper-convert.ipynb) notebook and the paths where quantized models will be saved.

In [None]:
from pathlib import Path

WHISPER_ENCODER_OV = Path("whisper_encoder.xml")
WHISPER_DECODER_OV = Path("whisper_decoder.xml")

WHISPER_ENCODER_OV_INT8 = Path("whisper_encoder_int8.xml")
WHISPER_DECODER_OV_INT8 = Path("whisper_decoder_int8.xml")

Load FP32 model IR.

In [41]:
import whisper
from utils import patch_whisper_for_ov_inference, OpenVINOAudioEncoder, OpenVINOTextDecoder

model_fp32 = whisper.load_model("base").to("cpu").eval()
patch_whisper_for_ov_inference(model_fp32)

model_fp32.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV, device=device.value)
model_fp32.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV, device=device.value)

<a id="3"></a>
### Prepare calibration datasets [&#8657;](#0)

Whisper consists of an encoder and a decoder models. We need to collect calibration data for both of them.

Below we overwrite encoder/decoder forward methods in order to collect calibration samples.

In [42]:
from contextlib import contextmanager
from functools import partial
from openvino import Tensor
from typing import Optional
import torch

COLLECT_CALIBRATION_DATA = False
encoder_calibration_data = []
decoder_calibration_data = []

@contextmanager
def calibration_data_collection():
    global COLLECT_CALIBRATION_DATA
    try:
        COLLECT_CALIBRATION_DATA = True
        yield
    finally:
        COLLECT_CALIBRATION_DATA = False


def encoder_forward(self, mel: torch.Tensor):
    if COLLECT_CALIBRATION_DATA:
        encoder_calibration_data.append(mel)
    return torch.from_numpy(self.compiled_model(mel)[self.output_blob])

def decoder_forward(self, x: torch.Tensor, xa: torch.Tensor, kv_cache: Optional[dict] = None):
    feed_dict = {'x': Tensor(x.numpy()), 'xa': Tensor(xa.numpy())}
    feed_dict = (self.preprocess_kv_cache_inputs(feed_dict, kv_cache))
    if COLLECT_CALIBRATION_DATA:
        decoder_calibration_data.append(feed_dict)
    res = self.compiled_model(feed_dict)
    return self.postprocess_outputs(res)

model_fp32.encoder.forward = partial(encoder_forward, model_fp32.encoder)
model_fp32.decoder.forward = partial(decoder_forward, model_fp32.decoder)

We use a portion of [librispeech_asr_dummy](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy) dataset from Hugging Face as calibration data.

In [43]:
from datasets import load_dataset
from tqdm import tqdm

CALIBRATION_DATASET_SIZE = 15

# calibration_dataset = load_dataset("librispeech_asr", "clean", split="train.100", streaming=True).take(CALIBRATION_DATASET_SIZE)
calibration_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split=f"validation[:{CALIBRATION_DATASET_SIZE}]")

with calibration_data_collection():
    for data_item in tqdm(calibration_dataset, desc="Collecting calibration data", total=CALIBRATION_DATASET_SIZE):
        model_fp32.transcribe(data_item["audio"]["array"].astype("float32"), beam_size=5, best_of=5, task=task.value)

Collecting calibration data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:24<00:00,  1.62s/it]


Quantize both encoder and decoder models using `nncf.quantize()` API and save the quantized IRs after that.

In [44]:
import nncf
from nncf.quantization.advanced_parameters import OverflowFix
from openvino.runtime import serialize

print(f"Quantizing encoder...")
quantized_encoder = nncf.quantize(
    model=model_fp32.encoder.model,
    calibration_dataset=nncf.Dataset(encoder_calibration_data),
    subset_size=len(encoder_calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        overflow_fix=OverflowFix.DISABLE,
        smooth_quant_alpha=0.5
    )
)
serialize(quantized_encoder, WHISPER_ENCODER_OV_INT8)
print(f"Saved quantized encoder at ./{WHISPER_ENCODER_OV_INT8}")

print(f"Quantizing decoder...")
quantized_decoder = nncf.quantize(
    model=model_fp32.decoder.model,
    calibration_dataset=nncf.Dataset(decoder_calibration_data),
    subset_size=len(decoder_calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        overflow_fix=OverflowFix.DISABLE,
        smooth_quant_alpha=0.95
    )
)
serialize(quantized_decoder, WHISPER_DECODER_OV_INT8)
print(f"Saved quantized decoder at ./{WHISPER_DECODER_OV_INT8}")

Quantizing encoder...


Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:02<00:00, 12.07it/s]
Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 70.13it/s]


INFO:nncf:18 ignored nodes was found by name in the NNCFGraph


Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  4.91it/s]
Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:06<00:00,  5.09it/s]


Saved quantized encoder at ./whisper_encoder_int8.xml
Quantizing decoder...


Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 398/398 [00:19<00:00, 20.58it/s]
Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 42.11it/s]


INFO:nncf:36 ignored nodes was found by name in the NNCFGraph


Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 398/398 [00:48<00:00,  8.16it/s]
Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:07<00:00,  6.20it/s]


Saved quantized decoder at ./whisper_decoder_int8.xml


<a id="4"></a>
## Run quantized OpenVINO model [&#8657;](#0)

Load `INT8` models saved before into a new instance of Whisper model.

In [None]:
model_int8 = whisper.load_model("base").to("cpu").eval()
patch_whisper_for_ov_inference(model_int8)

model_int8.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV_INT8, device=device.value)
model_int8.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV_INT8, device=device.value)

Select a video for transcription as in [227-whisper-convert](227-whisper-convert.ipynb) notebook.

In [46]:
VIDEO_LINK = "https://youtu.be/kgL5LBM-hFI"
link = widgets.Text(
    value=VIDEO_LINK,
    placeholder="Type link for video",
    description="Video:",
    disabled=False
)
link

Text(value='https://youtu.be/kgL5LBM-hFI', description='Video:', placeholder='Type link for video')

In [47]:
from pytube import YouTube

print(f"Downloading video {link.value} started")

output_file = Path("downloaded_video.mp4")
yt = YouTube(link.value)
# yt.streams.get_highest_resolution().download(filename=output_file)
print(f"Video saved to {output_file}")

Downloading video https://youtu.be/kgL5LBM-hFI started
Video saved to downloaded_video.mp4


In [48]:
from utils import get_audio

audio = get_audio(output_file)

Run transcription by the quantized model.

In [49]:
transcription = model_int8.transcribe(audio, beam_size=5, best_of=5, task=task.value)

In [50]:
from utils import prepare_srt

srt_lines = prepare_srt(transcription)
# save transcription
with output_file.with_suffix(".srt").open("w") as f:
    f.writelines(srt_lines)

Now let us see the results.

In [51]:
# widgets.Video.from_file(output_file, loop=False, width=800, height=800)

In [52]:
print("".join(srt_lines))

1
00:00:00,000 --> 00:00:05,000
 What's that?

2
00:00:05,000 --> 00:00:07,000
 Oh wow.

3
00:00:09,000 --> 00:00:11,000
 Hello humans

4
00:00:14,000 --> 00:00:15,000
 focus on me.

5
00:00:15,000 --> 00:00:25,000
 Focus on the guard.

6
00:00:25,000 --> 00:00:32,000
 This is where it all changes.


<a id="5"></a>
#### Compare File Size [&#8657;](#0)

In [53]:
def calculate_compression_rate(model_path_ov, model_path_ov_int8):
    model_size_fp32 = model_path_ov.with_suffix(".bin").stat().st_size / 1024
    model_size_int8 = model_path_ov_int8.with_suffix(".bin").stat().st_size / 1024
    print(f"Model: {model_path_ov.stem}")
    print(f"    * FP32 IR model size: {model_size_fp32:.2f} KB")
    print(f"    * INT8 IR model size: {model_size_int8:.2f} KB")
    print(f"    * Model compression rate: {model_size_fp32 / model_size_int8:.3f}")

calculate_compression_rate(WHISPER_ENCODER_OV, WHISPER_ENCODER_OV_INT8)
calculate_compression_rate(WHISPER_DECODER_OV, WHISPER_DECODER_OV_INT8)

Model: whisper_encoder
    * FP32 IR model size: 40216.07 KB
    * INT8 IR model size: 21092.37 KB
    * Model compression rate: 1.907
Model: whisper_decoder
    * FP32 IR model size: 101961.09 KB
    * INT8 IR model size: 78058.77 KB
    * Model compression rate: 1.306


<a id="6"></a>
#### Compare inference time and accuracy of the FP32 IR and quantized models [&#8657;](#0)
To measure the inference performance of the `FP32` and `INT8` encoder/decoder models, we use median inference time on calibration dataset.
So we can approximately estimate the speed-up of the dynamic quantized models.

> **NOTE**: For the most accurate performance estimation, it is recommended to run `benchmark_app` with static shapes in a terminal/command prompt after closing other applications.

In [54]:
import time
import numpy as np

def calculate_call_inference_time(model, dataset):
    inference_time = []
    for data_item in tqdm(dataset[:100], desc="Measuring performance"):
        start = time.perf_counter()
        model(data_item)
        end = time.perf_counter()
        delta = end - start
        inference_time.append(delta)
    return np.median(inference_time)


encoder_time_fp32 = calculate_call_inference_time(model_fp32.encoder.compiled_model, encoder_calibration_data)
encoder_time_int8 = calculate_call_inference_time(model_int8.encoder.compiled_model, encoder_calibration_data)
print(f"Encoder performance speedup: {encoder_time_fp32 / encoder_time_int8:.3f}")

decoder_time_fp32 = calculate_call_inference_time(model_fp32.decoder.compiled_model, decoder_calibration_data)
decoder_time_int8 = calculate_call_inference_time(model_int8.decoder.compiled_model, decoder_calibration_data)
print(f"Decoder performance speedup: {decoder_time_fp32 / decoder_time_int8:.3f}")

Measuring performance: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:02<00:00, 12.94it/s]
Measuring performance: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:01<00:00, 17.36it/s]


Encoder performance speedup: 1.324


Measuring performance: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 22.39it/s]
Measuring performance: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 32.33it/s]


Decoder performance speedup: 1.438


Measuring performance and accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [00:26<00:00,  1.12it/s]
Measuring performance and accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [00:18<00:00,  1.63it/s]

Whisper transcription performance speedup: 1.452
Whisper transcription word accuracy. FP32: 95.57. INT8: 95.40





We measure the whole transcription performance separately, because a single Whisper `transcribe()` call triggers multiple encoder and decoder inference calls. And the number of these calls is dynamic and depends on the model accuracy.
This time we use the mean time instead of the median because the model transcription time is less uniform.

We also compare accuracy values of the `FP32` and `INT8` models on a small subset of [librispeech_asr](https://huggingface.co/datasets/librispeech_asr) dataset.

In [None]:
from evaluate import load
from transformers import WhisperProcessor

wer = load("wer")

TEST_DATASET_SIZE = 30
test_dataset = load_dataset("librispeech_asr", "clean", split="test", streaming=True).take(TEST_DATASET_SIZE)

def calculate_transcription_time_and_accuracy(model, dataset):
    processor = WhisperProcessor.from_pretrained("openai/whisper-large")

    ground_truths = []
    predictions = []
    inference_time = []
    for data_item in tqdm(dataset, desc="Measuring performance and accuracy", total=TEST_DATASET_SIZE):
        audio = data_item["audio"]["array"].astype("float32")

        start_time = time.perf_counter()
        transcription = model.transcribe(audio, task=task.value)
        end_time = time.perf_counter()
        delta_time = end_time - start_time

        reference = processor.tokenizer._normalize(data_item["text"])
        prediction = processor.tokenizer._normalize(transcription["text"])
        ground_truths.append(reference)
        predictions.append(prediction)
        inference_time.append(delta_time)

    word_accuracy = (1 - wer.compute(references=ground_truths, predictions=predictions)) * 100
    mean_inference_time = np.mean(inference_time)
    return mean_inference_time, word_accuracy

transcription_time_fp32, accuracy_fp32 = calculate_transcription_time_and_accuracy(model_fp32, test_dataset)
transcription_time_int8, accuracy_int8 = calculate_transcription_time_and_accuracy(model_int8, test_dataset)
print(f"Whisper transcription performance speedup: {transcription_time_fp32 / transcription_time_int8:.3f}")
print(f"Whisper transcription word accuracy. FP32: {accuracy_fp32:.2f}. INT8: {accuracy_int8:.2f}")