# Quantize Speech Recognition Models using NNCF PTQ API
This tutorial demonstrates how to apply `INT8` quantization to the speech recognition model, known as [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2), using the NNCF (Neural Network Compression Framework) 8-bit quantization in post-training mode (without the fine-tuning pipeline). This notebook uses a fine-tuned [Wav2Vec2-Base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) [PyTorch](https://pytorch.org/) model trained on the [LibriSpeech ASR corpus](https://www.openslr.org/12). The tutorial is designed to be extendable to custom models and datasets. It consists of the following steps:

- Download and prepare the Wav2Vec2 model and LibriSpeech dataset.
- Define data loading and accuracy validation functionality.
- Model quantization.
- Compare Accuracy of original PyTorch model, OpenVINO FP16 and INT8 models.
- Compare performance of the original and quantized models.

## Imports

In [None]:
# install nncf==2.5.0 from github, while it is not published on PyPi. THis version includes quantization advanced parameters feature.
!pip install -q git+https://github.com/openvinotoolkit/nncf@release_v250#egg=nncf

In [None]:
import os
import sys
import re
import numpy as np
import tarfile
from itertools import groupby
import soundfile as sf
import IPython.display as ipd

from transformers import Wav2Vec2ForCTC

sys.path.append("../utils")
from notebook_utils import download_file

## Settings

In [None]:
from pathlib import Path

# Set the data and model directories, model source URL and model filename.
MODEL_DIR = Path("model")
DATA_DIR = Path("../data/datasets/librispeech")
MODEL_DIR.mkdir(exist_ok=True)
DATA_DIR.mkdir(exist_ok=True)

## Prepare the Model
Perform the following:
- Download and unpack a pre-trained Wav2Vec2 model.
- Convert the model to ONNX.
- Run Model Optimizer to convert the model from the ONNX representation to the OpenVINO Intermediate Representation (OpenVINO IR).

In [None]:
download_file("https://huggingface.co/facebook/wav2vec2-base-960h/resolve/main/pytorch_model.bin", directory=Path(MODEL_DIR) / 'pytorch', show_progress=True)
download_file("https://huggingface.co/facebook/wav2vec2-base-960h/resolve/main/config.json", directory=Path(MODEL_DIR) / 'pytorch', show_progress=False)

Import all dependencies to load the original PyTorch model and convert it to the ONNX representation.

In [None]:
BATCH_SIZE = 1
MAX_SEQ_LENGTH = 30480


def export_model_to_onnx(model, path):
    with torch.no_grad():
        default_input = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
        inputs = {
            "inputs": default_input
        }
        symbolic_names = {0: "batch_size", 1: "sequence_len"}
        torch.onnx.export(
            model,
            (inputs["inputs"]),
            path,
            opset_version=11,
            input_names=["inputs"],
            output_names=["logits"],
            dynamic_axes={
                "inputs": symbolic_names,
                "logits": symbolic_names,
            },
        )
        print("ONNX model saved to {}".format(path))


torch_model = Wav2Vec2ForCTC.from_pretrained(Path(MODEL_DIR) / 'pytorch')
onnx_model_path = Path(MODEL_DIR) / "wav2vec2_base.onnx"
if not onnx_model_path.exists():
    export_model_to_onnx(torch_model, onnx_model_path)

In [None]:
from openvino.tools import mo
from openvino.runtime import Core, serialize
import torch


ov_model = mo.convert_model(onnx_model_path, compress_to_fp16=True)

ir_model_path = MODEL_DIR / "wav2vec2_base.xml"
serialize(ov_model, str(ir_model_path))

## Prepare LibriSpeech Dataset

Use the code below to download and unpack the archives with 'dev-clean' and 'test-clean' subsets of LibriSpeech Dataset.

In [None]:
download_file("http://openslr.elda.org/resources/12/dev-clean.tar.gz", directory=DATA_DIR, show_progress=True)
download_file("http://openslr.elda.org/resources/12/test-clean.tar.gz", directory=DATA_DIR, show_progress=True)

if not os.path.exists(f'{DATA_DIR}/LibriSpeech/dev-clean'):
    with tarfile.open(f"{DATA_DIR}/dev-clean.tar.gz") as tar:
        tar.extractall(path=DATA_DIR)
if not os.path.exists(f'{DATA_DIR}/LibriSpeech/test-clean'):
    with tarfile.open(f"{DATA_DIR}/test-clean.tar.gz") as tar:
        tar.extractall(path=DATA_DIR)

## Define DataLoader
Wav2Vec2 model accepts a raw waveform of the speech signal as input and produces vocabulary class estimations as output. Since the dataset contains
audio files in FLAC format, use the 'soundfile' package to convert them to waveform.

> **NOTE**: Consider increasing `samples_limit` to get more precise results. A suggested value is `300` or more, as it will take longer time to process.

In [None]:
class LibriSpeechDataLoader:

    @staticmethod
    def read_flac(file_name):
        speech, samplerate = sf.read(file_name)
        assert samplerate == 16000, "read_flac: only 16kHz supported!"
        return speech

    # Required methods
    def __init__(self, config, samples_limit=300):
        """Constructor
        :param config: data loader specific config
        """
        self.samples_limit = samples_limit
        self._data_dir = config["data_source"]
        self._ds = []
        self._prepare_dataset()

    def __len__(self):
        """Returns size of the dataset"""
        return len(self._ds)

    def __getitem__(self, index):
        """
        Returns annotation, data and metadata at the specified index.
        Possible formats:
        (index, annotation), data
        (index, annotation), data, metadata
        """
        label = self._ds[index][0]
        inputs = {'inputs': np.expand_dims(self._ds[index][1], axis=0)}
        return label, inputs

    # Methods specific to the current implementation
    def _prepare_dataset(self):
        pattern = re.compile(r'([0-9\-]+)\s+(.+)')
        data_folder = Path(self._data_dir)
        txts = list(data_folder.glob('**/*.txt'))
        counter = 0
        for txt in txts:
            content = txt.open().readlines()
            for line in content:
                res = pattern.search(line)
                if not res:
                    continue
                name = res.group(1)
                transcript = res.group(2)
                fname = txt.parent / name
                fname = fname.with_suffix('.flac')
                identifier = str(fname.relative_to(data_folder))
                self._ds.append(((counter, transcript.upper()), LibriSpeechDataLoader.read_flac(os.path.join(self._data_dir, identifier))))
                counter += 1
                if counter >= self.samples_limit:
                    # Limit exceeded
                    return

## Run Quantization
[NNCF](https://github.com/openvinotoolkit/nncf) provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop.
> **Note**: NNCF Post-training Quantization is available as a preview feature in OpenVINO 2022.3 release.
Fully functional support will be provided in the next releases.

Create a quantized model from the pre-trained `FP16` model and the calibration dataset. The optimization process contains the following steps:
    1. Create a Dataset for quantization.
    2. Run `nncf.quantize` for getting an optimized model. The `nncf.quantize` function provides an interface for model quantization. It requires an instance of the OpenVINO Model and quantization dataset.
Optionally, some additional parameters for the configuration quantization process (number of samples for quantization, preset, ignored scope, etc.) can be provided. For more accurate results, we should keep the operation in the postprocessing subgraph in floating point precision, using the `ignored_scope` parameter. `advanced_parameters` can be used to specify advanced quantization parameters for fine-tuning the quantization algorithm. In this tutorial we pass range estimator parameters for activations. For more information see [Tune quantization parameters](https://docs.openvino.ai/2023.0/basic_quantization_flow.html#tune-quantization-parameters).
    3. Serialize OpenVINO IR model using `openvino.runtime.serialize` function.

In [None]:
import nncf
from nncf.quantization.advanced_parameters import AdvancedQuantizationParameters, RangeEstimatorParameters
from nncf.quantization.range_estimator import StatisticsCollectorParameters, StatisticsType, AggregatorType
from nncf.parameters import ModelType


def transform_fn(data_item):
    """
    Extract the model's input from the data item.
    The data item here is the data item that is returned from the data source per iteration.
    This function should be passed when the data item cannot be used as model's input.
    """
    _, inputs = data_item

    return inputs["inputs"]


dataset_config = {"data_source": os.path.join(DATA_DIR, "LibriSpeech/dev-clean")}
data_loader = LibriSpeechDataLoader(dataset_config, samples_limit=300)
calibration_dataset = nncf.Dataset(data_loader, transform_fn)


quantized_model = nncf.quantize(
    ov_model,
    calibration_dataset,
    model_type=ModelType.TRANSFORMER,  # specify additional transformer patterns in the model
    ignored_scope=nncf.IgnoredScope(
        names=[
            '/wav2vec2/feature_extractor/conv_layers.1/conv/Conv',
            '/wav2vec2/feature_extractor/conv_layers.2/conv/Conv',
            '/wav2vec2/encoder/layers.7/feed_forward/output_dense/MatMul'
        ],
    ),
    advanced_parameters=AdvancedQuantizationParameters(
        activations_range_estimator_params=RangeEstimatorParameters(
            min=StatisticsCollectorParameters(
                statistics_type=StatisticsType.MIN,
                aggregator_type=AggregatorType.MIN
            ),
            max=StatisticsCollectorParameters(
                statistics_type=StatisticsType.QUANTILE,
                aggregator_type=AggregatorType.MEAN,
                quantile_outlier_prob=0.0001
            ),
        )
    )
)

In [None]:
MODEL_NAME = 'quantized_wav2vec2_base'
quantized_model_path = Path(f"{MODEL_NAME}_openvino_model/{MODEL_NAME}_quantized.xml")
serialize(quantized_model, str(quantized_model_path))

## Model Usage Example with Inference Pipeline
Both initial (`FP16`) and quantized (`INT8`) models are exactly the same in use.

Start with taking one example from the dataset to show inference steps for it.

Next, load the quantized model to the inference pipeline.

In [None]:
audio = LibriSpeechDataLoader.read_flac(f'{DATA_DIR}/LibriSpeech/test-clean/121/127105/121-127105-0017.flac')

ipd.Audio(audio, rate=16000)

In [None]:
core = Core()

compiled_model = core.compile_model(model=quantized_model, device_name='CPU')

input_data = np.expand_dims(audio, axis=0)
output_layer = compiled_model.outputs[0]

Next, make a prediction.

In [None]:
predictions = compiled_model([input_data])[output_layer]

Now, you just need to decode predicted probabilities to text, using tokenizer `decode_logits`.

Alternatively, use a built-in `Wav2Vec2Processor` tokenizer from the `transformers` package.

In [None]:
def decode_logits(logits):
    decoding_vocab = dict(enumerate(MetricWER.alphabet))
    token_ids = np.squeeze(np.argmax(logits, -1))
    tokens = [decoding_vocab[idx] for idx in token_ids]
    tokens = [token_group[0] for token_group in groupby(tokens)]
    tokens = [t for t in tokens if t != MetricWER.pad_token]
    res_string = ''.join([t if t != MetricWER.words_delimiter else ' ' for t in tokens]).strip()
    res_string = ' '.join(res_string.split(' '))
    res_string = res_string.lower()
    return res_string


predicted_text = decode_logits(predictions)
predicted_text

## Validate model accuracy on dataset
The code below is used for running model inference on a single sample from the dataset. It contains the following steps:

* Define `MetricWER` class to calculate Word Error Rate.
* Define dataloader for test dataset.
* Define functions to get inference for PyTorch and OpenVINO models.
* Define functions to compute Word Error Rate.

In [None]:
class MetricWER:
    alphabet = [
        "<pad>", "<s>", "</s>", "<unk>", "|",
        "e", "t", "a", "o", "n", "i", "h", "s", "r", "d", "l", "u",
        "m", "w", "c", "f", "g", "y", "p", "b", "v", "k", "'", "x", "j", "q", "z"]
    words_delimiter = '|'
    pad_token = '<pad>'

    # Required methods
    def __init__(self):
        self._name = "WER"
        self._sum_score = 0
        self._sum_words = 0
        self._cur_score = 0
        self._decoding_vocab = dict(enumerate(self.alphabet))

    @property
    def value(self):
        """Returns accuracy metric value for the last model output."""
        return {self._name: self._cur_score}

    @property
    def avg_value(self):
        """Returns accuracy metric value for all model outputs."""
        return {self._name: self._sum_score / self._sum_words if self._sum_words != 0 else 0}

    def update(self, output, target):
        """
        Updates prediction matches.

        :param output: model output
        :param target: annotations
        """
        decoded = [decode_logits(i) for i in output]
        target = [i.lower() for i in target]
        assert len(output) == len(target), "sizes of output and target mismatch!"
        for i in range(len(output)):
            self._get_metric_per_sample(decoded[i], target[i])

    def reset(self):
        """
        Resets collected matches
        """
        self._sum_score = 0
        self._sum_words = 0

    def get_attributes(self):
        """
        Returns a dictionary of metric attributes {metric_name: {attribute_name: value}}.
        Required attributes: 'direction': 'higher-better' or 'higher-worse'
                             'type': metric type
        """
        return {self._name: {"direction": "higher-worse", "type": "WER"}}

    # Methods specific to the current implementation
    def _get_metric_per_sample(self, annotation, prediction):
        cur_score = self._editdistance_eval(annotation.split(), prediction.split())
        cur_words = len(annotation.split())

        self._sum_score += cur_score
        self._sum_words += cur_words
        self._cur_score = cur_score / cur_words

        result = cur_score / cur_words if cur_words != 0 else 0
        return result

    def _editdistance_eval(self, source, target):
        n, m = len(source), len(target)

        distance = np.zeros((n + 1, m + 1), dtype=int)
        distance[:, 0] = np.arange(0, n + 1)
        distance[0, :] = np.arange(0, m + 1)

        for i in range(1, n + 1):
            for j in range(1, m + 1):
                cost = 0 if source[i - 1] == target[j - 1] else 1

                distance[i][j] = min(distance[i - 1][j] + 1,
                                     distance[i][j - 1] + 1,
                                     distance[i - 1][j - 1] + cost)
        return distance[n][m]

In [None]:
from tqdm.notebook import tqdm

import numpy as np


dataset_config = {"data_source": os.path.join(DATA_DIR, "LibriSpeech/test-clean")}
test_data_loader = LibriSpeechDataLoader(dataset_config, samples_limit=300)


# inference function for pytorch
def torch_infer(model, sample):
    output = model(torch.Tensor(sample[1]['inputs'])).logits
    output = output.detach().cpu().numpy()

    return output


# inference function for openvino
def ov_infer(model, sample):
    output = model.output(0)
    output = model(np.array(sample[1]['inputs']))[output]

    return output


def compute_wer(dataset, model, infer_fn):
    wer = MetricWER()
    for sample in tqdm(dataset):
        # run infer function on sample
        output = infer_fn(model, sample)
        # update metric on sample result
        wer.update(output, [sample[0][1]])

    return wer.avg_value

Now, compute WER for the original PyTorch model, OpenVINO IR model and quantized model.

In [None]:
compiled_fp32_ov_model = core.compile_model(ov_model)

pt_result = compute_wer(test_data_loader, torch_model, torch_infer)
ov_fp32_result = compute_wer(test_data_loader, compiled_fp32_ov_model, ov_infer)
quantized_result = compute_wer(test_data_loader, compiled_model, ov_infer)

print(f'[PyTorch]   Word Error Rate: {pt_result["WER"]:.4f}')
print(f'[OpenVino]  Word Error Rate: {ov_fp32_result["WER"]:.4f}')
print(f'[Quantized OpenVino]  Word Error Rate: {quantized_result["WER"]:.4f}')

## Compare Performance of the Original and Quantized Models
Finally, use [Benchmark Tool](https://docs.openvino.ai/latest/openvino_inference_engine_tools_benchmark_tool_README.html) to measure the inference performance of the `FP16` and `INT8` models.

> **NOTE**: For more accurate performance, it is recommended to run `benchmark_app` in a terminal/command prompt after closing other applications. Run `benchmark_app -m model.xml -d CPU` to benchmark async inference on CPU for one minute. Change `CPU` to `GPU` to benchmark on GPU. Run `benchmark_app --help` to see an overview of all command-line options.

In [None]:
# Inference FP16 model (OpenVINO IR)
! benchmark_app -m $ir_model_path -shape [1,30480] -d CPU -api async

In [None]:
# Inference INT8 model (OpenVINO IR)
! benchmark_app -m $quantized_model_path -shape [1,30480] -d CPU -api async