# Post-Training Quantization of Grammar Error Correction model with NNCF

The goal of this tutorial is to demonstrate how to speed up the model by applying 8-bit post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) (Neural Network Compression Framework) and infer quantized model via OpenVINO™ Toolkit. The optimization process contains the following steps:

1. Quantize the converted OpenVINO model from [214-grammar-correction-convert notebook](214-grammar-correction-convert.ipynb) with NNCF.
2. Check model result for the sample text.
3. Compare model size, performance and accuracy of FP32 and quantized INT8 models.

> **NOTE**: you should run [214-grammar-correction-convert](214-grammar-correction-convert.ipynb) notebook first to generate OpenVINO IR model that is used for quantization.

#### Table of contents:
- [Prerequisites](#Prerequisites-$\Uparrow$)
- [Quantization](#Quantization-$\Uparrow$)
    - [Prepare calibration dataset](#Prepare-calibration-dataset-$\Uparrow$)
    - [Run quantization](#Run-quantization-$\Uparrow$)
- [Run grammar correction with quantized OpenVINO model](#Run-grammar-correction-with-quantized-OpenVINO-model-$\Uparrow$)
- [Compare model size, performance and accuracy](#Compare-model-size,-performance-and-accuracy-$\Uparrow$)


## Prerequisites [$\Uparrow$](#Table-of-contents:)

First we define the prerequisites and load models same as in [214-grammar-correction-convert](214-grammar-correction-convert.ipynb) notebook.

In [1]:
%pip install -q "git+https://github.com/openvinotoolkit/nncf.git@9c671f0ae0a118e4bc2de8b09e66425931c0bfa4"
%pip install -q datasets
%pip install -q jiwer

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Select inference device

In [2]:
import ipywidgets as widgets
from openvino.runtime import Core

core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Load Grammar Checker


In [3]:
from pathlib import Path
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForSequenceClassification

grammar_checker_model_id = "textattack/roberta-base-CoLA"
grammar_checker_dir = Path("roberta-base-cola")
grammar_checker_tokenizer = AutoTokenizer.from_pretrained(grammar_checker_model_id)

if grammar_checker_dir.exists():
    grammar_checker_model = OVModelForSequenceClassification.from_pretrained(grammar_checker_dir, device=device.value)
else:
    grammar_checker_model = OVModelForSequenceClassification.from_pretrained(grammar_checker_model_id, export=True, device=device.value)
    grammar_checker_model.save_pretrained(grammar_checker_dir)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.7'
2023-09-20 18:27:16.343735: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-20 18:27:16.378509: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Compiling the model...
Set CACHE_DIR to roberta-base-cola/model_cache


Load Grammar Corrector

In [4]:
from optimum.intel.openvino import OVModelForSeq2SeqLM

grammar_corrector_model_id = "pszemraj/flan-t5-large-grammar-synthesis"
grammar_corrector_dir = Path("flan-t5-large-grammar-synthesis")
grammar_corrector_tokenizer = AutoTokenizer.from_pretrained(grammar_corrector_model_id)

if grammar_corrector_dir.exists():
    grammar_corrector_model_fp32 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_dir, device=device.value)
else:
    grammar_corrector_model_fp32 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_model_id, export=True, device=device.value)
    grammar_corrector_model_fp32.save_pretrained(grammar_corrector_dir)

Compiling the encoder...
Compiling the decoder...
Compiling the decoder...


Create grammar checker and corrector pipelines

In [5]:
from transformers import pipeline

grammar_checker_pipe = pipeline("text-classification", model=grammar_checker_model, tokenizer=grammar_checker_tokenizer)
grammar_corrector_pipe_fp32 = pipeline("text2text-generation", model=grammar_corrector_model_fp32, tokenizer=grammar_corrector_tokenizer)

## Quantization [$\Uparrow$](#Table-of-contents:)

[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in `INT8` instead of `FP32`/`FP16` making model inference faster.

Grammar checker model takes up a tiny portion of the whole text correction pipeline so we optimize only the grammar corrector model. Grammar corrector itself consists of three models: encoder, first call decoder and decoder with past. The last model's share of inference time is about 90%. Because of this we quantize only it.

The optimization process contains the following step:

1. Create a calibration dataset for quantization.
2. Run `nncf.quantize` to obtain quantized models.
3. Serialize the `INT8` model using `openvino.runtime.serialize` function.

### Prepare calibration dataset [$\Uparrow$](#Table-of-contents:)

In order to collect calibration dataset for the decoder model we need to collect tensors which are used at its inputs. For this, we wrap one of the methods that is used during its inference. The wrapper intercepts input tensors and collects them in a separate `calibration_data` list. Thus, after we inference grammar corrector on some text samples this list will contain input data for quantization of this model.

We use first several samples from validation split of [jfleg](https://huggingface.co/datasets/jfleg) text correction dataset.

In [6]:
import datasets
from contextlib import contextmanager
from tqdm.notebook import tqdm

COLLECT_CALIBRATION_DATA = False
calibration_data = []
ov_decoder = grammar_corrector_pipe_fp32.model.decoder_with_past

@contextmanager
def calibration_data_collection():
    global COLLECT_CALIBRATION_DATA
    try:
        COLLECT_CALIBRATION_DATA = True
        yield
    finally:
        COLLECT_CALIBRATION_DATA = False

def wrap_for_data_collection():
    original_fn = ov_decoder.request.start_async
    def wrapper(*args, **kwargs):
        inputs = kwargs.get("inputs", args[0])
        if COLLECT_CALIBRATION_DATA:
            calibration_data.append(inputs)
        return original_fn(*args, **kwargs)
    ov_decoder.request.start_async = wrapper

In [7]:
CALIBRATION_DATASET_SIZE = 10

wrap_for_data_collection()

calibration_dataset = datasets.load_dataset("jfleg", split=f"validation[:{CALIBRATION_DATASET_SIZE}]")
with calibration_data_collection():
    for data_item in tqdm(calibration_dataset, total=CALIBRATION_DATASET_SIZE, desc="Collecting calibration data"):
        grammar_corrector_pipe_fp32(data_item["sentence"])

Collecting calibration data:   0%|          | 0/10 [00:00<?, ?it/s]

  last_hidden_state = torch.from_numpy(self.request(inputs, shared_memory=True)["last_hidden_state"]).to(
  self.request.start_async(inputs, shared_memory=True)
  return original_fn(*args, **kwargs)


### Run quantization [$\Uparrow$](#Table-of-contents:)

In [8]:
import openvino.runtime as ov
import nncf
from nncf.quantization.range_estimator import RangeEstimatorParameters, StatisticsCollectorParameters, StatisticsType

quantized_model_path = Path("quantized_decoder_with_past") / "openvino_model.xml"

quantized_model = nncf.quantize(
    ov_decoder.model,
    calibration_dataset=nncf.Dataset(calibration_data),
    subset_size=len(calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        disable_bias_correction=True,  # Disable bias correction because the model does not contain quantizable operations with bias
        smooth_quant_alpha=0.95,  # The value of 0.95 was selected by grid search
        activations_range_estimator_params=RangeEstimatorParameters(
            # Quantile statistic is employed due to outliers in some activations; this parameter was found by quantize_with_accuracy_control method
            max=StatisticsCollectorParameters(StatisticsType.QUANTILE)
        )
    ),
)

if not quantized_model_path.parent.exists():
    quantized_model_path.parent.mkdir(parents=True)
ov.serialize(quantized_model, quantized_model_path)

Output()

Output()

Output()

## Run grammar correction with quantized OpenVINO model [$\Uparrow$](#Table-of-contents:)

Create quantized grammar corrector pipeline and run correction based on it.

In [9]:
grammar_corrector_model_int8 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_dir, device=device.value)
grammar_corrector_model_int8.decoder_with_past.model = quantized_model
grammar_corrector_model_int8.decoder_with_past.request = None
grammar_corrector_model_int8.decoder_with_past._compile()
grammar_corrector_pipe_int8 = pipeline("text2text-generation", model=grammar_corrector_model_int8, tokenizer=grammar_corrector_tokenizer)

Compiling the encoder...
Compiling the decoder...
Compiling the decoder...
Compiling the decoder...


In [10]:
from utils import correct_text

default_text = (
    "Most of the course is about semantic or  content of language but there are also interesting"
    " topics to be learned from the servicefeatures except statistics in characters in documents.At"
    " this point, He introduces herself as his native English speaker and goes on to say that if"
    " you contine to work on social scnce"
)

corrected_text_fp32 = correct_text(default_text, grammar_checker_pipe, grammar_corrector_pipe_fp32)
corrected_text_int8 = correct_text(default_text, grammar_checker_pipe, grammar_corrector_pipe_int8)

correcting text..:   0%|          | 0/1 [00:00<?, ?it/s]

  return original_fn(*args, **kwargs)


correcting text..:   0%|          | 0/1 [00:00<?, ?it/s]

Let's see the results. The generated texts for quantized `INT8` model and original `FP32` model should be almost the same.

In [11]:
print(f"Input text:                   {default_text}\n")
print(f'Generated text by FP32 model: {corrected_text_fp32}\n')
print(f'Generated text by INT8 model: {corrected_text_int8}')

Input text:                   Most of the course is about semantic or  content of language but there are also interesting topics to be learned from the servicefeatures except statistics in characters in documents.At this point, He introduces herself as his native English speaker and goes on to say that if you contine to work on social scnce

Generated text by FP32 model: Most of the course is about the semantic content of language but there are also interesting topics to be learned from the service features except statistics in characters in documents. At this point, she introduces herself as a native English speaker and goes on to say that if you continue to work on social science, you will continue to be successful.

Generated text by INT8 model: Most of the course is about the semantic content of language but there are also interesting topics to be learned from the service features except statistics in characters in documents. At this point, she introduces himself as a native Englis

## Compare model size, performance and accuracy [$\Uparrow$](#Table-of-contents:)

First, we compare file size of `FP32` and `INT8` models.

In [12]:
def calculate_compression_rate(model_path_ov, model_path_ov_int8):
    model_size_fp32 = model_path_ov.with_suffix(".bin").stat().st_size / 1024
    model_size_int8 = model_path_ov_int8.with_suffix(".bin").stat().st_size / 1024
    print(f"Model: {model_path_ov.stem}")
    print(f"    * FP32 IR model size: {model_size_fp32:.2f} KB")
    print(f"    * INT8 IR model size: {model_size_int8:.2f} KB")
    print(f"    * Model compression rate: {model_size_fp32 / model_size_int8:.3f}")

calculate_compression_rate(grammar_corrector_dir / "openvino_decoder_with_past_model.xml", quantized_model_path)

Model: openvino_decoder_with_past_model
    * FP32 IR model size: 1658150.16 KB
    * INT8 IR model size: 513467.73 KB
    * Model compression rate: 3.229


Second, we compare two grammar correction pipelines from performance and accuracy stand points.

We again use [jfleg](https://huggingface.co/datasets/jfleg) dataset, but in this case the test split it selected. One dataset sample consists of a text with errors as input and several corrected versions as labels.

When measuring accuracy we use mean `(1 - WER)` against corrected text versions, where WER is Word Error Rate metric.

In [13]:
import time
from jiwer import wer, wer_standardize

TEST_DATASET_SIZE = 50
test_dataset = datasets.load_dataset("jfleg", split=f"test[:{TEST_DATASET_SIZE}]")

def calculate_inference_time_and_accuracy(grammar_corrector_pipe):
    ground_truths = []
    predictions = []
    inference_time = []
    for data_item in tqdm(test_dataset, total=TEST_DATASET_SIZE, desc="Evaluation"):
        input_text = data_item["sentence"]  # e.g., "For not use car . "
        references = data_item["corrections"]  # e.g., [ "Not for use with a car . ", "Do not use in the car . ", "Car not for use . ", "Can not use the car . " ]

        start_time = time.perf_counter()
        corrected_text = correct_text(input_text, grammar_checker_pipe, grammar_corrector_pipe, disable_tqdm=True)
        end_time = time.perf_counter()
        delta_time = end_time - start_time

        ground_truths.extend(references)
        predictions.extend([corrected_text] * len(references))
        inference_time.append(delta_time)

    word_accuracy = (1 - wer(ground_truths, predictions, reference_transform=wer_standardize, hypothesis_transform=wer_standardize)) * 100
    sum_inference_time =sum(inference_time)
    return sum_inference_time, word_accuracy

inference_time_fp32, accuracy_fp32 = calculate_inference_time_and_accuracy(grammar_corrector_pipe_fp32)
print(f"Evaluation results of FP32 grammar correction pipeline. Accuracy: {accuracy_fp32:.2f}%. Time: {inference_time_fp32:.2f} sec.")
inference_time_int8, accuracy_int8 = calculate_inference_time_and_accuracy(grammar_corrector_pipe_int8)
print(f"Evaluation results of INT8 grammar correction pipeline. Accuracy: {accuracy_int8:.2f}%. Time: {inference_time_int8:.2f} sec.")
print(f"Performance speedup: {inference_time_fp32 / inference_time_int8:.3f}")
print(f"Accuracy drop :{accuracy_fp32 - accuracy_int8:.2f}%.")

Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

  return original_fn(*args, **kwargs)


Evaluation results of FP32 grammar correction pipeline. Accuracy: 67.59%. Time: 53.54 sec.


Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

Evaluation results of INT8 grammar correction pipeline. Accuracy: 68.80%. Time: 39.50 sec.
Performance speedup: 1.356
Accuracy drop :-1.21%.


## Interactive demo

In [14]:
import gradio as gr


def correct(text, _=gr.Progress(track_tqdm=True)):
    return correct_text(text, grammar_checker_pipe, grammar_corrector_pipe_int8)


demo = gr.Interface(
    correct,
    gr.Textbox(label="Text"),
    gr.Textbox(label="Correction"),
    examples=[default_text],
    allow_flagging="never",
)
try:
    demo.queue().launch(debug=True, server_port=7860)
except Exception:
    demo.queue().launch(share=True, debug=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Keyboard interruption in main thread... closing server.
