# Post-Training Quantization of Grammar Error Correction model with NNCF

The goal of this tutorial is to demonstrate how to speed up the model by applying 8-bit post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) (Neural Network Compression Framework) and infer quantized model via OpenVINO™ Toolkit. The optimization process contains the following steps:

1. Quantize the converted OpenVINO model from [214-grammar-correction-convert notebook](214-grammar-correction-convert.ipynb) with NNCF.
2. Check model result for the sample text.
3. Compare model size and performance of FP32 and quantized INT8 models.

> **NOTE**: you should run [214-grammar-correction-convert](214-grammar-correction-convert.ipynb) notebook first to generate OpenVINO IR model that is used for quantization.

Table of content:
- [Prerequisites](#Prerequisites-$\Uparrow$)
- [Create and initialize quantization](#Create-and-initialize-quantization-$\Uparrow$)
    - [Prepare calibration datasets](#Prepare-calibration-datasets-$\Uparrow$)
    - [Quantize Whisper encoder and decoder models](#Quantize-Whisper-encoder-and-decoder-models-$\Uparrow$)
- [Transcribe video with quantized OpenVINO model](#Transcribe-video-with-quantized-OpenVINO-model-$\Uparrow$)
- [Compare performance and accuracy of the FP32 and INT8 IRs](#Compare-performance-and-accuracy-of-the-FP32-and-INT8-IRs-$\Uparrow$)


## Prerequisites [$\Uparrow$](#Table-of-content:)

First we define the prerequisites and load models same as in [214-grammar-correction-convert](214-grammar-correction-convert.ipynb) notebook.

In [15]:
# %pip install -q "nncf>=2.6.0"
# %pip install -q datasets
# %pip install evaluate jiwer

Select inference device

In [16]:
import ipywidgets as widgets
from openvino.runtime import Core

core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')

Load Grammar Checker


In [17]:
from pathlib import Path
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForSequenceClassification

grammar_checker_model_id = "textattack/roberta-base-CoLA"
grammar_checker_dir = Path("roberta-base-cola")
grammar_checker_tokenizer = AutoTokenizer.from_pretrained(grammar_checker_model_id)

if grammar_checker_dir.exists():
    grammar_checker_model = OVModelForSequenceClassification.from_pretrained(grammar_checker_dir, device=device.value)
else:
    grammar_checker_model = OVModelForSequenceClassification.from_pretrained(grammar_checker_model_id, export=True, device=device.value)
    grammar_checker_model.save_pretrained(grammar_checker_dir)

Compiling the model...
Set CACHE_DIR to roberta-base-cola/model_cache


Load Grammar Corrector

In [18]:
from optimum.intel.openvino import OVModelForSeq2SeqLM

grammar_corrector_model_id = "pszemraj/flan-t5-large-grammar-synthesis"
grammar_corrector_dir = Path("flan-t5-large-grammar-synthesis")
grammar_corrector_tokenizer = AutoTokenizer.from_pretrained(grammar_corrector_model_id)

if grammar_corrector_dir.exists():
    grammar_corrector_model_fp32 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_dir, device=device.value)
else:
    grammar_corrector_model_fp32 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_model_id, export=True, device=device.value)
    grammar_corrector_model_fp32.save_pretrained(grammar_corrector_dir)

Compiling the encoder...
Compiling the decoder...
Compiling the decoder...


Create grammar checker and corrector pipelines

In [19]:
from transformers import pipeline

grammar_checker_pipe = pipeline("text-classification", model=grammar_checker_model, tokenizer=grammar_checker_tokenizer)
grammar_corrector_pipe_fp32 = pipeline("text2text-generation", model=grammar_corrector_model_fp32, tokenizer=grammar_corrector_tokenizer)

## Quantize Grammar Corrector [$\Uparrow$](#Table-of-content:)

TODO: description

### Collect calibration dataset

In [20]:
import datasets
from tqdm.notebook import tqdm

CALIBRATION_DATASET_SIZE = 10
calibration_data = []
ov_decoder = grammar_corrector_pipe_fp32.model.decoder_with_past

def wrap_for_data_collection():
    original_fn = ov_decoder.request.start_async
    def wrapper(*args, **kwargs):
        inputs = kwargs.get("inputs", args[0])
        calibration_data.append(inputs)
        return original_fn(*args, **kwargs)
    ov_decoder.request.start_async = wrapper

wrap_for_data_collection()

calibration_dataset = datasets.load_dataset("jfleg", split=f"validation[:{CALIBRATION_DATASET_SIZE}]")

for data_item in tqdm(calibration_dataset, total=CALIBRATION_DATASET_SIZE, desc="Collecting calibration data"):
    grammar_corrector_pipe_fp32(data_item["sentence"])

Collecting calibration data:   0%|          | 0/10 [00:00<?, ?it/s]

  last_hidden_state = torch.from_numpy(self.request(inputs, shared_memory=True)["last_hidden_state"]).to(
  self.request.start_async(inputs, shared_memory=True)
  return original_fn(*args, **kwargs)


### Quantize

In [21]:
import openvino.runtime as ov
import nncf
from nncf.quantization.range_estimator import RangeEstimatorParameters, StatisticsCollectorParameters, StatisticsType, AggregatorType

quantized_model_path = Path("quantized_decoder_with_past") / "openvino_model.xml"

quantized_model = nncf.quantize(
    ov_decoder.model,
    calibration_dataset=nncf.Dataset(calibration_data),
    subset_size=len(calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        smooth_quant_alpha=0.95,
        activations_range_estimator_params=RangeEstimatorParameters(
            max=StatisticsCollectorParameters(statistics_type=StatisticsType.QUANTILE, aggregator_type=AggregatorType.MEAN)
        )
    ),
)

if not quantized_model_path.parent.exists():
    quantized_model_path.parent.mkdir(parents=True)
ov.serialize(quantized_model, quantized_model_path)

Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 279/279 [00:17<00:00, 15.73it/s]
Applying Smooth Quant: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 145/145 [00:05<00:00, 24.97it/s]
Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 279/279 [00:57<00:00,  4.85it/s]
Applying Fast Bias correction: 0it [00:00, ?it/s]


Create quantized grammar corrector pipeline

In [30]:
grammar_corrector_model_int8 = OVModelForSeq2SeqLM.from_pretrained(grammar_corrector_dir, device=device.value)
grammar_corrector_model_int8.decoder_with_past.model = quantized_model
grammar_corrector_model_int8.decoder_with_past.request = None
grammar_corrector_model_int8.decoder_with_past._compile()
grammar_corrector_pipe_int8 = pipeline("text2text-generation", model=grammar_corrector_model_int8, tokenizer=grammar_corrector_tokenizer)

Compiling the encoder...
Compiling the decoder...
Compiling the decoder...
Compiling the decoder...


Let us see it in action.

In [23]:
from utils import correct_text

default_text = (
    "Most of the course is about semantic or  content of language but there are also interesting"
    " topics to be learned from the servicefeatures except statistics in characters in documents.At"
    " this point, He introduces herself as his native English speaker and goes on to say that if"
    " you contine to work on social scnce"
)

corrected_text = correct_text(default_text, grammar_checker_pipe, grammar_corrector_pipe_int8)

correcting text..:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
print(f"input text:     {default_text}\n") 
print(f'generated text: {corrected_text}')

input text:     Most of the course is about semantic or  content of language but there are also interesting topics to be learned from the servicefeatures except statistics in characters in documents.At this point, He introduces herself as his native English speaker and goes on to say that if you contine to work on social scnce

generated text: Most of the course is about the semantic content of language but there are also interesting topics to be learned from the service features except statistics in characters in documents. At this point, she introduces herself as a native English speaker and goes on to say that if you continue to work on social science, you will continue to be successful.


## Compare model size and performance

In [25]:
def calculate_compression_rate(model_path_ov, model_path_ov_int8):
    model_size_fp32 = model_path_ov.with_suffix(".bin").stat().st_size / 1024
    model_size_int8 = model_path_ov_int8.with_suffix(".bin").stat().st_size / 1024
    print(f"Model: {model_path_ov.stem}")
    print(f"    * FP32 IR model size: {model_size_fp32:.2f} KB")
    print(f"    * INT8 IR model size: {model_size_int8:.2f} KB")
    print(f"    * Model compression rate: {model_size_fp32 / model_size_int8:.3f}")

calculate_compression_rate(grammar_corrector_dir / "openvino_decoder_with_past_model.xml", quantized_model_path)

Model: openvino_decoder_with_past_model
    * FP32 IR model size: 1658150.16 KB
    * INT8 IR model size: 513467.73 KB
    * Model compression rate: 3.229


In [31]:
import time
import numpy as np
from jiwer import wer, wer_standardize

TEST_DATASET_SIZE = 50
test_dataset = datasets.load_dataset("jfleg", split=f"test[:{TEST_DATASET_SIZE}]")

def calculate_inference_time_and_accuracy(grammar_corrector_pipe):
    ground_truths = []
    predictions = []
    inference_time = []
    for data_item in tqdm(test_dataset, total=TEST_DATASET_SIZE, desc="Evaluation"):
        input_text = data_item["sentence"]
        reference = data_item["corrections"][0]

        start_time = time.perf_counter()
        corrected_text = correct_text(input_text, grammar_checker_pipe, grammar_corrector_pipe, disable_tqdm=True)
        end_time = time.perf_counter()
        delta_time = end_time - start_time

        ground_truths.append(reference)
        predictions.append(corrected_text)
        inference_time.append(delta_time)

    word_accuracy = (1 - wer(ground_truths, predictions, reference_transform=wer_standardize, hypothesis_transform=wer_standardize)) * 100
    mean_inference_time = np.mean(inference_time)
    return mean_inference_time, word_accuracy

inference_time_fp32, accuracy_fp32 = calculate_inference_time_and_accuracy(grammar_corrector_pipe_fp32)
inference_time_int8, accuracy_int8 = calculate_inference_time_and_accuracy(grammar_corrector_pipe_int8)
print(f"Grammar correction performance speedup: {inference_time_fp32 / inference_time_int8:.3f}")
print(f"Grammar correction word accuracy. FP32: {accuracy_fp32:.2f}%. INT8: {accuracy_int8:.2f}%. Accuracy drop :{accuracy_fp32 - accuracy_int8:.2f}%.")

Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

  return original_fn(*args, **kwargs)


Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

Grammar correction performance speedup: 1.217
Grammar correction word accuracy. FP32: 68.55%. INT8: 68.82%. Accuracy drop :-0.27%.


## Interactive demo

In [27]:
# import gradio as gr
#
#
# def correct(text, _=gr.Progress(track_tqdm=True)):
#     return correct_text(text, grammar_checker_pipe, grammar_corrector_pipe_int8)
#
#
# demo = gr.Interface(
#     correct,
#     gr.Textbox(label="Text"),
#     gr.Textbox(label="Correction"),
#     examples=[default_text],
#     allow_flagging="never",
# )
# try:
#     demo.queue().launch(debug=True, server_port=7860)
# except Exception:
#     demo.queue().launch(share=True, debug=True)
# # if you are launching remotely, specify server_name and server_port
# # demo.launch(server_name='your server name', server_port='server port in int')
# # Read more in the docs: https://gradio.app/docs/