# Quantize NLP models with Post-Training Optimization Tool ​in OpenVINO™
This tutorial demonstrates how to improve performance of sparse NLP models with [OpenVINO](https://docs.openvino.ai/) on 4th Gen Intel® Xeon® Scalable processors. It uses a pre-trained model from the [HuggingFace Transformers](https://huggingface.co/transformers/) library and shows how to convert it to the OpenVINO™ IR format and run inference of the model on the CPU using a dedicated runtime option that enables sparsity optimizations. It also demonstrates how to get more performance stacking sparsity with 8-bit quantization. To simplify the user experience, the [HuggingFace Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to the OpenVINO™ IR format and quantize it. It consists of the following steps:

- Download and convert the sparse BERT model.
- Compare sparse vs. dense inference performance.
- Quantize model.
- Compare sparse 8-bit vs. dense 8-bit inference performance.


## Imports

In [12]:
import time
from functools import partial

import numpy as np

from datasets import load_dataset

from transformers import AutoModelForSequenceClassification, AutoTokenizer

import openvino.runtime as ov

from optimum.intel.openvino import OVModelForSequenceClassification
from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig

## Prepare the Model

In [2]:
model_id = "neuralmagic/oBERT-12-downstream-pruned-unstructured-90-mnli"
local_path = "./bert_90_sparse"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(local_path)

Instantiate a model using OpenVINO Python API:

In [3]:
core = ov.Core()

ov_model = core.read_model(local_path + "/openvino_model.xml")

## Prepare model inputs

In [4]:
text = "This is a great restaurant. This is a great restaurant. This is a great restaurant. This is a great restaurant. This is a great restaurant"
inputs = tokenizer(text, return_tensors="np")
ov_inputs = {
            "input_ids": inputs["input_ids"],
            "attention_mask": inputs["attention_mask"],
            "token_type_ids": inputs["token_type_ids"],
        }

## Benchmark dense inference performance

In [5]:
dense_compiled = core.compile_model(ov_model, "CPU")

attempts = 1000
dense_counters = []
for i in range(attempts):
    m_start = time.time()
    output = dense_compiled(ov_inputs)
    dense_counters.append(time.time() - m_start)

dense_median = np.median(np.array(dense_counters))

print(f"Dense model median elapsed time: {dense_median}")


Dense model median elapsed time: 0.009908080101013184


## Bechmark sparse inference performance

In [9]:
#core2 = ov.Core()
#core2.set_property("CPU", ov.properties.intel_cpu.sparse_weights_decompression_rate(0.8))
config = {"SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.8}
sparse_compiled = core.compile_model(ov_model, "CPU", config)

sparse_counters = []
for i in range(attempts):
    m_start = time.time()
    output = sparse_compiled(ov_inputs)
    sparse_counters.append(time.time() - m_start)

sparse_median = np.median(np.array(sparse_counters))

print(f"Sparse model median elapsed time: {sparse_median}")

RuntimeError: [ NOT_FOUND ] Unsupported property SPARSE_WEIGHTS_DECOMPRESSION_RATE by CPU plugin

## Quantize model with HuggingFace Optimum API

In [14]:
save_dir = "./bert_90_sparse_quantized"

torch_model = AutoModelForSequenceClassification.from_pretrained(model_id)

def preprocess_function(examples, tokenizer):
    return tokenizer(
        examples["premise"], examples["hypothesis"], padding="max_length", max_length=128, truncation=True
    )

# Load the default quantization configuration detailing the quantization we wish to apply
quantization_config = OVConfig()
# Instantiate our OVQuantizer using the desired configuration
quantizer = OVQuantizer.from_pretrained(torch_model, feature="sequence-classification")
# Create the calibration dataset used to perform static quantization

dataset = load_dataset("glue", "mnli")

calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="mnli",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    quantization_config=quantization_config, calibration_dataset=calibration_dataset, save_directory=save_dir
)

Found cached dataset glue (/home/alex/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/5 [00:00<?, ?it/s]

Found cached dataset glue (/home/alex/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Loading cached shuffled indices for dataset at /home/alex/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2bba8406484faf80.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

  return g.op(add_domain("FakeQuantize"), input_, input_low, input_high, output_low, output_high, levels_i=levels)
  _C._jit_pass_onnx_node_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


## Benchmark quantized dense inference performance

In [None]:
q_ov_model = core.read_model(save_dir + "/openvino_model.xml")
q_dense_compiled = core.compile_model(q_ov_model, "CPU")

attempts = 1000
q_dense_counters = []
for i in range(attempts):
    m_start = time.time()
    output = q_dense_compiled(ov_inputs)
    q_dense_counters.append(time.time() - m_start)

q_dense_median = np.median(np.array(q_dense_counters))

print(f"Dense model median elapsed time: {q_dense_median}")