# Accelerate Inference of sparse Transformer models with OpenVINO™ and 4th Gen Intel&reg; Xeon&reg; Scalable processors
This tutorial demonstrates how to improve performance of sparse Transformer models with [OpenVINO](https://docs.openvino.ai/) on 4th Gen Intel® Xeon® Scalable processors. It uses a pre-trained model from the [HuggingFace Transformers](https://huggingface.co/transformers/) library and shows how to convert it to the OpenVINO™ IR format and run inference of the model on the CPU using a dedicated runtime option that enables sparsity optimizations. It also demonstrates how to get more performance stacking sparsity with 8-bit quantization. To simplify the user experience, the [HuggingFace Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to the OpenVINO™ IR format and quantize it. It consists of the following steps:

- Download and convert the sparse BERT model.
- Compare sparse vs. dense inference performance.
- Quantize model.
- Compare sparse 8-bit vs. dense 8-bit inference performance.


## Imports

In [1]:
import time
from functools import partial
from pathlib import Path

import numpy as np

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

import openvino.runtime as ov

from optimum.intel.openvino import OVModelForSequenceClassification
from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig

  from .autonotebook import tqdm as notebook_tqdm


INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
INFO:nncf:Compiling and loading extensions for quantization...
INFO:nncf:Compiling and loading extensions for binarization...


## Prepare the Model

In [2]:
model_id = "neuralmagic/oBERT-12-downstream-pruned-unstructured-90-mnli"
sparse_path = Path("bert_90_sparse")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(sparse_path)

Instantiate a model using OpenVINO Python API:

In [3]:
core = ov.Core()
ov_model = core.read_model(sparse_path / "openvino_model.xml")

## Prepare model inputs

In [4]:
text = "This is a great restaurant. This is a great restaurant. This is a great restaurant. This is a great restaurant. This is a great restaurant"
inputs = tokenizer(text, return_tensors="np")
ov_inputs = {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
    "token_type_ids": inputs["token_type_ids"],
}

## Benchmark dense inference performance

In [5]:
dense_compiled = core.compile_model(ov_model, "CPU")

attempts = 1000
dense_counters = []
for i in range(attempts):
    m_start = time.time()
    output = dense_compiled(ov_inputs)
    dense_counters.append(time.time() - m_start)

dense_median = np.median(np.array(dense_counters))

print(f"Dense model median elapsed time: {dense_median}")


Dense model median elapsed time: 0.009994864463806152


## Bechmark sparse inference performance

In [6]:
config = {"CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.8}
sparse_compiled = core.compile_model(ov_model, "CPU", config)

sparse_counters = []
for i in range(attempts):
    m_start = time.time()
    output = sparse_compiled(ov_inputs)
    sparse_counters.append(time.time() - m_start)

sparse_median = np.median(np.array(sparse_counters))

print(f"Sparse model median elapsed time: {sparse_median}")

Sparse model median elapsed time: 0.009945034980773926


## Quantize model with HuggingFace Optimum API

In [8]:
quantized_sparse_dir = Path("bert_90_sparse_quantized")

torch_model = AutoModelForSequenceClassification.from_pretrained(model_id)

def preprocess_function(examples, tokenizer):
    return tokenizer(
        examples["premise"], examples["hypothesis"], padding="max_length", max_length=128, truncation=True
    )

quantization_config = OVConfig()
quantizer = OVQuantizer.from_pretrained(torch_model, feature="sequence-classification")

dataset = load_dataset("glue", "mnli")
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="mnli",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    quantization_config=quantization_config, calibration_dataset=calibration_dataset, save_directory=quantized_sparse_dir
)



100%|██████████| 5/5 [00:00<00:00, 446.13it/s]








100%|██████████| 1/1 [00:00<00:00, 51.54ba/s]


## Benchmark quantized dense inference performance

In [9]:
q_ov_model = core.read_model(quantized_sparse_dir / "openvino_model.xml")
q_dense_compiled = core.compile_model(q_ov_model, "CPU")

attempts = 1000
q_dense_counters = []
for i in range(attempts):
    m_start = time.time()
    output = q_dense_compiled(ov_inputs)
    q_dense_counters.append(time.time() - m_start)

q_dense_median = np.median(np.array(q_dense_counters))

print(f"Dense quantized model median elapsed time: {q_dense_median}")

Dense quantized model median elapsed time: 0.006369829177856445


## Benchmark quantized sparse inference performance

In [10]:
q_sparse_compiled = core.compile_model(q_ov_model, "CPU", config)

q_sparse_counters = []
for i in range(attempts):
    m_start = time.time()
    output = q_sparse_compiled(ov_inputs)
    q_sparse_counters.append(time.time() - m_start)

q_sparse_median = np.median(np.array(q_sparse_counters))

print(f"Sparse quantized model median elapsed time: {q_sparse_median}")

Sparse quantized model median elapsed time: 0.0063517093658447266
