# Accelerate Inference of sparse Transformer models with OpenVINO™ and 4th Gen Intel&reg; Xeon&reg; Scalable processors
This tutorial demonstrates how to improve performance of sparse Transformer models with [OpenVINO](https://docs.openvino.ai/) on 4th Gen Intel® Xeon® Scalable processors. It uses a pre-trained model from the [HuggingFace Transformers](https://huggingface.co/transformers/) library and shows how to convert it to the OpenVINO™ IR format and run inference of the model on the CPU using a dedicated runtime option that enables sparsity optimizations. It also demonstrates how to get more performance stacking sparsity with 8-bit quantization. To simplify the user experience, the [HuggingFace Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to the OpenVINO™ IR format and quantize it. It consists of the following steps:

- Download and quantize sparse BERT model from the public using HuggingFace Optimum for OpenVINO.
- Compare sparse 8-bit vs. dense 8-bit inference performance.


## Imports

In [None]:
import time
from functools import partial
from pathlib import Path

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig

## Quantize model with HuggingFace Optimum API

In [None]:
model_id = "neuralmagic/oBERT-12-downstream-pruned-unstructured-90-mnli"
quantized_sparse_dir = Path("bert_90_sparse_quantized")

torch_model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_function(examples, tokenizer):
    return tokenizer(
        examples["premise"], examples["hypothesis"], padding="max_length", max_length=128, truncation=True
    )

quantization_config = OVConfig()
quantizer = OVQuantizer.from_pretrained(torch_model, feature="sequence-classification")

dataset = load_dataset("glue", "mnli")
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="mnli",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    quantization_config=quantization_config, calibration_dataset=calibration_dataset, save_directory=quantized_sparse_dir
)

## Benchmark quantized dense inference performance

Benchmark dense inference performance using parallel execution on four CPU cores. Sequense length is 32 which fits to everage Ssquense length of popular benchmark datasets.

In [None]:
# Dump benchmarking config for dense inference
with open("perf_config.json", "w") as outfile:
    outfile.write(
"""
{
    "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
}
""")

In [None]:
!benchmark_app -m bert_90_sparse_quantized/openvino_model.xml -shape "input_ids[1,32],attention_mask[1,32],token_type_ids[1,32]" -load_config perf_config.json

## Benchmark quantized sparse inference performance

In [None]:
# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider 
# for sparse optimization at the runtime.
with open("perf_config_sparse.json", "w") as outfile:
    outfile.write(
"""
{
    "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.8}
}
""")

In [None]:
!benchmark_app -m bert_90_sparse_quantized/openvino_model.xml -shape "input_ids[1,32],attention_mask[1,32],token_type_ids[1,32]" -load_config perf_config_sparse.json

## When this might be helpful

This feauture can improve inference performance for models with sparse weights in the scenarios when the model is deployed to handle multiple requests in parallel. application that processes multiple requests in parallel.