# Accelerate Inference of Sparse Transformer Models with OpenVINO™ and 4th Gen Intel&reg; Xeon&reg; Scalable Processors
This tutorial demonstrates how to improve performance of sparse Transformer models with [OpenVINO](https://docs.openvino.ai/) on 4th Gen Intel® Xeon® Scalable processors. 

The tutorial downloads a BERT-base model which has been quantized, sparsified, and tuned for SST2 datasets using [Optimum-Intel](https://github.com/huggingface/optimum-intel). It demonstrates the inference performance advantage on 4th Gen Intel&reg; Xeon&reg; Scalable Processors by running it with [Sparse Weight Decompression](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_CPU.html#sparse-weights-decompression), a runtime option that siezes model sparsity for efficiency. The notebook consists of the following steps:

- Install prerequisites
- Download and quantize sparse BERT model from a public source using the OpenVINO integration with Hugging Face Optimum.
- Compare sparse 8-bit vs. dense 8-bit inference performance.


## Prerequisites

In [1]:
!pip install optimum[openvino] openvino-dev datasets

[0m

## Imports

In [2]:
import shutil
from pathlib import Path

from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
from huggingface_hub import hf_hub_download

  from .autonotebook import tqdm as notebook_tqdm


### Download a quantized sparse the model using Hugging Face Optimum API

The first step is to download a quantized sparse transformers which has been translated to OpenVINO IR. Then, it will be put through a classification, for simple validation of a working downloaded model. 

In [3]:
# The following model has been quantized, sparsified using Optimum-Intel 1.7 which is enabled by OpenVINO and NNCF
# for reproducibility, refer https://huggingface.co/OpenVINO/bert-base-uncased-sst2-int8-unstructured80
model_id = "OpenVINO/bert-base-uncased-sst2-int8-unstructured80"

# The following two steps will set up the model and download them to HF Cache folder
ov_model = OVModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Let's take the model for a spin!
sentiment_classifier = pipeline("text-classification", model=ov_model, tokenizer=tokenizer)

text = "He's a dreadful magician."
outputs = sentiment_classifier(text)

print(outputs)

Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████| 840/840 [00:00<00:00, 315kB/s]
Downloading (…)n/openvino_model.xml: 100%|███████████████████████████████████████████████████| 1.35M/1.35M [00:00<00:00, 3.39MB/s]
Downloading openvino_model.bin: 100%|██████████████████████████████████████████████████████████| 182M/182M [00:03<00:00, 50.1MB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████| 571/571 [00:00<00:00, 150kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.20MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████| 711k/711k [00:00<00:00, 2.18MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████| 125/125 [00:00<00:00, 137kB/s]


[{'label': 'negative', 'score': 0.9981877207756042}]


For benchmarking, we will use OpenVINO's benchmark application and let's put the IRs into a single folder.

In [4]:
# create a folder
quantized_sparse_dir = Path("bert_80pc_sparse_quantized_ir")
quantized_sparse_dir.mkdir(parents=True, exist_ok=True)

# following return path to specified filename in cache folder and download if not found
ov_ir_xml_path = hf_hub_download(repo_id=model_id, filename="openvino_model.xml")
ov_ir_bin_path = hf_hub_download(repo_id=model_id, filename="openvino_model.bin")

# copy IRs to our folder
shutil.copy(ov_ir_xml_path, quantized_sparse_dir)
shutil.copy(ov_ir_bin_path, quantized_sparse_dir)                                

'bert_80pc_sparse_quantized_ir/openvino_model.bin'

## Benchmark quantized dense inference performance
Benchmark dense inference performance using parallel execution on four CPU cores to simulate a small instance in the cloud infrastructure. Sequense length is set to 16 which is common for multiple use cases, e.g. conversational AI.

In [5]:
# Dump benchmarking config for dense inference
with open(f"{quantized_sparse_dir}/perf_config.json", "w") as outfile:
    outfile.write(
        """
        {
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
        }
        """
    )

In [6]:
!(benchmark_app \
  -m bert_80pc_sparse_quantized_ir/openvino_model.xml \
  -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" \
  -load_config bert_80pc_sparse_quantized_ir/perf_config.json)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ] 
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 130.04 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] /

## Benchmark quantized sparse inference performance

To enable sparse weight decompression feature, users can add it to runtime config like below. `CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE` honors value between 0.5 and 1.0, it is a layer-level sparsity threshold for which a layer will be enabled.

In [7]:
# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider 
# for sparse optimization at the runtime.
with open(f"{quantized_sparse_dir}/perf_config_sparse.json", "w") as outfile:
    outfile.write(
        """
        {
            "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.75}
        }
        """
    )

In [8]:
!(benchmark_app \
  -m bert_80pc_sparse_quantized_ir/openvino_model.xml \
  -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" \
  -load_config bert_80pc_sparse_quantized_ir/perf_config_sparse.json)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ] 
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 142.44 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] /

## When this might be helpful

This feauture can improve inference performance for models with sparse weights in the scenarios when the model is deployed to handle multiple requests in parallel asyncronously. It is especially helpful in the case of small sequence length, e.g. 32 and lower.

For more details about asynchronous inference with OpenVINO please refer to the following documentation:
- [Deployment Optimization Guide](https://docs.openvino.ai/latest/openvino_docs_deployment_optimization_guide_common.html#doxid-openvino-docs-deployment-optimization-guide-common-1async-api)
- [Inference Request API](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Infer_request.html#doxid-openvino-docs-o-v-u-g-infer-request-1in-out-tensors)