## ⚠️ **DEPRECATED**

This notebook is deprecated and may no longer be maintained.
Please use it with caution or refer to updated resources.


# Quick Get Started Notebook of Intel® Neural Compressor for ONNXRuntime


This notebook is designed to provide an easy-to-follow guide for getting started with the [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) library for [ONNXRuntime](https://github.com/microsoft/onnxruntime) framework.

In the following sections, we are going to use a DistilBert model fine-tuned on SST-2 as an example to show how to apply post-training quantization on [ONNX](https://github.com/onnx/onnx) models using the INC library.


The main objectives of this notebook are:

1. Prerequisite: Prepare necessary environment, model and dataset.
2. Quantization with INC: Walk through the step-by-step process of applying post-training quantization.
3. Benchmark with INC: Evaluate and compare the performance of the FP32 and INT8 models.


## 1. Prerequisite

### 1.1 Environment

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [neural-compressor](https://github.com/intel/neural-compressor), [onnxruntime](https://github.com/microsoft/onnxruntime) and other required packages.

Otherwise, you can setup a new environment. First, we install [Anaconda](https://www.anaconda.com/distribution/). Then open an Anaconda prompt window and run the following commands:

```shell
conda create -n inc_notebook python==3.8
conda activate inc_notebook
pip install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

Then, let's install necessary packages.

In [None]:
# install neural-compressor from source
import sys
!git clone https://github.com/intel/neural-compressor.git
%cd ./neural-compressor
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} setup.py install
%cd ..
# or install stable basic version from pypi
# pip install neural-compressor


In [None]:
# install required packages
!{sys.executable} install -r requirements.txt


### 1.2 Prepare model

Export [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model to ONNX with [Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model) command-line.


In [1]:
!optimum-cli export onnx --model distilbert-base-uncased-finetuned-sst-2-english --task text-classification onnx-model/

Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
  mask, torch.tensor(torch.finfo(scores.dtype).min)
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating models in subprocesses...
Validating ONNX model onnx-model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 2) matches (2, 2)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: onnx-model


### 1.3 Prepare dataset

The General Language Understanding Evaluation (GLUE) benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. This dataset is built from the SQuAD dataset.
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. This dataset is built from the Winograd Schema Challenge dataset.

Here, we download SST-2 task.

In [2]:
!export GLUE_DIR=./glue_data
!wget https://raw.githubusercontent.com/Shimao-Zhang/Download_GLUE_Data/master/download_glue_data.py
!{sys.executable} download_glue_data.py --data_dir=GLUE_DIR --tasks=SST


--2023-10-10 16:56:47--  https://raw.githubusercontent.com/Shimao-Zhang/Download_GLUE_Data/master/download_glue_data.py
Resolving proxy-prc.intel.com (proxy-prc.intel.com)... 10.240.252.16
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 7045 (6.9K) [text/plain]
Saving to: ‘download_glue_data.py’


2023-10-10 16:56:48 (4.21 MB/s) - ‘download_glue_data.py’ saved [7045/7045]

Downloading and extracting SST...
	Completed!


## 2. Quantization with Intel® Neural Compressor

Define the variables that will be used.

In [3]:
model_name_or_path = "distilbert-base-uncased-finetuned-sst-2-english"
fp32_model_path = "onnx-model/model.onnx"
int8_model_path = "onnx-model/int8-model.onnx"
data_path = "./GLUE_DIR/SST-2"
task = "sst-2"
batch_size = 8


### 2.1 Define dataset and dataloader

In this part, we define a GLUE dataset and register it as an INC dataloader.

Refer to doc [dataset.md](https://github.com/intel/neural-compressor/blob/master/docs/source/dataset.md#user-specific-dataset) and [dataloader.md](https://github.com/intel/neural-compressor/blob/master/docs/source/dataloader.md#build-custom-dataloader-with-python-apiapi) for how to build your own dataset and dataloader.


In [4]:
import os
import onnx
import torch
import logging
import numpy as np
import transformers
from transformers.data import InputFeatures

logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.WARN)

class GLUEDataset:
    """Dataset used for GLUE."""
    def __init__(self, model, data_dir, model_name_or_path, max_seq_length=128,\
                do_lower_case=True, task='mrpc', model_type='bert', dynamic_length=False,\
                evaluate=True, transform=None, filter=None):
        self.inputs = [inp.name for inp in onnx.load(model).graph.input]
        task = task.lower()
        model_type = model_type.lower()
        assert task in ['mrpc', 'qqp', 'qnli', 'rte', 'sts-b', 'cola', 'mnli', 'wnli', 'sst-2'], 'Unsupported task type'
        assert model_type in ['distilbert', 'bert', 'mobilebert', 'roberta'], 'Unsupported model type'

        tokenizer = transformers.AutoTokenizer.from_pretrained(model_name_or_path, do_lower_case=do_lower_case)
        self.dataset = load_and_cache_examples(data_dir, model_name_or_path, \
            max_seq_length, task, model_type, tokenizer, evaluate)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        batch = tuple(t.detach().cpu().numpy() if not isinstance(t, np.ndarray) else t for t in self.dataset[index])
        return batch[:len(self.inputs)], batch[-1]

def load_and_cache_examples(data_dir, model_name_or_path, max_seq_length, task, model_type, tokenizer, evaluate):
    from torch.utils.data import TensorDataset

    processor = transformers.glue_processors[task]()
    output_mode = transformers.glue_output_modes[task]
    # Load data features from cache or dataset file
    if not os.path.exists("./dataset_cached"):
        os.makedirs("./dataset_cached")
    cached_features_file = os.path.join("./dataset_cached", 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, model_name_or_path.split('/'))).pop(),
        str(max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file):
        logger.info("Load features from cached file {}.".format(cached_features_file))
        features = torch.load(cached_features_file)
    else:
        logger.info("Create features from dataset file at {}.".format(data_dir))
        label_list = processor.get_labels()
        examples = processor.get_dev_examples(data_dir) if evaluate else \
            processor.get_train_examples(data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                task=task,
                                                label_list=label_list,
                                                max_length=max_seq_length,
                                                output_mode=output_mode,
        )
        logger.info("Save features into cached file {}.".format(cached_features_file))
        torch.save(features, cached_features_file)
    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    # all_seq_lengths = torch.tensor([f.seq_length for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

def convert_examples_to_features(examples, tokenizer, max_length=128, task=None, label_list=None, 
                                 output_mode="classification", pad_token=0, pad_token_segment_id=0, 
                                 mask_padding_with_zero=True,):
    processor = transformers.glue_processors[task]()
    if label_list is None:
        label_list = processor.get_labels()
        logger.info("Use label list {} for task {}.".format(label_list, task))
    label_map = {label: i for i, label in enumerate(label_list)}
    features = []
    for (ex_index, example) in enumerate(examples):
        inputs = tokenizer.encode_plus(
            example.text_a,
            example.text_b,
            add_special_tokens=True,
            max_length=max_length,
            return_token_type_ids=True,
            truncation=True,
        )
        input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        seq_length = len(input_ids)
        padding_length = max_length - len(input_ids)

        input_ids = input_ids + ([pad_token] * padding_length)
        attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
        token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)

        assert len(input_ids) == max_length, \
            "Error with input_ids length {} vs {}".format(len(input_ids), max_length)
        assert len(attention_mask) == max_length, \
            "Error with attention_mask length {} vs {}".format(len(attention_mask), max_length)
        assert len(token_type_ids) == max_length, \
            "Error with token_type_ids length {} vs {}".format(len(token_type_ids), max_length)
        if output_mode == "classification":
            label = label_map[example.label]
        elif output_mode == "regression":
            label = float(example.label)
        else:
            raise KeyError(output_mode)

        feats = InputFeatures(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            label=label
        )
        features.append(feats)
    return features


INC provides an unified `DataLoader` API which takes a dataset as the input parameter and loads data from the dataset when needed. 

In [5]:
from neural_compressor.data import DataLoader

dataset = GLUEDataset(fp32_model_path,
                      data_dir=data_path,
                      model_name_or_path=model_name_or_path,
                      model_type="distilbert",
                      task=task)
dataloader = DataLoader(framework="onnxruntime", dataset=dataset, batch_size=batch_size)




### 2.2 Define metric and evaluate function

In this part, we define a GLUE metirc and use it to generate an evaluate function for INC.

Refer to doc [metric.md](https://github.com/intel/neural-compressor/blob/master/docs/source/metric.md#build-custom-metric-with-python-api) for how to build your own metric.

In [6]:
class GLUEMetric:
    """Computes GLUE score."""
    def __init__(self, task='mrpc'):
        assert task in ['mrpc', 'qqp', 'qnli', 'rte', 'sts-b', 'cola', 'mnli', 'wnli', 'sst-2'], 'Unsupported task type'
        self.pred_list = None
        self.label_list = None
        self.task = task
        self.return_key = {
            "cola": "mcc",
            "mrpc": "f1",
            "sts-b": "corr",
            "qqp": "acc",
            "mnli": "mnli/acc",
            "qnli": "acc",
            "rte": "acc",
            "wnli": "acc",
            "sst-2": "acc"
        }

    def update(self, preds, labels):
        """add preds and labels to storage"""
        if isinstance(preds, list) and len(preds) == 1:
            preds = preds[0]
        if isinstance(labels, list) and len(labels) == 1:
            labels = labels[0]
        if self.pred_list is None:
            self.pred_list = preds
            self.label_list = labels
        else:
            self.pred_list = np.append(self.pred_list, preds, axis=0)
            self.label_list = np.append(self.label_list, labels, axis=0)

    def reset(self):
        """clear preds and labels storage"""
        self.pred_list = None
        self.label_list = None

    def result(self):
        """calculate metric"""
        assert self.pred_list is not None, "Predict list in GLUE metric is None."
        assert self.label_list is not None, "Label list in GLUE metric is None."
        
        output_mode = transformers.glue_output_modes[self.task]

        if output_mode == "classification":
            processed_preds = np.argmax(self.pred_list, axis=1)
        elif output_mode == "regression":
            processed_preds = np.squeeze(self.pred_list)
        result = transformers.glue_compute_metrics(self.task, processed_preds, self.label_list)
        return result[self.return_key[self.task]]


The evaluate function for INC takes model as parameter, and outputs an accuracy scalar value.

In [7]:
import onnxruntime as ort
from onnx import ModelProto

metric = GLUEMetric(task)

def eval_func(model: ModelProto):
    metric.reset()
    session = ort.InferenceSession(model.SerializeToString(), 
                                   providers=ort.get_available_providers())
    ort_inputs = {}
    len_inputs = len(session.get_inputs())
    inputs_names = [session.get_inputs()[i].name for i in range(len_inputs)]
    for idx, (inputs, labels) in enumerate(dataloader):
        if not isinstance(labels, list):
            labels = [labels]
        inputs = inputs[:len_inputs]
        for i in range(len_inputs):
            ort_inputs.update({inputs_names[i]: inputs[i]})
        predictions = session.run(None, ort_inputs)
        metric.update(predictions[0], labels)
    return metric.result()


### 2.3 Optimize the model

It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers) on the FP32 ONNX models. It could help verify whether the model can be fully optimized, and get performance results.

In [8]:
from onnxruntime.transformers import optimizer
from onnxruntime.transformers.fusion_options import FusionOptions

model_type = 'bert'
num_heads = 12
hidden_size = 768

opt_options = FusionOptions(model_type)
opt_options.enable_embed_layer_norm = False

model_optimizer = optimizer.optimize_model(
    fp32_model_path,
    model_type,
    num_heads=num_heads,
    hidden_size=hidden_size,
    optimization_options=opt_options)
model = model_optimizer.model


input: "/distilbert/transformer/layer.0/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.0/attention/Div_output_0"
name: "/distilbert/transformer/layer.0/attention/Div"
op_type: "Div"

input: "/distilbert/transformer/layer.1/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.1/attention/Div_output_0"
name: "/distilbert/transformer/layer.1/attention/Div"
op_type: "Div"

input: "/distilbert/transformer/layer.2/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.2/attention/Div_output_0"
name: "/distilbert/transformer/layer.2/attention/Div"
op_type: "Div"

input: "/distilbert/transformer/layer.3/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.3/attention/Div_output_0"
name: "/distilbert/transformer/layer.3/attention/Div"
op_type: "Div"

input: "/distilbert/transformer/layer.4/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.4/attention/Div_output_0"
name: "/distilbert/transformer/laye

### 2.4 Run quantization

So far, we can finally start to quantize the model. 

To start, we need to set the configuration for post-training quantization using `PostTrainingQuantConfig` class. Once the configuration is set, we can proceed to the next step by calling the `quantization.fit()` function. This function performs the quantization process on the model and will return the best quantized model.

In [9]:
from neural_compressor import quantization, PostTrainingQuantConfig

config = PostTrainingQuantConfig(approach='static')
q_model = quantization.fit(model, 
                           config,
                           eval_func=eval_func,
                           calib_dataloader=dataloader)
q_model.save(int8_model_path)


2023-10-10 17:07:33 [INFO] Start auto tuning.
2023-10-10 17:07:33 [INFO] Execute the tuning process due to detect the evaluation function.
2023-10-10 17:07:33 [INFO] Adaptor has 5 recipes.
2023-10-10 17:07:33 [INFO] 0 recipes specified by user.
2023-10-10 17:07:33 [INFO] 3 recipes require future tuning.
2023-10-10 17:07:33 [INFO] *** Initialize auto tuning
2023-10-10 17:07:33 [INFO] {
2023-10-10 17:07:33 [INFO]     'PostTrainingQuantConfig': {
2023-10-10 17:07:33 [INFO]         'AccuracyCriterion': {
2023-10-10 17:07:33 [INFO]             'criterion': 'relative',
2023-10-10 17:07:33 [INFO]             'higher_is_better': True,
2023-10-10 17:07:33 [INFO]             'tolerable_loss': 0.01,
2023-10-10 17:07:33 [INFO]             'absolute': None,
2023-10-10 17:07:33 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x7fd9e7e53ee0>>,
2023-10-10 17:07:33 [INFO]             'relative': 0.01
2023-10-10 17:07:33 [INFO]    

## 3. Benchmark with Intel® Neural Compressor

INC provides a benchmark feature to measure the model performance with the objective settings.
Now we can see that we have two models under the `onnx-model` directory: the original fp32 model `model.onnx` and the quantized int8 model `int8-model.onnx`, and then we are going to do performance comparisons between them.

To avoid the conflicts of jupyter notebook kernel to our benchmark process. We create a `benchmark.py` and run it directly to do the benchmarks.

In [10]:
# FP32 benchmark
!python benchmark.py --input_model ./onnx-model/model.onnx 2>&1|tee fp32_benchmark.log

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You are using a model of type distilbert to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-10-10 17:08:36 [INFO] Start to run Benchmark.
2023-10-10 17:08:36 [INFO] num of instance: 1
2023-10-10 17:08:36 [INFO] cores per instance: 4
2023-10-10 17:08:37 [INFO] Running command is
OMP_NUM_THREADS=4 numactl --localalloc --physcpubind=0,1,2,3 /home/yuwenzho/miniconda3/envs/example/bin/python benchmark.py --input_model ./onnx-model/model.onnx 2>&1|tee 1_4_0.log & \
wait
You are using a model of type distilbert to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-10-10 17:08:41 [INFO] Start 

In [11]:
# INT8 benchmark
!python benchmark.py --input_model ./onnx-model/int8-model.onnx 2>&1|tee int8_benchmark.log

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You are using a model of type distilbert to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-10-10 17:08:51 [INFO] Start to run Benchmark.
2023-10-10 17:08:51 [INFO] num of instance: 1
2023-10-10 17:08:51 [INFO] cores per instance: 4
2023-10-10 17:08:51 [INFO] Running command is
OMP_NUM_THREADS=4 numactl --localalloc --physcpubind=0,1,2,3 /home/yuwenzho/miniconda3/envs/example/bin/python benchmark.py --input_model ./onnx-model/int8-model.onnx 2>&1|tee 1_4_0.log & \
wait
You are using a model of type distilbert to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-10-10 17:08:56 [INFO] S

As shown in the logs, the int8/fp32 performance gain is about 98.000/40.652 = 2.41x