### Quantization with Huggingface Optimum (and ONNX Runtime)

Scaling and productionizing Transformers with millions of parameters are difficult tasks 😔. 

Addressing this, Huggingface just released a new tool called **Optimum** 💥(<html> https://huggingface.co/blog/hardware-partners-program </html>) which aims to speed up the inference time of Transformers. It enables ML practitioners to leverage the available hardware features to quantize the models. 


Quantization is the process of approximating models' parameters (and possibly activations) in floating point number by low bit width number. By doing this, the deep learning model size becomes smaller and takes less resources to run 👶. 


This notebook demonstrates some experiments on quantizing HF **pre-trained** models for *sentiment analysis* task, and also *summarization*. It also compares the performance of Optimum x Lpot quantization, ONNX/ONNX Runtime quantization, and the baseline model. The results are summarized in tables after Section 2 and Section 3. 

It's recommended to run this notebook using Google Cloud AI Platform using a N2-standard-4 CPU, since this supports modern optimization frameworks. Results when running on Colab will probably be less impressive in terms of speedup.

## 1. Setting

In [1]:
!pip install -q transformers datasets
!pip install optimum[intel]



[K     |████████████████████████████████| 3.5 MB 6.0 MB/s 
[K     |████████████████████████████████| 311 kB 53.4 MB/s 
[K     |████████████████████████████████| 596 kB 3.1 MB/s 
[K     |████████████████████████████████| 895 kB 12.7 MB/s 
[K     |████████████████████████████████| 6.8 MB 4.7 MB/s 
[K     |████████████████████████████████| 67 kB 3.2 MB/s 
[K     |████████████████████████████████| 243 kB 42.1 MB/s 
[K     |████████████████████████████████| 133 kB 41.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 46.4 MB/s 
[K     |████████████████████████████████| 94 kB 628 kB/s 
[K     |████████████████████████████████| 144 kB 38.0 MB/s 
[K     |████████████████████████████████| 271 kB 52.1 MB/s 
[?25hCollecting optimum[intel]
  Downloading optimum-0.1.3-py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 413 kB/s 
[?25hCollecting coloredlogs
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[K     |███████████████████████████████

In [None]:
!pip install -q onnxruntime onnxruntime-gpu onnxruntime-tools onnx psutil

[K     |████████████████████████████████| 4.9 MB 22.7 MB/s 
[K     |████████████████████████████████| 104.8 MB 93 kB/s 
[K     |████████████████████████████████| 212 kB 60.1 MB/s 
[K     |████████████████████████████████| 12.7 MB 49.6 MB/s 
[K     |████████████████████████████████| 55 kB 2.9 MB/s 
[?25h

In [None]:
# import unittest
import time
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = ""
import numpy as np

from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    default_data_collator,
    TrainingArguments,
    pipeline
)
from datasets import load_dataset, load_metric, list_metrics
import optimum.intel.neural_compressor
from optimum.intel.neural_compressor.quantization import IncQuantizer, IncQuantizerForSequenceClassification


In [None]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
    "summarization": ("article", None)
}

## 2. Quantization - Sentiment analysis 🎭

We first perform the quantization based on a sentiment analysis task. The cell below is the configurations of the pretrained model, its assigned task (sentiment analysis), and the name of the validation dataset which will later on be useful for the quantization.

In [None]:
model_name = "textattack/bert-base-uncased-SST-2"
# config_path = "content/"
task = "sst2"
padding = "max_length"
max_seq_length = 128
max_eval_samples = 200
metric_name = "eval_accuracy"
dataset = load_dataset("glue", task, split="validation")
metric = load_metric("glue", task)
data_collator = default_data_collator
sentence1_key, sentence2_key = task_to_keys[task]

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

### Neural compressor quantization

Configure a neural compressor-based Quantizer for sequence classification which is suitable for the task at hand (SST2). As a side note, neural compressor, (previously called Lpot - or Intel® Low Precision Optimization Tool), is a tool that supports automatic accuracy-driven tuning strategies to help user quickly find out the best quantized model. Interested reader can find more in <html> https://github.com/intel/neural-compressor </html>.

Downloading the config yaml file:

In [None]:
!wget https://raw.githubusercontent.com/ml6team/quick-tips/main/nlp/2021_10_12_huggingface_optimum/quantization.yml .

--2022-02-14 10:02:20--  https://raw.githubusercontent.com/ml6team/quick-tips/main/nlp/2021_10_12_huggingface_optimum/quantization.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1595 (1.6K) [text/plain]
Saving to: ‘quantization.yml’


2022-02-14 10:02:20 (24.8 MB/s) - ‘quantization.yml’ saved [1595/1595]

--2022-02-14 10:02:20--  http://./
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2022-02-14 10:02:20--
Total wall clock time: 0.1s
Downloaded: 1 files, 1.6K in 0s (24.8 MB/s)


Defining the quantizer:

In [None]:
quantizer = IncQuantizerForSequenceClassification.from_config(
    model_name_or_path = "textattack/bert-base-uncased-SST-2",
    inc_config="quantization.yml"
)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

We can directly use the pretrained model for predictions of sentiment analysis task. Before focusing on prediction, let us perform quantization to tune the model. The cells below are the necessary components that enable us to perform quantization on this pretrained model.

In [None]:
tokenizer = quantizer.tokenizer
model = quantizer.model

For the pre-processing of the evaluation data:

In [None]:
def preprocess_function(examples):
  args = (
    (examples[sentence1_key],) if sentence2_key is None else (
    examples[sentence1_key], examples[sentence2_key])
  )
  result = tokenizer(*args, padding=padding, max_length=max_seq_length, truncation=True)
  return result

eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.select(range(max_eval_samples))

  0%|          | 0/1 [00:00<?, ?ba/s]

The evaluation data is now stored in `eval_dataset`. The cells below help us define the metrics to compute. Note that for the task SST2, the dataset for `glue` is the movie reviews and human annotations of their sentiment (Stanford Sentiment Treebank), with accuracy is the evaluation criteria.

In [None]:
def compute_metrics(p: EvalPrediction):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = metric.compute(predictions=preds, references=p.label_ids)
  if len(result) > 1:
    result["combined_score"] = np.mean(list(result.values())).item()
  print(result)
  return result

We are now ready to initiate a `Trainer` object that can be used to evaluate model accuracy during the tuning phase of quantization.

In [None]:
trainer = Trainer(
            model=model,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

In [None]:
def take_eval_steps(model, trainer, metric_name):
  trainer.model = model
  metrics = trainer.evaluate()
  return metrics.get(metric_name)

def eval_func(model):
  return take_eval_steps(model, trainer, metric_name)

In [None]:
quantizer.eval_func = eval_func
q_model = quantizer.fit_dynamic()

2022-02-14 10:45:08 [INFO] Pass query framework capability elapsed time: 3.32 ms
2022-02-14 10:45:08 [INFO] Get FP32 model baseline.
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


2022-02-14 10:46:33 [INFO] Save tuning history to /content/nc_workspace/2022-02-14_10-00-09/./history.snapshot.
2022-02-14 10:46:33 [INFO] FP32 baseline is: [Accuracy: 0.9150, Duration (seconds): 85.7602]


{'accuracy': 0.915}


2022-02-14 10:46:35 [INFO] |*****Mixed Precision Statistics*****|
2022-02-14 10:46:35 [INFO] +--------------+-----------+---------+
2022-02-14 10:46:35 [INFO] |   Op Type    |   Total   |   INT8  |
2022-02-14 10:46:35 [INFO] +--------------+-----------+---------+
2022-02-14 10:46:35 [INFO] |    Linear    |     74    |    74   |
2022-02-14 10:46:35 [INFO] +--------------+-----------+---------+
2022-02-14 10:46:35 [INFO] Pass quantize model elapsed time: 2108.28 ms
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8
2022-02-14 10:47:40 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.9100|0.9150, Duration (seconds) (int8|fp32): 64.7209|85.7602], Best tune result is: [Accuracy: 0.9100, Duration (seconds): 64.7209]
2022-02-14 10:47:40 [INFO] Save tuning history to /content/nc_workspace/2022-02-14_10-00-09/./histo

{'accuracy': 0.91}


A quantized model is found! Let's investigate how much time does it take to perform inference on the validation set.

In [None]:
start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'accuracy': 0.91}
Quantized model obtained with eval_accuracy of 0.91, time elapsed: 70.50518870353699


### ONNX quantization

Another approach to quantization is using ONNX/ONNX Runtime (<html> https://huggingface.co/transformers/serialization.html </html>). We first export the pretrained model to ONNX format, and then optimize and quantize it. The quantized model will perform inference on `eval_dataset`.

In [None]:
!rm -rf onnx/ 
from pathlib import Path
from transformers import AutoTokenizer
from transformers.convert_graph_to_onnx import convert

# Exporting the model to ONNX
convert(pipeline_name="sentiment-analysis",
        framework="pt",
        model="textattack/bert-base-uncased-SST-2",
        tokenizer="textattack/bert-base-uncased-SST-2",
        output=Path("onnx/bert-base-uncased-SST-2.onnx"),
        opset=11)

loading configuration file https://huggingface.co/textattack/bert-base-uncased-SST-2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/293ab95645c102b941dee443ccf73fb9b5b5a9706b9893f09b5f1941b1bd0c8b.32da30c4245b376f0c4fd55aaf1c536c5ef13f10c248390e0311fcb4ca48f475
Model config BertConfig {
  "_name_or_path": "textattack/bert-base-uncased-SST-2",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "sst-2",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading configuration

ONNX opset version set to: 11
Loading pipeline (model: textattack/bert-base-uncased-SST-2, tokenizer: textattack/bert-base-uncased-SST-2)


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at textattack/bert-base-uncased-SST-2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading configuration file https://huggingface.co/textattack/bert-base-uncased-SST-2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/293ab95645c102b941dee443ccf73fb9b5b5a9706b9893f09b5f1941b1bd0c8b.32da30c4245b376f0c4fd55aaf1c536c5ef13f10c248390e0311fcb4ca48f475
Model config BertConfig {
  "_name_or_path": "textattack/bert-base-uncased-SST-2",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "sst-2",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 7

Creating folder onnx
Using framework PyTorch: 1.10.0+cu111
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']




In [None]:
from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_all_providers
from contextlib import contextmanager
from dataclasses import dataclass
from time import time
from tqdm import trange

def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: 
  
  assert provider in get_all_providers(), f"provider {provider} not found, {get_all_providers()}"

  # Few properties that might have an impact on performances (provided by MS)
  options = SessionOptions()
  options.intra_op_num_threads = 1
  options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

  # Load the model as a graph and prepare the CPU backend 
  session = InferenceSession(model_path, options, providers=[provider])
  session.disable_fallback()
    
  return session

In [None]:
from transformers.convert_graph_to_onnx import quantize, optimize

# Optimize the ONNX model
optimized_model_path = optimize(Path("onnx/bert-base-uncased-SST-2.onnx"))

# Quantize the previously optimized ONNX
quantized_model_path = quantize(Path("onnx/bert-base-uncased-SST-2-optimized.onnx"))

# Then you just have to load through ONNX runtime
quantized_model_onnx = create_model_for_provider(quantized_model_path.as_posix(), "CPUExecutionProvider")

Optimized model has been written at onnx/bert-base-uncased-SST-2-optimized.onnx: ✔
/!\ Optimized model contains hardware specific operators which might not be portable. /!\


         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.


2022-02-14 10:49:58 [INFO] Quantization parameters for tensor:"239" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"298" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"277" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"312" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"327" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"338" not specified
2022-02-14 10:49:59 [INFO] Quantization parameters for tensor:"353" not specified
2022-02-14 10:50:00 [INFO] Quantization parameters for tensor:"412" not specified
2022-02-14 10:50:00 [INFO] Quantization parameters for tensor:"391" not specified
2022-02-14 10:50:00 [INFO] Quantization parameters for tensor:"426" not specified
2022-02-14 10:50:00 [INFO] Quantization parameters for tensor:"441" not specified
2022-02-14 10:50:00 [INFO] Quantization parameters for tensor:"452" not specified
2022-02-14 10:50

Quantized model has been written at onnx/bert-base-uncased-SST-2-optimized-quantized.onnx: ✔


In [None]:
import os

print('ONNX full precision model size (MB):', os.path.getsize("onnx/bert-base-uncased-SST-2.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("onnx/bert-base-uncased-SST-2-optimized-quantized.onnx")/(1024*1024))

ONNX full precision model size (MB): 417.7162866592407
ONNX quantized model size (MB): 106.53940677642822


Preparing inputs for the ONNX model:

In [None]:
inputs_onnx = {
    'attention_mask': eval_dataset['attention_mask'],
    'input_ids': eval_dataset['input_ids'],
    'token_type_ids': eval_dataset['token_type_ids']
}

Let's start performing the benchmark.

In [None]:
import time 

start_time = time.time()
outputs_batch = quantized_model_onnx.run(None, inputs_onnx)
elapsed_time = time.time() - start_time

pred = np.argmax(np.array(outputs_batch[0]), axis=1) # compare it with the eval_dataset['label']
metric_quantized_onnx = np.sum(pred == np.array(eval_dataset['label']))/len(pred)

print(f"Quantized model obtained with {metric_name} of {metric_quantized_onnx}, time elapsed: {elapsed_time}")

Quantized model obtained with eval_accuracy of 0.92, time elapsed: 69.48547196388245


### Non-quantized version

The time it takes for the ONNX-quantized model is twice longer than that of the Lpot-quantized model. And that was about *quantized* model, how about the *non-quantized* version? How much accuracy does it get?

In [None]:
tokenizer_std = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")
model_std = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")

loading configuration file https://huggingface.co/textattack/bert-base-uncased-SST-2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/293ab95645c102b941dee443ccf73fb9b5b5a9706b9893f09b5f1941b1bd0c8b.32da30c4245b376f0c4fd55aaf1c536c5ef13f10c248390e0311fcb4ca48f475
Model config BertConfig {
  "_name_or_path": "textattack/bert-base-uncased-SST-2",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "sst-2",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://

In [None]:
trainer_std = Trainer(
            model=model_std,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer_std,
            data_collator=data_collator,
        )

start_time = time.time()
metric_std = take_eval_steps(model_std, trainer_std, metric_name)
elapsed_time = time.time() - start_time
print(f"Non-quantized model obtained with {metric_name} of {metric_std}, time elapsed: {elapsed_time}")

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'accuracy': 0.915}
Non-quantized model obtained with eval_accuracy of 0.915, time elapsed: 80.45638847351074


From the results, we see that the Lpot-quantized model offers the best performance when doing inference on the evaluation set. However, depending on the CPU that is currently in use, this percentage number can vary. A good thing is that the evaluation accuracy remains the same between all models. A summary table of the results can be found below. 

.|Lpot-quantization |ONNX-quantization |Baseline
-----|-----|-----|----- 
*Accuracy*|0.915|0.915|0.915
*Inference time*|10.58|22.27|20.73


## 3. Quantization - Summarization 🤏

For now, let us pay attention to another famous NLP task - <i>summarization</i>. The question remains the same: How fast the model can get with Optimum?

For the *quantized* version, we have the following:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments # Sequence-to-sequence language modelling

Following the same sequence as before, let's first have a look at the quantized model before checking out the non-quantized model. We first create some global variables to config the model.

In [None]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[?25l[K     |███▋                            | 10 kB 18.7 MB/s eta 0:00:01[K     |███████▏                        | 20 kB 25.3 MB/s eta 0:00:01[K     |██████████▉                     | 30 kB 29.4 MB/s eta 0:00:01[K     |██████████████▍                 | 40 kB 20.9 MB/s eta 0:00:01[K     |██████████████████              | 51 kB 23.2 MB/s eta 0:00:01[K     |█████████████████████▋          | 61 kB 18.9 MB/s eta 0:00:01[K     |█████████████████████████▎      | 71 kB 17.7 MB/s eta 0:00:01[K     |████████████████████████████▉   | 81 kB 18.8 MB/s eta 0:00:01[K     |████████████████████████████████| 90 kB 6.9 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting portalocker
  Downloading portalocker-2.3.2-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.4 portalocker-2.3.2 sacreble

In [None]:
model_name = "sshleifer/distilbart-cnn-12-6"
# config_path = "content/"
task = "summarization"
padding = "max_length"
max_seq_length = 128 
max_eval_samples = 8
metric_name = "eval_score"
dataset = load_dataset("cnn_dailymail",
                       '3.0.0',
                       split="validation[:5%]") # 5% of the validation data
metric = load_metric("sacrebleu")
data_collator = default_data_collator
sentence1_key, sentence2_key = "article", None

Downloading:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234...


  0%|          | 0/5 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/661k [00:00<?, ?B/s]

  0%|          | 0/5 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

### Inc (intel neural compressor) quantization

Note that the evaluation sample for quantization tuning is very small (8). We can increase this number; however, it can be the case that quantization doesn't find a good model that matches the accuracy of the baseline model. 

IncQuantizer doesn't come with a sequence-to-sequence class, hence we create one as in the cell below

In [None]:
class IncQuantizerForSequenceToSequenceLM(IncQuantizer):
  TRANSFORMERS_AUTO_CLASS = AutoModelForSeq2SeqLM

Initiate the quantizer:

In [None]:
# Don't forget to upload 'quantization.yml' before running this line
quantizer = IncQuantizerForSequenceToSequenceLM.from_config(
    model_name_or_path="sshleifer/distilbart-cnn-12-6",
    inc_config="quantization.yml",
)

https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpihnl1us6


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

storing https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/f5316f64f9716436994a7ad76a354dc20ecb2dd74eb61d278f103a9c8b80291f.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
creating metadata file for /root/.cache/huggingface/transformers/f5316f64f9716436994a7ad76a354dc20ecb2dd74eb61d278f103a9c8b80291f.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp848gpdvi


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

storing https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/adac95cf641be69365b3dd7fe00d4114b3c7c77fb0572931db31a92d4995053b.a50597c2c8b540e8d07e03ca4d58bf615a365f134fb10ca988f4f67881789178
creating metadata file for /root/.cache/huggingface/transformers/adac95cf641be69365b3dd7fe00d4114b3c7c77fb0572931db31a92d4995053b.a50597c2c8b540e8d07e03ca4d58bf615a365f134fb10ca988f4f67881789178
loading configuration file https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/adac95cf641be69365b3dd7fe00d4114b3c7c77fb0572931db31a92d4995053b.a50597c2c8b540e8d07e03ca4d58bf615a365f134fb10ca988f4f67881789178
Model config BartConfig {
  "_name_or_path": "sshleifer/distilbart-cnn-12-6",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartF

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

storing https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/9951e68693b9a7c583ae677e9cb53c02715d9bd0311a78706401372653cdea0a.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
creating metadata file for /root/.cache/huggingface/transformers/9951e68693b9a7c583ae677e9cb53c02715d9bd0311a78706401372653cdea0a.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp6qch1dy3


Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

storing https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/7588c8d398d659b230a038240cc023f67b6848117d2999f06ab625af7bfc7ec1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
creating metadata file for /root/.cache/huggingface/transformers/7588c8d398d659b230a038240cc023f67b6848117d2999f06ab625af7bfc7ec1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/9951e68693b9a7c583ae677e9cb53c02715d9bd0311a78706401372653cdea0a.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
loading file https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/7588c8d398d659b230a038240cc023f67b6848117d2999f06ab625af7bfc7ec1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c811577

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

storing https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/b336fa0b874ea92e3e22f07a7e6f8fa9da01221759c33abeb2679d6d98fe7755.585965cf7e82e4536033cd21d76c486af3d6b1c2a34b3a847840d4e7fe9d8844
creating metadata file for /root/.cache/huggingface/transformers/b336fa0b874ea92e3e22f07a7e6f8fa9da01221759c33abeb2679d6d98fe7755.585965cf7e82e4536033cd21d76c486af3d6b1c2a34b3a847840d4e7fe9d8844
loading weights file https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/b336fa0b874ea92e3e22f07a7e6f8fa9da01221759c33abeb2679d6d98fe7755.585965cf7e82e4536033cd21d76c486af3d6b1c2a34b3a847840d4e7fe9d8844
All model checkpoint weights were used when initializing BartForConditionalGeneration.

All the weights of BartForConditionalGeneration were initialized from the model checkpoint at sshleifer/distilbart-cnn-12-6.
If your task is similar to the ta

In [None]:
tokenizer = quantizer.tokenizer
model = quantizer.model # BartForConditionalGeneration 

Having the tokenizer, define a preprocessing function for the inputs

In [None]:
def preprocess_function(examples):
  result = tokenizer(examples['article'],
                     padding=padding,
                     max_length=max_seq_length,
                     truncation=True)
  
  ground_truth = tokenizer(examples['highlights'],
                     padding=padding,
                     max_length=max_seq_length,
                     truncation=True)

  result["labels"] = ground_truth["input_ids"]
  result["label_ids"] = ground_truth["input_ids"]

  return result

def print_article(examples):
  print(examples['article'])

eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.select(range(max_eval_samples))

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
# The lines below test how the model outputs
# dummy_input = tokenizer("summarize: " + eval_dataset[0]['article'], return_tensors='pt', 
#                         max_length=128, truncation=True)
# dummy_output = model.generate(dummy_input["input_ids"], max_length=150, min_length=40,
#                               length_penalty=2.0, num_beams=4, early_stopping=True)

In [None]:
# tokenizer.decode(dummy_output[0])

Define a *compute_metric* functions that serve as the inputs for the Seq2SeqTrainer class. 

In [None]:
def compute_metrics(p: EvalPrediction):
  label_ids = p.label_ids
  pred_ids = p.predictions
  pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
  label_ids[label_ids == -100] = tokenizer.pad_token_id
  label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
  label_str = [[label] for label in label_str]

  bleu_output = metric.compute(predictions=pred_str, references=label_str)
  return bleu_output

In [None]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    output_dir="content/"
)

trainer = Seq2SeqTrainer(
            model=model,
            args=training_args,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
def take_eval_steps(model, trainer, metric_name):
  trainer.model = model
  metrics = trainer.evaluate(max_length=max_seq_length, num_beams=4)
  return metrics[metric_name]

def eval_func(model):
  return take_eval_steps(model, trainer, metric_name)

In [None]:
quantizer.eval_func = eval_func
q_model = quantizer.fit_dynamic()

2022-02-14 11:06:00 [INFO] Pass query framework capability elapsed time: 6.03 ms
2022-02-14 11:06:00 [INFO] Get FP32 model baseline.
The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: highlights, id, article.
***** Running Evaluation *****
  Num examples = 8
  Batch size = 8


Trainer is attempting to log a value of "[98, 34, 23, 17]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[434, 426, 418, 410]" of type <class 'list'> for key "eval/totals" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[22.580645161290324, 7.981220657276995, 5.502392344497608, 4.146341463414634]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
2022-02-14 11:06:41 [INFO] Save tuning history to /content/nc_workspace/2022-02-14_10-00-09/./history.snapshot.
2022-02-14 11:06:41 [INFO] FP32 baseline is: [Accuracy: 8.0077, Duration (seconds): 41.0515]
2022-02-14 11:06:51 [INFO] |*****Mixed Precision Statistics*****|
2022-02-14 11:06:51 [

Let's evaluate the time it takes to evaluate

In [None]:
start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: highlights, id, article.
***** Running Evaluation *****
  Num examples = 8
  Batch size = 8


Trainer is attempting to log a value of "[101, 41, 27, 17]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[465, 457, 449, 441]" of type <class 'list'> for key "eval/totals" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[21.72043010752688, 8.971553610503282, 6.013363028953229, 3.854875283446712]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


Quantized model obtained with eval_score of 8.198151847957936, time elapsed: 28.835773706436157


How is the performance of this quantized model if we evaluate it on a larger sample size? 

In [None]:
max_eval_samples = 100
eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset_large = eval_dataset.select(range(max_eval_samples))

trainer = Seq2SeqTrainer(
            model=q_model.model,
            args=training_args,
            eval_dataset=eval_dataset_large,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: highlights, id, article.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


### Non-quantized version

For the non-quantized version, we have the following:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

In [None]:
trainer = Seq2SeqTrainer(
            model=model,
            args=training_args,
            eval_dataset=eval_dataset_large,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

In [None]:
start_time = time.time()
metric_std = take_eval_steps(model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Non-quantized model obtained with {metric_name} of {metric_std}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


Non-quantized model obtained with eval_score of 8.269434672683811, time elapsed: 191.84312677383423


The time it takes the non-quantized model to make a summarization doubles the time it takes the quantized model. However, this comes at a loss of accuracy: the BLEU score of non-quantized model is 30% higher than that of the quantized model. It is clear that the evaluation dataset during the tuning phase is too small. 

Metric|Lpot-quantization |Baseline
-----|-----|-----
*BLEU*|6.31|8.27
*Inference time*|87.76|191.84

## 4. Conclusion

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Model quantization in itself is a powerful technique for putting transformer models into production.

Furthermore, neural compressor quantization offers a strong and mature-feeling alternative to the more well-known ONNX quantization.

Definitely a keeper!