### Quantization with Huggingface Optimum (and ONNX Runtime)

Scaling and productionizing Transformers with millions of parameters are difficult tasks 😔. 

Addressing this, Huggingface just released a new tool called **Optimum** 💥(<html> https://huggingface.co/blog/hardware-partners-program </html>) which aims to speed up the inference time of Transformers. It enables ML practitioners to leverage the available hardware features to quantize the models. 


Quantization is the process of approximating models' parameters (and possibly activations) in floating point number by low bit width number. By doing this, the deep learning model size becomes smaller and takes less resources to run 👶. 


This notebook demonstrates some experiments on quantizing HF **pre-trained** models for *sentiment analysis* task, and also *summarization*. It also compares the performance of Optimum x Lpot quantization, ONNX/ONNX Runtime quantization, and the baseline model. The results are summarized in tables after Section 2 and Section 3. 

It's recommended to run this notebook using Google Cloud AI Platform using a N2-standard-4 CPU, since this supports modern optimization frameworks. Results when running on Colab will probably be less impressive in terms of speedup.

## 1. Setting

In [None]:
!pip install -q transformers datasets optimum lpot 

In [None]:
!pip install -q onnxruntime onnxruntime-gpu onnxruntime-tools onnx psutil

In [None]:
# import unittest
import time
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = ""
import numpy as np

from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    default_data_collator,
    TrainingArguments,
    pipeline
)
from datasets import load_dataset, load_metric, list_metrics
from optimum.intel import lpot
from optimum.intel.lpot.quantization import LpotQuantizer, LpotQuantizerForSequenceClassification


In [None]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
    "summarization": ("article", None)
}

## 2. Quantization - Sentiment analysis 🎭

We first perform the quantization based on a sentiment analysis task. The cell below is the configurations of the pretrained model, its assigned task (sentiment analysis), and the name of the validation dataset which will later on be useful for the quantization.

In [None]:
model_name = "textattack/bert-base-uncased-SST-2"
# config_path = "content/"
task = "sst2"
padding = "max_length"
max_seq_length = 128
max_eval_samples = 200
metric_name = "eval_accuracy"
dataset = load_dataset("glue", task, split="validation")
metric = load_metric("glue", task)
data_collator = default_data_collator
sentence1_key, sentence2_key = task_to_keys[task]

### LPOT quantization

Configure a Lpot-based Quantizer for sequence classification which is suitable for the task at hand (SST2). As a side note, Lpot - or Intel® Low Precision Optimization Tool, is a tool that supports automatic accuracy-driven tuning strategies to help user quickly find out the best quantized model. Interested reader can find more in <html> https://github.com/intel/neural-compressor </html>.

Downloading the config yaml file:

In [None]:
!wget https://raw.githubusercontent.com/ml6team/quick-tips/main/nlp/2021_10_12_huggingface_optimum/quantization.yml .

Defining the quantizer:

In [None]:
quantizer = LpotQuantizerForSequenceClassification.from_config(
    os.getcwd(),
    "quantization.yml",
    model_name_or_path="textattack/bert-base-uncased-SST-2")

We can directly use the pretrained model for predictions of sentiment analysis task. Before focusing on prediction, let us perform quantization to tune the model. The cells below are the necessary components that enable us to perform quantization on this pretrained model.

In [None]:
tokenizer = quantizer.tokenizer
model = quantizer.model

For the pre-processing of the evaluation data:

In [None]:
def preprocess_function(examples):
  args = (
    (examples[sentence1_key],) if sentence2_key is None else (
    examples[sentence1_key], examples[sentence2_key])
  )
  result = tokenizer(*args, padding=padding, max_length=max_seq_length, truncation=True)
  return result

eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.select(range(max_eval_samples))

The evaluation data is now stored in `eval_dataset`. The cells below help us define the metrics to compute. Note that for the task SST2, the dataset for `glue` is the movie reviews and human annotations of their sentiment (Stanford Sentiment Treebank), with accuracy is the evaluation criteria.

In [None]:
def compute_metrics(p: EvalPrediction):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = metric.compute(predictions=preds, references=p.label_ids)
  if len(result) > 1:
    result["combined_score"] = np.mean(list(result.values())).item()
  print(result)
  return result

We are now ready to initiate a `Trainer` object that can be used to evaluate model accuracy during the tuning phase of quantization.

In [None]:
trainer = Trainer(
            model=model,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

In [None]:
def take_eval_steps(model, trainer, metric_name):
  trainer.model = model
  metrics = trainer.evaluate()
  return metrics.get(metric_name)

def eval_func(model):
  return take_eval_steps(model, trainer, metric_name)

In [None]:
quantizer.eval_func = eval_func
q_model = quantizer.fit_dynamic()

A quantized model is found! Let's investigate how much time does it take to perform inference on the validation set.

In [None]:
start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'accuracy': 0.915}
Quantized model obtained with eval_accuracy of 0.915, time elapsed: 10.579970121383667


### ONNX quantization

Another approach to quantization is using ONNX/ONNX Runtime (<html> https://huggingface.co/transformers/serialization.html </html>). We first export the pretrained model to ONNX format, and then optimize and quantize it. The quantized model will perform inference on `eval_dataset`.

In [None]:
!rm -rf onnx/ 
from pathlib import Path
from transformers import AutoTokenizer
from transformers.convert_graph_to_onnx import convert

# Exporting the model to ONNX
convert(pipeline_name="sentiment-analysis",
        framework="pt",
        model="textattack/bert-base-uncased-SST-2",
        tokenizer="textattack/bert-base-uncased-SST-2",
        output=Path("onnx/bert-base-uncased-SST-2.onnx"),
        opset=11)

In [None]:
from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_all_providers
from contextlib import contextmanager
from dataclasses import dataclass
from time import time
from tqdm import trange

def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: 
  
  assert provider in get_all_providers(), f"provider {provider} not found, {get_all_providers()}"

  # Few properties that might have an impact on performances (provided by MS)
  options = SessionOptions()
  options.intra_op_num_threads = 1
  options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

  # Load the model as a graph and prepare the CPU backend 
  session = InferenceSession(model_path, options, providers=[provider])
  session.disable_fallback()
    
  return session

In [None]:
from transformers.convert_graph_to_onnx import quantize, optimize

# Optimize the ONNX model
optimized_model_path = optimize(Path("onnx/bert-base-uncased-SST-2.onnx"))

# Quantize the previously optimized ONNX
quantized_model_path = quantize(Path("onnx/bert-base-uncased-SST-2-optimized.onnx"))

# Then you just have to load through ONNX runtime
quantized_model_onnx = create_model_for_provider(quantized_model_path.as_posix(), "CPUExecutionProvider")

In [None]:
import os

print('ONNX full precision model size (MB):', os.path.getsize("onnx/bert-base-uncased-SST-2.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("onnx/bert-base-uncased-SST-2-optimized-quantized.onnx")/(1024*1024))

ONNX full precision model size (MB): 417.7162857055664
ONNX quantized model size (MB): 106.53945922851562


Preparing inputs for the ONNX model:

In [None]:
inputs_onnx = {
    'attention_mask': eval_dataset['attention_mask'],
    'input_ids': eval_dataset['input_ids'],
    'token_type_ids': eval_dataset['token_type_ids']
}

Let's start performing the benchmark.

In [None]:
import time 

start_time = time.time()
outputs_batch = quantized_model_onnx.run(None, inputs_onnx)
elapsed_time = time.time() - start_time

pred = np.argmax(np.array(outputs_batch[0]), axis=1) # compare it with the eval_dataset['label']
metric_quantized_onnx = np.sum(pred == np.array(eval_dataset['label']))/len(pred)

print(f"Quantized model obtained with {metric_name} of {metric_quantized_onnx}, time elapsed: {elapsed_time}")

Quantized model obtained with eval_accuracy of 0.915, time elapsed: 22.273820400238037


### Non-quantized version

The time it takes for the ONNX-quantized model is twice longer than that of the Lpot-quantized model. And that was about *quantized* model, how about the *non-quantized* version? How much accuracy does it get?

In [None]:
tokenizer_std = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")
model_std = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")

In [None]:
trainer_std = Trainer(
            model=model_std,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer_std,
            data_collator=data_collator,
        )

start_time = time.time()
metric_std = take_eval_steps(model_std, trainer_std, metric_name)
elapsed_time = time.time() - start_time
print(f"Non-quantized model obtained with {metric_name} of {metric_std}, time elapsed: {elapsed_time}")

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'accuracy': 0.915}
Non-quantized model obtained with eval_accuracy of 0.915, time elapsed: 20.731496810913086


From the results, we see that the Lpot-quantized model offers the best performance when doing inference on the evaluation set. However, depending on the CPU that is currently in use, this percentage number can vary. A good thing is that the evaluation accuracy remains the same between all models. A summary table of the results can be found below. 

.|Lpot-quantization |ONNX-quantization |Baseline
-----|-----|-----|----- 
*Accuracy*|0.915|0.915|0.915
*Inference time*|10.58|22.27|20.73


## 3. Quantization - Summarization 🤏

For now, let us pay attention to another famous NLP task - <i>summarization</i>. The question remains the same: How fast the model can get with Optimum?

For the *quantized* version, we have the following:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments # Sequence-to-sequence language modelling

Following the same sequence as before, let's first have a look at the quantized model before checking out the non-quantized model. We first create some global variables to config the model.

In [None]:
!pip install sacrebleu

In [None]:
model_name = "sshleifer/distilbart-cnn-12-6"
# config_path = "content/"
task = "summarization"
padding = "max_length"
max_seq_length = 128 
max_eval_samples = 8
metric_name = "eval_score"
dataset = load_dataset("cnn_dailymail",
                       '3.0.0',
                       split="validation[:5%]") # 5% of the validation data
metric = load_metric("sacrebleu")
data_collator = default_data_collator
sentence1_key, sentence2_key = "article", None

### LPOT quantization

Note that the evaluation sample for quantization tuning is very small (8). We can increase this number; however, it can be the case that quantization doesn't find a good model that matches the accuracy of the baseline model. 

LpotQuantizer doesn't come with a sequence-to-sequence class, hence we create one as in the cell below

In [None]:
class LpotQuantizerForSequenceToSequenceLM(LpotQuantizer):
  TRANSFORMERS_AUTO_CLASS = AutoModelForSeq2SeqLM

Initiate the quantizer:

In [None]:
# Don't forget to upload 'quantization.yml' before running this line
quantizer = LpotQuantizerForSequenceToSequenceLM.from_config(
    os.getcwd(),
    "quantization.yml",
    model_name_or_path="sshleifer/distilbart-cnn-12-6")

In [None]:
tokenizer = quantizer.tokenizer
model = quantizer.model # BartForConditionalGeneration 

Having the tokenizer, define a preprocessing function for the inputs

In [None]:
def preprocess_function(examples):
  result = tokenizer(examples['article'],
                     padding=padding,
                     max_length=max_seq_length,
                     truncation=True)
  
  ground_truth = tokenizer(examples['highlights'],
                     padding=padding,
                     max_length=max_seq_length,
                     truncation=True)

  result["labels"] = ground_truth["input_ids"]
  result["label_ids"] = ground_truth["input_ids"]

  return result

def print_article(examples):
  print(examples['article'])

eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.select(range(max_eval_samples))

In [None]:
# The lines below test how the model outputs
# dummy_input = tokenizer("summarize: " + eval_dataset[0]['article'], return_tensors='pt', 
#                         max_length=128, truncation=True)
# dummy_output = model.generate(dummy_input["input_ids"], max_length=150, min_length=40,
#                               length_penalty=2.0, num_beams=4, early_stopping=True)

In [None]:
# tokenizer.decode(dummy_output[0])

Define a *compute_metric* functions that serve as the inputs for the Seq2SeqTrainer class. 

In [None]:
def compute_metrics(p: EvalPrediction):
  label_ids = p.label_ids
  pred_ids = p.predictions
  pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
  label_ids[label_ids == -100] = tokenizer.pad_token_id
  label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
  label_str = [[label] for label in label_str]

  bleu_output = metric.compute(predictions=pred_str, references=label_str)
  return bleu_output

In [None]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    output_dir="content/"
)

trainer = Seq2SeqTrainer(
            model=model,
            args=training_args,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

In [None]:
def take_eval_steps(model, trainer, metric_name):
  trainer.model = model
  metrics = trainer.evaluate(max_length=max_seq_length, num_beams=4)
  return metrics[metric_name]

def eval_func(model):
  return take_eval_steps(model, trainer, metric_name)

In [None]:
quantizer.eval_func = eval_func
q_model = quantizer.fit_dynamic()

Let's evaluate the time it takes to evaluate

In [None]:
start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article.
***** Running Evaluation *****
  Num examples = 8
  Batch size = 8


Quantized model obtained with eval_score of 8.280828935558953, time elapsed: 6.632972240447998


How is the performance of this quantized model if we evaluate it on a larger sample size? 

In [None]:
max_eval_samples = 100
eval_dataset = dataset.map(preprocess_function, batched=True)
eval_dataset_large = eval_dataset.select(range(max_eval_samples))

trainer = Seq2SeqTrainer(
            model=q_model.model,
            args=training_args,
            eval_dataset=eval_dataset_large,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

start_time = time.time()
metric_quantized = take_eval_steps(q_model.model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Quantized model obtained with {metric_name} of {metric_quantized}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


Quantized model obtained with eval_score of 6.314764697064099, time elapsed: 87.75756764411926


### Non-quantized version

For the non-quantized version, we have the following:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

In [None]:
trainer = Seq2SeqTrainer(
            model=model,
            args=training_args,
            eval_dataset=eval_dataset_large,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

In [None]:
start_time = time.time()
metric_std = take_eval_steps(model, trainer, metric_name)
elapsed_time = time.time() - start_time
print(f"Non-quantized model obtained with {metric_name} of {metric_std}, time elapsed: {elapsed_time}")

The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, highlights, article.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


Non-quantized model obtained with eval_score of 8.269434672683811, time elapsed: 191.84312677383423


The time it takes the non-quantized model to make a summarization doubles the time it takes the quantized model. However, this comes at a loss of accuracy: the BLEU score of non-quantized model is 30% higher than that of the quantized model. It is clear that the evaluation dataset during the tuning phase is too small. 

Metric|Lpot-quantization |Baseline
-----|-----|-----
*BLEU*|6.31|8.27
*Inference time*|87.76|191.84

## 4. Conclusion

Model quantization in itself is a powerful technique for putting transformer models into production.

Furthermore, LPOT quantization offers a strong and mature-feeling alternative to the more well-known ONNX quantization.

Definitely a keeper!