# Quantize PyTorch Models with Intel® Neural Compressor (INC)

In this notebook we will look at a real-world use case of text classification using a Huggingface model. We will first use a stock FP32 PyTorch model to generate predictions. Then, we will perform INT8 Quantization with easy-to-use APIs provided by Intel® Neural Compressor (INC) to see how speedups can be gained over stock PyTorch on Intel® hardware.

In [None]:
import torch
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import transformers
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    PretrainedConfig,
    set_seed,
)

In [None]:
#######################################################################################################################################
import neural_compressor # Intel® Neural Compressor (INC) is an open-source Python library that supports popular model compression techniques
#######################################################################################################################################

# check if Intel® Neural Compressor (INC) is above v2.0
assert float(neural_compressor.__version__) >= 2.0, "The below APIs work with Intel® Neural Compressor (INC) 2.0 and above" 

In [None]:
# set constant seed for reprodcibility
set_seed(0)

**Helper Functions**

Some functions to help us with loading the model and summarizing the optimizations. The functions below will help us record the time taken to run and, plot comparison charts.

In [None]:
def get_average_inference_time(model, data):
    """
    does a model warm up and times the model runtime
    """
    with torch.no_grad():
        # warm up
        for _ in range(25):
            model(input_ids=data[0], attention_mask=data[1])

        # measure
        import time
        start = time.time()
        for _ in range(25):
            output = model(input_ids=data[0], attention_mask=data[1])
        end = time.time()
        average_inference_time = (end-start)/25*1000
    
    return average_inference_time

def plot_speedup(inference_time_stock, inference_time_optimized):
    """
    Plots a bar chart comparing the time taken by stock PyTorch model and time taken by
    the quantized model
    """
    data = {'FP32': inference_time_stock, 'INT8': inference_time_optimized}
    model_type = list(data.keys())
    times = list(data.values())

    fig = plt.figure(figsize = (10, 5))

    # creating the bar plot
    plt.bar(model_type, times, color ='blue',
            width = 0.4)

    plt.ylabel("Runtime (ms)")
    plt.title(f"Speedup acheived - {inference_time_stock/inference_time_optimized:.2f}x")
    plt.savefig('inc_speedup.png')
    plt.show()
    

**Model**

Instantiate a FP32 BERT model finetuned on the IMDB dataset from Huggingface. In a real-world use-case this could be any model that has been trained and is ready to be deployed.

In [None]:
# Load pretrained model and tokenizer
model_name = "JiaqiLee/imdb-finetuned-bert-base-uncased"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

**Data**

We will use the IMDB dataset from the Huggingface datasets library to demonstrate the role that the dataset plays in quantizing the model using Intel® Neural Compressor (INC)

In [None]:
# set up data
from datasets import load_dataset

# data = load_dataset("sms_spam")
data = load_dataset("imdb")

Let's look at what one example from the dataset looks like

In [None]:
data['test'][20000]

In [None]:
# The torch.utils.data.Dataset class IMDBDataset allows us to prepare a tokenized dataset
from dataset import IMDBDataset

text = data['test']['text']
labels = data['test']['label'] 

test_dataset = IMDBDataset(text, labels, tokenizer=tokenizer, data_size=1200)

In [None]:
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64)

**Evaluation Function**

The eval function is an important part of quantizing with Intel® Neural Compressor (INC). It contains metrics that we care about and want to preserve as much as possible post quantization, for example - accuracy, f1score etc

In [None]:
# The following eval function computes accuracy
def eval_func(model_q):
    test_preds = []
    test_labels = []
    for _, batch in enumerate(test_loader):
        inputs, labels = batch
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        pred = model_q(
            input_ids=ids,
            attention_mask=mask,
        )
        # save predictions and labels for all loops to calculate accuracy later
        test_preds.extend(pred.logits.argmax(-1))
        test_labels.extend(labels)
    return accuracy_score(test_preds, test_labels)

Data for Benchmarking - Let's pick one sample from the test dataloader

In [None]:
batch = next(iter(test_loader))
data = (batch[0]['input_ids'], batch[0]['attention_mask'])

Benchmark Stock PyTorch Model (FP32)

In [None]:
inference_time_stock = get_average_inference_time(model.eval(), data)

print(f"time taken for forward pass: {inference_time_stock} ms")

**Quantization**

Quantization is a very popular deep learning model optimization technique for improving inference speeds. It minimizes the number of bits required to represent either the weights or activations in a NN. This is done by converting a set of real-valued numbers into their lower bit data representations, such as int8 and int4, mainly during the inference phase with minimal to no loss in accuracy.

Intel® Neural Compressor (INC) provides three types of Quantization APIs:
- post training dynamic quantization
- post training static quantization
- quantization aware training

In [None]:
#######################################################################################################################################
from neural_compressor.quantization import fit
from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion, AccuracyCriterion
#######################################################################################################################################

Dynamic Quantization

The weights and activations of the neural network get quantized into int8 format from float32 format offline (post training)

For this we need the min/max range of the bit representation. These are collected during inference runtime i.e when the data is passed through the model. We will see that for Static Quantization, this is not the case.

In [None]:
tuning_criterion = TuningCriterion(max_trials=5)
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.73, criterion='absolute')
conf = PostTrainingQuantConfig(approach="dynamic", tuning_criterion=tuning_criterion, accuracy_criterion=accuracy_criterion)
q_model = fit(model, conf=conf, eval_func=eval_func)

In [None]:
inference_time_optimized = get_average_inference_time(q_model.eval(), data)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

Static Quantization

The weights and activations of the neural network get quantized into int8 format from float32 format offline (post training)

For this we need the min/max range of the bit representation. These are collected using a calibration dataset. The calibration dataset should be able to represent the data distribution of unseen data. The calibration process runs on the original fp32 model and dumps out all the tensor distributions for Scale and ZeroPoint calculations. Usually preparing 100 samples are enough for calibration.

In [None]:
tuning_criterion = TuningCriterion(max_trials=5)
conf = PostTrainingQuantConfig(approach="static")
q_model = fit(model, conf=conf, eval_func=eval_func, calib_dataloader=test_loader)

In [None]:
inference_time_optimized = get_average_inference_time(q_model.eval(), image_channels_last)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

Intel® Neural Compressor (INC) provides a host of model compression tecnhiques apart from Quantization such as Advanced Mixed Precision, Pruning (Sparsity), Distillation, Orchestration, Benchmarking etc.

Please visit our [GitHub](https://github.com/intel/neural-compressor)