# Quantizing Models with PyTorch

Quantizing an NLP-based model in PyTorch involves reducing the precision of the model's parameters to improve its inference speed and reduce its memory footprint. The process involves converting floating-point parameters to integers and can be implemented by adding a few lines of code.

## Loading the Model

The first step is to load any NLP-related model. In this notebook, we will be using a pre-trained GPT-2 model from the Hugging Face's Hub.

In [1]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")

## Post-Training Quantization (PTQ)

Post-Training Quantization (PTQ) is a technique of quantizing a pre-trained model, where dynamic quantization is used to adjust the quantization levels during runtime to ensure optimal accuracy and performance.

Archai's offer a wrapper function, denoted as `dynamic_quantization_torch()`, which takes care of dynamically quantizing the pre-trained model.

*Note that we set PyTorch's number of threads to 1 because quantized models will only use a single thread.*

In [2]:
import torch
from archai.quantization.ptq import dynamic_quantization_torch

torch.set_num_threads(1)
model_qnt = dynamic_quantization_torch(model)

2023-02-03 12:05:52,756 - archai.quantization.ptq — INFO —  Quantizing model ...


## Comparing Default and Quantized Models

Finally, we can compare the size of default and quantized models, as well as their logits different. Nevertheless, please note that if the model has not been pre-trained with Quantization Aware Training (QAT), it might produce different logits and have its performance diminished.

In [3]:
from archai.common.file_utils import calculate_torch_model_size

print(f"Model: {calculate_torch_model_size(model)}MB")
print(f"Model-QNT: {calculate_torch_model_size(model_qnt)}MB")

inputs = {"input_ids": torch.randint(1, 10, (1, 192))}
logits = model(**inputs).logits
logits_qnt = model_qnt(**inputs).logits

print(f"Difference between logits: {logits_qnt - logits}")

Model: 510.391647MB
Model-QNT: 431.250044MB
Difference between logits: tensor([[[ -0.3091,  -0.5829,  -0.1439,  ...,   3.1061,   2.7097,  -1.1030],
         [ -1.3238,  -0.7332,  -3.8590,  ...,  -2.8122,  -3.3422,  -1.6324],
         [ -2.3850,  -5.1132,  -6.7728,  ...,  -4.2977,  -4.5302,  -1.9685],
         ...,
         [ -1.6885,  -5.1900,  -9.1044,  ...,   1.7422,  -1.2876,   0.9441],
         [ -5.2036,  -8.5287, -11.4208,  ...,  -3.6595,  -5.0663,  -2.8279],
         [ -4.3205,  -7.2593, -10.5583,  ...,  -2.6262,  -3.7815,  -1.0048]]],
       grad_fn=<SubBackward0>)
