# Lesson 4: Quantization Theory

**Objective:**
Understand linear quantization techniques and implement weight quantization for pretrained language models using Hugging Face’s quanto library. This tutorial reorganizes the original material, adds explanatory notes, illustrative examples, calibration strategies, links to SOTA papers, and expert tips.

# 1. Introduction to Quantization

Quantization maps a large set of continuous values to a smaller, discrete set. In deep learning, reducing bit-width of weights and activations (e.g., FP32 → INT8) saves memory and speeds up inference, typically with minimal accuracy degradation.

**Why Quantize?**
- **Memory savings:** 32-bit → 8-bit yields 4× reduction in model size.
- **Compute efficiency:** Integer operations (INT8) are faster and more energy-efficient on many accelerators.
- **Deployment:** Smaller models fit on edge devices and reduce bandwidth requirements.

**Linear Quantization Overview:**
1. **Range estimation:** Find tensor’s minimum $r_{\min}$ and maximum $r_{\max}$.  
2. **Compute parameters:**  
   $$s = \frac{r_{\max} - r_{\min}}{q_{\max} - q_{\min}}, \quad z = \mathrm{round}\bigl( -r_{\min} / s\bigr).$$
3. **Quantize:**  
   $$q = \mathrm{clip}\bigl(\mathrm{round}(r / s) + z, \; q_{\min}, q_{\max}\bigr).$$
4. **Dequantize:**  
   $$\hat r = s\,(q - z).$$

- Here, $[q_{\min}, q_{\max}]$ is typically $[-128, 127]$ for signed INT8.
- **Quantization error** $\hat r - r$ arises but can be mitigated with calibration and training.

In [1]:
# Function to compute scale (s) and zero-point (z)
def get_quant_params(r_min, r_max, q_min=-128, q_max=127):
    """
    Returns scale s and zero-point z for mapping floats in [r_min, r_max]
    to ints in [q_min, q_max].
    """
    s = (r_max - r_min) / (q_max - q_min)
    z = int(round(-r_min / s))
    z = max(q_min, min(q_max, z))
    return s, z

# Example usage
r_min, r_max = -1.0, 1.0
scale, zero_point = get_quant_params(r_min, r_max)
print(f"Scale: {scale:.4f}, Zero-point: {zero_point}")

Scale: 0.0078, Zero-point: 127


In [2]:
import numpy as np

def quantize_tensor(r, s, z, q_min=-128, q_max=127):
    q = np.round(r / s) + z
    return np.clip(q, q_min, q_max).astype(np.int8)

def dequantize_tensor(q, s, z):
    return s * (q.astype(np.int32) - z)

# Synthetic example
tensor = np.linspace(-1, 1, num=10)
q_tensor = quantize_tensor(tensor, scale, zero_point)
reconstructed = dequantize_tensor(q_tensor, scale, zero_point)

print("Original:", tensor)
print("Quantized:", q_tensor)
print("Reconstructed:", np.round(reconstructed, 4))
print("Reconstruction Error:", np.round(reconstructed - tensor, 4))

Original: [-1.         -0.77777778 -0.55555556 -0.33333333 -0.11111111  0.11111111
  0.33333333  0.55555556  0.77777778  1.        ]
Quantized: [ -1  28  56  84 113 127 127 127 127 127]
Reconstructed: [-1.0039 -0.7765 -0.5569 -0.3373 -0.1098  0.      0.      0.      0.
  0.    ]
Reconstruction Error: [-0.0039  0.0013 -0.0013 -0.0039  0.0013 -0.1111 -0.3333 -0.5556 -0.7778
 -1.    ]


## 0. Set-up

In [3]:
# Install required libraries
#!pip install transformers==4.35.0
#!pip install quanto==0.0.11
#!pip install torch==2.1.1

#!pip install datasets

In [4]:

# Imports and helper setup
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from helper import compute_module_sizes
import numpy as np
from datasets import load_dataset

## 1. Load model

In [5]:
# Load model & tokenizer
model_name = "EleutherAI/pythia-410m"
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Compute FP32 model size
total_bytes = sum(compute_module_sizes(model).values())
print(f"FP32 model size: {total_bytes / 1024**3:.2f} GB")

FP32 model size: 9.23 GB


## 2. Weight quantization with `quanto`

In [6]:
from quanto import quantize, freeze

# Apply static INT8 quantization to weights
quantize(model, weights=torch.int8, activations=None)

# Finalize quantized model (freeze modifies model in-place)
freeze(model)
qmodel = model

# Compute size of quantized model
total_q_bytes = sum(compute_module_sizes(qmodel).values())
print(f"INT8 model size: {total_q_bytes / 1024**3:.2f} GB")

INT8 model size: 3.22 GB


In [7]:
# Try with smaller model as in L4 tutorial: FLAN-T5 small

#!pip install sentencepiece==0.2.0

In [8]:
# Try with smaller model as in L4 tutorial: FLAN-T5 small

# Load Google FLAN-T5 Small (original tutorial model)
import sentencepiece as spm 
from transformers import T5Tokenizer, T5ForConditionalGeneration
from helper import compute_module_sizes

model_name = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Generate a sample output
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("FP32 output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# Compute FP32 model size
module_sizes = compute_module_sizes(model)
print(f"FP32 model size: {module_sizes[''] / 1024**3:.2f} GB")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


FP32 output: annie scott
FP32 model size: 0.29 GB


In [9]:
# Quantize Google FLAN-T5 Small weights to INT8
from quanto import quantize, freeze
import torch

quantize(model, weights=torch.int8, activations=None)

# Display model to confirm quantization wrappers
print(model)

# Finalize quantized model (freeze modifies in-place)
freeze(model)
qmodel = model

# Compute quantized model size
total_q_bytes = sum(compute_module_sizes(qmodel).values())
print(f"INT8 model size: {total_q_bytes / 1024**3:.2f} GB")

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): QLinear(in_features=512, out_features=384, bias=False)
              (k): QLinear(in_features=512, out_features=384, bias=False)
              (v): QLinear(in_features=512, out_features=384, bias=False)
              (o): QLinear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): QLinear(in_features=512, out_features=1024, bias=False)
              (wi_1): QLinear(in_features=512, out_features=1024, bias=False)
              

In [10]:
# Compare text generation outputs

def sample_text(mdl, prompt="Hello, my name is", max_new_tokens=10):
    inputs = tokenizer(prompt, return_tensors="pt")
    out = mdl.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print("FP32 ->", sample_text(model))
print("INT8 ->", sample_text(qmodel))

FP32 -> annie scott
INT8 -> annie scott


## 3. Activation Quantization & Calibration

In [11]:
# Activation quantization requires matching modules (e.g., nn.Linear).
# FLAN-T5 uses custom blocks; activations may not be quantized by default.

from transformers import T5ForConditionalGeneration
# Reload FP32 model
model_act = T5ForConditionalGeneration.from_pretrained(model_name)

# Quantize both weights and activations
quantize(model_act, weights=torch.int8, activations=torch.int8)

# Verify which modules were wrapped
from quanto.nn.qlinear import QLinear
wrapped = [(n, m) for n, m in model_act.named_modules() if isinstance(m, QLinear)]
if not wrapped:
    print("No QLinear modules found. Weight+activation quantization skipped for custom layers.")
else:
    print(f"Found {len(wrapped)} QLinear modules for activation quantization.")

# Prepare texts and batch size for calibration
texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")["text"][:50]
batch_size = 8

# Calibration: run both encoder and decoder to record activations
# We'll use input_ids as decoder_input_ids to satisfy forward requirements
model_act.eval()
with torch.no_grad():
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, return_tensors="pt", truncation=True, padding=True)
        # Provide decoder_input_ids equal to input_ids for teacher forcing
        batch_outputs = model_act(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            decoder_input_ids=inputs.input_ids
        )

# Freeze calibrated model
freeze(model_act)
qmodel_act = model_act

# Compute size and sample generation
size_act = sum(compute_module_sizes(qmodel_act).values()) / 1024**3
print(f"Calibrated INT8 model size: {size_act:.2f} GB")
# Use generate to produce text
# print("Sample generation after activation quantization:", sample_text(qmodel_act))
# Note:
# - We passed `decoder_input_ids` only for calibration, which may not fully capture decoder dynamics when generating text.
# - Calibration on teacher-forced outputs can leave the model ill-prepared for free-form generation, leading to gibberish.


Found 145 QLinear modules for activation quantization.


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Calibrated INT8 model size: 0.66 GB


## 4. Quantization-Aware Training (QAT)
Quantization-aware training interleaves quantization during training so that the model learns to adapt to lower precision.

**Key concept:** Maintain two sets of weights:
- **Float weights:** used for gradient updates.
- **Fake-quantized weights:** used in forward pass to simulate INT8 behavior.

During backprop, gradients flow through the fake quantization (“Straight-Through Estimator”) to update float weights.

**High-level Steps:**
1. Insert quant/dequant op wrappers around key layers.
2. Train on task data; forward uses quantized weights, backward updates float weights.
3. Export final weights by quantizing the trained float weights.

## 5. Recent SOTA Quantization in LLMs
 
| Method   | Bits | Key Idea                                    | Calibration? | Summary                                                                                         | Paper Link                                      |
|----------|------|---------------------------------------------|--------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------|
| LLM.INT8 | 8    | Outlier-aware two-stage (detect & scale)    | No           | Detects and rescales rare large weights (outliers) to minimize distortion.                       | [LLM.INT8](https://arxiv.org/abs/2208.07339)      |
| QLoRA    | 4    | LoRA adapters + QAT at 4-bit                | No           | Fine-tunes low-rank adapters on a quantized model to recover performance in 4-bit precision.      | [QLoRA](https://arxiv.org/abs/2305.14314)        |
| AWQ      | 4    | Per-channel activation-aware scaling        | Yes          | Learns per-channel scales based on activation statistics, improving INT4/INT8 accuracy.          | [AWQ](https://arxiv.org/abs/2306.00978)         |
| GPTQ     | 4    | Hessian-aware greedy rounding               | No           | Uses Hessian information to guide greedy weight rounding, preserving model loss surface.          | [GPTQ](https://arxiv.org/abs/2210.17323)     |
| HQQ      | 2    | Hybrid quant + learned reconstruction       | Yes          | Combines coarse quantization with reconstruction layers to achieve robust 2-bit representation. | [HQQ](https://mobiusml.github.io/hqq_blog/)         |
| QuIP     | 2    | Importance-based pruning + quantization     | Yes          | Prunes negligible weights based on importance, then applies quantization to remaining ones.       | [QuIP](https://arxiv.org/abs/2307.13304)        |

> **Tip:** Many of these methods include open-source implementations—start by experimenting on small models to familiarize yourself with their pipelines.

## 6. Conclusion & References

**Key takeaways:**
- Linear quantization (FP32→INT8) yields substantial memory and compute gains with simple formulas.
- `quanto` automates weight quantization; calibration further refines activations.
- QAT integrates quantization into training to reduce error.
- SOTA LLM quantization spans 8-bit to 2-bit, balancing size vs. accuracy.

**References:**
1. T. Dettmers, "LLM.INT8: Outlier-aware Quantization" (Aug 2022)
2. T. Dettmers, "QLoRA: 4-bit LoRA Fine-tuning" (May 2023)
3. R. Fan et al., "AWQ: Activation-Aware Quantization" (2024)
4. A. Frantar et al., "GPTQ: Optimal Brain Quantization" (2023)
5. H. Badri et al., "HQQ: Hybrid Quantization with Reconstruction" (Nov 2023)
6. R. Tseng et al., "QuIP: Quantization via Importance-based Pruning" (Jul 2023)