<a href="https://colab.research.google.com/github/iamfaham/quantization-benchmark/blob/main/quantization_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantization Benchmark
This notebook explores how quantization affects the performance, accuracy, and efficiency of decoder-only language models.

We compare **FP16** (half-precision) and **INT4** (4-bit quantized) inference on a real evaluation dataset.

**Metrics:**
- **Perplexity** – measures model accuracy  
- **Latency** – generation time (speed)  
- **GPU Memory** – efficiency during inference


## Imports
We’ll use the Hugging Face `transformers`, `datasets`, and `bitsandbytes` libraries to load, quantize, and benchmark models.


In [None]:
!pip install transformers datasets bitsandbytes accelerate --quiet
import torch, time
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load Evaluation Data
We use the **WikiText-2** dataset, a standard benchmark for language modeling.
Empty or very short lines are removed for cleaner perplexity calculations.


In [None]:
# Load a real dataset for evaluation
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")

# Filter out empty or very short lines
texts = [t for t in dataset["text"] if len(t.strip()) > 20][:200]

# Tokenize (with padding + truncation)
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)

print(f"Loaded {len(texts)} valid text samples for evaluation.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Loaded 200 valid text samples for evaluation.


## Tokenizer Configuration
Decoder-only architectures (like OPT) require **left-padding** for correct token alignment during generation.
We ensure this and set the EOS token as the padding token if not already defined.


In [None]:
# Make tokenizer generation-friendly for decoder-only models
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
print("padding_side:", tokenizer.padding_side, "| pad_token_id:", tokenizer.pad_token_id)


padding_side: left | pad_token_id: 1


## Load Models (FP16 + INT4)
We load two versions of the same model:
- FP16 (baseline for accuracy and speed)
- INT4 quantized (4-bit compression via `bitsandbytes`)

This lets us compare trade-offs between performance and precision.


In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# FP16 baseline model
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16,
    device_map="auto"
)

# 4-bit quantized model using bitsandbytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
    dtype=torch.float16,
)

# Enable caching for faster generation
model_fp16.config.use_cache = True
model_4bit.config.use_cache = True

print("✅ Models loaded successfully on GPU.")


✅ Models loaded successfully on GPU.


## GPU Cleanup
Before benchmarking, we clear GPU memory and reset peak stats to ensure fair comparison between FP16 and INT4 runs.


In [None]:
def clear_gpu():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()
    print("🧹 GPU memory cleared and synchronized.")

# Example usage:
clear_gpu()


🧹 GPU memory cleared and synchronized.


## Perplexity Evaluation
We calculate perplexity on WikiText-2 using batched input evaluation.
Lower perplexity = better model predictions.


In [None]:
import torch, math

def compute_perplexity_safe(model, tokenizer, texts, max_length=128, batch_size=4):
    """
    Computes perplexity safely by batching and ignoring pad tokens.
    Works for both FP16 and 4-bit models.
    """
    model.eval()
    device = next(model.parameters()).device
    loss_fct = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id, reduction="sum")

    total_loss = 0
    total_tokens = 0

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encodings = tokenizer(batch_texts, return_tensors="pt", padding=True,
                              truncation=True, max_length=max_length)
        input_ids = encodings["input_ids"].to(device)
        attention_mask = encodings["attention_mask"].to(device)

        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
            logits = outputs.logits
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = input_ids[..., 1:].contiguous()
            shift_attention = attention_mask[..., 1:].contiguous()

            # Flatten for CE loss
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            valid_tokens = shift_attention.sum().item()
            total_loss += loss.item()
            total_tokens += valid_tokens

    ppl = math.exp(total_loss / total_tokens)
    return ppl


ppl_fp16 = compute_perplexity_safe(model_fp16, tokenizer, texts)
ppl_4bit = compute_perplexity_safe(model_4bit, tokenizer, texts)
print(f"FP16 Perplexity: {ppl_fp16:.2f} | INT4 Perplexity: {ppl_4bit:.2f}")

FP16 Perplexity: 70.93 | INT4 Perplexity: 74.70


## Weight Memory Measurement
We measure GPU memory usage immediately after loading each model.
This isolates **weight-only memory**, distinct from runtime activations or KV-cache.


In [None]:
def mem_mb():
    return torch.cuda.memory_allocated() / (1024**2)

def show_weight_only_memory(model, label):
    torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()
    _ = sum(p.numel() for p in model.parameters())
    print(f"{label} — weight-only GPU mem (post-load, pre-forward): {mem_mb():.1f} MB")

# After loading each model:
show_weight_only_memory(model_fp16, "FP16")
# ... later, after deleting FP16 and loading INT4:
show_weight_only_memory(model_4bit, "INT4")

FP16 — weight-only GPU mem (post-load, pre-forward): 9079.0 MB
INT4 — weight-only GPU mem (post-load, pre-forward): 9079.0 MB


## Inference Benchmark
We generate a fixed number of tokens for both models and record:
- Inference time (latency)
- Peak GPU memory usage during generation

Each test includes a warm-up pass to stabilize CUDA kernel timings.


In [None]:
def benchmark_fixed(model, inputs, label="Model", n_tokens=200, batch=8):
    # encode with left padding
    enc = tokenizer(texts[:batch], return_tensors="pt", padding=True, truncation=True, max_length=128)
    enc = {k: v.cuda() for k, v in enc.items()}
    torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); torch.cuda.synchronize()

    with torch.no_grad():
        # warm-up
        _ = model.generate(**enc, max_new_tokens=16)
        torch.cuda.synchronize()
        # timed run
        t0 = time.time()
        _ = model.generate(**enc, max_new_tokens=n_tokens)
        torch.cuda.synchronize()
        t1 = time.time()

    peak = torch.cuda.max_memory_allocated() / (1024**2)
    print(f"{label}: {(t1-t0):.2f}s | peak gen mem: {peak:.1f} MB")
    return (t1-t0), peak


## Results and Discussion

| Metric | FP16 | INT4 | Observation |
|--------|------|------|--------------|
| **Perplexity** | ~70.9 | ~74.7 | ✅ Small accuracy drop |
| **Latency** | ~1.9s | ~5.9s | ⚠️ 4-bit slower (small model overhead) |
| **Peak Memory** | ~9.2GB | ~9.2GB | ⚠️ KV-cache still FP16 |

**Key Insights:**
- Quantization preserved model quality with minimal loss.
- Kernel overhead dominated performance for small models.
- Runtime memory stayed constant because the KV-cache isn’t quantized.
- True gains appear for larger models or longer token generations.


In [None]:
clear_gpu()
t_fp16, mem_fp16 = benchmark_fixed(model_fp16, inputs, "FP16", n_tokens=200)
clear_gpu()
t_4bit, mem_4bit = benchmark_fixed(model_4bit, inputs, "INT4", n_tokens=200)

print(f"""
🧠 Quantization Results on WikiText-2 (Cleaned)
----------------------------------------------
FP16  → Perplexity: {ppl_fp16:.2f} | Latency: {t_fp16:.2f}s | GPU Mem: {mem_fp16:.1f} MB
INT4  → Perplexity: {ppl_4bit:.2f} | Latency: {t_4bit:.2f}s | GPU Mem: {mem_4bit:.1f} MB
Speed-up ≈ {t_fp16/t_4bit:.2f}× | Memory Reduction ≈ {mem_fp16/mem_4bit:.2f}×
""")


🧹 GPU memory cleared and synchronized.
FP16: 1.91s | peak gen mem: 9214.8 MB
🧹 GPU memory cleared and synchronized.
INT4: 5.93s | peak gen mem: 9215.2 MB

🧠 Quantization Results on WikiText-2 (Cleaned)
----------------------------------------------
FP16  → Perplexity: 70.93 | Latency: 1.91s | GPU Mem: 9214.8 MB
INT4  → Perplexity: 74.70 | Latency: 5.93s | GPU Mem: 9215.2 MB
Speed-up ≈ 0.32× | Memory Reduction ≈ 1.00×



## Conclusion

Quantization effectively reduces model precision without significant accuracy degradation.
However, smaller models may not show real-time speedups because:
- Kernel overheads are high.
- Key-value cache remains FP16.
- Short sequences underutilize GPU parallelism.

This notebook provides a reproducible benchmark to measure precision-performance trade-offs in transformer inference.
