# Post-Training Quantization with Min-Max Calibration using TensorRT Model Optimizer PTQ

This notebook demonstrates how to apply standard Post-Training Quantization (PTQ) using min-max calibration on an LLM—specifically meta-llama/Llama-3.1-8B-Instruct—with NVIDIA's TensorRT Model Optimizer (ModelOpt) PTQ toolkit. We walk through loading the model, calibrating it using a CNN/DailyMail dataset sample, applying FP8 quantization, generating outputs, and exporting the quantized model.

Key Dependencies:
- nvidia-modelopt
- torch
- transformers

## Standard FP4/FP8 Quantization with Min-Max Calibration

### 1. Import Dependencies
Import all necessary libraries:

- `torch`: Used for tensor computation and model execution.

- `modelopt.torch.quantization`: Core API for quantization using TensorRT ModelOpt PTQ.

- `transformers`: Hugging Face interface to load and tokenize LLMs.

- `get_dataset_dataloader` and `create_forward_loop`: Utilities to prepare calibration data and run calibration.

- `login`: Required to download gated models (like Llama 3.1) from Hugging Face.

💡 If you're using this in Colab or a restricted environment, make sure all packages are installed and CUDA is available.

In [1]:
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer

import modelopt.torch.quantization as mtq
from modelopt.torch.utils.dataset_utils import create_forward_loop, get_dataset_dataloader

### 2. Set Configurations and Login to Hugging Face

Set the model you want to quantize (Llama-3.1-8B-Instruct) and the dataset to use for calibration (cnn_dailymail).

- `batch_size` and `calib_samples` control how much data is used during calibration—more samples improve accuracy but - increase calibration time.

🔐 You must `login()` with a valid Hugging Face token to access gated models. Get your token at hf.co/settings/tokens.

🔁 You can substitute your own model or dataset as long as the inputs are compatible with the model's tokenizer.

In [None]:
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset_name = "cnn_dailymail"
batch_size = 8
calib_samples = 512

login()

### 3. Load Model and Tokenizer

- Load the model into GPU memory.
- Set `pad_token` to eos_token to prevent padding errors in decoder-only models like Llama.

💡 Always check for token mismatch warnings in console when loading tokenizer.
🧠 Setting `pad_token` helps avoid errors during batch generation or dataset collation.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

### 4. Configure Dataloader
- Load a few batches of real-world text to extract representative activation ranges.
- The calibration dataset should reflect your expected inference use case for best results.

⚠️ More samples = better accuracy, but takes longer. We recommend 512 samples or more. 
🧪 Use your target task’s dataset (e.g., chat, summarization, code) for domain-specific calibration.

In [None]:
dataloader = get_dataset_dataloader(
    dataset_name=dataset_name,
    tokenizer=tokenizer,
    batch_size=batch_size,
    num_samples=calib_samples,
    device="cuda",
)

### 5. Create the Forward Loop
- Wraps your `dataloader` into a loop that feeds batches into the model.
- Required by `modelopt.quantize()` to perform calibration pass.

🧰 You can create your own custom forward loop if you're doing multi-modal or conditional generation tasks.
🧠 ModelOpt expects this loop to return outputs so it can record activations for min/max stats.

In [5]:
forward_loop = create_forward_loop(dataloader=dataloader)

### 6. Set Quantization Configuration and Apply
- Apply FP8 quantization using the default min-max config provided by TensorRT ModelOpt.
- This pass captures the range of activations and applies a quantization transform.
- To change the quantization configuration, you simply need to change the value of the `quant_cfg` variable. For example, to change this from FP8 to NVFP4, you can set it to `mtq.NVFP4_DEFAULT_CFG`

📏 Min-max calibration uses observed min and max values per tensor to set scaling ranges.
💡 You can experiment with other formats (e.g., FP4, INT8) by swapping out quant_cfg.

In [None]:
quant_cfg = mtq.FP8_DEFAULT_CFG  # mtq.NVFP4_DEFAULT_CFG
model = mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

### 7. Quick Test of Quantized Model
- Test the quantized model with a simple prompt.
- This helps verify that quantization didn’t break forward generation or drastically harm output quality.

✅ Expect slightly more variation or truncation in output compared to the original model, but it should still be coherent.
🧪 You can test on more complex prompts to evaluate qualitative performance further.

In [None]:
model = torch.compile(model)
inputs = tokenizer("Hello world", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)

In [None]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### 8. Export Quantized Checkpoint
- Save the quantized model in Hugging Face-compatible format for reuse or deployment.
- Export includes weights and config files in standard structure.

📁 This allows you to upload it to Hugging Face Hub or load later with from_pretrained() 🧰 You can also use this exported model with inference engines like vLLM, SGLang, or TensorRT-LLM.

In [None]:
from modelopt.torch.export import export_hf_checkpoint

export_path = "./quantized_model_min-max/"
export_hf_checkpoint(model, export_dir=export_path)
tokenizer.save_pretrained(export_path)

# ✅ Conclusion & Key Takeaways
    ✅ Min-max calibration is a fast and simple way to apply quantization with good performance tradeoffs.

    ✅ TensorRT-LLM ModelOpt PTQ abstracts away many of the complexities of quantization while still offering flexibility and export options.

    ✅ Using a representative dataset like cnn_dailymail improves calibration accuracy for summarization-style models.

    ✅ The quantized model remains Hugging Face-compatible—meaning it can be deployed or fine-tuned using existing tools.

    ✅ You can easily customize: The quantization format (e.g., INT8, FP4), Calibration samples and batch size, and Dataset/task alignment