# AutoQuantization with TensorRT Model Optimizer PTQ

This notebook demonstrates how to use ModelOpt PTQ's auto_quantize feature to perform automated mixed-precision quantization on the Meta-LLaMA-3-8B model. You'll define a target effective bit rate (e.g., 8.0), provide a search space of quantization formats, and optionally include KV cache quantization.

The process automatically searches the quantization format and layer mapping that best satisfies the target bit constraint while minimizing accuracy loss—using loss-based scoring and real calibration data.

Key Dependencies: 
- nvidia-modelopt
- torch
- transformers

# Applying AutoQuantization

### 1. Import Dependencies

Load general-purpose and reproducibility packages. Random seeds will be set for deterministic calibration, and the Hugging Face login is required to pull the model.

- Import core ModelOpt utilities for quantization, calibration, and dataset handling.
- init_quantized_weights is used internally by some ModelOpt features for weight initialization—no need to call directly here, but it supports hybrid quantized weight loading when needed.
- Set fixed seeds for reproducibility. This ensures consistent calibration data selection and loss values during scoring.

In [None]:
import random
import time

import numpy as np
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer

import modelopt.torch.quantization as mtq
from modelopt.torch.utils.dataset_utils import (
    create_forward_loop,
    get_dataset_dataloader,
    get_max_batch_size,
)

SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

### 2. Set Configurations and Login to Hugging Face

Define all major tuning knobs:

- `EFFECTIVE_BITS` is the average precision target across quantized layers.
- `Q_FORMATS` defines the list of quantization formats used during the AutoQuant search.
- `KV_FORMAT` allows optional quantization of KV cache, applied after main quant.
- `EXPORT_FMT` supports exporting to Hugging Face (`hf`) or TensorRT-LLM (`tensorrt_llm`).
- Also logs into Hugging Face

💡 Try adding formats like `"nvfp4"`, `"w4a8_awq"` to explore tradeoffs.

In [None]:
MODEL_ID = "meta-llama/Meta-Llama-3-8B"
DATASET = "cnn_dailymail"
CALIB_SAMPLES = 512
EFFECTIVE_BITS = 6.0  # search target
Q_FORMATS = "fp8,int4_awq"  # search space
KV_FORMAT = "none"  # or "none" to skip
EXPORT_DIR = "llama3_8b_autoq"  # output folder
EXPORT_FMT = "tensorrt_llm"  # or "hf"
# ----------------------------
DEVICE = "cuda"
DTYPE = torch.float16  # keep default for faster search

login()

### 3. Load Model and Tokenizer

Load model and tokenizer into memory:

- torch_dtype=torch.float16 is used to reduce memory during calibration.
- Left padding is preferred for decoder-only LLMs to better align prompt-token positions during calibration.

⚠️ Ensure the pad_token is set for batching; some LLaMA variants may not have it by default.

In [None]:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=DTYPE).to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
model.eval()

### 4. Configure Data Loader and Forward Loop

- Use a small number of real samples to capture representative activations and enable loss-based scoring.
- include_labels=True is important for loss computation, which guides AutoQuant format decisions.

⚙️ get_max_batch_size() estimates the largest batch that fits in memory given model size and hardware.

In [None]:
batch_size = min(get_max_batch_size(model), CALIB_SAMPLES)
calib_loader = get_dataset_dataloader(
    dataset_name=DATASET,
    tokenizer=tokenizer,
    batch_size=batch_size,
    num_samples=CALIB_SAMPLES,
    device=DEVICE,
    include_labels=True,
)
forward_loop = create_forward_loop(dataloader=calib_loader)
print(f"Calibration batches: {len(calib_loader)}  |  Batch size: {batch_size}")

### 5. Possible Quantization Configurations

Define lookup tables for available quantization config presets. These are used to construct the format search space for AutoQuant.

✅ You can freely extend these dictionaries to add custom formats or constraints.

In [19]:
QUANT_CFG = {
    "int8": mtq.INT8_DEFAULT_CFG,
    "int8_sq": mtq.INT8_SMOOTHQUANT_CFG,
    "fp8": mtq.FP8_DEFAULT_CFG,
    "int4_awq": mtq.INT4_AWQ_CFG,
    "nvfp4": mtq.NVFP4_DEFAULT_CFG,
    "nvfp4_awq": mtq.NVFP4_AWQ_LITE_CFG,
    "w4a8_awq": mtq.W4A8_AWQ_BETA_CFG,
}

KV_CFG = {
    "none": None,
    "fp8": mtq.FP8_KV_CFG["quant_cfg"],
    "nvfp4": mtq.NVFP4_KV_CFG["quant_cfg"],
    "nvfp4_affine": mtq.NVFP4_AFFINE_KV_CFG["quant_cfg"],
}

### 6. Start AutoQuantization Search and Optimization

Wrap the model’s native loss function. Required for format scoring during AutoQuant—smaller loss = better format match.

Automatically search the best per-layer quantization format mapping:

- Constraints guide the average bit precision.
- Loss is evaluated across candidate formats to preserve accuracy.
- disabled_layers=["*lm_head*"] keeps the final layer unquantized (important for generation quality).

🔍 Verbose mode shows layer-level decisions and scoring for each candidate format.

In [None]:
def loss_fn(out, batch):  # tiny wrapper around HF loss
    return out.loss


print("🚧  Launching auto_quantize ...")
t0 = time.time()

model, _ = mtq.auto_quantize(
    model,
    constraints={"effective_bits": EFFECTIVE_BITS},
    data_loader=calib_loader,
    forward_step=lambda m, b: m(**b),
    loss_func=loss_fn,
    quantization_formats=[QUANT_CFG[q] for q in Q_FORMATS.split(",")],
    num_calib_steps=len(calib_loader),
    num_score_steps=len(calib_loader),
    verbose=True,
    disabled_layers=["*lm_head*"],  # keep LM head in fp16
)
print(f"✅ Done in {time.time() - t0:.1f}s")

### 7. [Optional] KV Cache AutoQuantization

- This happens after main quantization.
- Only KV-specific quantizers are enabled during this pass.

⚠️ Quantizing KV cache may affect generation performance and context retention—test thoroughly.

In [None]:
if KV_FORMAT != "none":
    print(f"Enabling KV cache quantization ⟶ {KV_FORMAT}")
    kv_cfg = KV_CFG[KV_FORMAT]

    # Plug only the KV quantizers
    mtq.set_quantizer_by_cfg(model, quant_cfg=kv_cfg)

    # Calibrate **only** those quantizers
    with mtq.set_quantizer_by_cfg_context(model, {"*": {"enable": False}, **kv_cfg}):
        mtq.calibrate(model, algorithm="max", forward_loop=forward_loop)
else:
    print("KV cache left unquantized.")

### 8. Inspect the Quantized Layers

Print a full summary of quantized layers, formats, and bit precision estimates—useful for debugging or profiling.

In [None]:
mtq.print_quant_summary(model)

### 9. Quick Test of Quantized Model

Sanity check: Run a quick generation to verify the quantized model produces reasonable output.

🧪 Consider adding prompts from your real use case to validate quality before deployment.

In [None]:
sample = "Tell me a short story about a quantized llama."
inputs = tokenizer(sample, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    gen_ids = model.generate(**inputs, max_new_tokens=50, do_sample=True)
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))

### 10. Export Model for TensorRT-LLM

Export the quantized model to the desired format:

- Use tensorrt_llm for high-performance deployment on NVIDIA accelerators.
- Use hf to reload with Hugging Face APIs or inference frameworks like vLLM.

📁 Check the contents of the output folder to confirm all weights/configs are present.

In [None]:
from modelopt.torch.export import export_hf_checkpoint, export_tensorrt_llm_checkpoint

if EXPORT_FMT == "tensorrt_llm":
    export_tensorrt_llm_checkpoint(
        model,
        model_type="llama",
        export_dir=EXPORT_DIR,
        inference_tensor_parallel=1,
        inference_pipeline_parallel=1,
    )
else:
    export_hf_checkpoint(model, export_dir=EXPORT_DIR)

print(f"📦  Saved quantized model to →  {EXPORT_DIR}")

# ✅ Conclusion & Key Takeaways
    ✅ AutoQuant in TensorRT-LLM ModelOpt enables fast, automated mixed-precision quantization by searching across multiple formats (e.g., FP8, INT4-AWQ) to meet a user-defined effective bit constraint.

    ✅ Using a small calibration set with loss-based scoring, AutoQuant intelligently selects the optimal quantization format per layer—balancing model size, performance, and accuracy.

    ✅ The workflow supports flexible search spaces and fine-grained control over disabled layers, block sizes, and forward calibration passes.

    ✅ Optional KV cache quantization provides further memory and bandwidth savings, but should be enabled only after validating generation quality.

    ✅ Exported models are fully compatible with both Hugging Face and TensorRT-LLM inference runtimes—enabling rapid deployment across a wide range of applications.