<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Quantization Tutorial.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Model Quantization Tutorial

This tutorial demonstrates how to use AWQ (Activation-aware Weight Quantization) to compress large language models while maintaining performance.

## Prerequisites

❗**NOTICE:** Model quantization requires a GPU. If running on Google Colab, you must use a GPU runtime (Colab Menu: `Runtime` -> `Change runtime type` -> Select `T4 GPU` or better).

⚠️ **DEVELOPMENT STATUS**: The quantization feature is currently under active development. Some features may change in future releases.

First, let's install Oumi with GPU support and the required quantization libraries:

```bash
pip install oumi[gpu]
pip install autoawq
pip install triton==3.0.0  # Required for AWQ inference compatibility
```

## 1. Basic AWQ Quantization

Let's start by quantizing TinyLlama to 4-bit using AWQ:

In [None]:
from oumi.core.configs import ModelParams, QuantizationConfig  # type: ignore
from oumi.quantize import quantize  # type: ignore

# Configure quantization
config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="awq_q4_0",  # 4-bit AWQ quantization
    output_format="pytorch",
    output_path="tinyllama_awq_4bit_tutorial",
    calibration_samples=32,  # Number of calibration samples
    # 32 for fast testing, 1024 for better accuracy
)

# Run quantization
print("Starting AWQ quantization...")
result = quantize(config)

# Calculate sizes and compression
original_size_gb = 2.2  # TinyLlama 1.1B in fp16
quantized_size_gb = result.quantized_size_bytes / (1024**3)  # type: ignore
compression_ratio = original_size_gb / quantized_size_gb

print("\n✅ Quantization complete!")
print(f"Original size (fp16): {original_size_gb:.2f}GB")
print(f"Quantized size (4-bit): {quantized_size_gb:.2f}GB")
print(f"Compression ratio: {compression_ratio:.1f}x")
size_reduction_pct = (original_size_gb - quantized_size_gb) / original_size_gb * 100
print(f"Size reduction: {size_reduction_pct:.1f}%")

Starting AWQ quantization...
[2025-08-04 17:00:12,262][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:86] Starting AWQ quantization pipeline...
[2025-08-04 17:00:12,263][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:118] Loading model for AWQ quantization: TinyLlama/TinyLlama-1.1B-Chat-v1.0
[2025-08-04 17:00:12,264][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:121] 📥 Loading base model...


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

[2025-08-04 17:00:12,606][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:138] 🔧 Configuring AWQ quantization parameters...
[2025-08-04 17:00:12,607][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:155] ⚙️  AWQ config: {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
[2025-08-04 17:00:12,608][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:156] 📊 Using 32 calibration samples
[2025-08-04 17:00:12,609][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:157] 🧮 Starting AWQ calibration and quantization...


Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████| 22/22 [01:10<00:00,  3.20s/it]


[2025-08-04 17:01:23,973][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:92] PyTorch format requested. Saving AWQ model...
[2025-08-04 17:01:24,658][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:99] ✅ AWQ quantization successful! Saved as PyTorch format.
[2025-08-04 17:01:24,659][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:100] 📊 Quantized size: 734.0 MB
[2025-08-04 17:01:24,660][oumi][rank0][pid:173685][MainThread][INFO]][awq_quantizer.py:101] 💡 Use this model with: AutoAWQForCausalLM.from_quantized('tinyllama_awq_4bit')

✅ Quantization complete!
Original size (fp16): 2.20GB
Quantized size (4-bit): 0.72GB
Compression ratio: 3.1x
Size reduction: 67.4%


## 2. Using the Quantized Model

Now let's load and use the quantized model for inference:

In [2]:
import torch  # type: ignore
from awq import AutoAWQForCausalLM  # type: ignore
from transformers import AutoTokenizer  # type: ignore

# Load the quantized model
model_path = "tinyllama_awq_4bit"

print(f"Loading AWQ model from: {model_path}")
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=False,  # Disable layer fusion to avoid compatibility issues
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"✅ Model loaded! GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f}GB")

Loading AWQ model from: tinyllama_awq_4bit


Replacing layers...: 100%|██████████| 22/22 [00:03<00:00,  5.96it/s]


  0%|          | 0/509 [00:00<?, ?w/s]

✅ Model loaded! GPU memory: 0.03GB


In [3]:
# Test inference
prompt = "Explain the benefits of model quantization in simple terms:"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate
print(f"Prompt: {prompt}\n")
print("Generating response...")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

# Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Response:\n{response}")

Prompt: Explain the benefits of model quantization in simple terms:

Generating response...
Response:
Explain the benefits of model quantization in simple terms: A model quantized to a lower bitwidth can achieve higher efficiency and lower power consumption compared to a model with the same parameters trained on the original data. This is because the model is compressed into a smaller space that can be processed in a faster, lower-power processor. Additionally, model quantization can be used to reduce the number of weights or parameters in a model, which can also improve its efficiency. This is because fewer weights need to be updated during inference, reducing the number of computations required to achieve the same accuracy. Overall, model quantization can help to improve the efficiency, power consumption, and accuracy of deep learning models.


## 3. Advanced Configuration

AWQ offers several configuration options for fine-tuning the quantization process:

In [4]:
# Advanced AWQ configuration with more calibration samples
advanced_config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="awq_q4_0",
    output_path="tinyllama_awq_advanced.safetensors",
    output_format="safetensors",  # Use SafeTensors format
    # AWQ-specific parameters
    calibration_samples=1024,  # More samples for better calibration
    awq_group_size=128,  # Weight grouping size
    awq_version="GEMM",  # AWQ kernel version (GEMM is faster)
    awq_zero_point=True,  # Use zero-point quantization
)

print("Configuration:")
print(f"- Output format: {advanced_config.output_format}")
print(f"- Calibration samples: {advanced_config.calibration_samples}")
print(f"- Group size: {advanced_config.awq_group_size}")
print(f"- AWQ version: {advanced_config.awq_version}")
print(f"- Zero point: {advanced_config.awq_zero_point}")

Configuration:
- Output format: safetensors
- Calibration samples: 1024
- Group size: 128
- AWQ version: GEMM
- Zero point: True


## Summary

In this tutorial, you learned how to:

1. ✅ Quantize models using AWQ to 4-bit precision
2. ✅ Load and use AWQ quantized models for inference
3. ✅ Configure AWQ parameters for better quality


### Key Benefits of AWQ:
- **Memory Efficiency**: ~75% reduction in model size
- **Speed**: Faster inference due to reduced memory bandwidth
- **Quality**: Minimal performance degradation
- **Compatibility**: Works with most transformer models

Happy quantizing! 🚀