<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Quantization Tutorial.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Model Quantization Tutorial

This tutorial demonstrates how to use llm_compressor to compress large language models using AWQ (Activation-aware Weight Quantization) and other quantization methods while maintaining performance.

## Prerequisites

‚ùó**NOTICE:** Model quantization requires a GPU. If running on Google Colab, you must use a GPU runtime (Colab Menu: `Runtime` -> `Change runtime type` -> Select `T4 GPU` or better).

‚ö†Ô∏è **DEVELOPMENT STATUS**: The quantization feature is currently under active development. Some features may change in future releases.

First, let's install Oumi with GPU support and the required quantization libraries:

In [None]:
%pip install oumi[gpu,quantization]

## 1. Basic Quantization with llm_compressor

Let's start by quantizing TinyLlama to 4-bit using llm_compressor's W4A16 (4-bit weights, 16-bit activations) scheme:

In [None]:
from oumi.core.configs import ModelParams, QuantizationConfig  # type: ignore
from oumi.quantize import quantize  # type: ignore

# Configure quantization
config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="llmc_W4A16_ASYM",  # 4-bit asymmetric AWQ quantization
    output_path="tinyllama_w4a16_tutorial",
    calibration_samples=32,  # Number of calibration samples
    calibration_dataset="open_platypus",  # Dataset for calibration
    # 32 for fast testing, 512 for better accuracy
)

# Run quantization
print("Starting llm_compressor quantization...")
result = quantize(config)

# Calculate sizes and compression
original_size_gb = 2.2  # TinyLlama 1.1B in fp16
quantized_size_gb = result.quantized_size_bytes / (1024**3)  # type: ignore
compression_ratio = original_size_gb / quantized_size_gb

print("\n‚úÖ Quantization complete!")
print(f"Original size (fp16): {original_size_gb:.2f}GB")
print(f"Quantized size (4-bit): {quantized_size_gb:.2f}GB")
print(f"Compression ratio: {compression_ratio:.1f}x")
size_reduction_pct = (original_size_gb - quantized_size_gb) / original_size_gb * 100
print(f"Size reduction: {size_reduction_pct:.1f}%")

## 2. Using the Quantized Model

Now let's load and use the quantized model for inference. Models quantized with llm_compressor can be loaded directly with transformers:

In [None]:
import torch  # type: ignore
from transformers import AutoModelForCausalLM, AutoTokenizer  # type: ignore

# Load the quantized model
model_path = "tinyllama_w4a16_tutorial"

print(f"Loading quantized model from: {model_path}")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Model loaded! GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f}GB")

In [3]:
# Test inference
prompt = "Explain the benefits of model quantization in simple terms:"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate
print(f"Prompt: {prompt}\n")
print("Generating response...")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

# Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Response:\n{response}")

Prompt: Explain the benefits of model quantization in simple terms:

Generating response...
Response:
Explain the benefits of model quantization in simple terms:

Model quantization is the process of compressing a neural network model into a smaller number of parameters without sacrificing performance. Here are some benefits of model quantization:

1. Improved Model Size: Model quantization reduces the model's size, which can be beneficial for storage and transmission.

2. Faster Training: Model quantization can lead to faster training, especially for smaller models.

3. Reduced Inference Time: With less parameters, inference time can be reduced, leading to faster and more accurate inference.

4. Improved Convergence: Model quantization can improve convergence, as the model's parameters are more accurately represented and optimized.




## 3. Advanced Configuration

llm_compressor offers several configuration options for fine-tuning the quantization process:

In [None]:
# Advanced llm_compressor configuration with more calibration samples
advanced_config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="llmc_W4A16_ASYM",  # Can also use llmc_W8A8_INT, llmc_W8A8_FP8, etc.
    output_path="tinyllama_advanced",
    output_format="safetensors",  # Use SafeTensors format
    # llm_compressor-specific parameters
    calibration_samples=512,  # More samples for better calibration
    calibration_dataset="open_platypus",  # Dataset for calibration
    max_seq_length=2048,  # Maximum sequence length for calibration
    llmc_group_size=128,  # Weight grouping size
    llmc_targets=["Linear"],  # Target layer types
    llmc_ignore=["lm_head"],  # Layers to exclude from quantization
)

print("Configuration:")
print(f"- Method: {advanced_config.method}")
print(f"- Output format: {advanced_config.output_format}")
print(f"- Calibration samples: {advanced_config.calibration_samples}")
print(f"- Calibration dataset: {advanced_config.calibration_dataset}")
print(f"- Group size: {advanced_config.llmc_group_size}")
print(f"- Target layers: {advanced_config.llmc_targets}")
print(f"- Ignored layers: {advanced_config.llmc_ignore}")

## Summary

In this tutorial, you learned how to:

1. ‚úÖ Quantize models using llm_compressor to 4-bit precision (W4A16)
2. ‚úÖ Load and use quantized models for inference with transformers
3. ‚úÖ Configure llm_compressor parameters for better quality

### Available Quantization Methods:
- **llmc_W4A16 / llmc_W4A16_ASYM**: 4-bit weights, 16-bit activations (AWQ)
- **llmc_W8A16**: 8-bit weights, 16-bit activations (GPTQ)
- **llmc_W8A8_INT**: 8-bit weights and activations (INT8 with SmoothQuant)
- **llmc_W8A8_FP8**: 8-bit weights and activations (FP8)
- **llmc_FP8_BLOCK**: FP8 block quantization

### Key Benefits:
- **Memory Efficiency**: ~75% reduction in model size with W4A16
- **Speed**: Faster inference due to reduced memory bandwidth
- **Quality**: Minimal performance degradation with calibration
- **Compatibility**: Works with most transformer models via vLLM

Happy quantizing! üöÄ

# üß≠ What's Next?

Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).

üì∞ Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!

‚ö° Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi-ai.typeform.com/early-access) to the Oumi Platform, or [chat with us](https://calendly.com/d/ctcx-nps-47m/chat-with-us-get-early-access-to-the-oumi-platform) to learn more!