Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ AutoRound

## 🚀 What is AutoRound?

AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more details and quantized models in several
Hugging Face Spaces,
e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)
and [fbaldassarri](https://huggingface.co/fbaldassarri). Please check out [User guide](./docs/step_by_step.md) for more details
AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).
It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility.
For more details, see our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)
and [fbaldassarri](https://huggingface.co/fbaldassarri).For usage instructions, please refer to [User Guide](./docs/step_by_step.md).

<p align="center">
<img src="docs/imgs/autoround_overview.png" alt="AutoRound Overview" width="80%">
Expand Down Expand Up @@ -84,7 +84,7 @@ Choose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your
✅ Advanced Utilities
Includes [multiple gpus quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#devicemulti-gpu-setting-in-quantization), [multiple calibration datasets](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#default-dataset) and support for [10+ runtime backends](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#specify-inference-backend).

🟨 Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more.
Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more.


## Installation
Expand Down Expand Up @@ -164,25 +164,25 @@ configuration to suit your specific requirements and available resources.
### API Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load a model (supports BF16/FP16/FP8/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

## the best accuracy, 4-5X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True)

## 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym )
# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, tokenizer, nsamples=128, iters=50, lr=5e-3)

# Save quantized model
output_dir = "./tmp_autoround"
## format= 'auto_round'(default), 'auto_gptq', 'auto_awq'
autoround.quantize_and_save(output_dir, format="auto_round")
# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m"
ar.quantize_and_save(output_dir, format="auto_round")
```

<details>
Expand Down
4 changes: 2 additions & 2 deletions auto_round/schemes.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ def is_preset_scheme(name: str) -> bool:
# "act_data_type": "fp",
# }))

FPW8_STATIC = QuantizationScheme.from_dict(
FP8_STATIC = QuantizationScheme.from_dict(
{
"bits": 8,
"group_size": -1,
Expand All @@ -163,7 +163,7 @@ def is_preset_scheme(name: str) -> bool:
"MXFP8": MXFP8,
"NVFP4": NVFP4,
"FPW8A16": FPW8A16,
"FPW8_STATIC": FPW8_STATIC,
"FP8_STATIC": FP8_STATIC,
}
from auto_round.export.export_to_gguf.config import GGUF_CONFIG

Expand Down
Loading
Loading