From 44d2d441bbe860e7622df664e47cc0ebc461cc87 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 10:54:27 +0800 Subject: [PATCH 1/7] update readme --- README.md | 62 +++++++++++++++++++++++++++++++------------------------ 1 file changed, 35 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index 6b9def6c8..abbf3bafa 100644 --- a/README.md +++ b/README.md @@ -65,26 +65,23 @@ refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and Delivers strong performance even at 2–3 bits [example models](https://huggingface.co/collections/OPEA/2-3-bits-67a5f0bc6b49d73c01b4753b), with leading results at 4 bits [benchmark](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard). ✅ **Ecosystem Integration** -Seamlessly works with **Transformers, vLLM,** and more. +Seamlessly works with **Transformers, vLLM, SGLang** and more. ✅ **Multiple Formats Export** Support **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in [export formats](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#supported-export-formats) +✅ **Fast Mixed Bits/Dtypes Scheme Generation** +Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme). + +✅ **Optimized Round-to-Nearest Mode** +Use `--iters 0` for fast quantization with some accuracy drop for 4 bits. Details are shown in [opt_rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#opt-rtn-mode) + ✅ **Affordable Quantization Cost** Quantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs) -✅ **Fast Mixed Bits/Dtypes Scheme Generation** -Automatically configure in minutes, with about 2X-4X the model’s BF16 VRAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme). - ✅ **10+ VLMs Support** Out-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix) -✅ **Layerwise Mixed Bits Quantization** -Assign different bits per layer for fine-grained accuracy/performance trade-offs. Details are shown in [mixed bits quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#mixed-bits-usage) - -✅ **Optimized Round-to-Nearest Mode** -Use `--iters 0` for fast, calibration-free quantization with some accuracy drop for 4 bits. Details are shown in [opt_rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#opt-rtn-mode) - ✅ **Multiple Recipes** Choose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your needs. Details are shown in [quantization recipes](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#recipe-recommendation) @@ -187,21 +184,6 @@ ar = AutoRound(model_name_or_path, scheme="W4A16") ar.quantize_and_save(output_dir="./qmodel", format="auto_round") ``` -### AutoScheme Usage -Please refer to the [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for more details on AutoScheme. -~~~python -from auto_round import AutoRound, AutoScheme - -model_name = "Qwen/Qwen3-8B" -avg_bits = 3.0 -scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) -layer_config = {"lm_head": "GGUF:Q6_K"} - -# Change iters to 200 for non-GGUF schemes -ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) -ar.quantize_and_save() -~~~ -
Important Hyperparameters @@ -212,7 +194,6 @@ ar.quantize_and_save() - **`sym` (bool)**: Whether to use symmetric quantization (default is `None`). If not None, it will override the scheme setting. - **`layer_config` (dict)**: Configuration for weight quantization (default is `None`), mainly for mixed schemes. - ##### Algorithm Settings - **`enable_alg_ext` (bool)**: Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is `False`. - **`disable_opt_rtn` (bool)**: Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is `False` (improved RTN enabled). @@ -227,7 +208,6 @@ ar.quantize_and_save() - **`nsamples` (int)**: Number of samples for tuning (default is `128`). - **`seqlen` (int)**: Data length of the sequence for tuning (default is `2048`). - ##### Device/Speed Configuration - **`enable_torch_compile` (bool)**: If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource. - **`low_gpu_mem_usage` (bool)**: Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is `False`). @@ -235,6 +215,34 @@ ar.quantize_and_save()
+### AutoScheme Usage +Please refer to the [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for more details on AutoScheme. +~~~python +from auto_round import AutoRound, AutoScheme + +model_name = "Qwen/Qwen3-8B" +avg_bits = 3.0 +scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) +layer_config = {"lm_head": "GGUF:Q6_K"} + +# Change iters to 200 for non-GGUF schemes +ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) +ar.quantize_and_save() +~~~ + +
+Important Hyperparameters of AutoScheme + + +##### AutoScheme Hyperparameters + +- **`avg_bits` (float)**: Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation. +- **`options` (str | list[str] | list[QuantizationScheme])**: Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., `"W4A16,W2A16"`), a list of strings (e.g., `["W4A16", "W2A16"]`), or a list of `QuantizationScheme` objects. +- **`ignore_scale_zp_bits` (bool)**: Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: `False`). +- **`shared_layers` (Iterable[Iterable[str]], optional)**: Only supported in API usage. Defines groups of layers that share quantization settings. +- **`batch_size` (int, optional)**: Only supported in API usage. Can be set to `1` to reduce VRAM usage at the expense of longer tuning time. + +
### API Usage for VLMs If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better From a1869da18e302dfbfca687909200dc0e3c949835 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 10:56:59 +0800 Subject: [PATCH 2/7] update readme --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index abbf3bafa..8cd9aa93e 100644 --- a/README.md +++ b/README.md @@ -48,11 +48,7 @@ refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and all bits other than 3 bits. **A more advanced algorithm** tailored for specific configurations may be available in v0.8.1. -[2025/05] AutoRound has been integrated into **vLLM**. You can now run models in the AutoRound format directly with - vLLM versions later than v0.85.post1. - -[2025/04] AutoRound has been integrated into **Transformers**. You can run models in the AutoRound format directly - with Transformers versions later than 4.51.3. +[2025/05] AutoRound has been integrated into **Transformers** and **vLLM**. [2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy. Check out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc). From f7ee799e4bb7787641287dbfc9ede6617bfdca54 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 10:58:45 +0800 Subject: [PATCH 3/7] fix --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 8cd9aa93e..ff0632369 100644 --- a/README.md +++ b/README.md @@ -239,6 +239,7 @@ ar.quantize_and_save() - **`batch_size` (int, optional)**: Only supported in API usage. Can be set to `1` to reduce VRAM usage at the expense of longer tuning time. + ### API Usage for VLMs If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better From 8dd309c0787174300f1a94cadcbb8612eed178a8 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 11:30:22 +0800 Subject: [PATCH 4/7] refine --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ff0632369..75e8c0f0d 100644 --- a/README.md +++ b/README.md @@ -207,7 +207,7 @@ ar.quantize_and_save(output_dir="./qmodel", format="auto_round") ##### Device/Speed Configuration - **`enable_torch_compile` (bool)**: If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource. - **`low_gpu_mem_usage` (bool)**: Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is `False`). -- **`device_map` (str|dict|int)**: The device to be used for tuning, e.g., `"cpu"`, `"cuda"`, `"0,1,2"` (default is `'0'`). +- **`device_map` (str|dict|int)**: The device to be used for tuning, e.g., `auto`, "cpu"`, `"cuda"`, `"0,1,2"` (default is `'0'`). When using "auto", it will try to use all available GPUs. From 069b53a8831188cb6ca2ac835172e88819e33976 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 13:01:57 +0800 Subject: [PATCH 5/7] fix --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 75e8c0f0d..5bcac30bf 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. -For more details, see our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) +See our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, please refer to [User Guide](./docs/step_by_step.md).

From 5de4c27e971734eea14ac7848b1c0b837d533f22 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 13:05:28 +0800 Subject: [PATCH 6/7] update --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 5bcac30bf..ddaecc53a 100644 --- a/README.md +++ b/README.md @@ -22,8 +22,7 @@ AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. -See our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) -and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, please refer to [User Guide](./docs/step_by_step.md). +See our [paper](https://arxiv.org/pdf/2309.05516) for more details. For usage instructions, please refer to [User Guide](./docs/step_by_step.md).

AutoRound Overview From 7ab306b72f71ea27631d7bbf64ce02d128f9dd79 Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Thu, 30 Oct 2025 13:22:17 +0800 Subject: [PATCH 7/7] fix a critic regression --- auto_round/compressors/base.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/auto_round/compressors/base.py b/auto_round/compressors/base.py index 869fc74de..c56e10750 100644 --- a/auto_round/compressors/base.py +++ b/auto_round/compressors/base.py @@ -141,7 +141,7 @@ def __init__( device_map: Union[str, torch.device, int, dict] = 0, enable_torch_compile: bool = False, enable_alg_ext: bool = False, - disable_opt_rtn: bool = True, + disable_opt_rtn: bool = False, seed: int = 42, **kwargs, ):