diff --git a/README.md b/README.md index 6b9def6c8..ddaecc53a 100644 --- a/README.md +++ b/README.md @@ -22,8 +22,7 @@ AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. -For more details, see our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) -and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, please refer to [User Guide](./docs/step_by_step.md). +See our [paper](https://arxiv.org/pdf/2309.05516) for more details. For usage instructions, please refer to [User Guide](./docs/step_by_step.md).

AutoRound Overview @@ -48,11 +47,7 @@ refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and all bits other than 3 bits. **A more advanced algorithm** tailored for specific configurations may be available in v0.8.1. -[2025/05] AutoRound has been integrated into **vLLM**. You can now run models in the AutoRound format directly with - vLLM versions later than v0.85.post1. - -[2025/04] AutoRound has been integrated into **Transformers**. You can run models in the AutoRound format directly - with Transformers versions later than 4.51.3. +[2025/05] AutoRound has been integrated into **Transformers** and **vLLM**. [2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy. Check out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc). @@ -65,26 +60,23 @@ refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and Delivers strong performance even at 2–3 bits [example models](https://huggingface.co/collections/OPEA/2-3-bits-67a5f0bc6b49d73c01b4753b), with leading results at 4 bits [benchmark](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard). ✅ **Ecosystem Integration** -Seamlessly works with **Transformers, vLLM,** and more. +Seamlessly works with **Transformers, vLLM, SGLang** and more. ✅ **Multiple Formats Export** Support **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in [export formats](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#supported-export-formats) +✅ **Fast Mixed Bits/Dtypes Scheme Generation** +Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme). + +✅ **Optimized Round-to-Nearest Mode** +Use `--iters 0` for fast quantization with some accuracy drop for 4 bits. Details are shown in [opt_rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#opt-rtn-mode) + ✅ **Affordable Quantization Cost** Quantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs) -✅ **Fast Mixed Bits/Dtypes Scheme Generation** -Automatically configure in minutes, with about 2X-4X the model’s BF16 VRAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme). - ✅ **10+ VLMs Support** Out-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix) -✅ **Layerwise Mixed Bits Quantization** -Assign different bits per layer for fine-grained accuracy/performance trade-offs. Details are shown in [mixed bits quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#mixed-bits-usage) - -✅ **Optimized Round-to-Nearest Mode** -Use `--iters 0` for fast, calibration-free quantization with some accuracy drop for 4 bits. Details are shown in [opt_rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#opt-rtn-mode) - ✅ **Multiple Recipes** Choose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your needs. Details are shown in [quantization recipes](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#recipe-recommendation) @@ -187,21 +179,6 @@ ar = AutoRound(model_name_or_path, scheme="W4A16") ar.quantize_and_save(output_dir="./qmodel", format="auto_round") ``` -### AutoScheme Usage -Please refer to the [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for more details on AutoScheme. -~~~python -from auto_round import AutoRound, AutoScheme - -model_name = "Qwen/Qwen3-8B" -avg_bits = 3.0 -scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) -layer_config = {"lm_head": "GGUF:Q6_K"} - -# Change iters to 200 for non-GGUF schemes -ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) -ar.quantize_and_save() -~~~ -

Important Hyperparameters @@ -212,7 +189,6 @@ ar.quantize_and_save() - **`sym` (bool)**: Whether to use symmetric quantization (default is `None`). If not None, it will override the scheme setting. - **`layer_config` (dict)**: Configuration for weight quantization (default is `None`), mainly for mixed schemes. - ##### Algorithm Settings - **`enable_alg_ext` (bool)**: Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is `False`. - **`disable_opt_rtn` (bool)**: Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is `False` (improved RTN enabled). @@ -227,11 +203,39 @@ ar.quantize_and_save() - **`nsamples` (int)**: Number of samples for tuning (default is `128`). - **`seqlen` (int)**: Data length of the sequence for tuning (default is `2048`). - ##### Device/Speed Configuration - **`enable_torch_compile` (bool)**: If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource. - **`low_gpu_mem_usage` (bool)**: Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is `False`). -- **`device_map` (str|dict|int)**: The device to be used for tuning, e.g., `"cpu"`, `"cuda"`, `"0,1,2"` (default is `'0'`). +- **`device_map` (str|dict|int)**: The device to be used for tuning, e.g., `auto`, "cpu"`, `"cuda"`, `"0,1,2"` (default is `'0'`). When using "auto", it will try to use all available GPUs. + +
+ +### AutoScheme Usage +Please refer to the [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for more details on AutoScheme. +~~~python +from auto_round import AutoRound, AutoScheme + +model_name = "Qwen/Qwen3-8B" +avg_bits = 3.0 +scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) +layer_config = {"lm_head": "GGUF:Q6_K"} + +# Change iters to 200 for non-GGUF schemes +ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) +ar.quantize_and_save() +~~~ + +
+Important Hyperparameters of AutoScheme + + +##### AutoScheme Hyperparameters + +- **`avg_bits` (float)**: Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation. +- **`options` (str | list[str] | list[QuantizationScheme])**: Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., `"W4A16,W2A16"`), a list of strings (e.g., `["W4A16", "W2A16"]`), or a list of `QuantizationScheme` objects. +- **`ignore_scale_zp_bits` (bool)**: Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: `False`). +- **`shared_layers` (Iterable[Iterable[str]], optional)**: Only supported in API usage. Defines groups of layers that share quantization settings. +- **`batch_size` (int, optional)**: Only supported in API usage. Can be set to `1` to reduce VRAM usage at the expense of longer tuning time.