Conversation
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.
Changes:
- Introduces
auto_round.compressors.model_freewith shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support. - Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
- Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/compressors/model_free.py | New model-free RTN WOQ implementation with shard streaming and per-layer config support. |
| auto_round/utils/missing_tensors.py | Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing. |
| auto_round/utils/common.py | Improves compress_layer_names by repeatedly compressing until stable. |
| auto_round/main.py | Adds --model_free CLI flag and routes to model-free quantization flow. |
| test/test_cpu/quantization/test_model_free.py | New unit tests for model-free quantization behavior and helpers. |
| test/test_cpu/utils/test_missing_tensors.py | Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors. |
| docs/step_by_step.md | Documents Model-Free Mode usage (CLI/API). |
| docs/step_by_step_CN.md | Chinese documentation updates aligned with the English Model-Free Mode section and related corrections. |
Signed-off-by: Xin He <xin3.he@intel.com>
3 tasks
n1ck-guo
reviewed
Apr 21, 2026
yiliu30
reviewed
Apr 21, 2026
Contributor
|
I think it would be better to wrap it as a class and use a unified interface with auto_round. |
Contributor
Author
Thanks for the remind, conv1d is skipped now. |
Contributor
Author
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
wenhuach21
reviewed
Apr 26, 2026
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
Author
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
Author
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.
Key features:
config.jsonand safetensors files are needed--layer_configfor per-layer bit-width overrides and--ignore_layersto keep specific layers in full precision--iters 0 --disable_opt_rtnflow for all supported schemesSupported schemes
Model-free mode currently supports the following integer weight-only preset schemes (packed in the
auto_round:auto_gptqformat):W2A16W2A16G32W2A16G64W4A16(default)W4A16_MIXEDW8A16All of the above presets also support asymmetric quantization (
sym=False) for 2-bit and 8-bit variants (W2A16,W2A16G32,W2A16G64,W8A16), producingauto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow usesauto_round:auto_awqpacking as suggested; use the standard AutoRound flow for that case.You can also pass a custom
QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16)withbits ?? {2, 4, 8}and any group_size / sym configuration.Schemes that require special packing kernels (
W3A16,FPW8A16,BF16,MXFP4,MXFP8,MXINT4,NVFP4,FP8_BLOCK,FP8_STATIC,INT8_W8A8,GGUF:*, ...) are not supported in model-free mode and will raiseValueError. Use the regular AutoRound flow for those.CLI Usage
API Usage
Memory and performance optimizations
quantize_weight_rtnfunction to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.Fused expert tensor handling
Utility function improvements
compress_layer_namesutility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.Documentation and minor corrections
Time and memory usage
Qwen/Qwen3.5-35B-A3B
Model free:Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB
Type of Change
Related Issues
Fixes or relates to ##1491
Checklist Before Submitting