support model_free WOQ quantization by xin3he · Pull Request #1699 · intel/auto-round

xin3he · 2026-04-17T03:56:02Z

Description

Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.

Auto-enabled by default. As of v0.13, when you pass --iters 0 --disable_opt_rtn together with a supported INT WOQ scheme, the CLI automatically takes the model-free path. This is bit-exactly equivalent to the regular --iters 0 --disable_opt_rtn flow but uses far less memory. Use --disable_model_free to opt out and force the original flow.

Key features:

No model object required ?? only config.json and safetensors files are needed
Low disk memory required (If no local model files) ?? downloads and quantizes one shard at a time, deleting the source shard after processing
Per-layer configuration ?? supports --layer_config for per-layer bit-width overrides and --ignore_layers to keep specific layers in full precision
Predefined ignore layers ?? automatically skips model-specific layers (e.g., MoE gates, MTP layers) based on config detection
Bit-exact parity with the standard --iters 0 --disable_opt_rtn flow for all supported schemes

Supported schemes

Model-free mode currently supports the following integer weight-only preset schemes (packed in the auto_round:auto_gptq format):

Preset	Bits	Group size	Sym
`W2A16`	2	128	true
`W2A16G32`	2	32	true
`W2A16G64`	2	64	true
`W4A16` (default)	4	128	true
`W4A16_MIXED`	4	128	true
`W8A16`	8	128	true

All of the above presets also support asymmetric quantization (sym=False) for 2-bit and 8-bit variants (W2A16, W2A16G32, W2A16G64, W8A16), producing auto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow uses auto_round:auto_awq packing as suggested; use the standard AutoRound flow for that case.

You can also pass a custom QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16) with bits ?? {2, 4, 8} and any group_size / sym configuration.

Schemes that require special packing kernels (W3A16, FPW8A16, BF16, MXFP4, MXFP8, MXINT4, NVFP4, FP8_BLOCK, FP8_STATIC, INT8_W8A8, GGUF:*, ...) are not supported in model-free mode and will raise ValueError. Use the regular AutoRound flow for those.

CLI Usage

# Easiest: --iters 0 --disable_opt_rtn auto-routes to model-free
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn \
  --output_dir ./int4-llama

# Equivalent explicit invocation
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --output_dir ./int4-llama

# Opt out of auto-routing and use the regular flow instead
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn --disable_model_free \
  --output_dir ./int4-llama

# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama

API Usage

from auto_round import AutoRound

AutoRound(
    model="meta-llama/Llama-3.2-1B-Instruct",
    scheme="W4A16",  # Or a QuantizationScheme instance for custom group_size / sym.
    layer_config={
        ".*k_proj": {"bits": 8, "group_size": 32},
        ".*v_proj": {"bits": 8, "group_size": 32},
    },
    ignore_layers="mlp",
    model_free=True,
).quantize_and_save("./int4-llama")

Note: Model-free mode only supports the auto_round output format and uses RTN (no calibration data, no iterative tuning). For higher-quality quantization or schemes outside the supported list, use the standard AutoRound flow.

Memory and performance optimizations

Improved the quantize_weight_rtn function to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.

Fused expert tensor handling

Added logic to automatically split fused 3D expert tensors (common in MoE models) into per-expert 2D tensors, ensuring compatibility with quantization routines and improving support for a wider range of model architectures.

Utility function improvements

Enhanced the compress_layer_names utility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.

Documentation and minor corrections

Updated documentation for both English and Chinese users, including new sections on model-free mode, corrected quantization scheme tables, and clarified quantization backend support.

Time and memory usage

Qwen/Qwen3.5-35B-A3B

Model free:
Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:
Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB

Type of Change

Related Issues

Fixes or relates to ##1491

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: Xin He <xin3.he@intel.com>

for more information, see https://pre-commit.ci

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot

Pull request overview

Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.

Changes:

Introduces auto_round.compressors.model_free with shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support.
Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
auto_round/compressors/model_free.py	New model-free RTN WOQ implementation with shard streaming and per-layer config support.
auto_round/utils/missing_tensors.py	Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing.
auto_round/utils/common.py	Improves `compress_layer_names` by repeatedly compressing until stable.
auto_round/main.py	Adds `--model_free` CLI flag and routes to model-free quantization flow.
test/test_cpu/quantization/test_model_free.py	New unit tests for model-free quantization behavior and helpers.
test/test_cpu/utils/test_missing_tensors.py	Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors.
docs/step_by_step.md	Documents Model-Free Mode usage (CLI/API).
docs/step_by_step_CN.md	Chinese documentation updates aligned with the English Model-Free Mode section and related corrections.

Signed-off-by: Xin He <xin3.he@intel.com>

n1ck-guo · 2026-04-21T08:13:00Z

I think it would be better to wrap it as a class and use a unified interface with auto_round.

xin3he · 2026-04-25T14:30:43Z

Have you tested a model with Conv1D layers where the weights need to be transposed before quantization? How do you detect the layer type?

Thanks for the remind, conv1d is skipped now.

xin3he · 2026-04-25T14:31:44Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-25T14:31:53Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-26T13:47:01Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-26T13:47:09Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-28T05:59:36Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-28T05:59:46Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he added 13 commits April 14, 2026 13:00

implement model free

dc592e9

Signed-off-by: Xin He <xin3.he@intel.com>

polished implementation

177bf48

Signed-off-by: Xin He <xin3.he@intel.com>

remove useless gpu_concurrency

97e0362

Signed-off-by: Xin He <xin3.he@intel.com>

添加预编译模式匹配器以提高量化过程中的性能和可扩展性

ff47a97

Signed-off-by: Xin He <xin3.he@intel.com>

fix typo

4d9ad0e

Signed-off-by: Xin He <xin3.he@intel.com>

update document

58709e6

Signed-off-by: Xin He <xin3.he@intel.com>

remove useless code and update UT

d3951f2

Signed-off-by: Xin He <xin3.he@intel.com>

mend

16991ea

remove high_gpu_mem_usage since no performacen benefit.

83b9b4f

Signed-off-by: Xin He <xin3.he@intel.com>

update regex

687260d

Signed-off-by: Xin He <xin3.he@intel.com>

fix bug and simplify UT

68d0cb7

Signed-off-by: Xin He <xin3.he@intel.com>

fix bug

312f75d

Signed-off-by: Xin He <xin3.he@intel.com>

add WOQ limiation and support bits group_size setting

3ca4d3b

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI review requested due to automatic review settings April 17, 2026 03:56

Copilot started reviewing on behalf of xin3he April 17, 2026 03:56 View session

xin3he and others added 4 commits April 17, 2026 11:56

Merge branch 'main' into xinhe/4-14

3f15e02

[pre-commit.ci] auto fixes from pre-commit.com hooks

47b3f35

for more information, see https://pre-commit.ci

update doc

76f9915

Signed-off-by: Xin He <xin3.he@intel.com>

minor fix

c588ad2

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI reviewed Apr 17, 2026

View reviewed changes

enable quant_nontext_module

0c14165

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI mentioned this pull request Apr 20, 2026

Make Type of Change single-select without inflating GitHub task counts #1707

Closed

3 tasks

xin3he requested review from WeiweiZhang1, changwangss, n1ck-guo, wenhuach21 and yiliu30 April 20, 2026 08:19

n1ck-guo reviewed Apr 21, 2026

View reviewed changes

Comment thread auto_round/compressors/model_free.py Outdated

yiliu30 reviewed Apr 21, 2026

View reviewed changes

Comment thread auto_round/compressors/model_free.py Outdated

Comment thread docs/step_by_step.md

xin3he requested review from n1ck-guo and yiliu30 April 25, 2026 14:31

fix CI

8b8d084

Signed-off-by: Xin He <xin3.he@intel.com>