Skip to content

support model_free WOQ quantization#1699

Open
xin3he wants to merge 37 commits intomainfrom
xinhe/4-14
Open

support model_free WOQ quantization#1699
xin3he wants to merge 37 commits intomainfrom
xinhe/4-14

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented Apr 17, 2026

Description

Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.

Auto-enabled by default. As of v0.13, when you pass --iters 0 --disable_opt_rtn together with a supported INT WOQ scheme, the CLI automatically takes the model-free path. This is bit-exactly equivalent to the regular --iters 0 --disable_opt_rtn flow but uses far less memory. Use --disable_model_free to opt out and force the original flow.

Key features:

  • No model object required ?? only config.json and safetensors files are needed
  • Low disk memory required (If no local model files) ?? downloads and quantizes one shard at a time, deleting the source shard after processing
  • Per-layer configuration ?? supports --layer_config for per-layer bit-width overrides and --ignore_layers to keep specific layers in full precision
  • Predefined ignore layers ?? automatically skips model-specific layers (e.g., MoE gates, MTP layers) based on config detection
  • Bit-exact parity with the standard --iters 0 --disable_opt_rtn flow for all supported schemes

Supported schemes

Model-free mode currently supports the following integer weight-only preset schemes (packed in the auto_round:auto_gptq format):

Preset Bits Group size Sym
W2A16 2 128 true
W2A16G32 2 32 true
W2A16G64 2 64 true
W4A16 (default) 4 128 true
W4A16_MIXED 4 128 true
W8A16 8 128 true

All of the above presets also support asymmetric quantization (sym=False) for 2-bit and 8-bit variants (W2A16, W2A16G32, W2A16G64, W8A16), producing auto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow uses auto_round:auto_awq packing as suggested; use the standard AutoRound flow for that case.

You can also pass a custom QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16) with bits ?? {2, 4, 8} and any group_size / sym configuration.

Schemes that require special packing kernels (W3A16, FPW8A16, BF16, MXFP4, MXFP8, MXINT4, NVFP4, FP8_BLOCK, FP8_STATIC, INT8_W8A8, GGUF:*, ...) are not supported in model-free mode and will raise ValueError. Use the regular AutoRound flow for those.

CLI Usage

# Easiest: --iters 0 --disable_opt_rtn auto-routes to model-free
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn \
  --output_dir ./int4-llama

# Equivalent explicit invocation
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --output_dir ./int4-llama

# Opt out of auto-routing and use the regular flow instead
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn --disable_model_free \
  --output_dir ./int4-llama

# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama

API Usage

from auto_round import AutoRound

AutoRound(
    model="meta-llama/Llama-3.2-1B-Instruct",
    scheme="W4A16",  # Or a QuantizationScheme instance for custom group_size / sym.
    layer_config={
        ".*k_proj": {"bits": 8, "group_size": 32},
        ".*v_proj": {"bits": 8, "group_size": 32},
    },
    ignore_layers="mlp",
    model_free=True,
).quantize_and_save("./int4-llama")

Note: Model-free mode only supports the auto_round output format and uses RTN (no calibration data, no iterative tuning). For higher-quality quantization or schemes outside the supported list, use the standard AutoRound flow.

Memory and performance optimizations

  • Improved the quantize_weight_rtn function to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.

Fused expert tensor handling

  • Added logic to automatically split fused 3D expert tensors (common in MoE models) into per-expert 2D tensors, ensuring compatibility with quantization routines and improving support for a wider range of model architectures.

Utility function improvements

  • Enhanced the compress_layer_names utility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.

Documentation and minor corrections

  • Updated documentation for both English and Chinese users, including new sections on model-free mode, corrected quantization scheme tables, and clarified quantization backend support.

Time and memory usage

Qwen/Qwen3.5-35B-A3B

Model free:
Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:
Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to ##1491

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

xin3he added 13 commits April 14, 2026 13:00
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Copilot AI review requested due to automatic review settings April 17, 2026 03:56
xin3he and others added 4 commits April 17, 2026 11:56
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.

Changes:

  • Introduces auto_round.compressors.model_free with shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support.
  • Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
  • Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
auto_round/compressors/model_free.py New model-free RTN WOQ implementation with shard streaming and per-layer config support.
auto_round/utils/missing_tensors.py Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing.
auto_round/utils/common.py Improves compress_layer_names by repeatedly compressing until stable.
auto_round/main.py Adds --model_free CLI flag and routes to model-free quantization flow.
test/test_cpu/quantization/test_model_free.py New unit tests for model-free quantization behavior and helpers.
test/test_cpu/utils/test_missing_tensors.py Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors.
docs/step_by_step.md Documents Model-Free Mode usage (CLI/API).
docs/step_by_step_CN.md Chinese documentation updates aligned with the English Model-Free Mode section and related corrections.

Comment thread auto_round/compressors/model_free.py Outdated
Comment thread test/test_cpu/utils/test_missing_tensors.py
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/__main__.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Signed-off-by: Xin He <xin3.he@intel.com>
Comment thread auto_round/compressors/model_free.py Outdated
Comment thread auto_round/compressors/model_free.py Outdated
Comment thread docs/step_by_step.md
@n1ck-guo
Copy link
Copy Markdown
Contributor

I think it would be better to wrap it as a class and use a unified interface with auto_round.

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 25, 2026

Have you tested a model with Conv1D layers where the weights need to be transposed before quantization? How do you detect the layer type?

Thanks for the remind, conv1d is skipped now.

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 25, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xin3he xin3he requested review from n1ck-guo and yiliu30 April 25, 2026 14:31
Signed-off-by: Xin He <xin3.he@intel.com>
Comment thread test/test_cpu/quantization/test_model_free_parity.py
Comment thread test/test_cpu/quantization/test_model_free_parity.py
Comment thread docs/step_by_step_CN.md
Comment thread docs/step_by_step.md
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Comment thread test/test_cpu/quantization/test_model_free_parity.py
xin3he added 3 commits April 26, 2026 09:49
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 26, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xin3he added 4 commits April 27, 2026 07:32
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 28, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xin3he added 2 commits April 28, 2026 06:02
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants