Skip to content

Optimize CPU RAM peak memory during quantization#1386

Merged
chensuyue merged 78 commits intomainfrom
lvl/ram_usage_optimization
Mar 10, 2026
Merged

Optimize CPU RAM peak memory during quantization#1386
chensuyue merged 78 commits intomainfrom
lvl/ram_usage_optimization

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

Optimize CPU RAM peak memory during quantization:

  1. Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
    cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
    cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).

  2. The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.

Test

Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.

Optimization options:

  1. cpu_stream_offload_blocks: Offload block weights to disk, load on demand
  2. cpu_stream_loss: Compute loss on-the-fly using frozen block copy

Summary: Peak RAM Comparison

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 24.29 1582.3 baseline
+ offload_blocks 20.26 1609.1 -4.03 GB
+ stream_loss 21.31 1364.0 -2.98 GB
All optimizations 15.57 1269.3 -8.72 GB

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings February 3, 2026 07:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.

Changes:

  • Added CPU RAM optimization options (cpu_stream_offload_blocks and cpu_stream_loss) to reduce memory usage during quantization
  • Modified export logic to only save quantization config attributes that differ from scheme defaults
  • Added comprehensive test for CPU RAM optimization with memory tracking

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
auto_round/compressors/base.py Core implementation of CPU RAM optimization with block offloading and streaming loss computation
auto_round/utils/model.py Added utility functions for saving/loading/clearing module weights to support offloading
auto_round/export/export_to_autoround/export.py Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_fp8.py Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py Modified to only save non-default config attributes in extra_config
test/test_cuda/advanced/test_cpu_ram_optimization.py New test file to validate CPU RAM optimization features
test/test_cuda/quantization/test_mix_bits.py Updated assertions to verify only non-default attributes are saved
test/test_cpu/quantization/test_mix_bits.py Updated assertions to verify only non-default attributes are saved
test/test_cuda/integrations/test_sglang.py Updated test configuration and assertions
test/test_cpu/quantization/test_act_quantization.py Removed assertions for default config values
test/test_cuda/export/test_gguf.py Changed device specification from integer to string format
auto_round/auto_scheme/utils.py Added fallback device handling for string device specifications

WeiweiZhang1 and others added 4 commits February 3, 2026 07:17
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
yiliu30 and others added 8 commits February 4, 2026 03:10
Signed-off-by: yiliu30 <yi4.liu@intel.com>
…atible) (#1374)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: n1ck-guo <heng.guo@intel.com>
Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

lvliang-intel commented Feb 9, 2026

Support AutoScheme CPU RAM Optimization:

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model /models/Qwen2.5-3B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1
[Result] low_cpu_mem_usage=False -> 'peak_ram': 13.65GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=False -> time=688.6s
[Result] low_cpu_mem_usage=True -> 'peak_ram': 8.83GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=True -> time=637.2s
=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 13.65 688.6 --
Optimizations 8.83 637.2 4.82 GB
Ratio 0.65x 0.92x

lvliang-intel and others added 12 commits February 9, 2026 06:55
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
…ble type: 'set' (#1425)

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…e g_idx for gptqmodel backend (#1429)

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
…/auto-round into lvl/ram_usage_optimization

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

lvliang-intel commented Mar 3, 2026

compare_compressor_ram.py
compare_auto_scheme_ram.py

Compressor RAM Optimization Test Result:

1 Qwen/Qwen3-4B-Instruct-2507

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-4B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 15.05 43.7 baseline
Optimized 5.35 56.2 +9.70 GB

Best config (Optimized): RAM 0.36x baseline | Time 1.29x baseline

2 Qwen/Qwen2.5-7B-Instruct

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen2.5-7B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 19.04 48.4 baseline
Optimized 7.57 76.4 +11.47 GB

Best config (Optimized): RAM 0.40x baseline | Time 1.58x baseline

3 Qwen/Qwen3-14B

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-14B/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 31.59 79.5 baseline
Optimized 5.19 135.5 +26.40 GB

Best config (Optimized): RAM 0.16x baseline | Time 1.70x baseline

4 Qwen/Qwen3-30B-A3B-Instruct-2507

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-30B-A3B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 67.81 500.7 baseline
Optimized 12.98 612.1 +54.83 GB

Best config (Optimized): RAM 0.19x baseline | Time 1.22x baseline

AutoScheme RAM Optimization Test Result:

1 Qwen/Qwen3-4B-Instruct-2507

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen3-4B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 18.86 230.1 --
Optimized 3.83 235.6 15.03 GB
Ratio 0.20x 1.02x

2 Qwen/Qwen2.5-7B-Instruct

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen2.5-7B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 28.72 371.8 --
Optimized 4.70 351.1 24.02 GB
Ratio 0.16x 0.94x

3 Qwen/Qwen3-14B

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen3-14B/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 61.18 673.0 --
Optimized 6.51 653.3 54.67 GB
Ratio 0.11x 0.97x

@lvliang-intel lvliang-intel force-pushed the lvl/ram_usage_optimization branch from 9718d68 to 89a44e8 Compare March 3, 2026 14:40
lvliang-intel and others added 5 commits March 4, 2026 10:40
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Still has a CI issue caused by #1478:
FAILED

=============================== FAILURES ================================

________________________ TestGGUF.test_q2k_mixed ________________________

self = <test.test_cpu.export.test_gguf_format.TestGGUF object at 0x7fc706c720d0>

def test_q2k_mixed(self):

    model_name = get_model_path("Qwen/Qwen1.5-MoE-A2.7B")

    saved_tiny_model_path = save_tiny_model(

        model_name,

        "./tmp/tiny_qwen_model_path",

        num_layers=3,

        is_mllm=False,

    )

    autoround = AutoRound(

        saved_tiny_model_path,

        iters=0,

        nsamples=1,

        seqlen=16,

        disable_opt_rtn=True,

    )

    quantized_model_path = "./saved"

    autoround.quantize_and_save(output_dir=quantized_model_path, format="gguf:q2_k_mixed")

    gguf_file = os.listdir(quantized_model_path)[0]

    file_size = os.path.getsize(os.path.join(quantized_model_path, gguf_file)) / 1024**2

    assert abs(file_size - 1362) < 5.0

    from gguf.gguf_reader import GGUFReader

    gguf_model = GGUFReader(os.path.join(quantized_model_path, gguf_file))

    assert gguf_model.get_tensor(2).name == "blk.0.attn_k.weight"

    assert gguf_model.get_tensor(2).tensor_type.name == "Q4_K"
  assert gguf_model.get_tensor(10).name == "blk.0.ffn_up_exps.weight"

E AssertionError: assert 'blk.0.ffn_down_exps.weight' == 'blk.0.ffn_up_exps.weight'

E

E - blk.0.ffn_up_exps.weight

E ? ^^

E + blk.0.ffn_down_exps.weight

E ? ^^^^

test/test_cpu/export/test_gguf_format.py:347: AssertionError

@n1ck-guo
Copy link
Copy Markdown
Contributor

LGTM

@chensuyue chensuyue merged commit 0d37215 into main Mar 10, 2026
29 checks passed
@chensuyue chensuyue deleted the lvl/ram_usage_optimization branch March 10, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.