Optimize CPU RAM peak memory during quantization by lvliang-intel · Pull Request #1386 · intel/auto-round

lvliang-intel · 2026-02-03T07:06:07Z

Description

Optimize CPU RAM peak memory during quantization:

Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).
The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.

Test

Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.

Optimization options:

cpu_stream_offload_blocks: Offload block weights to disk, load on demand
cpu_stream_loss: Compute loss on-the-fly using frozen block copy

Summary: Peak RAM Comparison

Configuration	Peak RAM (GB)	Time (s)	RAM Saved
Baseline	24.29	1582.3	baseline
+ offload_blocks	20.26	1609.1	-4.03 GB
+ stream_loss	21.31	1364.0	-2.98 GB
All optimizations	15.57	1269.3	-8.72 GB

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot

Pull request overview

This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.

Changes:

Added CPU RAM optimization options (cpu_stream_offload_blocks and cpu_stream_loss) to reduce memory usage during quantization
Modified export logic to only save quantization config attributes that differ from scheme defaults
Added comprehensive test for CPU RAM optimization with memory tracking

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
auto_round/compressors/base.py	Core implementation of CPU RAM optimization with block offloading and streaming loss computation
auto_round/utils/model.py	Added utility functions for saving/loading/clearing module weights to support offloading
auto_round/export/export_to_autoround/export.py	Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_fp8.py	Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py	Modified to only save non-default config attributes in extra_config
test/test_cuda/advanced/test_cpu_ram_optimization.py	New test file to validate CPU RAM optimization features
test/test_cuda/quantization/test_mix_bits.py	Updated assertions to verify only non-default attributes are saved
test/test_cpu/quantization/test_mix_bits.py	Updated assertions to verify only non-default attributes are saved
test/test_cuda/integrations/test_sglang.py	Updated test configuration and assertions
test/test_cpu/quantization/test_act_quantization.py	Removed assertions for default config values
test/test_cuda/export/test_gguf.py	Changed device specification from integer to string format
auto_round/auto_scheme/utils.py	Added fallback device handling for string device specifications

auto_round/compressors/base.py

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

auto_round/compressors/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: n1ck-guo <heng.guo@intel.com>

… references (#1389)

Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

lvliang-intel · 2026-02-09T06:42:46Z

Support AutoScheme CPU RAM Optimization:

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model /models/Qwen2.5-3B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1
[Result] low_cpu_mem_usage=False -> 'peak_ram': 13.65GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=False -> time=688.6s
[Result] low_cpu_mem_usage=True -> 'peak_ram': 8.83GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=True -> time=637.2s
=== Summary ===

Configuration	Peak RAM (GB)	Time (s)	RAM Saved
Baseline	13.65	688.6	--
Optimizations	8.83	637.2	4.82 GB
Ratio	0.65x	0.92x

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: He, Xin3 <xin3.he@intel.com>

…ble type: 'set' (#1425) Signed-off-by: He, Xin3 <xin3.he@intel.com>

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…e g_idx for gptqmodel backend (#1429) Signed-off-by: He, Xin3 <xin3.he@intel.com>

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Signed-off-by: chensuyue <suyue.chen@intel.com>

…/auto-round into lvl/ram_usage_optimization Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-03-03T14:33:10Z

compare_compressor_ram.py
compare_auto_scheme_ram.py

Compressor RAM Optimization Test Result:

1 Qwen/Qwen3-4B-Instruct-2507

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-4B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 15.05 43.7 baseline
Optimized 5.35 56.2 +9.70 GB

Best config (Optimized): RAM 0.36x baseline | Time 1.29x baseline

2 Qwen/Qwen2.5-7B-Instruct

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen2.5-7B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 19.04 48.4 baseline
Optimized 7.57 76.4 +11.47 GB

Best config (Optimized): RAM 0.40x baseline | Time 1.58x baseline

3 Qwen/Qwen3-14B

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-14B/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 31.59 79.5 baseline
Optimized 5.19 135.5 +26.40 GB

Best config (Optimized): RAM 0.16x baseline | Time 1.70x baseline

4 Qwen/Qwen3-30B-A3B-Instruct-2507

CUDA_VISIBLE_DEVICES=4 python compare_compressor_ram.py --model ./Qwen3-30B-A3B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1 --iters 20

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 67.81 500.7 baseline
Optimized 12.98 612.1 +54.83 GB

Best config (Optimized): RAM 0.19x baseline | Time 1.22x baseline

AutoScheme RAM Optimization Test Result:

1 Qwen/Qwen3-4B-Instruct-2507

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen3-4B-Instruct-2507/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 18.86 230.1 --
Optimized 3.83 235.6 15.03 GB
Ratio 0.20x 1.02x

2 Qwen/Qwen2.5-7B-Instruct

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen2.5-7B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 28.72 371.8 --
Optimized 4.70 351.1 24.02 GB
Ratio 0.16x 0.94x

3 Qwen/Qwen3-14B

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model ./Qwen3-14B/ --nsamples 8 --seqlen 256 --batch-size 1

=== Summary ===

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 61.18 673.0 --
Optimized 6.51 653.3 54.67 GB
Ratio 0.11x 0.97x

auto_round/compressors/base.py

auto_round/utils/offload.py

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

auto_round/utils/offload.py

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

test/test_cpu/core/test_low_cpu_mem_options.py

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-03-04T15:22:23Z

Still has a CI issue caused by #1478:
FAILED

=============================== FAILURES ================================

________________________ TestGGUF.test_q2k_mixed ________________________

self = <test.test_cpu.export.test_gguf_format.TestGGUF object at 0x7fc706c720d0>

def test_q2k_mixed(self):

    model_name = get_model_path("Qwen/Qwen1.5-MoE-A2.7B")

    saved_tiny_model_path = save_tiny_model(

        model_name,

        "./tmp/tiny_qwen_model_path",

        num_layers=3,

        is_mllm=False,

    )

    autoround = AutoRound(

        saved_tiny_model_path,

        iters=0,

        nsamples=1,

        seqlen=16,

        disable_opt_rtn=True,

    )

    quantized_model_path = "./saved"

    autoround.quantize_and_save(output_dir=quantized_model_path, format="gguf:q2_k_mixed")

    gguf_file = os.listdir(quantized_model_path)[0]

    file_size = os.path.getsize(os.path.join(quantized_model_path, gguf_file)) / 1024**2

    assert abs(file_size - 1362) < 5.0

    from gguf.gguf_reader import GGUFReader

    gguf_model = GGUFReader(os.path.join(quantized_model_path, gguf_file))

    assert gguf_model.get_tensor(2).name == "blk.0.attn_k.weight"

    assert gguf_model.get_tensor(2).tensor_type.name == "Q4_K"

  assert gguf_model.get_tensor(10).name == "blk.0.ffn_up_exps.weight"

E AssertionError: assert 'blk.0.ffn_down_exps.weight' == 'blk.0.ffn_up_exps.weight'

E

E - blk.0.ffn_up_exps.weight

E ? ^^

E + blk.0.ffn_down_exps.weight

E ? ^^^^

test/test_cpu/export/test_gguf_format.py:347: AssertionError

…am_usage_optimization

n1ck-guo · 2026-03-10T08:25:32Z

LGTM

Optimize CPU RAM peak memeory during quantization

dee1db7

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot AI review requested due to automatic review settings February 3, 2026 07:06

Merge branch 'main' into lvl/ram_usage_optimization

459ee8a

Copilot AI reviewed Feb 3, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

WeiweiZhang1 and others added 4 commits February 3, 2026 07:17

rm duplicate args of the quantization extra config (#1334)

2a78a18

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix --device_map cuda and xpu issue (#1383)

e00c176

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6d9f77

for more information, see https://pre-commit.ci

refine test case

ca55ae8

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

n1ck-guo reviewed Feb 4, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

yiliu30 and others added 8 commits February 4, 2026 03:10

Disable replace FP8Expert (#1379)

7a3dcac

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Support general MOE replacement for MOE models (Transformers 5.0 comp…

082bf4c

…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix cuda ut fail (#1370)

dd45c31

Signed-off-by: n1ck-guo <heng.guo@intel.com>

[Regression] Detach scale tensor to prevent holding computation graph…

10028e8

… references (#1389)

fix layer config (#1373)

b2dff81

Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

5614894

update code for comments

a041da8

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e62a708

for more information, see https://pre-commit.ci

lvliang-intel and others added 12 commits February 9, 2026 06:55

support AutoScheme cpu ram optimization

5dcd064

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

10f0a4a

Refactor evaluation in tests to use evaluate_accuracy function (#1402)

13c140b

fix bug in PR-1244 (#1422)

4daa325

Signed-off-by: He, Xin3 <xin3.he@intel.com>

remove require_intel_extension_for_pytorch and fix TypeError: unhasha…

b32e9a1

…ble type: 'set' (#1425) Signed-off-by: He, Xin3 <xin3.he@intel.com>

Fix cuda model ut [glm4][Molmo] (#1428)

1e41b94

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Update BackendInfos for AutoGPTQ based on transformer version and sav…

accef77

…e g_idx for gptqmodel backend (#1429) Signed-off-by: He, Xin3 <xin3.he@intel.com>

fix CUDA CI (#1431)

b7a7a30

Signed-off-by: He, Xin3 <xin3.he@intel.com>

refine global scale calculation to blockwise (#1421)

b9e925f

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

support GPTQ_FORMAT for "gptqmodel:exllamav2" backend (#1434)

3464711

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Update diffusion README (#1439)

564f9a1

Enable new cpu test pool (#1435)

8a45c86

Signed-off-by: chensuyue <suyue.chen@intel.com>

Merge branch 'lvl/ram_usage_optimization' of https://github.com/intel…

89a44e8

…/auto-round into lvl/ram_usage_optimization Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel force-pushed the lvl/ram_usage_optimization branch from 9718d68 to 89a44e8 Compare March 3, 2026 14:40

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Outdated Show resolved Hide resolved

lvliang-intel and others added 5 commits March 4, 2026 10:40

fix ci issues

dd2e35a

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

refactor offload manager

a088346

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6c92c30

for more information, see https://pre-commit.ci

Merge branch 'main' into lvl/ram_usage_optimization

fa186ff

Update base.py

31c4176

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

auto_round/utils/offload.py Outdated Show resolved Hide resolved

lvliang-intel and others added 3 commits March 4, 2026 15:31

update code for comments

e7c606b

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

1190935

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c0911e

for more information, see https://pre-commit.ci

wenhuach21 reviewed Mar 4, 2026

View reviewed changes

test/test_cpu/core/test_low_cpu_mem_options.py Show resolved Hide resolved

lvliang-intel added 2 commits March 4, 2026 23:13

fix ci issues

cce2dcd

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

72f8433

Merge branch 'main' of https://github.com/intel/auto-round into lvl/r…

876fa66

…am_usage_optimization

n1ck-guo approved these changes Mar 10, 2026

View reviewed changes

chensuyue merged commit 0d37215 into main Mar 10, 2026
29 checks passed

chensuyue deleted the lvl/ram_usage_optimization branch March 10, 2026 08:41

lvliang-intel mentioned this pull request Mar 15, 2026

Fix shard finalization issue to prevent skipping non-LLM tail layers #1548

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU RAM peak memory during quantization#1386

Optimize CPU RAM peak memory during quantization#1386
chensuyue merged 78 commits intomainfrom
lvl/ram_usage_optimization

lvliang-intel commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented Feb 9, 2026 •

edited

Loading

Uh oh!

lvliang-intel commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented Mar 4, 2026

Uh oh!

n1ck-guo commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

lvliang-intel commented Feb 3, 2026

Description