Update MOE linear_loop implementation for speedup and matching name in vLLM by xin3he · Pull Request #1478 · intel/auto-round

xin3he · 2026-02-27T12:06:25Z

Description

SpeedUp: use 'meta' Linear instead of create a real Linear for MOE list
Naming: split gate_up_proj into gate_proj and up_proj; change name from experts.down_proj.[idx] toexperts.[idx]. down_proj during saving.

2026-02-27 19:45:30 INFO moe_experts_interface.py L478: [MoE Prep] Before unfuse: 'peak_ram': 1.26GB
2026-02-27 19:45:47 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 40 MOE experts modules
2026-02-27 19:45:47 INFO moe_experts_interface.py L502: [MoE Prep] After unfuse: 'peak_ram': 41.34GB
2026-02-27 19:45:47 INFO replace_modules.py L81: Prepared 40 MOE modules for quantization

Previous time:
2026-02-27 17:25:14 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.0.mlp.experts': 49542.75 ms
2026-02-27 17:26:03 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.1.mlp.experts': 47986.69 ms
2026-02-27 17:26:58 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.2.mlp.experts': 55245.50 ms

Now:
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.0.mlp.experts': 243.57 ms
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.1.mlp.experts': 206.36 ms
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.2.mlp.experts': 189.15 ms

Type of Change

Related Issues

Fixes or relates to #1464

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Copilot

Pull request overview

This PR updates AutoRound’s Transformers MoE “linear_loop” pathway to better align naming with vLLM conventions and reduce unfusing overhead, and adjusts backend documentation/metadata for AutoGPTQ backends.

Changes:

Split fused gate_up_proj handling into separate gate_proj + up_proj during MoE unfusing and update the linear-loop forward accordingly.
Optimize MoE unfusing by creating nn.Linear shells on the meta device and assigning per-expert weight slices.
Update AutoGPTQ backend registration/requirements (and reflect priority changes in docs) and remap MoE expert parameter keys during shard saving.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
docs/step_by_step.md	Updates backend table entries (priority/packing format/requirements).
auto_round/modeling/fused_moe/moe_experts_interface.py	Refactors MoE projection naming and unfuse implementation; adds memory monitoring logs.
auto_round/inference/backend.py	Adjusts AutoGPTQ backend registration and requirements gating.
auto_round/compressors/shard_writer.py	Remaps expert parameter keys to `{experts}.{idx}.{proj}` style and skips meta tensors in finalize.

auto_round/modeling/fused_moe/moe_experts_interface.py

auto_round/inference/backend.py

docs/step_by_step.md

auto_round/modeling/fused_moe/moe_experts_interface.py

xin3he · 2026-02-27T12:36:28Z

verified with Qwen3.5-35B-A3B + vLLM

…n vLLM Signed-off-by: He, Xin3 <xin3.he@intel.com>

wenhuach21 · 2026-02-28T01:44:45Z

verified with Qwen3.5-35B-A3B + vLLM

1 as liang's pr shown, transformers v.5.2.0 transposes the weights. However, I don't see any transformers version control here. Better add it If the code is not compatible with < 5.2.0

2 better test 2 more different model families.

3 test transformers backend as well

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he · 2026-03-02T05:38:20Z

Qwen3-VL
5.2.0
custom: 1. AttributeError: 'Qwen3VLMoeTextSparseMoeBlock' object has no attribute 'top_k'; 2, IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
general: work

5.1.0
custom: 1. AttributeError: 'Qwen3VLMoeTextSparseMoeBlock' object has no attribute 'top_k'
general: work

4.57.6
custom: work
general: not applicable.

Qwen3-Next
5.2.0
unfused_moe works well with low memory
General recipe works well with high memory.

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Copilot AI review requested due to automatic review settings February 27, 2026 12:06

Copilot started reviewing on behalf of xin3he February 27, 2026 12:06 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

xin3he force-pushed the xinhe/2-26 branch from 230c103 to e687bce Compare February 27, 2026 12:46

Update MOE linear_loop implementation for speedup and matching name i…

e687bce

…n vLLM Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested review from lvliang-intel, n1ck-guo and wenhuach21 February 27, 2026 12:48

n1ck-guo approved these changes Feb 28, 2026

View reviewed changes

xin3he added 4 commits February 28, 2026 17:20

Merge branch 'main' into xinhe/2-26

88eb12d

update custom and general selection logic

f94a2b4

Signed-off-by: He, Xin3 <xin3.he@intel.com>

update BUILTIN_MODULES

c813d8e

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Merge remote-tracking branch 'origin/main' into xinhe/2-26

9f9e4fa

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested a review from n1ck-guo March 2, 2026 05:38

xin3he added 8 commits March 2, 2026 13:38

align model type design

efbbf7a

Signed-off-by: He, Xin3 <xin3.he@intel.com>

simplify the logic

8f9d805

Signed-off-by: He, Xin3 <xin3.he@intel.com>

revert initial changes and enable _ExpertContainer

88902e1

Signed-off-by: He, Xin3 <xin3.he@intel.com>

update UT to fix bug

664db49

Signed-off-by: He, Xin3 <xin3.he@intel.com>

disable gpt-oss custom modeling

ed09881

Signed-off-by: He, Xin3 <xin3.he@intel.com>

stop weight converter for AutoRound loading

90f6c4d

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Merge remote-tracking branch 'origin/main' into xinhe/2-26

0be126b

Signed-off-by: He, Xin3 <xin3.he@intel.com>

add version guard

477fdff

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he merged commit d150118 into main Mar 4, 2026
29 checks passed

xin3he deleted the xinhe/2-26 branch March 4, 2026 06:59

lvliang-intel mentioned this pull request Mar 4, 2026

Optimize CPU RAM peak memory during quantization #1386

Merged

9 tasks

lvliang-intel mentioned this pull request Mar 25, 2026

[Bug]: Qwen quantization duration highly increased #1588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update MOE linear_loop implementation for speedup and matching name in vLLM#1478

Update MOE linear_loop implementation for speedup and matching name in vLLM#1478
xin3he merged 13 commits intomainfrom
xinhe/2-26

xin3he commented Feb 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xin3he commented Feb 27, 2026

Uh oh!

wenhuach21 commented Feb 28, 2026 •

edited

Loading

Uh oh!

xin3he commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xin3he commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xin3he commented Feb 27, 2026

Uh oh!

wenhuach21 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xin3he commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xin3he commented Feb 27, 2026 •

edited

Loading

wenhuach21 commented Feb 28, 2026 •

edited

Loading