Update MOE linear_loop implementation for speedup and matching name in vLLM#1478
Update MOE linear_loop implementation for speedup and matching name in vLLM#1478
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates AutoRound’s Transformers MoE “linear_loop” pathway to better align naming with vLLM conventions and reduce unfusing overhead, and adjusts backend documentation/metadata for AutoGPTQ backends.
Changes:
- Split fused
gate_up_projhandling into separategate_proj+up_projduring MoE unfusing and update the linear-loop forward accordingly. - Optimize MoE unfusing by creating
nn.Linearshells on themetadevice and assigning per-expert weight slices. - Update AutoGPTQ backend registration/requirements (and reflect priority changes in docs) and remap MoE expert parameter keys during shard saving.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/step_by_step.md | Updates backend table entries (priority/packing format/requirements). |
| auto_round/modeling/fused_moe/moe_experts_interface.py | Refactors MoE projection naming and unfuse implementation; adds memory monitoring logs. |
| auto_round/inference/backend.py | Adjusts AutoGPTQ backend registration and requirements gating. |
| auto_round/compressors/shard_writer.py | Remaps expert parameter keys to {experts}.{idx}.{proj} style and skips meta tensors in finalize. |
|
verified with Qwen3.5-35B-A3B + vLLM |
…n vLLM Signed-off-by: He, Xin3 <xin3.he@intel.com>
1 as liang's pr shown, transformers v.5.2.0 transposes the weights. However, I don't see any transformers version control here. Better add it If the code is not compatible with < 5.2.0 2 better test 2 more different model families. 3 test transformers backend as well |
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
|
Qwen3-VL 5.1.0 4.57.6 Qwen3-Next |
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Description
experts.down_proj.[idx]toexperts.[idx]. down_projduring saving.2026-02-27 19:45:30 INFO moe_experts_interface.py L478: [MoE Prep] Before unfuse: 'peak_ram': 1.26GB
2026-02-27 19:45:47 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 40 MOE experts modules
2026-02-27 19:45:47 INFO moe_experts_interface.py L502: [MoE Prep] After unfuse: 'peak_ram': 41.34GB
2026-02-27 19:45:47 INFO replace_modules.py L81: Prepared 40 MOE modules for quantization
Previous time:
2026-02-27 17:25:14 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.0.mlp.experts': 49542.75 ms
2026-02-27 17:26:03 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.1.mlp.experts': 47986.69 ms
2026-02-27 17:26:58 INFO moe_experts_interface.py L474: [MoE Prep] Unfused 'model.language_model.layers.2.mlp.experts': 55245.50 ms
Now:
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.0.mlp.experts': 243.57 ms
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.1.mlp.experts': 206.36 ms
2026-02-27 19:07:44 INFO moe_experts_interface.py L491: [MoE Prep] Unfused 'model.language_model.layers.2.mlp.experts': 189.15 ms
Type of Change
Related Issues
Fixes or relates to #1464
Checklist Before Submitting