Tp param level by 3outeille · Pull Request #46290 · huggingface/transformers

3outeille · 2026-05-29T19:42:34Z

Done:
- handling sparse + dense in the sp_plan for models like qwen3_moe
- better tests to check ep/tp/sp path during forward/backward
- Double check MoEExpertsParallel
go from module level to param level ("layers..mlp.experts.gate_up_proj": "grouped_gemm", "layers..mlp.experts.down_proj": "grouped_gemm", "layers.*.mlp.experts": "moe_experts_allreduce",)
- Double check ep_router => is it the same as main and Amine
- fixing ep_backward test

HuggingFaceDocBuilderDev · 2026-05-29T19:55:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Decompose MoE tensor/expert parallelism per review feedback: weight sharding is declared per-parameter, while the experts module entry stays forward-comm only. - MoEParamShard: parameter-only style wrapping named expert weights as DTensor placeholders (no forward hook). grouped_gemm shards the expert dim and updates module.num_experts to the per-rank local count. - Register grouped_gemm (Shard(0)), moe_gate_up_colwise (_StridedShard(-2)), moe_gate_up_colwise_alt (_StridedShard(-1)), moe_down_rowwise (Shard(-1)). - EpRouterParallel (ep_router): forward-only slicing of router outputs to local experts, ported from the original RouterParallel (#39501). - moe_experts_allreduce is now forward-comm only: strip the baked shard_plan and drop the now-dead shard_plan ctor arg / _moe_shard_plan / shard_parameters override from MoEExpertsParallel; skip _AllReduceBackward on routing weights under EP. - verify_tp_plan: treat moe_experts_allreduce / ep_router as forward-only.

Run tensor parallelism in two passes: - Pass 1 (param-level): walk named_parameters() and, for styles in PARAM_ONLY_STYLES (grouped_gemm, moe_gate_up_colwise[_alt], moe_down_rowwise), shard the parameter directly via shard_parameters(). No forward hook. - Pass 2 (module-level): the existing named_modules() loop for forward hooks, now skipping PARAM_ONLY_STYLES. Param sharding runs first so module forward hooks (moe_experts_allreduce) see the already-sharded DTensor params. Also wire the EP-plan fallback so enable_expert_parallel uses model._ep_plan when no explicit plan is passed.

…tion - New tests/distributed/test_moe_tensor_parallel_plan.py: plan resolution, placement expectations for grouped_gemm / moe_gate_up_colwise[_alt] / moe_down_rowwise, gloo distributed integration (EP Shard(0), TP _StridedShard(-2)+Shard(-1), ep_router slicing), and a registry guard that moe_experts_allreduce carries no baked shard plan. - _verify_tp_sharding: add a two-sided check asserting that every parameter whose plan entry is a weight-sharding style actually comes back as a non-replicate DTensor. The prior check only validated params that happened to be sharded, so a style that gracefully degrades to replicated when unsharded (e.g. MoEExpertsParallel) could pass output-equality while silently running unparallelized.

For every TP/SP plan that sharded experts, declare per-parameter entries: "layers.*.mlp.experts.gate_up_proj": "moe_gate_up_colwise" "layers.*.mlp.experts.down_proj": "moe_down_rowwise" while keeping the forward-only "layers.*.mlp.experts": "moe_experts_allreduce". This matches the now-empty moe_experts_allreduce shard_plan; sharding is declared in config at parameter granularity. EP plans already used "grouped_gemm" and are unchanged. hy_v3 and laguna previously used "packed_colwise" / "rowwise_allreduce" on the 3D expert *parameters*; those styles are module-level and were silently no-ops on params (the bundled shard_plan did the work). They now use the param-level moe_gate_up_colwise / moe_down_rowwise like every other MoE model. Edited modular files where they own the plan literal; generated configs and inherited plans (e.g. from qwen3_moe) propagated via modular conversion.

- expert_parallelism.md: describe the param-level decomposition (grouped_gemm, ep_router, moe_experts_allreduce) instead of the removed GroupedGemmParallel class, and note the TP equivalents (moe_gate_up_colwise / moe_down_rowwise). - weightconverter.md: note that fused expert weights are sharded at parameter granularity by the parallel plan.

Rename registry and plan entries so TP-on-expert sharding is distinct from EP (grouped_gemm) and dense packed_colwise: moe_gate_up_colwise -> moe_tp_gate_up_colwise, moe_down_rowwise -> moe_tp_down_rowwise. Drop unused moe_tp_gate_up_colwise_alt (GPT-OSS-style layouts stay EP-only).

github-actions · 2026-06-03T06:36:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_moe, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, ernie4_5_vl_moe, exaone_moe, flex_olmo, gemma4, glm4_moe, glm4_moe_lite, glm4v_moe, glm_moe_dsa, gpt_oss, hrm_text, hy_v3

github-actions · 2026-06-03T06:53:50Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46290&sha=b5d0bb

3outeille added 5 commits May 29, 2026 20:27

3outeille mentioned this pull request May 29, 2026

DIstributed branch base #46269

Open

3outeille force-pushed the tp_param_level branch from 2326431 to ec2212b Compare May 29, 2026 22:38

3outeille mentioned this pull request May 30, 2026

sp + ep training / tp + ep inference #46292

Draft

3outeille added 15 commits May 30, 2026 07:16

handle sparse and dense sp plan for qwen3_moe

654fa22

better tests coverage for sp & ep

57b5a6e

linting

e72ee62

uniformize TP Api to avoid confusion with torch native ops

2ff3821

inline tp

f549368

rename

4089761

cleaning

18bccfa

inline

5da7549

cleaning

a08c6ac

cleaning

b78f234

linting

4592599

fix ci ep_backward

229235d

linting

8cdcd88

remove flag expert parallel

6d0e153

fix

df59a89

3outeille marked this pull request as ready for review June 1, 2026 22:00

3outeille requested a review from ArthurZucker June 1, 2026 22:01

3outeille added 3 commits June 1, 2026 23:19

add tp plan + ep_plan

9add329

revert doc

4bc3980

fix install_forward

8393d5a

3outeille added 4 commits June 3, 2026 02:37

linting

36a3c37

add moe identity back

24d042e

no need aymore

69e118d

update tp_plan for ernie4_5_vl_moe

b5d0bba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tp param level#46290

Tp param level#46290
3outeille wants to merge 28 commits into
distributedfrom
tp_param_level

3outeille commented May 29, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3outeille commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3outeille commented May 29, 2026 •

edited

Loading