Skip to content

Tp param level#46290

Open
3outeille wants to merge 28 commits into
distributedfrom
tp_param_level
Open

Tp param level#46290
3outeille wants to merge 28 commits into
distributedfrom
tp_param_level

Conversation

@3outeille
Copy link
Copy Markdown
Member

@3outeille 3outeille commented May 29, 2026

Done:
- handling sparse + dense in the sp_plan for models like qwen3_moe
- better tests to check ep/tp/sp path during forward/backward
- Double check MoEExpertsParallel
go from module level to param level ("layers..mlp.experts.gate_up_proj": "grouped_gemm", "layers..mlp.experts.down_proj": "grouped_gemm", "layers.*.mlp.experts": "moe_experts_allreduce",)
- Double check ep_router => is it the same as main and Amine
- fixing ep_backward test

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

3outeille added 5 commits May 29, 2026 20:27
Decompose MoE tensor/expert parallelism per review feedback: weight sharding
is declared per-parameter, while the experts module entry stays forward-comm only.

- MoEParamShard: parameter-only style wrapping named expert weights as DTensor
  placeholders (no forward hook). grouped_gemm shards the expert dim and updates
  module.num_experts to the per-rank local count.
- Register grouped_gemm (Shard(0)), moe_gate_up_colwise (_StridedShard(-2)),
  moe_gate_up_colwise_alt (_StridedShard(-1)), moe_down_rowwise (Shard(-1)).
- EpRouterParallel (ep_router): forward-only slicing of router outputs to local
  experts, ported from the original RouterParallel (#39501).
- moe_experts_allreduce is now forward-comm only: strip the baked shard_plan and
  drop the now-dead shard_plan ctor arg / _moe_shard_plan / shard_parameters
  override from MoEExpertsParallel; skip _AllReduceBackward on routing weights
  under EP.
- verify_tp_plan: treat moe_experts_allreduce / ep_router as forward-only.
Run tensor parallelism in two passes:
- Pass 1 (param-level): walk named_parameters() and, for styles in PARAM_ONLY_STYLES
  (grouped_gemm, moe_gate_up_colwise[_alt], moe_down_rowwise), shard the parameter
  directly via shard_parameters(). No forward hook.
- Pass 2 (module-level): the existing named_modules() loop for forward hooks, now
  skipping PARAM_ONLY_STYLES.

Param sharding runs first so module forward hooks (moe_experts_allreduce) see the
already-sharded DTensor params. Also wire the EP-plan fallback so
enable_expert_parallel uses model._ep_plan when no explicit plan is passed.
…tion

- New tests/distributed/test_moe_tensor_parallel_plan.py: plan resolution, placement
  expectations for grouped_gemm / moe_gate_up_colwise[_alt] / moe_down_rowwise, gloo
  distributed integration (EP Shard(0), TP _StridedShard(-2)+Shard(-1), ep_router
  slicing), and a registry guard that moe_experts_allreduce carries no baked shard plan.
- _verify_tp_sharding: add a two-sided check asserting that every parameter whose plan
  entry is a weight-sharding style actually comes back as a non-replicate DTensor. The
  prior check only validated params that happened to be sharded, so a style that
  gracefully degrades to replicated when unsharded (e.g. MoEExpertsParallel) could pass
  output-equality while silently running unparallelized.
For every TP/SP plan that sharded experts, declare per-parameter entries:
    "layers.*.mlp.experts.gate_up_proj": "moe_gate_up_colwise"
    "layers.*.mlp.experts.down_proj":    "moe_down_rowwise"
while keeping the forward-only "layers.*.mlp.experts": "moe_experts_allreduce".

This matches the now-empty moe_experts_allreduce shard_plan; sharding is declared in
config at parameter granularity. EP plans already used "grouped_gemm" and are unchanged.

hy_v3 and laguna previously used "packed_colwise" / "rowwise_allreduce" on the 3D expert
*parameters*; those styles are module-level and were silently no-ops on params (the bundled
shard_plan did the work). They now use the param-level moe_gate_up_colwise / moe_down_rowwise
like every other MoE model.

Edited modular files where they own the plan literal; generated configs and inherited
plans (e.g. from qwen3_moe) propagated via modular conversion.
- expert_parallelism.md: describe the param-level decomposition (grouped_gemm,
  ep_router, moe_experts_allreduce) instead of the removed GroupedGemmParallel class,
  and note the TP equivalents (moe_gate_up_colwise / moe_down_rowwise).
- weightconverter.md: note that fused expert weights are sharded at parameter
  granularity by the parallel plan.
Rename registry and plan entries so TP-on-expert sharding is distinct
from EP (grouped_gemm) and dense packed_colwise: moe_gate_up_colwise ->
moe_tp_gate_up_colwise, moe_down_rowwise -> moe_tp_down_rowwise. Drop
unused moe_tp_gate_up_colwise_alt (GPT-OSS-style layouts stay EP-only).
@3outeille 3outeille marked this pull request as ready for review June 1, 2026 22:00
@3outeille 3outeille requested a review from ArthurZucker June 1, 2026 22:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_moe, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, ernie4_5_vl_moe, exaone_moe, flex_olmo, gemma4, glm4_moe, glm4_moe_lite, glm4v_moe, glm_moe_dsa, gpt_oss, hrm_text, hy_v3

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46290&sha=b5d0bb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants