Skip to content

refactor + robusts tests for Tensor Parallel #42809

Merged
3outeille merged 65 commits intomainfrom
v5-test_tensor_parallel_moe
Jan 30, 2026
Merged

refactor + robusts tests for Tensor Parallel #42809
3outeille merged 65 commits intomainfrom
v5-test_tensor_parallel_moe

Conversation

@3outeille
Copy link
Member

@3outeille 3outeille commented Dec 11, 2025

image

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 starting to look good!

@3outeille 3outeille force-pushed the v5-test_tensor_parallel_moe branch from 88989a6 to d677102 Compare December 17, 2025 15:26
@3outeille 3outeille changed the title add tensor parallel test for MoE Fix tensor parallel for MoE Dec 19, 2025
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate": "ep_router", # we need to replicate here to correctly route experts
Copy link
Member Author

@3outeille 3outeille Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TP, all ranks have all experts but just sharded weights on each GPU. If we useRouterParallel, we will find ourselves in a case where we mask an expert that is needed (because EP assume we have a full expert in a GPU, not every experts but sharded in all GPUs ), thus missing its output, thus having partial output, thus incorrect all_reduce that happens at local_rowise in down_proj

"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate": "ep_router", # we need to replicate here to correctly route experts
"layers.*.mlp.experts.gate_up_proj": "local_colwise",
"layers.*.mlp.experts.gate_up_proj": "local_packed_colwise",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate_up is packed so we need to use a local version of packed_colwise

@3outeille 3outeille changed the title Fix tensor parallel for MoE Fix distributed training for MoE Dec 20, 2025
@3outeille
Copy link
Member Author

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"]
quantizations: []

@3outeille
Copy link
Member Author

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!
Thanks for taking all the feedback into account! Let's goooooo

Comment on lines +746 to +749
Args:
split_input: If True, splits replicated input before matmul. Use when input
comes from a non-parallelizable operation (chunk/slice).
Default False (expects pre-sharded input from colwise layer).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I am not certain here, QOL might be better to always split input if shape is not what we expect (which dtensor did on its own()

outputs = outputs + mod._bias
# back to local tensor if use_local_output is True
return outputs
return parameter.to(device=device, dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_tensor_shard should use with device already and thus we should not need to move here no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right! Much better version, thanks a lot for taking all the changes into account! Great work! 🤗🚀
Hard to make sure everything will work perfectly by simply reading distributed code though - could you notably confirm that chaining several "matching" layers, such as a MLP which contains 2 colwise and 1 rowwise will only incur a communication at the very end, after the final rowwise?
Otherwise, main comment is to remove all the redundant prepare_module_tp as they are all similar!

outputs = outputs + mod._bias
# back to local tensor if use_local_output is True
return outputs
return parameter.to(device=device, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work

@3outeille
Copy link
Member Author

could you notably confirm that chaining several "matching" layers, such as a MLP which contains 2 colwise and 1 rowwise will only incur a communication at the very end, after the final rowwise?

yes !

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

@3outeille 3outeille enabled auto-merge (squash) January 30, 2026 15:10
@3outeille 3outeille merged commit eaab9f2 into main Jan 30, 2026
26 checks passed
@3outeille 3outeille deleted the v5-test_tensor_parallel_moe branch January 30, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants