refactor + robusts tests for Tensor Parallel #42809
Conversation
eaeaae fix tensor parallel MoE test fix tensor parallel MoE test
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
88989a6 to
d677102
Compare
…l not transfer its attribute (especially _is_hf_initialized)
| "layers.*.self_attn.k_proj": "colwise", | ||
| "layers.*.self_attn.v_proj": "colwise", | ||
| "layers.*.self_attn.o_proj": "rowwise", | ||
| "layers.*.mlp.gate": "ep_router", # we need to replicate here to correctly route experts |
There was a problem hiding this comment.
In TP, all ranks have all experts but just sharded weights on each GPU. If we useRouterParallel, we will find ourselves in a case where we mask an expert that is needed (because EP assume we have a full expert in a GPU, not every experts but sharded in all GPUs ), thus missing its output, thus having partial output, thus incorrect all_reduce that happens at local_rowise in down_proj
| "layers.*.self_attn.o_proj": "rowwise", | ||
| "layers.*.mlp.gate": "ep_router", # we need to replicate here to correctly route experts | ||
| "layers.*.mlp.experts.gate_up_proj": "local_colwise", | ||
| "layers.*.mlp.experts.gate_up_proj": "local_packed_colwise", |
There was a problem hiding this comment.
gate_up is packed so we need to use a local version of packed_colwise
|
run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5 |
|
This comment contains models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"] |
|
run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5 |
|
This comment contains models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Very nice!
Thanks for taking all the feedback into account! Let's goooooo
| Args: | ||
| split_input: If True, splits replicated input before matmul. Use when input | ||
| comes from a non-parallelizable operation (chunk/slice). | ||
| Default False (expects pre-sharded input from colwise layer). |
There was a problem hiding this comment.
Okay I am not certain here, QOL might be better to always split input if shape is not what we expect (which dtensor did on its own()
| outputs = outputs + mod._bias | ||
| # back to local tensor if use_local_output is True | ||
| return outputs | ||
| return parameter.to(device=device, dtype=dtype) |
There was a problem hiding this comment.
I think get_tensor_shard should use with device already and thus we should not need to move here no?
There was a problem hiding this comment.
IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work
There was a problem hiding this comment.
All right! Much better version, thanks a lot for taking all the changes into account! Great work! 🤗🚀
Hard to make sure everything will work perfectly by simply reading distributed code though - could you notably confirm that chaining several "matching" layers, such as a MLP which contains 2 colwise and 1 rowwise will only incur a communication at the very end, after the final rowwise?
Otherwise, main comment is to remove all the redundant prepare_module_tp as they are all similar!
| outputs = outputs + mod._bias | ||
| # back to local tensor if use_local_output is True | ||
| return outputs | ||
| return parameter.to(device=device, dtype=dtype) |
There was a problem hiding this comment.
IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work
yes ! |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5 |
Uh oh!
There was an error while loading. Please reload this page.