refactor + robusts tests for Tensor Parallel by 3outeille · Pull Request #42809 · huggingface/transformers

3outeille · 2025-12-11T14:04:32Z

distributed training CI (Add distributed training CI #42765) needs FSDP to be refactor. We need same FSDP degree to compare training loss properly (https://github.com/pytorch/torchtitan/blob/main/docs/converging.md#guidelines)
Referencing Remove dtensor dependency in Tensor Parallel #43157 because it uses it and hasn't been review

eaeaae

eaeaae fix tensor parallel MoE test fix tensor parallel MoE test

* fix * linting

HuggingFaceDocBuilderDev · 2025-12-11T14:25:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

🚀 starting to look good!

src/transformers/core_model_loading.py

src/transformers/integrations/tensor_parallel.py

…l not transfer its attribute (especially _is_hf_initialized)

3outeille · 2025-12-19T13:46:59Z

src/transformers/models/mixtral/configuration_mixtral.py

        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
-        "layers.*.mlp.gate": "ep_router",  # we need to replicate here to correctly route experts


In TP, all ranks have all experts but just sharded weights on each GPU. If we useRouterParallel, we will find ourselves in a case where we mask an expert that is needed (because EP assume we have a full expert in a GPU, not every experts but sharded in all GPUs ), thus missing its output, thus having partial output, thus incorrect all_reduce that happens at local_rowise in down_proj

3outeille · 2025-12-19T13:47:26Z

src/transformers/models/mixtral/configuration_mixtral.py

        "layers.*.self_attn.o_proj": "rowwise",
-        "layers.*.mlp.gate": "ep_router",  # we need to replicate here to correctly route experts
-        "layers.*.mlp.experts.gate_up_proj": "local_colwise",
+        "layers.*.mlp.experts.gate_up_proj": "local_packed_colwise",


gate_up is packed so we need to use a local version of packed_colwise

…proach

3outeille · 2026-01-29T14:43:25Z

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

github-actions · 2026-01-29T14:44:44Z

This comment contains run-slow, running the specified jobs:

models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"]
quantizations: []

… >= 2.9

3outeille · 2026-01-29T15:40:34Z

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

github-actions · 2026-01-29T15:41:50Z

This comment contains run-slow, running the specified jobs:

models: ["models/afmoe", "models/apertus", "models/arcee", "models/aria", "models/bamba", "models/cohere", "models/cohere2", "models/cwm", "models/dbrx", "models/deepseek_v2", "models/deepseek_v3", "models/diffllama", "models/doge", "models/dots1", "models/emu3", "models/ernie4_5"]
quantizations: []

github-actions · 2026-01-29T15:57:54Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

ArthurZucker

Very nice!
Thanks for taking all the feedback into account! Let's goooooo

src/transformers/core_model_loading.py

ArthurZucker · 2026-01-29T19:22:12Z

src/transformers/integrations/tensor_parallel.py

+    Args:
+        split_input: If True, splits replicated input before matmul. Use when input
+                     comes from a non-parallelizable operation (chunk/slice).
+                     Default False (expects pre-sharded input from colwise layer).


Okay I am not certain here, QOL might be better to always split input if shape is not what we expect (which dtensor did on its own()

ArthurZucker · 2026-01-29T19:23:39Z

src/transformers/integrations/tensor_parallel.py

-            outputs = outputs + mod._bias
-        # back to local tensor if use_local_output is True
-        return outputs
+        return parameter.to(device=device, dtype=dtype)


I think get_tensor_shard should use with device already and thus we should not need to move here no?

IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work

src/transformers/integrations/tensor_parallel.py

Cyrilvallez

All right! Much better version, thanks a lot for taking all the changes into account! Great work! 🤗🚀
Hard to make sure everything will work perfectly by simply reading distributed code though - could you notably confirm that chaining several "matching" layers, such as a MLP which contains 2 colwise and 1 rowwise will only incur a communication at the very end, after the final rowwise?
Otherwise, main comment is to remove all the redundant prepare_module_tp as they are all similar!

src/transformers/integrations/tensor_parallel.py

Cyrilvallez · 2026-01-30T09:39:19Z

src/transformers/integrations/tensor_parallel.py

-            outputs = outputs + mod._bias
-        # back to local tensor if use_local_output is True
-        return outputs
+        return parameter.to(device=device, dtype=dtype)


IMO maybe very slightly more clear to keep get_tensor_shard only for the splitting and then move device/dtype here to avoid overloading the other, but no strong opinions at all, both work

src/transformers/integrations/tensor_parallel.py

3outeille · 2026-01-30T14:35:08Z

could you notably confirm that chaining several "matching" layers, such as a MLP which contains 2 colwise and 1 rowwise will only incur a communication at the very end, after the final rowwise?

yes !

…ingface/transformers into v5-test_tensor_parallel_moe

…bclasses

github-actions · 2026-01-30T15:05:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria, bamba, cohere, cohere2, cwm, dbrx, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5

3outeille and others added 6 commits November 13, 2025 14:08

begin Moe test tensor parallel

40b3e2b

create tiny moe model + fix test tensor parallel Moe

05172a9

eaeaae

create tiny moe model + fix test tensor parallel Moe

d75f4b8

eaeaae fix tensor parallel MoE test fix tensor parallel MoE test

Merge branch 'main' into v4.57.1-test_tensor_parallel

06635f7

Merge branch 'main' into v4.57.1-test_tensor_parallel

000c33f

fix backward pass test in tensor parallel for Dense model (#42811)

5f548ed

* fix * linting

3outeille and others added 10 commits December 16, 2025 13:20

Merge branch 'main' into v5-test_tensor_parallel_moe

48c69f7

use mixtral instead for testing

87fb140

fix dtensor and tensor mismatch

9524073

linting

ba79de0

checkout test tensor parallel to be like main

3fed52d

Merge branch 'main' into fix_dtensor_tensor_moe_mismatch

ad0f203

Merge branch 'main' into fix_dtensor_tensor_moe_mismatch

d6da5af

avoid hack and create class instead

12ff9a4

fix loading ep

b337af7

add moe test

7f19dbf

ArthurZucker reviewed Dec 17, 2025

View reviewed changes

src/transformers/core_model_loading.py Outdated Show resolved Hide resolved

src/transformers/integrations/tensor_parallel.py Outdated Show resolved Hide resolved

now EP inference works again but pass still fails

d677102

3outeille force-pushed the v5-test_tensor_parallel_moe branch from 88989a6 to d677102 Compare December 17, 2025 15:26

3outeille and others added 3 commits December 17, 2025 16:45

Merge branch 'main' into v5-test_tensor_parallel_moe

dc86437

linting

0155b0f

now load from checkpoint. Creating a nn.Parameter for param_value wil…

531561d

…l not transfer its attribute (especially _is_hf_initialized)

3outeille changed the title ~~add tensor parallel test for MoE~~ Fix tensor parallel for MoE Dec 19, 2025

3outeille added 2 commits December 19, 2025 10:57

forward now works (add LocalPackedColwise + dont use EP router)

19bfcef

for now test in float32

f99a67e

3outeille commented Dec 19, 2025

View reviewed changes

dont do all_reduce manually for GatherParellel. Convert to dtensor ap…

f88b490

…proach

3outeille changed the title ~~Fix tensor parallel for MoE~~ Fix distributed training for MoE Dec 20, 2025

3outeille and others added 6 commits January 29, 2026 14:44

remove disable. I was in an older torch version

33208b3

Add pytest skip condition for tensor parallel tests requiring PyTorch…

845c269

… >= 2.9

linting

4350cfc

Merge branch 'main' into v5-test_tensor_parallel_moe

b0e2c59

linting

65066dc

Merge branch 'main' into v5-test_tensor_parallel_moe

f48b207

3outeille requested a review from ArthurZucker January 29, 2026 15:01

3outeille and others added 3 commits January 29, 2026 15:07

fixing remaining modular

ab98ee5

linting

598c900

Merge branch 'main' into v5-test_tensor_parallel_moe

5233689

ArthurZucker approved these changes Jan 30, 2026

View reviewed changes

Cyrilvallez approved these changes Jan 30, 2026

View reviewed changes

3outeille added 2 commits January 30, 2026 11:18

Merge branch 'main' into v5-test_tensor_parallel_moe

43dc42a

Merge branch 'main' into v5-test_tensor_parallel_moe

7a83179

3outeille added 3 commits January 30, 2026 14:45

Refactor get_expected_sharded_shape to be only one call

6f08529

Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…

a14c851

…ingface/transformers into v5-test_tensor_parallel_moe

Remove redundant prepare_module_tp method from TensorParallelLayer su…

6845edc

…bclasses

Merge branch 'main' into v5-test_tensor_parallel_moe

91a56db

3outeille enabled auto-merge (squash) January 30, 2026 15:10

3outeille merged commit eaab9f2 into main Jan 30, 2026
26 checks passed

3outeille deleted the v5-test_tensor_parallel_moe branch January 30, 2026 15:22

Sebmono mentioned this pull request Feb 1, 2026

[PR] refactor + robusts tests for Tensor Parallel Sandgarden-Demo/transformers#7

Closed

3outeille mentioned this pull request Feb 2, 2026

Need to understand difference between TP support via transformers code v/s Pytorch's native parallelize_module API. #43048

Open

Conversation

3outeille commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2025

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

3outeille Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

3outeille commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

3outeille commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

CI Results

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

3outeille commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

3outeille commented Dec 11, 2025 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

3outeille Dec 19, 2025 •

edited

Loading

Cyrilvallez left a comment •

edited

Loading