[GPU] Add all-to-all support to S-curve model #34143

terryysun · 2025-11-19T03:48:18Z

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)

Branch	End-to-end execution time mean on mixtral_8x7b_bf16_2x8
main	1128328 us
terryysun/a2a_s_curve (this branch)	1009397 us

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

felixwqp

thank you Terry!

felixwqp · 2025-12-01T06:12:48Z

xla/service/gpu/model/sol_gpu_cost_model.h

+  // `num_nodes`: the number of nodes participating in the all-to-all.
+  // `num_communicators`: the number of communicators participating in the
+  // all-to-all.
+  absl::StatusOr<absl::Duration> AllToAllLatency(int64_t buff_size_bytes,


To improve consistency, since we're adding AllToAllLatency, perhaps we should rename RingLatency to RingCollectiveLatency? This would clarify that both interfaces deal with communication operation name.

I think ring is more of a pattern name. I haven't found a good pattern name for a2a so using collective name here -- it's one of it kind so there should not be much ambiguity.

felixwqp · 2025-12-01T06:18:22Z

xla/service/gpu/model/sol_gpu_cost_model_test.cc

+                              /*num_nodes=*/1, absl::Microseconds(292)},
+                             {SolGPUCostModel::CollectiveType::kSendRecv,
+                              /*num_nodes=*/2, absl::Microseconds(485)},
+                             {SolGPUCostModel::CollectiveType::kAllToAll,


can you double check to make sure the single-host(intra-host) a2a is using the perf-table and multi-host(cross-partition) is using the new model you are adding?

I ask because function CommunicationType is used to determine the type of alltoall; and which model to use to estimate this alltoall.

we may need to make sure the alltoall is properly classified in CommunicationType

with that said, can you add another unit test into xla/service/gpu/model/sol_latency_estimator_test.cc? we need to ensure the alltoall is properly classified by different underlying model(s-curve or perf-table)

added model dispatching test.

felixwqp · 2025-12-05T22:30:57Z

xla/service/gpu/model/sol_latency_estimator_test.cc

+                  interpolator.get())
+                  .ok());
+
+  // IB collective should use S-curve model (world-level across 2 hosts).


can you rephrase it to "cross-partition" collective?

IB may not apply to all Xla user's transport.

Imported from GitHub PR #34143 📝 Summary of Changes Added all-to-all support to S-curve model. 🎯 Justification S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all. 🚀 Kind of Contribution ⚡️ Performance Improvement/✨ New Feature 📊 Benchmark (for Performance Improvements) | Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 | | :------- | :------: | | main | 1128328 us | | terryysun/a2a_s_curve (this branch) | 1009397 us | Speedup over main: 11.78%. 🧪 Unit Tests: Added exact-matching unit tests to guard the estimation value. 🧪 Execution Tests: Added execution tests to guard the comm-compute overlapping behavior. Copybara import of the project: -- 794ef56 by Terry Sun <tesun@nvidia.com>: s-curve a2a support -- 4f85dae by Terry Sun <tesun@nvidia.com>: fix buffer size calculation -- 1dc94f7 by Terry Sun <tesun@nvidia.com>: add LHS test -- 56aef84 by Terry Sun <tesun@nvidia.com>: add model dispatching test -- d20ed93 by Terry Sun <tesun@nvidia.com>: fix merge issue -- f09474b by Terry Sun <tesun@nvidia.com>: rephase doc string Merging this change closes #34143 FUTURE_COPYBARA_INTEGRATE_REVIEW=#34143 from terryysun:terryysun/a2a_s_curve f09474b PiperOrigin-RevId: 842901679

Imported from GitHub PR openxla/xla#34143 📝 Summary of Changes Added all-to-all support to S-curve model. 🎯 Justification S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all. 🚀 Kind of Contribution ⚡️ Performance Improvement/✨ New Feature 📊 Benchmark (for Performance Improvements) | Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 | | :------- | :------: | | main | 1128328 us | | terryysun/a2a_s_curve (this branch) | 1009397 us | Speedup over main: 11.78%. 🧪 Unit Tests: Added exact-matching unit tests to guard the estimation value. 🧪 Execution Tests: Added execution tests to guard the comm-compute overlapping behavior. Copybara import of the project: -- 794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>: s-curve a2a support -- 4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>: fix buffer size calculation -- 1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>: add LHS test -- 56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>: add model dispatching test -- d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>: fix merge issue -- f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>: rephase doc string Merging this change closes #34143 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#34143 from terryysun:terryysun/a2a_s_curve f09474bc513804097956e15a7684f1299bef4173 PiperOrigin-RevId: 842901679

Imported from GitHub PR #34143 📝 Summary of Changes Added all-to-all support to S-curve model. 🎯 Justification S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all. 🚀 Kind of Contribution ⚡️ Performance Improvement/✨ New Feature 📊 Benchmark (for Performance Improvements) | Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 | | :------- | :------: | | main | 1128328 us | | terryysun/a2a_s_curve (this branch) | 1009397 us | Speedup over main: 11.78%. 🧪 Unit Tests: Added exact-matching unit tests to guard the estimation value. 🧪 Execution Tests: Added execution tests to guard the comm-compute overlapping behavior. Copybara import of the project: -- 794ef56 by Terry Sun <tesun@nvidia.com>: s-curve a2a support -- 4f85dae by Terry Sun <tesun@nvidia.com>: fix buffer size calculation -- 1dc94f7 by Terry Sun <tesun@nvidia.com>: add LHS test -- 56aef84 by Terry Sun <tesun@nvidia.com>: add model dispatching test -- d20ed93 by Terry Sun <tesun@nvidia.com>: fix merge issue -- f09474b by Terry Sun <tesun@nvidia.com>: rephase doc string Merging this change closes #34143 FUTURE_COPYBARA_INTEGRATE_REVIEW=#34143 from terryysun:terryysun/a2a_s_curve f09474b PiperOrigin-RevId: 842901679

Imported from GitHub PR openxla/xla#34143 📝 Summary of Changes Added all-to-all support to S-curve model. 🎯 Justification S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all. 🚀 Kind of Contribution ⚡️ Performance Improvement/✨ New Feature 📊 Benchmark (for Performance Improvements) | Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 | | :------- | :------: | | main | 1128328 us | | terryysun/a2a_s_curve (this branch) | 1009397 us | Speedup over main: 11.78%. 🧪 Unit Tests: Added exact-matching unit tests to guard the estimation value. 🧪 Execution Tests: Added execution tests to guard the comm-compute overlapping behavior. Copybara import of the project: -- 794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>: s-curve a2a support -- 4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>: fix buffer size calculation -- 1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>: add LHS test -- 56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>: add model dispatching test -- d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>: fix merge issue -- f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>: rephase doc string Merging this change closes #34143 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#34143 from terryysun:terryysun/a2a_s_curve f09474bc513804097956e15a7684f1299bef4173 PiperOrigin-RevId: 842901679

Imported from GitHub PR openxla/xla#34143 📝 Summary of Changes Added all-to-all support to S-curve model. 🎯 Justification S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all. 🚀 Kind of Contribution ⚡️ Performance Improvement/✨ New Feature 📊 Benchmark (for Performance Improvements) | Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 | | :------- | :------: | | main | 1128328 us | | terryysun/a2a_s_curve (this branch) | 1009397 us | Speedup over main: 11.78%. 🧪 Unit Tests: Added exact-matching unit tests to guard the estimation value. 🧪 Execution Tests: Added execution tests to guard the comm-compute overlapping behavior. Copybara import of the project: -- 794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>: s-curve a2a support -- 4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>: fix buffer size calculation -- 1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>: add LHS test -- 56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>: add model dispatching test -- d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>: fix merge issue -- f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>: rephase doc string Merging this change closes #34143 PiperOrigin-RevId: 843090023

terryysun added 6 commits October 21, 2025 21:02

s-curve a2a support

794ef56

fix buffer size calculation

4f85dae

add LHS test

1dc94f7

Merge branch 'main' into terryysun/a2a_s_curve

38d0218

Merge branch 'main' into terryysun/a2a_s_curve

2329fdf

Merge branch 'main' into terryysun/a2a_s_curve

13d4414

terryysun requested review from felixwqp and golechwierowicz November 19, 2025 03:48

terryysun changed the title ~~Terryysun/a2a s curve~~ [GPU] Add all-to-all support to S-curve model Nov 25, 2025

felixwqp reviewed Dec 1, 2025

View reviewed changes

terryysun requested a review from felixwqp December 2, 2025 18:40

terryysun force-pushed the terryysun/a2a_s_curve branch from bc389ff to 0d9c022 Compare December 4, 2025 23:05

add model dispatching test

56aef84

terryysun force-pushed the terryysun/a2a_s_curve branch from 0d9c022 to 56aef84 Compare December 4, 2025 23:51

terryysun added 2 commits December 5, 2025 00:39

Merge branch 'main' into terryysun/a2a_s_curve

9029251

fix merge issue

d20ed93

felixwqp reviewed Dec 5, 2025

View reviewed changes

rephase doc string

f09474b

terryysun requested a review from felixwqp December 9, 2025 04:20

felixwqp approved these changes Dec 10, 2025

View reviewed changes

copybara-service bot mentioned this pull request Dec 10, 2025

PR #34143: [GPU] Add all-to-all support to S-curve model #35121

Closed

copybara-service bot mentioned this pull request Dec 10, 2025

PR #34143: [GPU] Add all-to-all support to S-curve model tensorflow/tensorflow#106029

Merged

copybara-service bot closed this in b36f567 Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] Add all-to-all support to S-curve model #34143

[GPU] Add all-to-all support to S-curve model #34143

Uh oh!

terryysun commented Nov 19, 2025 •

edited

Loading

Uh oh!

felixwqp left a comment

Uh oh!

felixwqp Dec 1, 2025

Uh oh!

terryysun Dec 2, 2025

Uh oh!

felixwqp Dec 1, 2025

Uh oh!

terryysun Dec 2, 2025

Uh oh!

felixwqp Dec 5, 2025

Uh oh!

terryysun Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[GPU] Add all-to-all support to S-curve model #34143

[GPU] Add all-to-all support to S-curve model #34143

Uh oh!

Conversation

terryysun commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixwqp left a comment

Choose a reason for hiding this comment

Uh oh!

felixwqp Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

terryysun Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

felixwqp Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

terryysun Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

felixwqp Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

terryysun Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terryysun commented Nov 19, 2025 •

edited

Loading