Skip to content

Conversation

@terryysun
Copy link
Contributor

@terryysun terryysun commented Nov 19, 2025

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)

Branch End-to-end execution time mean on mixtral_8x7b_bf16_2x8
main 1128328 us
terryysun/a2a_s_curve (this branch) 1009397 us

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

@terryysun terryysun changed the title Terryysun/a2a s curve [GPU] Add all-to-all support to S-curve model Nov 25, 2025
Copy link
Member

@felixwqp felixwqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Terry!

// `num_nodes`: the number of nodes participating in the all-to-all.
// `num_communicators`: the number of communicators participating in the
// all-to-all.
absl::StatusOr<absl::Duration> AllToAllLatency(int64_t buff_size_bytes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve consistency, since we're adding AllToAllLatency, perhaps we should rename RingLatency to RingCollectiveLatency? This would clarify that both interfaces deal with communication operation name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ring is more of a pattern name. I haven't found a good pattern name for a2a so using collective name here -- it's one of it kind so there should not be much ambiguity.

/*num_nodes=*/1, absl::Microseconds(292)},
{SolGPUCostModel::CollectiveType::kSendRecv,
/*num_nodes=*/2, absl::Microseconds(485)},
{SolGPUCostModel::CollectiveType::kAllToAll,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you double check to make sure the single-host(intra-host) a2a is using the perf-table and multi-host(cross-partition) is using the new model you are adding?

I ask because function CommunicationType is used to determine the type of alltoall; and which model to use to estimate this alltoall.

we may need to make sure the alltoall is properly classified in CommunicationType

with that said, can you add another unit test into xla/service/gpu/model/sol_latency_estimator_test.cc? we need to ensure the alltoall is properly classified by different underlying model(s-curve or perf-table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added model dispatching test.

@terryysun terryysun requested a review from felixwqp December 2, 2025 18:40
@terryysun terryysun force-pushed the terryysun/a2a_s_curve branch from bc389ff to 0d9c022 Compare December 4, 2025 23:05
@terryysun terryysun force-pushed the terryysun/a2a_s_curve branch from 0d9c022 to 56aef84 Compare December 4, 2025 23:51
interpolator.get())
.ok());

// IB collective should use S-curve model (world-level across 2 hosts).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rephrase it to "cross-partition" collective?

IB may not apply to all Xla user's transport.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephased.

@terryysun terryysun requested a review from felixwqp December 9, 2025 04:20
copybara-service bot pushed a commit that referenced this pull request Dec 10, 2025
Imported from GitHub PR #34143

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)
| Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 |
| :------- | :------: |
| main     | 1128328 us   |
| terryysun/a2a_s_curve (this branch)   | 1009397 us |

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

Copybara import of the project:

--
794ef56 by Terry Sun <tesun@nvidia.com>:

s-curve a2a support

--
4f85dae by Terry Sun <tesun@nvidia.com>:

fix buffer size calculation

--
1dc94f7 by Terry Sun <tesun@nvidia.com>:

add LHS test

--
56aef84 by Terry Sun <tesun@nvidia.com>:

add model dispatching test

--
d20ed93 by Terry Sun <tesun@nvidia.com>:

fix merge issue

--
f09474b by Terry Sun <tesun@nvidia.com>:

rephase doc string

Merging this change closes #34143

FUTURE_COPYBARA_INTEGRATE_REVIEW=#34143 from terryysun:terryysun/a2a_s_curve f09474b
PiperOrigin-RevId: 842901679
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 10, 2025
Imported from GitHub PR openxla/xla#34143

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)
| Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 |
| :------- | :------: |
| main     | 1128328 us   |
| terryysun/a2a_s_curve (this branch)   | 1009397 us |

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

Copybara import of the project:

--
794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>:

s-curve a2a support

--
4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>:

fix buffer size calculation

--
1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>:

add LHS test

--
56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>:

add model dispatching test

--
d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>:

fix merge issue

--
f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>:

rephase doc string

Merging this change closes #34143

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#34143 from terryysun:terryysun/a2a_s_curve f09474bc513804097956e15a7684f1299bef4173
PiperOrigin-RevId: 842901679
copybara-service bot pushed a commit that referenced this pull request Dec 11, 2025
Imported from GitHub PR #34143

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)
| Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 |
| :------- | :------: |
| main     | 1128328 us   |
| terryysun/a2a_s_curve (this branch)   | 1009397 us |

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

Copybara import of the project:

--
794ef56 by Terry Sun <tesun@nvidia.com>:

s-curve a2a support

--
4f85dae by Terry Sun <tesun@nvidia.com>:

fix buffer size calculation

--
1dc94f7 by Terry Sun <tesun@nvidia.com>:

add LHS test

--
56aef84 by Terry Sun <tesun@nvidia.com>:

add model dispatching test

--
d20ed93 by Terry Sun <tesun@nvidia.com>:

fix merge issue

--
f09474b by Terry Sun <tesun@nvidia.com>:

rephase doc string

Merging this change closes #34143

FUTURE_COPYBARA_INTEGRATE_REVIEW=#34143 from terryysun:terryysun/a2a_s_curve f09474b
PiperOrigin-RevId: 842901679
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 11, 2025
Imported from GitHub PR openxla/xla#34143

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)
| Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 |
| :------- | :------: |
| main     | 1128328 us   |
| terryysun/a2a_s_curve (this branch)   | 1009397 us |

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

Copybara import of the project:

--
794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>:

s-curve a2a support

--
4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>:

fix buffer size calculation

--
1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>:

add LHS test

--
56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>:

add model dispatching test

--
d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>:

fix merge issue

--
f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>:

rephase doc string

Merging this change closes #34143

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#34143 from terryysun:terryysun/a2a_s_curve f09474bc513804097956e15a7684f1299bef4173
PiperOrigin-RevId: 842901679
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 11, 2025
Imported from GitHub PR openxla/xla#34143

📝 Summary of Changes
Added all-to-all support to S-curve model.

🎯 Justification
S-curve model doesn't support all-to-all, fallback may lead to bad performance, benchmarking justified that the added all-to-all model can improve performance for models with cross-NVL domain all-to-all.

🚀 Kind of Contribution
⚡️ Performance Improvement/✨ New Feature

📊 Benchmark (for Performance Improvements)
| Branch | End-to-end execution time mean on mixtral_8x7b_bf16_2x8 |
| :------- | :------: |
| main     | 1128328 us   |
| terryysun/a2a_s_curve (this branch)   | 1009397 us |

Speedup over main: 11.78%.

🧪 Unit Tests:
Added exact-matching unit tests to guard the estimation value.

🧪 Execution Tests:
Added execution tests to guard the comm-compute overlapping behavior.

Copybara import of the project:

--
794ef568fe9fcc0f6b4571f19e2a6ce6e06d0099 by Terry Sun <tesun@nvidia.com>:

s-curve a2a support

--
4f85dae4e688af0e6b1f0f5ff1aa0bfef052f15f by Terry Sun <tesun@nvidia.com>:

fix buffer size calculation

--
1dc94f78cd73ec3f0784b6b2db795a608468cdc7 by Terry Sun <tesun@nvidia.com>:

add LHS test

--
56aef84b36c2bc99bf39562fa868398240ae79c3 by Terry Sun <tesun@nvidia.com>:

add model dispatching test

--
d20ed933cbd97d56f1664bbea1b8d35f9092146e by Terry Sun <tesun@nvidia.com>:

fix merge issue

--
f09474bc513804097956e15a7684f1299bef4173 by Terry Sun <tesun@nvidia.com>:

rephase doc string

Merging this change closes #34143

PiperOrigin-RevId: 843090023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants