Add: L3 broadcast and all-to-all distributed collectives by georgebisbas · Pull Request #888 · hw-native-sys/simpler

georgebisbas · 2026-05-28T15:10:08Z

Complete the canonical collective set with two new examples that follow the existing scratch-window + TNOTIFY/TWAIT pattern used by allgather and reduce-scatter.

broadcast_distributed: root stages, barrier, all ranks read root scratch
all_to_all_distributed: dest-indexed scratch staging and peer gather
pytest wrappers parametrize 2 and 4 devices on a2a3sim/a2a3/a5sim
README: index allgather, reduce_scatter, broadcast, and all_to_all rows

coderabbitai · 2026-05-28T15:10:26Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9899cfbc-88f4-4937-86dd-62d4477095bf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Added two new L3 worker examples demonstrating distributed communication patterns. All-to-all implements symmetric 3-phase exchange with per-rank staging, cross-rank synchronization, and remote scratch reads. Broadcast implements 3-phase communication where root stages data, all ranks synchronize, then read from root. Both include AICORE kernels, C++ orchestration shims, Python drivers with CLI interfaces, and parametrized integration tests.

Changes

All-to-All Distributed Exchange

Layer / File(s)	Summary
Documentation and package structure `examples/workers/l3/README.md`, `examples/workers/l3/all_to_all_distributed/__init__.py`	README table extended with four new distributed communication examples; package marker created to enable test imports from local `main` module.
AICORE kernel and orchestration shim `examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp`, `examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp`	All-to-all kernel executes 3-phase workflow: (1) each rank stages input to local scratch with MTE2/MTE3 flag synchronization, (2) cross-rank barrier using atomic add notifications and wait-on-signal primitives, (3) each rank reads remote peer scratches and writes to output. Orchestration shim exposes configuration returning 5 expected arguments and submission entry point mapping three tensors and two scalars.
Python orchestration example and validation `examples/workers/l3/all_to_all_distributed/main.py`, `examples/workers/l3/all_to_all_distributed/test_all_to_all.py`	End-to-end driver parses device ranges (2–16 devices), compiles kernel and orchestration, allocates per-rank `Worker` domains with scratch buffers, submits orchestration DAG tasks, and validates output via golden reference with numeric tolerance; parametrized test runs on 2 and 4 device configurations.

Broadcast Distributed Communication

Layer / File(s)	Summary
Package structure `examples/workers/l3/broadcast_distributed/__init__.py`	Package marker module with license header and docstring for test import behavior.
AICORE kernel and orchestration shim `examples/workers/l3/broadcast_distributed/kernels/aiv/broadcast_kernel.cpp`, `examples/workers/l3/broadcast_distributed/kernels/orchestration/broadcast_orch.cpp`	Broadcast kernel executes 3-phase workflow: (1) root rank only stages input to local scratch with MTE flag synchronization, (2) all ranks participate in cross-rank barrier via atomic notifications and conditional waits, (3) all ranks use `CommRemotePtr` to compute root's remote scratch address, load via HCCL TLOAD/TSTORE with MTE synchronization, and write to output. Orchestration shim exposes configuration returning 6 expected arguments and submission entry point mapping three tensors and three scalars (nranks, root, CommContext).
Python orchestration example and validation `examples/workers/l3/broadcast_distributed/main.py`, `examples/workers/l3/broadcast_distributed/test_broadcast.py`	End-to-end driver parses device ranges (2–16 devices), compiles kernel and orchestration with platform-specific text section extraction, allocates per-rank `Worker` domains with scratch and window buffers, submits broadcast DAG tasks with configurable root rank, and validates output via golden reference; parametrized test runs on 2 and 4 device configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Two hoppy examples now take flight,
All-to-all exchanges, broadcast's might,
Phases three: stage, sync, and share,
Distributed rabbits everywhere! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and clearly summarizes the main change: adding two new distributed collective examples (broadcast and all-to-all) at the L3 worker level.
Description check	✅ Passed	The description is well-related to the changeset, detailing the two new collectives, their implementation pattern, test coverage, and documentation updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces new distributed communication examples for Level 3 workers, specifically implementing all_to_all_distributed and broadcast_distributed. Each example includes a symmetric 3-phase C++ kernel utilizing HCCL-window scratch patterns, an orchestration shim, a Python main script for end-to-end execution, and corresponding pytest suites. The main README has also been updated to document these additions. There are no review comments, so no feedback is provided.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp`:
- Around line 108-118: The barrier can pass early because signal slots may be
non-zero; fix by making waits monotonic: read the current local counter from
pto::comm::Signal(signal_base + my_rank) to compute a per-phase target = current
+ 1, then perform the remote increments with CommRemotePtr/pto::comm::TNOTIFY as
before and change the waits to TWAIT(..., target, ...) against each peer's
signal_base slot (use the same target for all peers) instead of waiting for >=
1; reference pto::comm::Signal, CommRemotePtr, TNOTIFY and TWAIT to locate and
update the logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad889782-193a-43a0-8a91-0183409f4553

📥 Commits

Reviewing files that changed from the base of the PR and between 61ba501 and 910d8ff.

📒 Files selected for processing (11)

examples/workers/l3/README.md
examples/workers/l3/all_to_all_distributed/__init__.py
examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp
examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp
examples/workers/l3/all_to_all_distributed/main.py
examples/workers/l3/all_to_all_distributed/test_all_to_all.py
examples/workers/l3/broadcast_distributed/__init__.py
examples/workers/l3/broadcast_distributed/kernels/aiv/broadcast_kernel.cpp
examples/workers/l3/broadcast_distributed/kernels/orchestration/broadcast_orch.cpp
examples/workers/l3/broadcast_distributed/main.py
examples/workers/l3/broadcast_distributed/test_broadcast.py

coderabbitai · 2026-05-28T15:20:05Z

+    for (int peer = 0; peer < nranks; ++peer) {
+        if (peer == my_rank) continue;
+        __gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);
+        pto::comm::Signal sig(remote_signal);
+        pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
+    }
+    for (int peer = 0; peer < nranks; ++peer) {
+        if (peer == my_rank) continue;
+        pto::comm::Signal sig(signal_base + peer);
+        pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Phase-2 barrier can pass early due to stale signal counters.

At Line [117], waiting for >= 1 assumes each signal_base[...] slot starts at zero. These slots are never initialized here, so reused/non-zero tail memory can satisfy waits immediately and break cross-rank synchronization.

💡 Proposed fix (monotonic wait targets)

+ int32_t wait_target[kMaxSupportedRanks]; + for (int peer = 0; peer < nranks; ++peer) { + wait_target[peer] = signal_base[peer]; + if (peer != my_rank) { + wait_target[peer] += 1; + } + } + for (int peer = 0; peer < nranks; ++peer) { if (peer == my_rank) continue; __gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer); pto::comm::Signal sig(remote_signal); pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd); } for (int peer = 0; peer < nranks; ++peer) { if (peer == my_rank) continue; pto::comm::Signal sig(signal_base + peer); - pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE); + pto::comm::TWAIT(sig, wait_target[peer], pto::comm::WaitCmp::GE); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (int peer = 0; peer < nranks; ++peer) {

if (peer == my_rank) continue;

__gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);

pto::comm::Signal sig(remote_signal);

pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);

}

for (int peer = 0; peer < nranks; ++peer) {

if (peer == my_rank) continue;

pto::comm::Signal sig(signal_base + peer);

pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);

}

int32_t wait_target[kMaxSupportedRanks];

for (int peer = 0; peer < nranks; ++peer) {

wait_target[peer] = signal_base[peer];

if (peer != my_rank) {

wait_target[peer] += 1;

}

}

for (int peer = 0; peer < nranks; ++peer) {

if (peer == my_rank) continue;

__gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);

pto::comm::Signal sig(remote_signal);

pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);

}

for (int peer = 0; peer < nranks; ++peer) {

if (peer == my_rank) continue;

pto::comm::Signal sig(signal_base + peer);

pto::comm::TWAIT(sig, wait_target[peer], pto::comm::WaitCmp::GE);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp` around lines 108 - 118, The barrier can pass early because signal slots may be non-zero; fix by making waits monotonic: read the current local counter from pto::comm::Signal(signal_base + my_rank) to compute a per-phase target = current + 1, then perform the remote increments with CommRemotePtr/pto::comm::TNOTIFY as before and change the waits to TWAIT(..., target, ...) against each peer's signal_base slot (use the same target for all peers) instead of waiting for >= 1; reference pto::comm::Signal, CommRemotePtr, TNOTIFY and TWAIT to locate and update the logic.

Complete the canonical collective set with two new examples that follow the existing scratch-window + TNOTIFY/TWAIT pattern used by allgather and reduce-scatter. - broadcast_distributed: root stages, barrier, all ranks read root scratch - all_to_all_distributed: dest-indexed scratch staging and peer gather - pytest wrappers parametrize 2 and 4 devices on a2a3sim/a2a3/a5sim - README: index allgather, reduce_scatter, broadcast, and all_to_all rows

L3 subprocesses fork chip children and load torch/libomp; running several in parallel on macos-latest has caused sporadic SIGABRT flakes in unrelated collectives. Linux sim jobs keep --max-parallel auto.

Pin Linux st-sim jobs below auto to reduce L3 resource-phase native flakes while keeping macOS at --max-parallel 1. Document both caps in docs/ci.md.

georgebisbas mentioned this pull request May 28, 2026

Update: index all L3 worker examples in README #887

Closed

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

georgebisbas force-pushed the feat/l3-broadcast-alltoall branch from 910d8ff to e7c8e25 Compare May 28, 2026 15:25

georgebisbas added 2 commits May 28, 2026 17:41

Support: cap macOS sim resource phase at --max-parallel 1

36dd108

L3 subprocesses fork chip children and load torch/libomp; running several in parallel on macos-latest has caused sporadic SIGABRT flakes in unrelated collectives. Linux sim jobs keep --max-parallel auto.

Support: cap Ubuntu sim CI at --max-parallel 3

ddca704

Pin Linux st-sim jobs below auto to reduce L3 resource-phase native flakes while keeping macOS at --max-parallel 1. Document both caps in docs/ci.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: L3 broadcast and all-to-all distributed collectives#888

Add: L3 broadcast and all-to-all distributed collectives#888
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-broadcast-alltoall

georgebisbas commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

georgebisbas commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 28, 2026 •

edited

Loading