Skip to content

Add: L3 broadcast and all-to-all distributed collectives#888

Open
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-broadcast-alltoall
Open

Add: L3 broadcast and all-to-all distributed collectives#888
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-broadcast-alltoall

Conversation

@georgebisbas
Copy link
Copy Markdown
Contributor

Complete the canonical collective set with two new examples that follow the existing scratch-window + TNOTIFY/TWAIT pattern used by allgather and reduce-scatter.

  • broadcast_distributed: root stages, barrier, all ranks read root scratch
  • all_to_all_distributed: dest-indexed scratch staging and peer gather
  • pytest wrappers parametrize 2 and 4 devices on a2a3sim/a2a3/a5sim
  • README: index allgather, reduce_scatter, broadcast, and all_to_all rows

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9899cfbc-88f4-4937-86dd-62d4477095bf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added two new L3 worker examples demonstrating distributed communication patterns. All-to-all implements symmetric 3-phase exchange with per-rank staging, cross-rank synchronization, and remote scratch reads. Broadcast implements 3-phase communication where root stages data, all ranks synchronize, then read from root. Both include AICORE kernels, C++ orchestration shims, Python drivers with CLI interfaces, and parametrized integration tests.

Changes

All-to-All Distributed Exchange

Layer / File(s) Summary
Documentation and package structure
examples/workers/l3/README.md, examples/workers/l3/all_to_all_distributed/__init__.py
README table extended with four new distributed communication examples; package marker created to enable test imports from local main module.
AICORE kernel and orchestration shim
examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp, examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp
All-to-all kernel executes 3-phase workflow: (1) each rank stages input to local scratch with MTE2/MTE3 flag synchronization, (2) cross-rank barrier using atomic add notifications and wait-on-signal primitives, (3) each rank reads remote peer scratches and writes to output. Orchestration shim exposes configuration returning 5 expected arguments and submission entry point mapping three tensors and two scalars.
Python orchestration example and validation
examples/workers/l3/all_to_all_distributed/main.py, examples/workers/l3/all_to_all_distributed/test_all_to_all.py
End-to-end driver parses device ranges (2–16 devices), compiles kernel and orchestration, allocates per-rank Worker domains with scratch buffers, submits orchestration DAG tasks, and validates output via golden reference with numeric tolerance; parametrized test runs on 2 and 4 device configurations.

Broadcast Distributed Communication

Layer / File(s) Summary
Package structure
examples/workers/l3/broadcast_distributed/__init__.py
Package marker module with license header and docstring for test import behavior.
AICORE kernel and orchestration shim
examples/workers/l3/broadcast_distributed/kernels/aiv/broadcast_kernel.cpp, examples/workers/l3/broadcast_distributed/kernels/orchestration/broadcast_orch.cpp
Broadcast kernel executes 3-phase workflow: (1) root rank only stages input to local scratch with MTE flag synchronization, (2) all ranks participate in cross-rank barrier via atomic notifications and conditional waits, (3) all ranks use CommRemotePtr to compute root's remote scratch address, load via HCCL TLOAD/TSTORE with MTE synchronization, and write to output. Orchestration shim exposes configuration returning 6 expected arguments and submission entry point mapping three tensors and three scalars (nranks, root, CommContext).
Python orchestration example and validation
examples/workers/l3/broadcast_distributed/main.py, examples/workers/l3/broadcast_distributed/test_broadcast.py
End-to-end driver parses device ranges (2–16 devices), compiles kernel and orchestration with platform-specific text section extraction, allocates per-rank Worker domains with scratch and window buffers, submits broadcast DAG tasks with configurable root rank, and validates output via golden reference; parametrized test runs on 2 and 4 device configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Two hoppy examples now take flight,
All-to-all exchanges, broadcast's might,
Phases three: stage, sync, and share,
Distributed rabbits everywhere! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and clearly summarizes the main change: adding two new distributed collective examples (broadcast and all-to-all) at the L3 worker level.
Description check ✅ Passed The description is well-related to the changeset, detailing the two new collectives, their implementation pattern, test coverage, and documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new distributed communication examples for Level 3 workers, specifically implementing all_to_all_distributed and broadcast_distributed. Each example includes a symmetric 3-phase C++ kernel utilizing HCCL-window scratch patterns, an orchestration shim, a Python main script for end-to-end execution, and corresponding pytest suites. The main README has also been updated to document these additions. There are no review comments, so no feedback is provided.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp`:
- Around line 108-118: The barrier can pass early because signal slots may be
non-zero; fix by making waits monotonic: read the current local counter from
pto::comm::Signal(signal_base + my_rank) to compute a per-phase target = current
+ 1, then perform the remote increments with CommRemotePtr/pto::comm::TNOTIFY as
before and change the waits to TWAIT(..., target, ...) against each peer's
signal_base slot (use the same target for all peers) instead of waiting for >=
1; reference pto::comm::Signal, CommRemotePtr, TNOTIFY and TWAIT to locate and
update the logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad889782-193a-43a0-8a91-0183409f4553

📥 Commits

Reviewing files that changed from the base of the PR and between 61ba501 and 910d8ff.

📒 Files selected for processing (11)
  • examples/workers/l3/README.md
  • examples/workers/l3/all_to_all_distributed/__init__.py
  • examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp
  • examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp
  • examples/workers/l3/all_to_all_distributed/main.py
  • examples/workers/l3/all_to_all_distributed/test_all_to_all.py
  • examples/workers/l3/broadcast_distributed/__init__.py
  • examples/workers/l3/broadcast_distributed/kernels/aiv/broadcast_kernel.cpp
  • examples/workers/l3/broadcast_distributed/kernels/orchestration/broadcast_orch.cpp
  • examples/workers/l3/broadcast_distributed/main.py
  • examples/workers/l3/broadcast_distributed/test_broadcast.py

Comment on lines +108 to +118
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
__gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);
pto::comm::Signal sig(remote_signal);
pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
}
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
pto::comm::Signal sig(signal_base + peer);
pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Phase-2 barrier can pass early due to stale signal counters.

At Line [117], waiting for >= 1 assumes each signal_base[...] slot starts at zero. These slots are never initialized here, so reused/non-zero tail memory can satisfy waits immediately and break cross-rank synchronization.

💡 Proposed fix (monotonic wait targets)
+    int32_t wait_target[kMaxSupportedRanks];
+    for (int peer = 0; peer < nranks; ++peer) {
+        wait_target[peer] = signal_base[peer];
+        if (peer != my_rank) {
+            wait_target[peer] += 1;
+        }
+    }
+
     for (int peer = 0; peer < nranks; ++peer) {
         if (peer == my_rank) continue;
         __gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);
         pto::comm::Signal sig(remote_signal);
         pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
     }
     for (int peer = 0; peer < nranks; ++peer) {
         if (peer == my_rank) continue;
         pto::comm::Signal sig(signal_base + peer);
-        pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
+        pto::comm::TWAIT(sig, wait_target[peer], pto::comm::WaitCmp::GE);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
__gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);
pto::comm::Signal sig(remote_signal);
pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
}
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
pto::comm::Signal sig(signal_base + peer);
pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
}
int32_t wait_target[kMaxSupportedRanks];
for (int peer = 0; peer < nranks; ++peer) {
wait_target[peer] = signal_base[peer];
if (peer != my_rank) {
wait_target[peer] += 1;
}
}
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
__gm__ int32_t *remote_signal = CommRemotePtr(commCtx, signal_base + my_rank, peer);
pto::comm::Signal sig(remote_signal);
pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
}
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) continue;
pto::comm::Signal sig(signal_base + peer);
pto::comm::TWAIT(sig, wait_target[peer], pto::comm::WaitCmp::GE);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp`
around lines 108 - 118, The barrier can pass early because signal slots may be
non-zero; fix by making waits monotonic: read the current local counter from
pto::comm::Signal(signal_base + my_rank) to compute a per-phase target = current
+ 1, then perform the remote increments with CommRemotePtr/pto::comm::TNOTIFY as
before and change the waits to TWAIT(..., target, ...) against each peer's
signal_base slot (use the same target for all peers) instead of waiting for >=
1; reference pto::comm::Signal, CommRemotePtr, TNOTIFY and TWAIT to locate and
update the logic.

Complete the canonical collective set with two new examples that follow
the existing scratch-window + TNOTIFY/TWAIT pattern used by allgather
and reduce-scatter.

- broadcast_distributed: root stages, barrier, all ranks read root scratch
- all_to_all_distributed: dest-indexed scratch staging and peer gather
- pytest wrappers parametrize 2 and 4 devices on a2a3sim/a2a3/a5sim
- README: index allgather, reduce_scatter, broadcast, and all_to_all rows
@georgebisbas georgebisbas force-pushed the feat/l3-broadcast-alltoall branch from 910d8ff to e7c8e25 Compare May 28, 2026 15:25
L3 subprocesses fork chip children and load torch/libomp; running several
in parallel on macos-latest has caused sporadic SIGABRT flakes in unrelated
collectives. Linux sim jobs keep --max-parallel auto.
Pin Linux st-sim jobs below auto to reduce L3 resource-phase native flakes
while keeping macOS at --max-parallel 1. Document both caps in docs/ci.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant