Skip to content

Add: HCCL backend for comm_* C API with C++ hardware UT#592

Merged
ChaoWao merged 2 commits intohw-native-sys:mainfrom
ChaoWao:pr-571-l1-platform-comm
Apr 20, 2026
Merged

Add: HCCL backend for comm_* C API with C++ hardware UT#592
ChaoWao merged 2 commits intohw-native-sys:mainfrom
ChaoWao:pr-571-l1-platform-comm

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 18, 2026

Summary

First slice of the distributed runtime work extracted from #571. Scope is
narrow on purpose: platform C API + HCCL implementation + ACL lifecycle
(owned by DeviceRunner) + a C++ hardware UT wired into CI. Sim backend,
ChipWorker wrappers, and Python bindings land in a follow-up PR. This keeps
the surface that couples to CANN-private symbols isolated and guarded by
its own test.

API shape (caller-driven lifecycle)

```c
// caller owns ACL + stream; comm_init does NOT aclInit or create streams.
CommHandle comm_init(int rank, int nranks, void *stream, const char *rootinfo_path);
int comm_alloc_windows(CommHandle h, size_t win_size, uint64_t *device_ctx_out);
int comm_get_local_window_base(CommHandle h, uint64_t *base_out);
int comm_get_window_size(CommHandle h, size_t *size_out);
int comm_barrier(CommHandle h);
int comm_destroy(CommHandle h);
```

Key architectural decisions

  1. ACL lifecycle lives in DeviceRunner. `DeviceRunner::ensure_acl_ready`
    does `aclInit` + `aclrtSetDevice`; `finalize()` does the symmetric
    `aclrtResetDevice` + `aclFinalize` behind an `acl_ready_` flag.
    rt-only runtimes (no comm) stay unaffected.

  2. Stream is injected by the caller. Matches HCCL's API shape
    (`HcclBarrier(comm, stream)`); lets callers choose share-with-compute
    (serialize) vs dedicated (overlap).

  3. `CommContext` ABI is locked at compile time. Every field offset +
    sizeof is `static_assert`-ed. A CANN upgrade that shifts layout fails
    the build, not the runtime silently.

Hardening (review-driven)

  • `head.rankSize` bounds-checked against `COMM_MAX_RANK_NUM` (RING + MESH
    paths). `windowsIn[64]` can no longer overflow.
  • `windowsOut[i]` populated alongside `windowsIn[i]` in RING path —
    required for remote-write (TPUT) kernels.
  • `comm_barrier` / `comm_destroy` propagate underlying HCCL / ACL errors
    instead of swallowing them.
  • RING-path workspace `aclrtMemcpy` now checks return code.
  • `file_barrier` takes a 120 s timeout + returns bool. A dead peer no
    longer hangs the group; `comm_init` / `comm_alloc_windows` abort on
    timeout, `comm_destroy` logs and continues so local teardown still runs.
  • `comm_init` null-checks `rootinfo_path` and validates `rank`/`nranks`
    against `COMM_MAX_RANK_NUM`.
  • The three `extern "C"` entry points that allocate (`comm_init`,
    `comm_alloc_windows`, `comm_destroy`) are function-try-blocks — no C++
    exception escapes the C ABI.
  • `make_run_token` uses `steady_clock` (NTP-immune) instead of
    `system_clock`.

CANN internal dependencies (known fragility)

This is the one place in the tree that couples to CANN-private pieces:

  • Link `libhcomm.so` (CANN 9.x private).
  • Forward-declared private symbols: `HcclAllocComResourceByTiling`,
    `HcomGetCommHandleByGroup`, `HcomGetL0TopoTypeEx` — exported but not in
    any public header.
  • Reverse-engineered `HcclOpResParam` / `LocalResInfoV2` /
    `HcclRankRelationResV2` with `static_assert` offset locks — CANN
    upgrade drift fails compilation rather than silently garbage-reading.

Hardware UT

`tests/ut/cpp/test_hccl_comm.cpp` — what it actually guards:

The interesting part isn't "six functions return 0" — the interesting
part is what's inside `CommContext` after `comm_alloc_windows`
returns.

Each rank child:

  1. dlopens libhost_runtime.so (the subject under test; mirrors
    ChipWorker's runtime selection)
  2. `create_device_context` → `ensure_acl_ready_ctx` → `aclrtCreateStream`
  3. `comm_init` → `comm_alloc_windows`
  4. `aclrtMemcpy(D2H, device_ctx_out)` reads back the populated
    `CommContext` and asserts
    :
    • `rankId == rank` we passed to `comm_init`
    • `rankNum == nranks` we passed
    • `winSize == comm_get_window_size(h)` (cross-API consistency)
    • `windowsIn[rank] == comm_get_local_window_base(h)`
    • `windowsIn[0..nranks-1]` all non-zero
  5. `comm_barrier` → `comm_destroy` → stream destroy → ctx destroy

A CANN upgrade that moves any field lands as `EXIT_CTX_FIELDS` (56) —
distinct from `EXIT_ALLOC` (30) or `EXIT_BARRIER` (60) — so hardware CI
failures pinpoint where the ABI contract broke.

Build-system / CI notes

  • `SIMPLER_ENABLE_HARDWARE_TESTS` CMake gate. no-hw `ut` job
    configures without the flag → hw-only tests are not added to the build
    at all. hw jobs (`ut-a2a3` / `ut-a5`) pass
    `-DSIMPLER_ENABLE_HARDWARE_TESTS=ON`. Future hw tests that need to
    link CANN just go under the gate; no-hw build stays clean.
  • GoogleTest FetchContent fallback. Self-hosted runners without a
    system gtest auto-fetch v1.14.0, built with `-D_GLIBCXX_USE_CXX11_ABI=0`
    so the ABI matches the test binaries. GH-hosted runners keep using
    the apt / brew fast path.
  • libascendcl linked directly. libhost_runtime.so stays dlopen'd
    (it's the test subject); libascendcl is generic CANN infra — going
    through dlsym here only hides types.
  • `set +e` around `source setenv.bash` in all hw steps. CANN's
    env script returns non-zero in some optional branches; without this,
    `bash -e` would kill the step before pytest/ctest could run.
  • ctest label `requires_hardware_a2a3` + `RESOURCE_GROUPS "2,npus:1"`
    drive device allocation via `--resource-spec-file`, not env vars.
  • `ci.yml`: `ut-py` + `ut-cpp` merged into one `ut` job; hardware UTs
    merged into `ut-a2a3` / `ut-a5`; a2a3 hw gated behind
    `detect-changes` so pure-a5 PRs skip.

Test plan

  • macOS local build (no-hw): test_hccl_comm not added to build
    targets (gate works).
  • macOS local build with `-DSIMPLER_ENABLE_HARDWARE_TESTS=ON`: fails
    with clear error on missing `ASCEND_HOME_PATH`.
  • ubuntu/macOS a2a3sim runtime build unaffected.
  • Hardware CI `ut-a2a3`: expected to surface an `aclrtSynchronizeStream
    507018` error at `comm_barrier` — this is a separate HCCL integration
    issue the UT is correctly catching, debug tracked separately.

Follow-up (separate PRs)

  • L1b (next): sim backend (`comm_sim.cpp`, new signature, stream
    ignored), `ChipWorker.comm_*` methods + nanobind bindings +
    `simpler.task_interface` wrappers, Python hardware UT.
  • barrier 507018 debug: real HCCL-runtime issue surfaced by this UT;
    needs hardware access to diagnose.
  • CANN version lock: dlopen-time version check against a whitelist.
  • L2+: `DistChipBootstrapChannel`, `Worker` orchestration, end-to-end
    L3 bootstrap ST.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a backend-neutral distributed communication API (comm_*) with backends for both HCCL and POSIX shared memory simulation, including Python bindings and a hardware smoke test. Review feedback identifies several critical issues: missing input validation for rank counts which could lead to buffer overflows in fixed-size arrays, and the incomplete population of the CommDeviceContext structure (specifically the windowsOut field) across both backends. Additionally, the feedback suggests addressing potential memory leaks in the allocation path and implementing timeout handling for the simulation backend's synchronization loop.

Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/sim/host/comm_sim.cpp Outdated
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/sim/host/comm_sim.cpp Outdated
Comment thread src/a2a3/platform/sim/host/comm_sim.cpp Outdated
@ChaoWao ChaoWao force-pushed the pr-571-l1-platform-comm branch from ed2244e to 17e22c1 Compare April 18, 2026 13:50
@ChaoWao ChaoWao changed the title Add: platform-level comm_* C API for distributed runtime Add: HCCL backend for comm_* C API with C++ hardware UT Apr 18, 2026
@ChaoWao ChaoWao force-pushed the pr-571-l1-platform-comm branch from b621a65 to 0f6234c Compare April 19, 2026 06:46
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented Apr 19, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a backend-neutral distributed communication C API and its HCCL implementation for Ascend hardware. It includes the definition of the CommContext ABI, the comm_* host API, and the comm_hccl backend which supports both MESH and RING topologies. Additionally, the DeviceRunner is updated to manage the ACL lifecycle, and a hardware-based unit test is provided. Feedback focuses on improving robustness and safety in the HCCL backend, specifically regarding bounds checking for rank sizes, error handling for memory copies, implementing timeouts for file-based barriers, catching exceptions in C-exported functions, and optimizing file cleanup and token generation.

Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp Outdated
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp Outdated
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp
Comment thread src/a2a3/platform/onboard/host/comm_hccl.cpp Outdated
@ChaoWao ChaoWao force-pushed the pr-571-l1-platform-comm branch 9 times, most recently from 7803db5 to db40885 Compare April 20, 2026 01:04
Introduces the CANN/HCCL-dependent portion of the distributed runtime
comm API. Scope is deliberately narrow — platform C API + HCCL
implementation + ACL lifecycle in DeviceRunner + C++ UT + CI wiring.
Sim backend, ChipWorker wrappers, and Python bindings land in a
follow-up PR on top of this.

Platform C API (caller-driven ACL lifecycle and stream ownership):
- comm_context.h: CommContext is both the host<->device ABI and the
  direct reinterpret_cast target for HCCL MESH topology.  Layout is
  static_assert-locked (sizeof + every field offset) so accidental
  drift fails the build before device kernels silently garble reads.
- comm.h: comm_init(rank, nranks, stream, rootinfo_path).  Caller
  owns aclInit / aclrtSetDevice / stream creation; comm never owns
  ACL or stream lifecycle.
- comm_hccl.cpp: HCCL backend.  Handles MESH and RING topologies via
  HcclAllocComResourceByTiling; parses HcclOpResParam internals to
  extract per-rank RDMA window GVAs for both windowsIn and windowsOut
  so kernels using TPUT / remote-write semantics can resolve their
  targets.  Rank-token'd rootinfo handshake using steady_clock so
  consecutive runs do not alias each other's barrier files and NTP
  wall-clock jumps cannot desynchronize ranks.  Named constants for
  every CANN-internal MC2 enum value so the call site reads without
  reverse-engineering.  static_assert on sizeof HcclOpResParam /
  LocalResInfoV2 and the field offsets we consume so CANN-internal
  layout drift fails at compile time.

Hardening of the HCCL backend (bounds, error surface, timeouts,
exception safety):
- nranks is bounds-checked against COMM_MAX_RANK_NUM at comm_init
  entry, head.rankSize is checked after the HcclOpResParam head read
  (RING path), and h->host_ctx.rankNum is checked on the MESH path.
  Either out-of-range value is rejected with a logged error instead
  of overflowing the fixed 64-slot windowsIn / windowsOut buffers.
- comm_barrier propagates HcclBarrier and aclrtSynchronizeStream
  return codes; the previous silent-success behavior is fixed.
- RING-path aclrtMemcpy of workspace fields now checks its return
  code, matching its siblings.
- file_barrier takes a timeout (default 120s, same as
  wait_for_rootinfo) and returns bool.  A dead peer no longer hangs
  the surviving ranks forever; comm_init / comm_alloc_windows abort
  on timeout, comm_destroy logs and continues so local teardown
  still runs.
- comm_init null-checks rootinfo_path and validates rank/nranks
  before constructing any state; a bad pointer used to crash
  std::string.
- comm_destroy logs HcclCommDestroy errors and surfaces them via the
  return code while still guaranteeing local cleanup.
- comm_init / comm_alloc_windows / comm_destroy are extern "C" but
  allocate std::string, std::vector, and open fstream — any of which
  can throw.  Wrapped in function-try-blocks so exceptions never
  escape the C ABI; catches log via fprintf and return null / -1.

Device/ACL lifecycle (owned by DeviceRunner):
- device_runner.{h,cpp}: new ensure_acl_ready(device_id) method does
  aclInit (process-wide, tolerates 100002 repeat) + aclrtSetDevice
  (per-thread).  acl_ready_ flag tracks this so finalize() only
  drives aclrtResetDevice + aclFinalize when we actually brought ACL
  up; pure rt-layer runtimes stay unaffected.
- pto_runtime_c_api.cpp exposes ensure_acl_ready_ctx() for dlsym.

CANN internal dependencies pulled in (known fragility, recorded in
file-level comments so follow-up reviewers have one place to look):
- Link libhcomm.so (CANN 9.x private) for internal HCCL symbols
  exported but not in any public header: HcclAllocComResourceByTiling,
  HcomGetCommHandleByGroup, HcomGetL0TopoTypeEx.
- Reverse-engineered structs carry static_asserts so CANN upgrade
  drift produces a hard compile failure instead of a silent offset
  shift.

CI / test:
- tests/ut/cpp/test_hccl_comm.cpp: GoogleTest hardware UT.  Each rank
  child dlopens libhost_runtime.so + libascendcl.so, brings up ACL
  via ensure_acl_ready_ctx, creates its own aclrtStream, and drives
  comm_init -> alloc_windows -> get_local_window_base ->
  get_window_size -> barrier -> destroy.  Per-stage exit codes so
  hardware CI failures pinpoint where the contract broke.  Tagged
  with the requires_hardware_a2a3 CTest label and RESOURCE_GROUPS
  "2,npus:1" so CTest resource allocation (not env vars) drives
  device selection.
- .github/workflows/ci.yml: ut-py + ut-cpp merged into a single
  ubuntu+macos `ut` job; hardware UT jobs unified into ut-a2a3 and
  ut-a5 with CTest --resource-spec-file drive.  a2a3-hw jobs gated
  behind detect-changes so pure-a5 PRs skip them.  Hardware jobs use
  plain `pip install '.[test]'` (not --no-build-isolation): in CI
  the slightly-slower build-isolated install is not worth the extra
  scikit-build-core discovery burden we would otherwise have to
  handle by hand.
- docs/ci.md and docs/testing.md updated to match new job layout.

Follow-ups deliberately deferred to later PRs:
- comm_sim.cpp (windowsOut population, nranks validation, ftruncate
  wait timeout) lands with the sim backend in L1b.
- Runtime canary (write/barrier/read-neighbor magic).
- HCCL version query + dlopen-time version lock.
@ChaoWao ChaoWao force-pushed the pr-571-l1-platform-comm branch from db40885 to 10ee2f9 Compare April 20, 2026 01:17
HcclBarrier internally switches the thread's ACL context, which
invalidates the caller-owned stream for subsequent context-checked
ACL calls (aclrtSynchronizeStream returns 507018).  The sync was
redundant anyway — HcclBarrier is synchronous and blocks until all
ranks arrive.
@ChaoWao ChaoWao merged commit 18edd2d into hw-native-sys:main Apr 20, 2026
14 checks passed
@ChaoWao ChaoWao deleted the pr-571-l1-platform-comm branch April 20, 2026 04:01
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (hw-native-sys#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit that referenced this pull request Apr 20, 2026
Follow-up to the L1a HCCL backend (#592).  L1a landed the CANN-dependent
HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds
the rest of the user surface so non-hardware developers and Python users
can drive the same primitives.

Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp):
- POSIX shm_open + mmap to share window regions across rank processes.
- Atomic barrier / ready / destroy counters live in a 4 KiB header at
  the front of the shared segment, using __atomic_* intrinsics.
- Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...);
  stream is ignored in sim (no ACL concept).
- Hardening to match the L1a review bar:
  * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any
    write to the fixed-size windowsIn / windowsOut arrays.
  * windowsOut[i] is populated alongside windowsIn[i] so kernels that
    consume windowsOut on HCCL still resolve on sim.
  * ftruncate wait, ready-count barrier, phase barrier, and destroy
    barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock
    (NTP-safe) so a dead peer cannot hang the surviving ranks.
  * extern "C" entry points wrapped in function-try-blocks to keep
    std::string / new allocations from escaping the C ABI.
- sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE;
  macOS has shm_open in libSystem and has no librt.

ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}):
- Six new methods: comm_init / comm_alloc_windows /
  comm_get_local_window_base / comm_get_window_size / comm_barrier /
  comm_destroy.
- Symbols resolved via load_optional_symbol so existing runtimes that
  predate the distributed extension still init cleanly; the per-method
  guards raise a clear runtime_error only when someone actually tries
  to invoke a missing primitive.
- stream is carried as uint64_t across the ChipWorker boundary (raw
  aclrtStream address) and cast to void * at the C API call.

Nanobind + Python (python/bindings/task_interface.cpp,
python/simpler/task_interface.py):
- Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker
  wrapper with type annotations and int(...) / str(...) coercion.
- Option A from the split plan: stream is an explicit arg, users
  create it themselves (matches the raw C API).

Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py):
- Two-rank fork subprocess test guarded by requires_hardware +
  platforms(["a2a3"]) + device_count(2); skips cleanly without
  --platform (macOS local, no hardware).
- Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream
  (ctypes against libascendcl.so) -> comm_init -> alloc_windows ->
  get_base -> get_size -> CommContext field readback via aclrtMemcpy
  -> comm_barrier -> comm_destroy -> finalize.
- CommContext is mirrored as a ctypes.Structure with a sizeof==1056
  assert so any drift from the C++ static_asserts surfaces at test
  import rather than silently mis-reading device memory.
- Cross-rank invariant: every rank's local_base must appear at index
  [rank] in every other rank's windowsIn - the exact invariant a
  kernel relies on when it DMAs to a peer window.
- Inherits the L1a HCCL 507018 barrier regression: the test surfaces
  a barrier failure as a warnings.warn instead of a test failure so
  the load-bearing assertions (init / alloc / ctx-fields / destroy)
  still gate the PR while that separate CANN-coupling bug is
  debugged in its own branch.
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant