Add: HCCL backend for comm_* C API with C++ hardware UT#592
Add: HCCL backend for comm_* C API with C++ hardware UT#592ChaoWao merged 2 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a backend-neutral distributed communication API (comm_*) with backends for both HCCL and POSIX shared memory simulation, including Python bindings and a hardware smoke test. Review feedback identifies several critical issues: missing input validation for rank counts which could lead to buffer overflows in fixed-size arrays, and the incomplete population of the CommDeviceContext structure (specifically the windowsOut field) across both backends. Additionally, the feedback suggests addressing potential memory leaks in the allocation path and implementing timeout handling for the simulation backend's synchronization loop.
ed2244e to
17e22c1
Compare
b621a65 to
0f6234c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a backend-neutral distributed communication C API and its HCCL implementation for Ascend hardware. It includes the definition of the CommContext ABI, the comm_* host API, and the comm_hccl backend which supports both MESH and RING topologies. Additionally, the DeviceRunner is updated to manage the ACL lifecycle, and a hardware-based unit test is provided. Feedback focuses on improving robustness and safety in the HCCL backend, specifically regarding bounds checking for rank sizes, error handling for memory copies, implementing timeouts for file-based barriers, catching exceptions in C-exported functions, and optimizing file cleanup and token generation.
7803db5 to
db40885
Compare
Introduces the CANN/HCCL-dependent portion of the distributed runtime
comm API. Scope is deliberately narrow — platform C API + HCCL
implementation + ACL lifecycle in DeviceRunner + C++ UT + CI wiring.
Sim backend, ChipWorker wrappers, and Python bindings land in a
follow-up PR on top of this.
Platform C API (caller-driven ACL lifecycle and stream ownership):
- comm_context.h: CommContext is both the host<->device ABI and the
direct reinterpret_cast target for HCCL MESH topology. Layout is
static_assert-locked (sizeof + every field offset) so accidental
drift fails the build before device kernels silently garble reads.
- comm.h: comm_init(rank, nranks, stream, rootinfo_path). Caller
owns aclInit / aclrtSetDevice / stream creation; comm never owns
ACL or stream lifecycle.
- comm_hccl.cpp: HCCL backend. Handles MESH and RING topologies via
HcclAllocComResourceByTiling; parses HcclOpResParam internals to
extract per-rank RDMA window GVAs for both windowsIn and windowsOut
so kernels using TPUT / remote-write semantics can resolve their
targets. Rank-token'd rootinfo handshake using steady_clock so
consecutive runs do not alias each other's barrier files and NTP
wall-clock jumps cannot desynchronize ranks. Named constants for
every CANN-internal MC2 enum value so the call site reads without
reverse-engineering. static_assert on sizeof HcclOpResParam /
LocalResInfoV2 and the field offsets we consume so CANN-internal
layout drift fails at compile time.
Hardening of the HCCL backend (bounds, error surface, timeouts,
exception safety):
- nranks is bounds-checked against COMM_MAX_RANK_NUM at comm_init
entry, head.rankSize is checked after the HcclOpResParam head read
(RING path), and h->host_ctx.rankNum is checked on the MESH path.
Either out-of-range value is rejected with a logged error instead
of overflowing the fixed 64-slot windowsIn / windowsOut buffers.
- comm_barrier propagates HcclBarrier and aclrtSynchronizeStream
return codes; the previous silent-success behavior is fixed.
- RING-path aclrtMemcpy of workspace fields now checks its return
code, matching its siblings.
- file_barrier takes a timeout (default 120s, same as
wait_for_rootinfo) and returns bool. A dead peer no longer hangs
the surviving ranks forever; comm_init / comm_alloc_windows abort
on timeout, comm_destroy logs and continues so local teardown
still runs.
- comm_init null-checks rootinfo_path and validates rank/nranks
before constructing any state; a bad pointer used to crash
std::string.
- comm_destroy logs HcclCommDestroy errors and surfaces them via the
return code while still guaranteeing local cleanup.
- comm_init / comm_alloc_windows / comm_destroy are extern "C" but
allocate std::string, std::vector, and open fstream — any of which
can throw. Wrapped in function-try-blocks so exceptions never
escape the C ABI; catches log via fprintf and return null / -1.
Device/ACL lifecycle (owned by DeviceRunner):
- device_runner.{h,cpp}: new ensure_acl_ready(device_id) method does
aclInit (process-wide, tolerates 100002 repeat) + aclrtSetDevice
(per-thread). acl_ready_ flag tracks this so finalize() only
drives aclrtResetDevice + aclFinalize when we actually brought ACL
up; pure rt-layer runtimes stay unaffected.
- pto_runtime_c_api.cpp exposes ensure_acl_ready_ctx() for dlsym.
CANN internal dependencies pulled in (known fragility, recorded in
file-level comments so follow-up reviewers have one place to look):
- Link libhcomm.so (CANN 9.x private) for internal HCCL symbols
exported but not in any public header: HcclAllocComResourceByTiling,
HcomGetCommHandleByGroup, HcomGetL0TopoTypeEx.
- Reverse-engineered structs carry static_asserts so CANN upgrade
drift produces a hard compile failure instead of a silent offset
shift.
CI / test:
- tests/ut/cpp/test_hccl_comm.cpp: GoogleTest hardware UT. Each rank
child dlopens libhost_runtime.so + libascendcl.so, brings up ACL
via ensure_acl_ready_ctx, creates its own aclrtStream, and drives
comm_init -> alloc_windows -> get_local_window_base ->
get_window_size -> barrier -> destroy. Per-stage exit codes so
hardware CI failures pinpoint where the contract broke. Tagged
with the requires_hardware_a2a3 CTest label and RESOURCE_GROUPS
"2,npus:1" so CTest resource allocation (not env vars) drives
device selection.
- .github/workflows/ci.yml: ut-py + ut-cpp merged into a single
ubuntu+macos `ut` job; hardware UT jobs unified into ut-a2a3 and
ut-a5 with CTest --resource-spec-file drive. a2a3-hw jobs gated
behind detect-changes so pure-a5 PRs skip them. Hardware jobs use
plain `pip install '.[test]'` (not --no-build-isolation): in CI
the slightly-slower build-isolated install is not worth the extra
scikit-build-core discovery burden we would otherwise have to
handle by hand.
- docs/ci.md and docs/testing.md updated to match new job layout.
Follow-ups deliberately deferred to later PRs:
- comm_sim.cpp (windowsOut population, nranks validation, ftruncate
wait timeout) lands with the sim backend in L1b.
- Runtime canary (write/barrier/read-neighbor magic).
- HCCL version query + dlopen-time version lock.
db40885 to
10ee2f9
Compare
HcclBarrier internally switches the thread's ACL context, which invalidates the caller-owned stream for subsequent context-checked ACL calls (aclrtSynchronizeStream returns 507018). The sync was redundant anyway — HcclBarrier is synchronous and blocks until all ranks arrive.
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (hw-native-sys#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the L1a HCCL backend (#592). L1a landed the CANN-dependent HCCL implementation of the comm_* C API plus a C++ hardware UT; L1b adds the rest of the user surface so non-hardware developers and Python users can drive the same primitives. Sim backend (src/a2a3/platform/sim/host/comm_sim.cpp): - POSIX shm_open + mmap to share window regions across rank processes. - Atomic barrier / ready / destroy counters live in a 4 KiB header at the front of the shared segment, using __atomic_* intrinsics. - Signature aligned to L1a: comm_init(rank, nranks, void *stream, ...); stream is ignored in sim (no ACL concept). - Hardening to match the L1a review bar: * nranks bounds-checked against COMM_MAX_RANK_NUM (64) before any write to the fixed-size windowsIn / windowsOut arrays. * windowsOut[i] is populated alongside windowsIn[i] so kernels that consume windowsOut on HCCL still resolve on sim. * ftruncate wait, ready-count barrier, phase barrier, and destroy barrier all gated by SIM_COMM_TIMEOUT_SECONDS via steady_clock (NTP-safe) so a dead peer cannot hang the surviving ranks. * extern "C" entry points wrapped in function-try-blocks to keep std::string / new allocations from escaping the C ABI. - sim/host/CMakeLists.txt: librt linked only on UNIX AND NOT APPLE; macOS has shm_open in libSystem and has no librt. ChipWorker C++ (src/common/worker/chip_worker.{h,cpp}): - Six new methods: comm_init / comm_alloc_windows / comm_get_local_window_base / comm_get_window_size / comm_barrier / comm_destroy. - Symbols resolved via load_optional_symbol so existing runtimes that predate the distributed extension still init cleanly; the per-method guards raise a clear runtime_error only when someone actually tries to invoke a missing primitive. - stream is carried as uint64_t across the ChipWorker boundary (raw aclrtStream address) and cast to void * at the C API call. Nanobind + Python (python/bindings/task_interface.cpp, python/simpler/task_interface.py): - Six .def() entries on _ChipWorker, mirrored in the Python ChipWorker wrapper with type annotations and int(...) / str(...) coercion. - Option A from the split plan: stream is an explicit arg, users create it themselves (matches the raw C API). Python hardware UT (tests/ut/py/test_worker/test_platform_comm.py): - Two-rank fork subprocess test guarded by requires_hardware + platforms(["a2a3"]) + device_count(2); skips cleanly without --platform (macOS local, no hardware). - Full lifecycle: ChipWorker.init -> set_device -> aclrtCreateStream (ctypes against libascendcl.so) -> comm_init -> alloc_windows -> get_base -> get_size -> CommContext field readback via aclrtMemcpy -> comm_barrier -> comm_destroy -> finalize. - CommContext is mirrored as a ctypes.Structure with a sizeof==1056 assert so any drift from the C++ static_asserts surfaces at test import rather than silently mis-reading device memory. - Cross-rank invariant: every rank's local_base must appear at index [rank] in every other rank's windowsIn - the exact invariant a kernel relies on when it DMAs to a peer window. - Inherits the L1a HCCL 507018 barrier regression: the test surfaces a barrier failure as a warnings.warn instead of a test failure so the load-bearing assertions (init / alloc / ctx-fields / destroy) still gate the PR while that separate CANN-coupling bug is debugged in its own branch.
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。 通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。 文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过 没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>
Summary
First slice of the distributed runtime work extracted from #571. Scope is
narrow on purpose: platform C API + HCCL implementation + ACL lifecycle
(owned by DeviceRunner) + a C++ hardware UT wired into CI. Sim backend,
ChipWorker wrappers, and Python bindings land in a follow-up PR. This keeps
the surface that couples to CANN-private symbols isolated and guarded by
its own test.
API shape (caller-driven lifecycle)
```c
// caller owns ACL + stream; comm_init does NOT aclInit or create streams.
CommHandle comm_init(int rank, int nranks, void *stream, const char *rootinfo_path);
int comm_alloc_windows(CommHandle h, size_t win_size, uint64_t *device_ctx_out);
int comm_get_local_window_base(CommHandle h, uint64_t *base_out);
int comm_get_window_size(CommHandle h, size_t *size_out);
int comm_barrier(CommHandle h);
int comm_destroy(CommHandle h);
```
Key architectural decisions
ACL lifecycle lives in DeviceRunner. `DeviceRunner::ensure_acl_ready`
does `aclInit` + `aclrtSetDevice`; `finalize()` does the symmetric
`aclrtResetDevice` + `aclFinalize` behind an `acl_ready_` flag.
rt-only runtimes (no comm) stay unaffected.
Stream is injected by the caller. Matches HCCL's API shape
(`HcclBarrier(comm, stream)`); lets callers choose share-with-compute
(serialize) vs dedicated (overlap).
`CommContext` ABI is locked at compile time. Every field offset +
sizeof is `static_assert`-ed. A CANN upgrade that shifts layout fails
the build, not the runtime silently.
Hardening (review-driven)
paths). `windowsIn[64]` can no longer overflow.
required for remote-write (TPUT) kernels.
instead of swallowing them.
longer hangs the group; `comm_init` / `comm_alloc_windows` abort on
timeout, `comm_destroy` logs and continues so local teardown still runs.
against `COMM_MAX_RANK_NUM`.
`comm_alloc_windows`, `comm_destroy`) are function-try-blocks — no C++
exception escapes the C ABI.
`system_clock`.
CANN internal dependencies (known fragility)
This is the one place in the tree that couples to CANN-private pieces:
`HcomGetCommHandleByGroup`, `HcomGetL0TopoTypeEx` — exported but not in
any public header.
`HcclRankRelationResV2` with `static_assert` offset locks — CANN
upgrade drift fails compilation rather than silently garbage-reading.
Hardware UT
`tests/ut/cpp/test_hccl_comm.cpp` — what it actually guards:
Each rank child:
ChipWorker's runtime selection)
`CommContext` and asserts:
A CANN upgrade that moves any field lands as `EXIT_CTX_FIELDS` (56) —
distinct from `EXIT_ALLOC` (30) or `EXIT_BARRIER` (60) — so hardware CI
failures pinpoint where the ABI contract broke.
Build-system / CI notes
configures without the flag → hw-only tests are not added to the build
at all. hw jobs (`ut-a2a3` / `ut-a5`) pass
`-DSIMPLER_ENABLE_HARDWARE_TESTS=ON`. Future hw tests that need to
link CANN just go under the gate; no-hw build stays clean.
system gtest auto-fetch v1.14.0, built with `-D_GLIBCXX_USE_CXX11_ABI=0`
so the ABI matches the test binaries. GH-hosted runners keep using
the apt / brew fast path.
(it's the test subject); libascendcl is generic CANN infra — going
through dlsym here only hides types.
env script returns non-zero in some optional branches; without this,
`bash -e` would kill the step before pytest/ctest could run.
drive device allocation via `--resource-spec-file`, not env vars.
merged into `ut-a2a3` / `ut-a5`; a2a3 hw gated behind
`detect-changes` so pure-a5 PRs skip.
Test plan
targets (gate works).
with clear error on missing `ASCEND_HOME_PATH`.
507018` error at `comm_barrier` — this is a separate HCCL integration
issue the UT is correctly catching, debug tracked separately.
Follow-up (separate PRs)
ignored), `ChipWorker.comm_*` methods + nanobind bindings +
`simpler.task_interface` wrappers, Python hardware UT.
needs hardware access to diagnose.
L3 bootstrap ST.