Fix: Python-side acquire/release on mailbox state#609
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 21, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces atomic acquire-load and release-store helpers in C++ and exposes them to Python to ensure correct memory ordering for mailbox state transitions across multiple processes. By replacing standard struct packing and unpacking with these helpers, the implementation guarantees that payload writes, such as error messages, are visible to other processes before state changes are observed, which is critical for weakly-ordered architectures like aarch64. The changes include a new test suite to verify these atomic invariants and cross-process visibility. I have no feedback to provide as the existing review comments were purely validating the implementation.
The C++ WorkerThread::dispatch_process already uses ldar/stlr (aarch64) and
compiler-barriered plain store (x86_64) when it reads/writes the mailbox
OFF_STATE word. The Python side of the handshake — the three worker loops
(_sub_worker_loop, _chip_process_loop, _child_worker_loop), _chip_control,
and the init/close paths — was using plain struct.pack_into("i",
buf, _OFF_STATE, …) / unpack_from. On aarch64 that lets the state flip
leak ahead of the preceding OFF_ERROR / OFF_ERROR_MSG writes, so a parent
that observes TASK_DONE can read a stale error message.
Design decisions (per .docs/l3-audit.md):
1. Exposure: add inline mailbox_load_i32 / mailbox_store_i32 in
worker_bind.h and bind them as _mailbox_load_i32 /
_mailbox_store_i32 on _task_interface. Underscore prefix keeps them
out of task_interface.__all__ — only simpler.worker imports them.
2. ABI: aarch64 ldar/stlr first (per .claude/rules/codestyle.md hw-native-sys#6),
x86_64 second with __asm__ volatile("" ::: "memory") to stop the
compiler from reordering across the TSO store, fallback to
__atomic_{load,store} with ACQUIRE / RELEASE.
3. addr type: uint64_t. Python computes via
ctypes.addressof(ctypes.c_char.from_buffer(buf)) + offset; C++
reinterpret_casts to volatile int32_t*. No void*.
4. Field ordering: all OFF_ERROR / OFF_ERROR_MSG / _CTRL_OFF_RESULT
writes happen BEFORE the release-store of OFF_STATE, and all
non-state reads happen AFTER the acquire-load of OFF_STATE. Every
state access in python/simpler/worker.py now goes through the
helper — 18 sites across the three loops, _chip_control, init, and
close — so the invariant is mechanical.
Adds tests/ut/py/test_worker/test_mailbox_atomics.py with four cases
(roundtrip, cross-process visibility, payload-before-state ordering over
1000 fork iterations, refactored sub-worker dispatch). Existing L4
error-propagation tests still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6bfbee3 to
f49abb1
Compare
3 tasks
ChaoWao
added a commit
to PKUZHOU/simpler
that referenced
this pull request
Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。 通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。 文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过 没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 3 of splitting #571. Already landed: L1a (#592 HCCL platform C API), L1b (#597 sim+ChipWorker+Python), L4 (#605 error propagation). Independent follow-ups (L2 DistChipBootstrapChannel, L5/L6/L7) are separate PRs.
The C++ side of the mailbox handshake (
WorkerThread::read_mailbox_state/write_mailbox_stateinworker_manager.cpp) usesldar/stlron aarch64 and a compiler-barriered plain store on x86_64 forMAILBOX_OFF_STATE. The Python side — three worker loops,_chip_control, and init/close — used plainstruct.pack_into("i", buf, _OFF_STATE, …)/unpack_fromwith no memory barrier.On aarch64 that lets the child's
OFF_STATE = TASK_DONErelease-store pass the precedingOFF_ERROR/OFF_ERROR_MSGplain writes, so a parent that acquire-loadsTASK_DONEcan read stale error text. Program order is not memory order on a weakly-consistent CPU.Changes
python/bindings/worker_bind.h: add inlinemailbox_load_i32/mailbox_store_i32(aarch64ldar/stlrfirst percodestyle.mdsupport extern func define in aicore #6, x86_64 compiler barrier + plain access,__atomic_{load,store}withACQUIRE/RELEASEfallback). Expose on_task_interfaceas_mailbox_load_i32/_mailbox_store_i32. Underscore prefix keeps them out oftask_interface.__all__; onlysimpler.workerimports them.python/simpler/worker.py: everyOFF_STATEread/write (18 sites across_sub_worker_loop,_chip_process_loop,_child_worker_loop,_chip_control,_init_hierarchical,close()) now goes through the helper. Error-path fields stay plain — they are published by the subsequent release-store. Adds_buffer_field_addr(buf, offset)helper.tests/ut/py/test_worker/test_mailbox_atomics.py: 4 cases0xDEADBEEFCAFEBABEpayload then release-storesstate=1; parent acquire-loadsstate==1then must observe the sentinel. Exercises the exact invariant the three loops rely on to publishOFF_ERROR_MSGwithTASK_DONE.MAP_SHAREDcounter to verify the refactored loop round-trips cleanly.Design decisions
inlinehelpers inworker_bind.hbound via nanobind — one implementation, one ABI, noctypeswrapper on the Python side.__aarch64__→__x86_64__→ fallback, percodestyle.mdsupport extern func define in aicore #6.uint64_t. Python computesctypes.addressof(ctypes.c_char.from_buffer(buf)) + offset; C++reinterpret_casts tovolatile int32_t*. Novoid*(nanobind would treat it as a Python object)._chip_control, the state store is the last mailbox write of a transition; the state load is the first mailbox read. No field other than state needs atomic helpers.Test plan
pytest tests/ut/py/test_worker/— 48 passed, 1 skipped (platform_comm needs hw), 0 regressionstest_mailbox_atomics.py— 6 passed including 1000-iter fork racetest_error_propagation.py— 5 passed (L4 paths exercise every refactored site end-to-end)