Add: ChipWorker.bootstrap_context one-shot chip bring-up (L5)#610
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 21, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a one-shot bootstrap mechanism for chip workers, enabling communicator initialization, memory window allocation, and host-to-device data staging via shared memory. It adds several configuration dataclasses and implements the bootstrap_context and shutdown_bootstrap methods within the ChipWorker class. Comprehensive hardware and simulation tests are also provided. Review feedback highlights a contradiction between the documentation and implementation regarding null communicators, a potential crash when handling zero-sized staging buffers, and the need for more robust handle cleanup in the shutdown process.
2b0a492 to
f6066bf
Compare
Wraps L1b's ChipWorker.comm_* + copy_to + set_device and L2's
ChipBootstrapChannel into a single ChipWorker.bootstrap_context(device_id,
cfg, channel=None) entry point that:
1. Sets the NPU device (set_device already wires ACL bring-up per L1b).
2. Brings up the communicator (comm_init + comm_alloc_windows +
comm_get_local_window_base + comm_get_window_size) when
cfg.comm is non-None. Skips the whole step when cfg.comm is None.
3. Carves the per-rank window sequentially into the ChipBufferSpec[]
list, validating placement=="window" and that the cumulative
nbytes does not overflow the actual (possibly rounded-up) window
size returned by the backend. buffer_ptrs is 1:1 aligned with
cfg.buffers so L6's ChipContext can build its name->ptr dict by zip.
4. For every ChipBufferSpec with load_from_host=True, attaches to the
matching HostBufferStaging POSIX shm (parent is expected to have
created + filled it pre-fork), copies the bytes into the device
window slice via ChipWorker.copy_to, and closes the local mapping.
5. Publishes ChipBootstrapResult(device_ctx, local_window_base,
actual_window_size, buffer_ptrs) via channel.write_success when
a channel is provided; on any exception, publishes
channel.write_error(1, "<ExceptionType>: <message>") first and
re-raises.
Also adds ChipWorker.shutdown_bootstrap(), the matching teardown: it
releases the HCCL comm handle stashed on self._comm_handle by
bootstrap_context inside a try/finally so the zero-handle guard makes
the method truly idempotent even if comm_destroy raises.
Design decisions (4):
1. Channel parameter is Optional[ChipBootstrapChannel], not required.
L5 unit tests -- especially the sim path where the child process
consumes the return value directly -- must be able to drive
bootstrap_context without allocating a per-chip mailbox. The
channel is the L6 publish hook for the parent-to-child handshake,
not a structural component of L5 itself. When channel=None,
exceptions still propagate normally; the only thing skipped is
the write_success/write_error side effect.
2. New dataclasses live in python/simpler/task_interface.py, not in
worker.py. ChipWorker is a task_interface module type and its
one-shot config -- ChipCommBootstrapConfig, ChipBufferSpec,
HostBufferStaging, ChipBootstrapConfig, ChipBootstrapResult --
belongs alongside it. worker.py describes L3+ Worker concerns
(scheduler, ring, mailbox), which L5 does not touch.
3. Failure mode collapses all exceptions to code=1 with a
"<ExceptionType>: <message>" body before rethrowing. The single
exit point wraps everything from set_device through the final
channel.write_success, so callers never need to distinguish
"before" vs "after" the communicator came up. code=1 matches the
L4 convention so downstream consumers that already multiplex on
the mailbox error_code do not see a new value. When channel is
None, the exception is simply re-raised; there is no mailbox
write path to skip.
4. Comm handle lifecycle is explicit. On successful comm_init,
bootstrap_context stashes the handle at self._comm_handle.
shutdown_bootstrap() is the matching release: it comm_destroys
the handle inside a try/finally and clears the field to zero, so
a double call is a no-op -- and so is a retry after comm_destroy
itself raises. finalize() is intentionally NOT wired to this
method; ChipWorker.finalize keeps its pre-L5 semantics and the
teardown order (shutdown_bootstrap then finalize) is L6's
orchestration concern. Tests verify this order explicitly.
Tests:
- tests/ut/py/test_worker/test_bootstrap_context_sim.py (no hardware):
* happy path: 2-rank fork on a2a3sim; each rank's ChipBootstrapResult
has non-zero local_window_base, actual_window_size>=requested, and
buffer_ptrs == [local_window_base].
* load_from_host: parent stages 64 bytes in POSIX shm, child 0 runs
bootstrap_context with load_from_host=True, then copy_from reads
the device window back to host and asserts the payload round-
tripped unchanged.
* channel integration: parent allocates one mailbox shm per rank,
children publish via ChipBootstrapChannel; parent verifies
state==SUCCESS and every field matches the return value.
* error path: single-process fork with placement="bogus" raises
ValueError; parent reads ERROR state with error_code=1 and
error_message starting "ValueError: " and containing "bogus".
- tests/ut/py/test_worker/test_bootstrap_context_hw.py (hardware):
* 2-rank tensormap_and_ringbuffer bootstrap on a2a3 devices.
Asserts device_ctx!=0, local_window_base!=0,
actual_window_size>=requested, buffer_ptrs == [local_window_base].
Deliberately does NOT call comm_barrier, so the known HCCL 507018
failure path (already documented in L1b's test_platform_comm.py)
cannot regress this test.
Incidental fix: src/common/platform_comm/comm_sim.cpp:make_shm_name
shortened so the worst-case shm name fits macOS's PSHMNAMLEN=31 limit.
The prior format `/simpler_comm_<pid>_<hash64>` reached 36 characters
and failed shm_open with EFILENAMEMAXEXCEEDED on darwin, which the
new L5 sim tests exercise for the first time in CI (the older
test_platform_comm.py is requires_hardware and so never ran on macOS).
The new format `/simpler_<pidhex>_<hash32>` is <= 26 characters and
works on both macOS and Linux; 32 bits of rootinfo-path hash is still
collision-resistant for the "one driver spawns N ranks" launch
pattern this backend is designed for.
Scope:
- python/simpler/task_interface.py: new dataclasses +
bootstrap_context + shutdown_bootstrap + re-exports of
CHIP_BOOTSTRAP_MAILBOX_SIZE, ChipBootstrapChannel, and
ChipBootstrapMailboxState.
- src/common/platform_comm/comm_sim.cpp: shm name length fix above.
- Does not touch worker.py, nanobind bindings, or any runtime code --
L5 is otherwise purely a Python composition layer over the
L1a/L1b/L2 surfaces already merged upstream.
- L6 (parent-side Worker.init fork orchestration) is deliberately not
addressed here; that is a separate PR that builds on this API.
Audit of existing ChipWorker signatures + the sim backend
ready-count-barrier constraint that forces the sim tests to fork
N rank children lives in .docs/l5-audit.md (local-only per repo
.gitignore convention).
Verified locally: tests/ut/py/test_worker (macOS arm64, Python 3.14)
59 passed, 2 skipped (HCCL hardware + test_platform_comm).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f6066bf to
36c3fc8
Compare
3 tasks
ChaoWao
added a commit
to PKUZHOU/simpler
that referenced
this pull request
Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。 通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。 文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过 没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChipWorker.bootstrap_context(device_id, cfg, channel=None)— a single entry point that composes L1b'sset_device+comm_*+copy_toand L2'sChipBootstrapChannelinto the one-shot per-chip bring-up L6 will call for every forked chip child.ChipWorker.shutdown_bootstrap()— idempotent release of the HCCL comm handle stashed bybootstrap_context.python/simpler/task_interface.pyto describe the inputs/outputs:ChipCommBootstrapConfig,ChipBufferSpec,HostBufferStaging,ChipBootstrapConfig,ChipBootstrapResult.CHIP_BOOTSTRAP_MAILBOX_SIZE,ChipBootstrapChannel, andChipBootstrapMailboxStatefromsimpler.task_interfaceso callers need a single import.Part of the PR #571 split (see the L1a/L1b/L2/L4 predecessors). L6 (parent-side
Worker.initfork orchestration) is a separate PR that builds on this API.Design decisions
bootstrap_contextdirectly and consume the return value — requiring a channel would force every test to allocate a mailbox shm. Channel is the L6 publish hook, not a structural component of L5.task_interface.py, notworker.py. They describeChipWorkerinputs, so they belong alongsideChipWorker.worker.pyis L3+ concerns (scheduler, ring, mailbox).code=1+"<ExceptionType>: <message>". Single try/except wraps the whole bring-up; callers never need to distinguish "before" vs "after" comm came up.code=1aligns with L4.bootstrap_contextstashes the HCCL handle onself._comm_handle;shutdown_bootstrap()releases it (zero-handle guard makes double-call a no-op).finalize()deliberately does NOT chain intoshutdown_bootstrap— L6 owns the teardown order.Test plan
pytest tests/ut/py/test_worker/test_bootstrap_context_sim.py(no hardware; 4 cases: happy path,load_from_hostround-trip, channel integration, error path)pytest tests/ut/py/test_worker/test_bootstrap_context_hw.py --platform a2a3 --device 0-1on a2a3 hardware (1 case: 2-rank HCCL bootstrap, no barrier — avoids known 507018)tests/ut/py/test_workerstill green under Linux CI (macOS sim masks some Linux-only failures — watching the CI run is load-bearing here)