Add: L4 error propagation from child workers to Worker.run by ChaoWao · Pull Request #605 · hw-native-sys/simpler

ChaoWao · 2026-04-20T09:33:00Z

Summary

Surfaces Python exceptions from forked SubWorker / ChipWorker / L4-inner-Worker
loops up to the caller of Worker.run(). Before this change, a child-side
exception was written to OFF_ERROR in the mailbox but never read by
WorkerThread::dispatch_process — the parent wrote IDLE on TASK_DONE
regardless and the caller saw silent success with garbage output. The
CONTROL path read OFF_ERROR but threw a message with no cause.

Add MAILBOX_OFF_ERROR_MSG (256 B at the mailbox tail; no existing
offset shifts) and expose it via nanobind so Python cannot drift.
dispatch_process reads error + msg after TASK_DONE and throws
std::runtime_error with the child-written message.
WorkerThread::loop() wraps dispatch_* in try/catch so an uncaught
exception cannot terminate the std::thread; failures route to
WorkerManager (first-error-wins std::exception_ptr).
Orchestrator::submit_impl and drain() check WorkerManager::has_error()
and rethrow — submit is fail-fast, drain waits for in-flight tasks to
finish before rethrowing so ring slots don't leak.
scope_end deliberately does NOT throw (would strand scope refs and
hang drain); the throw point is submit_* or drain.
Python loops collapse error codes to code=1 + message; Worker.run
clears the error slot before _scope_begin so a prior failed run
doesn't poison the next one.

Design decisions (4)

Error carrier: int32 code + 256-byte NUL-terminated message at
the mailbox tail. Existing offsets unchanged.
Code semantics: collapse 1 (registry miss) / 2 (callable
raised) to code=1 + msg. Subclassing exceptions deferred.
Scope mid-failure: fail-fast at submit_*, drain rethrow on
exit. scope_end never throws. In-flight tasks complete naturally.
Multi-child: no active signal to peers. Fail-fast prevents new
submits; in-flight peers finish their current task; close() writes
SHUTDOWN as usual.

See the commit message for the full rationale.

Test plan

tests/ut/py/test_worker/test_error_propagation.py (new, 5 cases)
tests/ut/py/test_worker entire suite (42 passed, 1 HCCL skip)
tests/ut/cpp entire suite via ctest (all 8 passed)
Hardware CI (no local hardware; relying on CI)

Before this change, a Python exception in a forked SubWorker / ChipWorker / L4-inner-Worker loop was written to OFF_ERROR in the mailbox but never read by WorkerThread::dispatch_process. The parent wrote IDLE on TASK_DONE regardless of error and the caller of Worker.run() saw silent success with garbage output. The CONTROL path read OFF_ERROR but threw a generic message with no cause. Design decisions (4): 1. Error carrier: int32 code + 256-byte message region at the mailbox tail (MAILBOX_OFF_ERROR_MSG). No existing offsets shift — OFF_ARGS and everything else stay byte-compatible. 256 B holds "<ExceptionType>: <message>" from the three Python loops; longer messages are truncated and NUL-terminated. MAILBOX_OFF_ERROR_MSG and MAILBOX_ERROR_MSG_SIZE are exposed via nanobind m.attr(...) so Python reads them from task_interface instead of hardcoding. 2. Error-code semantics: all failures collapse to code=1 + filled msg. The previous 1-vs-2 distinction (registry miss vs callable raised) was not actionable; the message already identifies the cause. Subclassing exceptions on the C++ side is deferred until there is a consumer that branches on them. 3. Scope mid-failure: fail-fast. Orchestrator::submit_* checks WorkerManager::has_error() at entry and rethrows the stored exception_ptr — the orch fn unwinds, Worker.run's finally runs _scope_end and _drain. drain() waits for active_tasks_ == 0 and rethrows (so in-flight tasks drain naturally and ring slots do not leak). scope_end deliberately does NOT throw: aborting it would leave scope refs unreleased and drain would hang forever. The rethrow moves from drain() once the allocator has been reset, so the next Worker.run() (after _clear_error) starts from a clean task_id = 0. 4. Multi-child: no active signal. When one child fails, peer children complete their current task and their WorkerThread::loop() catches any subsequent throw from dispatch_process. The parent's fail-fast at submit_* ensures no new tasks go out. close() writes SHUTDOWN to every mailbox as usual. C++ changes: - WorkerThread::loop() wraps dispatch_thread/dispatch_process in a try/catch — an uncaught exception would terminate the std::thread via std::terminate. The caught exception_ptr is reported to WorkerManager; on_complete_(slot) still fires so the scheduler releases consumers and drain() reaches zero. - WorkerThread::dispatch_process clears OFF_ERROR / OFF_ERROR_MSG before TASK_READY, reads them after TASK_DONE, and throws std::runtime_error with the child-written message on non-zero. - WorkerThread::control_malloc does the same for CONTROL_REQUEST. - WorkerManager gains report_error / has_error / take_error / clear_error, protecting a single std::exception_ptr under a mutex with first-error-wins semantics. - Orchestrator::submit_impl and Orchestrator::drain both check manager_->has_error() and rethrow. Orchestrator::clear_error delegates to WorkerManager. Python changes: - _sub_worker_loop / _chip_process_loop (TASK and CONTROL paths) / _child_worker_loop: write code=1 + `f"{prefix}: {type(e).__name__}: {e}"` into OFF_ERROR_MSG via a new _write_error helper that truncates and zero-pads. Previous code writing error=2 for "callable raised" is folded into code=1. - Worker.run calls Orchestrator._clear_error() before _scope_begin so a prior failed run does not poison the next one. - _chip_control reads OFF_ERROR_MSG and includes it in the RuntimeError it raises (parity with WorkerThread::control_malloc). Tests (tests/ut/py/test_worker/test_error_propagation.py, no hardware): - SubWorker callable raises → Worker.run raises RuntimeError containing original exception type and message. - Registry miss → surfaces with "not registered" in the message. - Failed run does not wedge the Worker; next Worker.run with a clean orch completes. - Sequential submits with fail-fast: second submit after a failure observes has_error and rethrows immediately. - L4 → L3 → SubWorker chain: exception raised in the innermost sub surfaces at the L4 caller with both child_worker and sub_worker prefixes in the chained message. Scope: - Does not touch the scheduler, ring, scope, tensormap, submit_impl body beyond the entry check, or any runtime code under src/{arch}/. - Does not change SubmitResult or TaskState. Error subclassing is deferred. The chip_process init-failure hang is NOT fixed here — a message is now written for the failure path but the state-handshake bug is out of scope for L4 and is tracked locally in KNOWN_ISSUES.md. Audit of writers/readers of OFF_ERROR + design rationale is captured in .docs/l4-audit.md (local-only per repo .gitignore convention). Verified: - tests/ut/py/test_worker: 42 passed, 1 skipped (HCCL, hardware only) - tests/ut/cpp (build/ut_cpp, ctest): all 8 tests passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements a robust error propagation mechanism across the hierarchical worker system. It introduces a dedicated error message region in the shared memory mailbox, allowing child processes to report detailed exception information back to the parent. The C++ WorkerManager and WorkerThread have been updated to capture these errors and rethrow them in the orchestrator thread, complemented by a fail-fast check in submit_impl and error state management in Worker.run. Comprehensive unit tests were also added to verify error surfacing across multiple worker levels. I have no feedback to provide as the existing review comments were either purely validating the implementation or lacked actionable code suggestions.

走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>

gemini-code-assist bot reviewed Apr 20, 2026

View reviewed changes

ChaoWao merged commit a9b521b into hw-native-sys:main Apr 20, 2026
14 checks passed

ChaoWao deleted the feat/l4-error-propagation branch April 20, 2026 12:14

ChaoWao mentioned this pull request Apr 20, 2026

Fix: Python-side acquire/release on mailbox state #609

Merged

4 tasks

ChaoWao mentioned this pull request Apr 21, 2026

Add: examples/workers/l3/allreduce_distributed example #307

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: L4 error propagation from child workers to Worker.run#605

Add: L4 error propagation from child workers to Worker.run#605
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/l4-error-propagation

ChaoWao commented Apr 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Apr 20, 2026

Summary

Design decisions (4)

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant