Add: L4 error propagation from child workers to Worker.run#605
Add: L4 error propagation from child workers to Worker.run#605ChaoWao merged 1 commit intohw-native-sys:mainfrom
Conversation
Before this change, a Python exception in a forked SubWorker / ChipWorker /
L4-inner-Worker loop was written to OFF_ERROR in the mailbox but never
read by WorkerThread::dispatch_process. The parent wrote IDLE on
TASK_DONE regardless of error and the caller of Worker.run() saw silent
success with garbage output. The CONTROL path read OFF_ERROR but threw
a generic message with no cause.
Design decisions (4):
1. Error carrier: int32 code + 256-byte message region at the mailbox
tail (MAILBOX_OFF_ERROR_MSG). No existing offsets shift — OFF_ARGS
and everything else stay byte-compatible. 256 B holds
"<ExceptionType>: <message>" from the three Python loops; longer
messages are truncated and NUL-terminated. MAILBOX_OFF_ERROR_MSG and
MAILBOX_ERROR_MSG_SIZE are exposed via nanobind m.attr(...) so
Python reads them from task_interface instead of hardcoding.
2. Error-code semantics: all failures collapse to code=1 + filled msg.
The previous 1-vs-2 distinction (registry miss vs callable raised)
was not actionable; the message already identifies the cause.
Subclassing exceptions on the C++ side is deferred until there is
a consumer that branches on them.
3. Scope mid-failure: fail-fast. Orchestrator::submit_* checks
WorkerManager::has_error() at entry and rethrows the stored
exception_ptr — the orch fn unwinds, Worker.run's finally runs
_scope_end and _drain. drain() waits for active_tasks_ == 0 and
rethrows (so in-flight tasks drain naturally and ring slots do not
leak). scope_end deliberately does NOT throw: aborting it would
leave scope refs unreleased and drain would hang forever. The
rethrow moves from drain() once the allocator has been reset, so
the next Worker.run() (after _clear_error) starts from a clean
task_id = 0.
4. Multi-child: no active signal. When one child fails, peer children
complete their current task and their WorkerThread::loop() catches
any subsequent throw from dispatch_process. The parent's fail-fast
at submit_* ensures no new tasks go out. close() writes SHUTDOWN to
every mailbox as usual.
C++ changes:
- WorkerThread::loop() wraps dispatch_thread/dispatch_process in a
try/catch — an uncaught exception would terminate the std::thread
via std::terminate. The caught exception_ptr is reported to
WorkerManager; on_complete_(slot) still fires so the scheduler
releases consumers and drain() reaches zero.
- WorkerThread::dispatch_process clears OFF_ERROR / OFF_ERROR_MSG
before TASK_READY, reads them after TASK_DONE, and throws
std::runtime_error with the child-written message on non-zero.
- WorkerThread::control_malloc does the same for CONTROL_REQUEST.
- WorkerManager gains report_error / has_error / take_error /
clear_error, protecting a single std::exception_ptr under a mutex
with first-error-wins semantics.
- Orchestrator::submit_impl and Orchestrator::drain both check
manager_->has_error() and rethrow. Orchestrator::clear_error
delegates to WorkerManager.
Python changes:
- _sub_worker_loop / _chip_process_loop (TASK and CONTROL paths) /
_child_worker_loop: write code=1 + `f"{prefix}: {type(e).__name__}: {e}"`
into OFF_ERROR_MSG via a new _write_error helper that truncates and
zero-pads. Previous code writing error=2 for "callable raised" is
folded into code=1.
- Worker.run calls Orchestrator._clear_error() before _scope_begin so
a prior failed run does not poison the next one.
- _chip_control reads OFF_ERROR_MSG and includes it in the RuntimeError
it raises (parity with WorkerThread::control_malloc).
Tests (tests/ut/py/test_worker/test_error_propagation.py, no hardware):
- SubWorker callable raises → Worker.run raises RuntimeError containing
original exception type and message.
- Registry miss → surfaces with "not registered" in the message.
- Failed run does not wedge the Worker; next Worker.run with a clean
orch completes.
- Sequential submits with fail-fast: second submit after a failure
observes has_error and rethrows immediately.
- L4 → L3 → SubWorker chain: exception raised in the innermost sub
surfaces at the L4 caller with both child_worker and sub_worker
prefixes in the chained message.
Scope:
- Does not touch the scheduler, ring, scope, tensormap, submit_impl
body beyond the entry check, or any runtime code under src/{arch}/.
- Does not change SubmitResult or TaskState. Error subclassing is
deferred. The chip_process init-failure hang is NOT fixed here — a
message is now written for the failure path but the state-handshake
bug is out of scope for L4 and is tracked locally in KNOWN_ISSUES.md.
Audit of writers/readers of OFF_ERROR + design rationale is captured
in .docs/l4-audit.md (local-only per repo .gitignore convention).
Verified:
- tests/ut/py/test_worker: 42 passed, 1 skipped (HCCL, hardware only)
- tests/ut/cpp (build/ut_cpp, ctest): all 8 tests passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a robust error propagation mechanism across the hierarchical worker system. It introduces a dedicated error message region in the shared memory mailbox, allowing child processes to report detailed exception information back to the parent. The C++ WorkerManager and WorkerThread have been updated to capture these errors and rethrow them in the orchestrator thread, complemented by a fail-fast check in submit_impl and error state management in Worker.run. Comprehensive unit tests were also added to verify error surfacing across multiple worker levels. I have no feedback to provide as the existing review comments were either purely validating the implementation or lacked actionable code suggestions.
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。 通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。 文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过 没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>
Summary
Surfaces Python exceptions from forked SubWorker / ChipWorker / L4-inner-Worker
loops up to the caller of
Worker.run(). Before this change, a child-sideexception was written to
OFF_ERRORin the mailbox but never read byWorkerThread::dispatch_process— the parent wroteIDLEonTASK_DONEregardless and the caller saw silent success with garbage output. The
CONTROLpath readOFF_ERRORbut threw a message with no cause.MAILBOX_OFF_ERROR_MSG(256 B at the mailbox tail; no existingoffset shifts) and expose it via nanobind so Python cannot drift.
dispatch_processreads error + msg afterTASK_DONEand throwsstd::runtime_errorwith the child-written message.WorkerThread::loop()wrapsdispatch_*in try/catch so an uncaughtexception cannot terminate the
std::thread; failures route toWorkerManager(first-error-winsstd::exception_ptr).Orchestrator::submit_implanddrain()checkWorkerManager::has_error()and rethrow — submit is fail-fast, drain waits for in-flight tasks to
finish before rethrowing so ring slots don't leak.
scope_enddeliberately does NOT throw (would strand scope refs andhang drain); the throw point is
submit_*ordrain.code=1+ message;Worker.runclears the error slot before
_scope_beginso a prior failed rundoesn't poison the next one.
Design decisions (4)
the mailbox tail. Existing offsets unchanged.
1(registry miss) /2(callableraised) to
code=1+ msg. Subclassing exceptions deferred.submit_*, drain rethrow onexit.
scope_endnever throws. In-flight tasks complete naturally.submits; in-flight peers finish their current task;
close()writesSHUTDOWN as usual.
See the commit message for the full rationale.
Test plan
tests/ut/py/test_worker/test_error_propagation.py(new, 5 cases)tests/ut/py/test_workerentire suite (42 passed, 1 HCCL skip)tests/ut/cppentire suite via ctest (all 8 passed)