Skip to content

Add: L4 error propagation from child workers to Worker.run#605

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/l4-error-propagation
Apr 20, 2026
Merged

Add: L4 error propagation from child workers to Worker.run#605
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/l4-error-propagation

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 20, 2026

Summary

Surfaces Python exceptions from forked SubWorker / ChipWorker / L4-inner-Worker
loops up to the caller of Worker.run(). Before this change, a child-side
exception was written to OFF_ERROR in the mailbox but never read by
WorkerThread::dispatch_process — the parent wrote IDLE on TASK_DONE
regardless and the caller saw silent success with garbage output. The
CONTROL path read OFF_ERROR but threw a message with no cause.

  • Add MAILBOX_OFF_ERROR_MSG (256 B at the mailbox tail; no existing
    offset shifts) and expose it via nanobind so Python cannot drift.
  • dispatch_process reads error + msg after TASK_DONE and throws
    std::runtime_error with the child-written message.
  • WorkerThread::loop() wraps dispatch_* in try/catch so an uncaught
    exception cannot terminate the std::thread; failures route to
    WorkerManager (first-error-wins std::exception_ptr).
  • Orchestrator::submit_impl and drain() check WorkerManager::has_error()
    and rethrow — submit is fail-fast, drain waits for in-flight tasks to
    finish before rethrowing so ring slots don't leak.
  • scope_end deliberately does NOT throw (would strand scope refs and
    hang drain); the throw point is submit_* or drain.
  • Python loops collapse error codes to code=1 + message; Worker.run
    clears the error slot before _scope_begin so a prior failed run
    doesn't poison the next one.

Design decisions (4)

  1. Error carrier: int32 code + 256-byte NUL-terminated message at
    the mailbox tail. Existing offsets unchanged.
  2. Code semantics: collapse 1 (registry miss) / 2 (callable
    raised) to code=1 + msg. Subclassing exceptions deferred.
  3. Scope mid-failure: fail-fast at submit_*, drain rethrow on
    exit. scope_end never throws. In-flight tasks complete naturally.
  4. Multi-child: no active signal to peers. Fail-fast prevents new
    submits; in-flight peers finish their current task; close() writes
    SHUTDOWN as usual.

See the commit message for the full rationale.

Test plan

  • tests/ut/py/test_worker/test_error_propagation.py (new, 5 cases)
  • tests/ut/py/test_worker entire suite (42 passed, 1 HCCL skip)
  • tests/ut/cpp entire suite via ctest (all 8 passed)
  • Hardware CI (no local hardware; relying on CI)

Before this change, a Python exception in a forked SubWorker / ChipWorker /
L4-inner-Worker loop was written to OFF_ERROR in the mailbox but never
read by WorkerThread::dispatch_process. The parent wrote IDLE on
TASK_DONE regardless of error and the caller of Worker.run() saw silent
success with garbage output. The CONTROL path read OFF_ERROR but threw
a generic message with no cause.

Design decisions (4):

1. Error carrier: int32 code + 256-byte message region at the mailbox
   tail (MAILBOX_OFF_ERROR_MSG). No existing offsets shift — OFF_ARGS
   and everything else stay byte-compatible. 256 B holds
   "<ExceptionType>: <message>" from the three Python loops; longer
   messages are truncated and NUL-terminated. MAILBOX_OFF_ERROR_MSG and
   MAILBOX_ERROR_MSG_SIZE are exposed via nanobind m.attr(...) so
   Python reads them from task_interface instead of hardcoding.

2. Error-code semantics: all failures collapse to code=1 + filled msg.
   The previous 1-vs-2 distinction (registry miss vs callable raised)
   was not actionable; the message already identifies the cause.
   Subclassing exceptions on the C++ side is deferred until there is
   a consumer that branches on them.

3. Scope mid-failure: fail-fast. Orchestrator::submit_* checks
   WorkerManager::has_error() at entry and rethrows the stored
   exception_ptr — the orch fn unwinds, Worker.run's finally runs
   _scope_end and _drain. drain() waits for active_tasks_ == 0 and
   rethrows (so in-flight tasks drain naturally and ring slots do not
   leak). scope_end deliberately does NOT throw: aborting it would
   leave scope refs unreleased and drain would hang forever. The
   rethrow moves from drain() once the allocator has been reset, so
   the next Worker.run() (after _clear_error) starts from a clean
   task_id = 0.

4. Multi-child: no active signal. When one child fails, peer children
   complete their current task and their WorkerThread::loop() catches
   any subsequent throw from dispatch_process. The parent's fail-fast
   at submit_* ensures no new tasks go out. close() writes SHUTDOWN to
   every mailbox as usual.

C++ changes:

- WorkerThread::loop() wraps dispatch_thread/dispatch_process in a
  try/catch — an uncaught exception would terminate the std::thread
  via std::terminate. The caught exception_ptr is reported to
  WorkerManager; on_complete_(slot) still fires so the scheduler
  releases consumers and drain() reaches zero.
- WorkerThread::dispatch_process clears OFF_ERROR / OFF_ERROR_MSG
  before TASK_READY, reads them after TASK_DONE, and throws
  std::runtime_error with the child-written message on non-zero.
- WorkerThread::control_malloc does the same for CONTROL_REQUEST.
- WorkerManager gains report_error / has_error / take_error /
  clear_error, protecting a single std::exception_ptr under a mutex
  with first-error-wins semantics.
- Orchestrator::submit_impl and Orchestrator::drain both check
  manager_->has_error() and rethrow. Orchestrator::clear_error
  delegates to WorkerManager.

Python changes:

- _sub_worker_loop / _chip_process_loop (TASK and CONTROL paths) /
  _child_worker_loop: write code=1 + `f"{prefix}: {type(e).__name__}: {e}"`
  into OFF_ERROR_MSG via a new _write_error helper that truncates and
  zero-pads. Previous code writing error=2 for "callable raised" is
  folded into code=1.
- Worker.run calls Orchestrator._clear_error() before _scope_begin so
  a prior failed run does not poison the next one.
- _chip_control reads OFF_ERROR_MSG and includes it in the RuntimeError
  it raises (parity with WorkerThread::control_malloc).

Tests (tests/ut/py/test_worker/test_error_propagation.py, no hardware):

- SubWorker callable raises → Worker.run raises RuntimeError containing
  original exception type and message.
- Registry miss → surfaces with "not registered" in the message.
- Failed run does not wedge the Worker; next Worker.run with a clean
  orch completes.
- Sequential submits with fail-fast: second submit after a failure
  observes has_error and rethrows immediately.
- L4 → L3 → SubWorker chain: exception raised in the innermost sub
  surfaces at the L4 caller with both child_worker and sub_worker
  prefixes in the chained message.

Scope:

- Does not touch the scheduler, ring, scope, tensormap, submit_impl
  body beyond the entry check, or any runtime code under src/{arch}/.
- Does not change SubmitResult or TaskState. Error subclassing is
  deferred. The chip_process init-failure hang is NOT fixed here — a
  message is now written for the failure path but the state-handshake
  bug is out of scope for L4 and is tracked locally in KNOWN_ISSUES.md.

Audit of writers/readers of OFF_ERROR + design rationale is captured
in .docs/l4-audit.md (local-only per repo .gitignore convention).

Verified:
- tests/ut/py/test_worker: 42 passed, 1 skipped (HCCL, hardware only)
- tests/ut/cpp (build/ut_cpp, ctest): all 8 tests passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a robust error propagation mechanism across the hierarchical worker system. It introduces a dedicated error message region in the shared memory mailbox, allowing child processes to report detailed exception information back to the parent. The C++ WorkerManager and WorkerThread have been updated to capture these errors and rethrow them in the orchestrator thread, complemented by a fail-fast check in submit_impl and error state management in Worker.run. Comprehensive unit tests were also added to verify error surfacing across multiple worker levels. I have no feedback to provide as the existing review comments were either purely validating the implementation or lacked actionable code suggestions.

@ChaoWao ChaoWao merged commit a9b521b into hw-native-sys:main Apr 20, 2026
14 checks passed
@ChaoWao ChaoWao deleted the feat/l4-error-propagation branch April 20, 2026 12:14
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant