Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade)#890
Conversation
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
📝 WalkthroughWalkthroughThis PR adds Ascend device-side logging environment variables to the ChangesAscend logging environment setup
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)
483-489: Consider adding a workflow guard to prevent accidental merge.Since this is a diagnostic branch explicitly marked "do not merge", consider adding a step that fails if running on the main branch to provide an extra safeguard against accidental merge. However, given this is a temporary diagnostic branch and the PR review process should catch this, this is optional.
Optional: Example workflow guard
steps: - name: Block merge to main (DEBUG branch only) if: github.ref == 'refs/heads/main' run: | echo "::error::This is a DEBUG branch and must not be merged to main" exit 1 - name: Checkout repository uses: actions/checkout@v5🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yml around lines 483 - 489, This workflow adds debug-only environment vars (ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to prevent accidental merges to main; add a CI step near the top of the job that checks github.ref == 'refs/heads/main' and exits non‑zero with an error message if true so the job fails on main, ensuring the diagnostic branch cannot run on main; reference the env keys ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL in the step’s message to make intent explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 483-489: This workflow adds debug-only environment vars
(ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to
prevent accidental merges to main; add a CI step near the top of the job that
checks github.ref == 'refs/heads/main' and exits non‑zero with an error message
if true so the job fails on main, ensuring the diagnostic branch cannot run on
main; reference the env keys ASCEND_SLOG_PRINT_TO_STDOUT and
ASCEND_GLOBAL_LOG_LEVEL in the step’s message to make intent explicit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b636ee88-6924-41b8-93ad-1c627e41f623
📒 Files selected for processing (1)
.github/workflows/ci.yml
743cdb3 to
1b08643
Compare
st-onboard-a2a3 fails intermittently with a whole-suite collapse (507899/507018/prepare_callable -1). It is NOT OOM (npu-smi shows HBM ~3G/64G free): the earliest error is an AICPU exception in simpler_aicpu_exec (507018, errorCode 0x2a) that faults the whole chip, after which every rtStreamCreate/rtMalloc on it returns 507899 [driver error:internal error] / 507901 [hdc disconnect]. Root cause: on a2a3 the two dies of one chip (npu-smi Phy-IDs 8/9, 4/5, ...) share the preinstall filesystem, but the runtime staged SOs there under names that are identical across dies. Concurrent bootstrap/staging on the same file corrupts the mmap'd image and traps the AICPU kernel on both dies at once. A single-die 50x solo loop never reproduces; only the parallel multi-die suite does — a cross-die shared-file race. Fix — make every SO staged under /usr/lib64/aicpu_kernels/0/ per-device: - Dispatcher inner SO: simpler_inner_<fp>_<device_id>.so. The real device_id (was hardcoded 0) is threaded from DeviceRunner through BootstrapDispatcher into the host JSON reader name (MakeInnerSoBasename) and the device-side writer (DeviceArgs.device_id -> MakeInnerSoPath). Bootstrap cache keyed by (fp, device_id). - AICPU executor orch SO: libdevice_orch_<pid>_<cid>_<device_id>.so. device_id reaches the AICPU via a new trailing KernelArgs.device_id field, pushed in kernel.cpp through set_orch_device_id() and read by the executor via get_orch_device_id() (platform_regs). - Update tests/ut/cpp test_orch_so_file for the new signature + add a distinct-device-id case. Applied symmetrically to a2a3 and a5 (onboard + sim). Validated with 5/5 consecutive st-onboard-a2a3 passes; investigation written up in docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1b08643 to
1cdaed8
Compare
…ace (#925) PR #710 added a silent fallback to 0xDEADBEEF placeholder addresses in get_aicore_regs(AIC_CTRL) when halMemCtl rejected the query, on the claim that "the dispatch path does not actually dereference these addresses". The claim is wrong: platform_init_aicore_regs and platform_deinit_aicore_regs do raw MMIO writes/reads through these addresses (FAST_PATH_ENABLE, DATA_MAIN_BASE, COND), so any chip that hit the fallback dispatched its first AICore task to 0xDEADBEEF — AICore never reached FAST_PATH_OPEN, the AICPU stream hung, and the host surfaced ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after ~2 s. This matches the recurring `device_id=11` stream-sync timeout flake on st-onboard-a2a3 (e.g. 2026-05-29 sdma_async_completion_demo, the 2026-05-30 *_distributed[4] sequence). The placeholder fix is symmetric across a2a3 + a5: propagate the HAL rc instead of synthesizing addresses. On a5, get_aicore_regs gains an int return type — the prior `host_regs.empty()` guard never fired because get_aicore_reg_info pre-resizes the vector. The upstream rc=13 (EACCES) on a2a3 has its own root cause: when 4 chip_processes for the same a2a3 runner fork concurrently (4-device distributed cases), a narrow driver-side serialization window for halMemCtl(AIC_CTRL) drops one request — empirically always dev=11. A short bounded retry (50 ms × 3) absorbs the race window without masking permanent failure modes. Only on a2a3; a5 uses halResMap and has not exhibited the symptom. PR #890's per-device staged-SO naming fix addressed a different 507046 path (paired-die simpler_inner_<fp>.so file race producing a simpler_aicpu_exec 0x2a cascade). The two bugs share the surface code but are independent — both need to be present to keep the flake rate down to zero. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Symptom
st-onboard-a2a3fails intermittently (~19% of runs after #870, vs ~10% before)with a whole-suite collapse: the
L2 tensormap_and_ringbufferphase reports ~10failed + ~23 errors at once —
simpler_init failed with code 507899,prepare_callable failed -1,run_prepared failed with code 507018/507046/507901,rtMalloc failed: 507899.It looks like an out-of-memory. It is not —
npu-smi infotaken right afterthe failure shows every chip
Health=OKwith HBM ~3 GB / 64 GB used.Root cause
507899is[driver error:internal error]and507901is[hdc disconnect]—a cascade after a device fault, not the cause. Surfacing the CANN device slog
(
ASCEND_SLOG_PRINT_TO_STDOUT=1) shows the earliest error is an AICPU exception:simpler_aicpu_execfaults the whole chip; afterwards everyrtStreamCreate/rtMallocon that chip returns 507899/507901, so the next test'ssimpler_initfails and the suite collapses.
Two facts pin the mechanism:
npu-smiPhy-IDs pair as die0/die1 of one Ascend910 (devices8/9, 4/5, …); the exception fires on both dies of one chip at the same
instant — a chip-shared resource was corrupted.
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/. The Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537) #870 dispatcher wrotethe AICPU runtime SO there under a content-fingerprint-only name
simpler_inner_<fp>.so— identical across both dies. Paired dies share thatfilesystem, so both wrote/renamed/executed the same file; concurrent bootstrap
corrupted the mmap'd image and trapped
simpler_aicpu_exec.A single-die 50× solo loop of the unregister/re-prepare/dedup tests never
reproduced; only the parallel multi-die suite did — consistent with a cross-die
shared-file race, not an intra-process use-after-free.
Fix
Make every SO staged under the shared preinstall dir per-device:
simpler_inner_<fp>_<device_id>.so. The realdevice_id(was hardcoded0) is threaded fromDeviceRunnerthroughBootstrapDispatcherinto both the host JSON reader name (MakeInnerSoBasename)and the device-side writer (
DeviceArgs.device_id→MakeInnerSoPath). Theprocess-level bootstrap cache is keyed by
(fp, device_id).libdevice_orch_<pid>_<cid>_<device_id>.so.device_idreaches the AICPU via a new trailingKernelArgs.device_idfield,pushed in
kernel.cppviaset_orch_device_id()and read by the executor viaget_orch_device_id()(platform_regs). This SO was already pid-named (mostlysafe); the suffix is defense-in-depth for colliding device-side pids.
Applied symmetrically to a2a3 and a5 (onboard + sim signatures in lockstep).
Testing
tests/ut/cpptest_orch_so_fileupdated for the new signature + a newDistinctDeviceIdsProduceDistinctPathscase; full no-hardware ctest passes.st-onboard-a2a3passes on the inner-SO fix.docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md.
🤖 Generated with Claude Code