Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade) by ChaoWao · Pull Request #890 · hw-native-sys/simpler

ChaoWao · 2026-05-29T02:36:01Z

Symptom

st-onboard-a2a3 fails intermittently (~19% of runs after #870, vs ~10% before)
with a whole-suite collapse: the L2 tensormap_and_ringbuffer phase reports ~10
failed + ~23 errors at once —
simpler_init failed with code 507899, prepare_callable failed -1,
run_prepared failed with code 507018/507046/507901, rtMalloc failed: 507899.

It looks like an out-of-memory. It is not — npu-smi info taken right after
the failure shows every chip Health=OK with HBM ~3 GB / 64 GB used.

Root cause

507899 is [driver error:internal error] and 507901 is [hdc disconnect] —
a cascade after a device fault, not the cause. Surfacing the CANN device slog
(ASCEND_SLOG_PRINT_TO_STDOUT=1) shows the earliest error is an AICPU exception:

ProcessStarsAicpuErrorInfo: error from device(chipId:N, dieId:0/1),
                            an exception occurred during AICPU execution
PrintAicpuErrorInfo: Aicpu kernel execute failed,
                     soName=simpler_inner_<fp>.so, funcName=simpler_aicpu_exec,
                     errorCode=0x2a            # 507018 ACL_ERROR_RT_AICPU_EXCEPTION

simpler_aicpu_exec faults the whole chip; afterwards every rtStreamCreate /
rtMalloc on that chip returns 507899/507901, so the next test's simpler_init
fails and the suite collapses.

Two facts pin the mechanism:

On a2a3 the npu-smi Phy-IDs pair as die0/die1 of one Ascend910 (devices
8/9, 4/5, …); the exception fires on both dies of one chip at the same
instant — a chip-shared resource was corrupted.
The runtime stages SOs under the shared preinstall dir
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/. The Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537) #870 dispatcher wrote
the AICPU runtime SO there under a content-fingerprint-only name
simpler_inner_<fp>.so — identical across both dies. Paired dies share that
filesystem, so both wrote/renamed/executed the same file; concurrent bootstrap
corrupted the mmap'd image and trapped simpler_aicpu_exec.

A single-die 50× solo loop of the unregister/re-prepare/dedup tests never
reproduced; only the parallel multi-die suite did — consistent with a cross-die
shared-file race, not an intra-process use-after-free.

Fix

Make every SO staged under the shared preinstall dir per-device:

Dispatcher inner SO → simpler_inner_<fp>_<device_id>.so. The real
device_id (was hardcoded 0) is threaded from DeviceRunner through
BootstrapDispatcher into both the host JSON reader name (MakeInnerSoBasename)
and the device-side writer (DeviceArgs.device_id → MakeInnerSoPath). The
process-level bootstrap cache is keyed by (fp, device_id).
AICPU executor orchestration SO → libdevice_orch_<pid>_<cid>_<device_id>.so.
device_id reaches the AICPU via a new trailing KernelArgs.device_id field,
pushed in kernel.cpp via set_orch_device_id() and read by the executor via
get_orch_device_id() (platform_regs). This SO was already pid-named (mostly
safe); the suffix is defense-in-depth for colliding device-side pids.

Applied symmetrically to a2a3 and a5 (onboard + sim signatures in lockstep).

Testing

tests/ut/cpp test_orch_so_file updated for the new signature + a new
DistinctDeviceIdsProduceDistinctPaths case; full no-hardware ctest passes.
Validated with 5/5 consecutive st-onboard-a2a3 passes on the inner-SO fix.
Investigation + diagnosis recipe written up in
docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md.

🤖 Generated with Claude Code

gemini-code-assist · 2026-05-29T02:36:06Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

coderabbitai · 2026-05-29T02:36:13Z

📝 Walkthrough

Walkthrough

This PR adds Ascend device-side logging environment variables to the st-onboard-a2a3 CI job, enabling slog output (including rtMalloc and OOM diagnostics) to be captured in CI logs for diagnostic visibility.

Changes

Ascend logging environment setup

Layer / File(s)	Summary
Ascend slog environment variables `.github/workflows/ci.yml`	Environment variables `ASCEND_SLOG_PRINT_TO_STDOUT` and `ASCEND_GLOBAL_LOG_LEVEL` are added to the `st-onboard-a2a3` CI job to surface device-side slog output in CI logs.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

A rabbit hops through CI logs with glee,
"Now slog shall flow for all to see!"
With Ascend's light and malloc's song,
Device diagnostics right along. 🐰📋

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title describes fixing per-device SO naming to address a device fault, but the actual changeset only adds Ascend logging environment variables for diagnostic purposes in CI.	Update the title to reflect that this is a diagnostic-only change adding CANN logging to CI (e.g., 'DEBUG: dump CANN device slog to stdout on st-onboard-a2a3'), and explicitly note it should not be merged.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description is related to the changeset, explaining the root cause investigation and diagnostic goals that motivate adding logging to the CI job.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/ci.yml (1)

483-489: Consider adding a workflow guard to prevent accidental merge.

Since this is a diagnostic branch explicitly marked "do not merge", consider adding a step that fails if running on the main branch to provide an extra safeguard against accidental merge. However, given this is a temporary diagnostic branch and the PR review process should catch this, this is optional.

Optional: Example workflow guard

    steps:
      - name: Block merge to main (DEBUG branch only)
        if: github.ref == 'refs/heads/main'
        run: |
          echo "::error::This is a DEBUG branch and must not be merged to main"
          exit 1
      
      - name: Checkout repository
        uses: actions/checkout@v5

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 483 - 489, This workflow adds
debug-only environment vars (ASCEND_SLOG_PRINT_TO_STDOUT and
ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to prevent accidental merges to main;
add a CI step near the top of the job that checks github.ref ==
'refs/heads/main' and exits non‑zero with an error message if true so the job
fails on main, ensuring the diagnostic branch cannot run on main; reference the
env keys ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL in the step’s
message to make intent explicit.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 483-489: This workflow adds debug-only environment vars
(ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to
prevent accidental merges to main; add a CI step near the top of the job that
checks github.ref == 'refs/heads/main' and exits non‑zero with an error message
if true so the job fails on main, ensuring the diagnostic branch cannot run on
main; reference the env keys ASCEND_SLOG_PRINT_TO_STDOUT and
ASCEND_GLOBAL_LOG_LEVEL in the step’s message to make intent explicit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b636ee88-6924-41b8-93ad-1c627e41f623

📥 Commits

Reviewing files that changed from the base of the PR and between 1de6a90 and 7d1863d.

📒 Files selected for processing (1)

.github/workflows/ci.yml

st-onboard-a2a3 fails intermittently with a whole-suite collapse (507899/507018/prepare_callable -1). It is NOT OOM (npu-smi shows HBM ~3G/64G free): the earliest error is an AICPU exception in simpler_aicpu_exec (507018, errorCode 0x2a) that faults the whole chip, after which every rtStreamCreate/rtMalloc on it returns 507899 [driver error:internal error] / 507901 [hdc disconnect]. Root cause: on a2a3 the two dies of one chip (npu-smi Phy-IDs 8/9, 4/5, ...) share the preinstall filesystem, but the runtime staged SOs there under names that are identical across dies. Concurrent bootstrap/staging on the same file corrupts the mmap'd image and traps the AICPU kernel on both dies at once. A single-die 50x solo loop never reproduces; only the parallel multi-die suite does — a cross-die shared-file race. Fix — make every SO staged under /usr/lib64/aicpu_kernels/0/ per-device: - Dispatcher inner SO: simpler_inner_<fp>_<device_id>.so. The real device_id (was hardcoded 0) is threaded from DeviceRunner through BootstrapDispatcher into the host JSON reader name (MakeInnerSoBasename) and the device-side writer (DeviceArgs.device_id -> MakeInnerSoPath). Bootstrap cache keyed by (fp, device_id). - AICPU executor orch SO: libdevice_orch_<pid>_<cid>_<device_id>.so. device_id reaches the AICPU via a new trailing KernelArgs.device_id field, pushed in kernel.cpp through set_orch_device_id() and read by the executor via get_orch_device_id() (platform_regs). - Update tests/ut/cpp test_orch_so_file for the new signature + add a distinct-device-id case. Applied symmetrically to a2a3 and a5 (onboard + sim). Validated with 5/5 consecutive st-onboard-a2a3 passes; investigation written up in docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ace (#925) PR #710 added a silent fallback to 0xDEADBEEF placeholder addresses in get_aicore_regs(AIC_CTRL) when halMemCtl rejected the query, on the claim that "the dispatch path does not actually dereference these addresses". The claim is wrong: platform_init_aicore_regs and platform_deinit_aicore_regs do raw MMIO writes/reads through these addresses (FAST_PATH_ENABLE, DATA_MAIN_BASE, COND), so any chip that hit the fallback dispatched its first AICore task to 0xDEADBEEF — AICore never reached FAST_PATH_OPEN, the AICPU stream hung, and the host surfaced ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after ~2 s. This matches the recurring `device_id=11` stream-sync timeout flake on st-onboard-a2a3 (e.g. 2026-05-29 sdma_async_completion_demo, the 2026-05-30 *_distributed[4] sequence). The placeholder fix is symmetric across a2a3 + a5: propagate the HAL rc instead of synthesizing addresses. On a5, get_aicore_regs gains an int return type — the prior `host_regs.empty()` guard never fired because get_aicore_reg_info pre-resizes the vector. The upstream rc=13 (EACCES) on a2a3 has its own root cause: when 4 chip_processes for the same a2a3 runner fork concurrently (4-device distributed cases), a narrow driver-side serialization window for halMemCtl(AIC_CTRL) drops one request — empirically always dev=11. A short bounded retry (50 ms × 3) absorbs the race window without masking permanent failure modes. Only on a2a3; a5 uses halResMap and has not exhibited the symptom. PR #890's per-device staged-SO naming fix addressed a different 507046 path (paired-die simpler_inner_<fp>.so file race producing a simpler_aicpu_exec 0x2a cascade). The two bugs share the surface code but are independent — both need to be present to keep the flake rate down to zero. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

ChaoWao changed the title ~~DEBUG: dump CANN device slog to stdout on st-onboard-a2a3 (do not merge)~~ Fix: per-device simpler_inner_<fp>_<device_id>.so — stop paired-die shared-file AICPU fault (507899 cascade) May 29, 2026

ChaoWao force-pushed the debug/a2a3-ascend-slog-stdout branch from 743cdb3 to 1b08643 Compare May 29, 2026 06:36

ChaoWao force-pushed the debug/a2a3-ascend-slog-stdout branch from 1b08643 to 1cdaed8 Compare May 29, 2026 06:50

ChaoWao changed the title ~~Fix: per-device simpler_inner_<fp>_<device_id>.so — stop paired-die shared-file AICPU fault (507899 cascade)~~ Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade) May 29, 2026

ChaoWao merged commit 204a384 into hw-native-sys:main May 29, 2026
28 of 31 checks passed

ChaoWao deleted the debug/a2a3-ascend-slog-stdout branch May 29, 2026 07:19

hw-native-sys-bot mentioned this pull request May 30, 2026

Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race #925

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade)#890

Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade)#890
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:debug/a2a3-ascend-slog-stdout

ChaoWao commented May 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom

Root cause

Fix

Testing

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading