Skip to content

Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade)#890

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:debug/a2a3-ascend-slog-stdout
May 29, 2026
Merged

Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade)#890
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:debug/a2a3-ascend-slog-stdout

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 29, 2026

Symptom

st-onboard-a2a3 fails intermittently (~19% of runs after #870, vs ~10% before)
with a whole-suite collapse: the L2 tensormap_and_ringbuffer phase reports ~10
failed + ~23 errors at once —
simpler_init failed with code 507899, prepare_callable failed -1,
run_prepared failed with code 507018/507046/507901, rtMalloc failed: 507899.

It looks like an out-of-memory. It is notnpu-smi info taken right after
the failure shows every chip Health=OK with HBM ~3 GB / 64 GB used.

Root cause

507899 is [driver error:internal error] and 507901 is [hdc disconnect]
a cascade after a device fault, not the cause. Surfacing the CANN device slog
(ASCEND_SLOG_PRINT_TO_STDOUT=1) shows the earliest error is an AICPU exception:

ProcessStarsAicpuErrorInfo: error from device(chipId:N, dieId:0/1),
                            an exception occurred during AICPU execution
PrintAicpuErrorInfo: Aicpu kernel execute failed,
                     soName=simpler_inner_<fp>.so, funcName=simpler_aicpu_exec,
                     errorCode=0x2a            # 507018 ACL_ERROR_RT_AICPU_EXCEPTION

simpler_aicpu_exec faults the whole chip; afterwards every rtStreamCreate /
rtMalloc on that chip returns 507899/507901, so the next test's simpler_init
fails and the suite collapses.

Two facts pin the mechanism:

  1. On a2a3 the npu-smi Phy-IDs pair as die0/die1 of one Ascend910 (devices
    8/9, 4/5, …); the exception fires on both dies of one chip at the same
    instant
    — a chip-shared resource was corrupted.
  2. The runtime stages SOs under the shared preinstall dir
    /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/. The Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537) #870 dispatcher wrote
    the AICPU runtime SO there under a content-fingerprint-only name
    simpler_inner_<fp>.so — identical across both dies. Paired dies share that
    filesystem, so both wrote/renamed/executed the same file; concurrent bootstrap
    corrupted the mmap'd image and trapped simpler_aicpu_exec.

A single-die 50× solo loop of the unregister/re-prepare/dedup tests never
reproduced; only the parallel multi-die suite did — consistent with a cross-die
shared-file race, not an intra-process use-after-free.

Fix

Make every SO staged under the shared preinstall dir per-device:

  • Dispatcher inner SOsimpler_inner_<fp>_<device_id>.so. The real
    device_id (was hardcoded 0) is threaded from DeviceRunner through
    BootstrapDispatcher into both the host JSON reader name (MakeInnerSoBasename)
    and the device-side writer (DeviceArgs.device_idMakeInnerSoPath). The
    process-level bootstrap cache is keyed by (fp, device_id).
  • AICPU executor orchestration SOlibdevice_orch_<pid>_<cid>_<device_id>.so.
    device_id reaches the AICPU via a new trailing KernelArgs.device_id field,
    pushed in kernel.cpp via set_orch_device_id() and read by the executor via
    get_orch_device_id() (platform_regs). This SO was already pid-named (mostly
    safe); the suffix is defense-in-depth for colliding device-side pids.

Applied symmetrically to a2a3 and a5 (onboard + sim signatures in lockstep).

Testing

  • tests/ut/cpp test_orch_so_file updated for the new signature + a new
    DistinctDeviceIdsProduceDistinctPaths case; full no-hardware ctest passes.
  • Validated with 5/5 consecutive st-onboard-a2a3 passes on the inner-SO fix.
  • Investigation + diagnosis recipe written up in
    docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md.

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Copy Markdown

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds Ascend device-side logging environment variables to the st-onboard-a2a3 CI job, enabling slog output (including rtMalloc and OOM diagnostics) to be captured in CI logs for diagnostic visibility.

Changes

Ascend logging environment setup

Layer / File(s) Summary
Ascend slog environment variables
.github/workflows/ci.yml
Environment variables ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL are added to the st-onboard-a2a3 CI job to surface device-side slog output in CI logs.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

A rabbit hops through CI logs with glee,
"Now slog shall flow for all to see!"
With Ascend's light and malloc's song,
Device diagnostics right along. 🐰📋

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title describes fixing per-device SO naming to address a device fault, but the actual changeset only adds Ascend logging environment variables for diagnostic purposes in CI. Update the title to reflect that this is a diagnostic-only change adding CANN logging to CI (e.g., 'DEBUG: dump CANN device slog to stdout on st-onboard-a2a3'), and explicitly note it should not be merged.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description is related to the changeset, explaining the root cause investigation and diagnostic goals that motivate adding logging to the CI job.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

483-489: Consider adding a workflow guard to prevent accidental merge.

Since this is a diagnostic branch explicitly marked "do not merge", consider adding a step that fails if running on the main branch to provide an extra safeguard against accidental merge. However, given this is a temporary diagnostic branch and the PR review process should catch this, this is optional.

Optional: Example workflow guard
    steps:
      - name: Block merge to main (DEBUG branch only)
        if: github.ref == 'refs/heads/main'
        run: |
          echo "::error::This is a DEBUG branch and must not be merged to main"
          exit 1
      
      - name: Checkout repository
        uses: actions/checkout@v5
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 483 - 489, This workflow adds
debug-only environment vars (ASCEND_SLOG_PRINT_TO_STDOUT and
ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to prevent accidental merges to main;
add a CI step near the top of the job that checks github.ref ==
'refs/heads/main' and exits non‑zero with an error message if true so the job
fails on main, ensuring the diagnostic branch cannot run on main; reference the
env keys ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL in the step’s
message to make intent explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 483-489: This workflow adds debug-only environment vars
(ASCEND_SLOG_PRINT_TO_STDOUT and ASCEND_GLOBAL_LOG_LEVEL) but lacks a guard to
prevent accidental merges to main; add a CI step near the top of the job that
checks github.ref == 'refs/heads/main' and exits non‑zero with an error message
if true so the job fails on main, ensuring the diagnostic branch cannot run on
main; reference the env keys ASCEND_SLOG_PRINT_TO_STDOUT and
ASCEND_GLOBAL_LOG_LEVEL in the step’s message to make intent explicit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b636ee88-6924-41b8-93ad-1c627e41f623

📥 Commits

Reviewing files that changed from the base of the PR and between 1de6a90 and 7d1863d.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml

@ChaoWao ChaoWao changed the title DEBUG: dump CANN device slog to stdout on st-onboard-a2a3 (do not merge) Fix: per-device simpler_inner_<fp>_<device_id>.so — stop paired-die shared-file AICPU fault (507899 cascade) May 29, 2026
@ChaoWao ChaoWao force-pushed the debug/a2a3-ascend-slog-stdout branch from 743cdb3 to 1b08643 Compare May 29, 2026 06:36
st-onboard-a2a3 fails intermittently with a whole-suite collapse
(507899/507018/prepare_callable -1). It is NOT OOM (npu-smi shows HBM
~3G/64G free): the earliest error is an AICPU exception in simpler_aicpu_exec
(507018, errorCode 0x2a) that faults the whole chip, after which every
rtStreamCreate/rtMalloc on it returns 507899 [driver error:internal error] /
507901 [hdc disconnect].

Root cause: on a2a3 the two dies of one chip (npu-smi Phy-IDs 8/9, 4/5, ...)
share the preinstall filesystem, but the runtime staged SOs there under names
that are identical across dies. Concurrent bootstrap/staging on the same file
corrupts the mmap'd image and traps the AICPU kernel on both dies at once. A
single-die 50x solo loop never reproduces; only the parallel multi-die suite
does — a cross-die shared-file race.

Fix — make every SO staged under /usr/lib64/aicpu_kernels/0/ per-device:
- Dispatcher inner SO: simpler_inner_<fp>_<device_id>.so. The real device_id
  (was hardcoded 0) is threaded from DeviceRunner through BootstrapDispatcher
  into the host JSON reader name (MakeInnerSoBasename) and the device-side
  writer (DeviceArgs.device_id -> MakeInnerSoPath). Bootstrap cache keyed by
  (fp, device_id).
- AICPU executor orch SO: libdevice_orch_<pid>_<cid>_<device_id>.so. device_id
  reaches the AICPU via a new trailing KernelArgs.device_id field, pushed in
  kernel.cpp through set_orch_device_id() and read by the executor via
  get_orch_device_id() (platform_regs).
- Update tests/ut/cpp test_orch_so_file for the new signature + add a
  distinct-device-id case.

Applied symmetrically to a2a3 and a5 (onboard + sim). Validated with 5/5
consecutive st-onboard-a2a3 passes; investigation written up in
docs/troubleshooting/a2a3-507899-aicpu-shared-so-fault.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the debug/a2a3-ascend-slog-stdout branch from 1b08643 to 1cdaed8 Compare May 29, 2026 06:50
@ChaoWao ChaoWao changed the title Fix: per-device simpler_inner_<fp>_<device_id>.so — stop paired-die shared-file AICPU fault (507899 cascade) Fix: per-device staged-SO naming to stop paired-die AICPU device fault (507899 cascade) May 29, 2026
@ChaoWao ChaoWao merged commit 204a384 into hw-native-sys:main May 29, 2026
28 of 31 checks passed
@ChaoWao ChaoWao deleted the debug/a2a3-ascend-slog-stdout branch May 29, 2026 07:19
ChaoWao added a commit that referenced this pull request May 31, 2026
…ace (#925)

PR #710 added a silent fallback to 0xDEADBEEF placeholder addresses in
get_aicore_regs(AIC_CTRL) when halMemCtl rejected the query, on the
claim that "the dispatch path does not actually dereference these
addresses". The claim is wrong: platform_init_aicore_regs and
platform_deinit_aicore_regs do raw MMIO writes/reads through these
addresses (FAST_PATH_ENABLE, DATA_MAIN_BASE, COND), so any chip that
hit the fallback dispatched its first AICore task to 0xDEADBEEF —
AICore never reached FAST_PATH_OPEN, the AICPU stream hung, and the
host surfaced ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after ~2 s.
This matches the recurring `device_id=11` stream-sync timeout flake
on st-onboard-a2a3 (e.g. 2026-05-29 sdma_async_completion_demo, the
2026-05-30 *_distributed[4] sequence).

The placeholder fix is symmetric across a2a3 + a5: propagate the HAL
rc instead of synthesizing addresses. On a5, get_aicore_regs gains an
int return type — the prior `host_regs.empty()` guard never fired
because get_aicore_reg_info pre-resizes the vector.

The upstream rc=13 (EACCES) on a2a3 has its own root cause: when 4
chip_processes for the same a2a3 runner fork concurrently (4-device
distributed cases), a narrow driver-side serialization window for
halMemCtl(AIC_CTRL) drops one request — empirically always dev=11.
A short bounded retry (50 ms × 3) absorbs the race window without
masking permanent failure modes. Only on a2a3; a5 uses halResMap and
has not exhibited the symptom.

PR #890's per-device staged-SO naming fix addressed a different
507046 path (paired-die simpler_inner_<fp>.so file race producing a
simpler_aicpu_exec 0x2a cascade). The two bugs share the surface
code but are independent — both need to be present to keep the flake
rate down to zero.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant