Skip to content

Add: hardware docs and CANN query tools#883

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/hardware-docs-and-cann-query-tool
May 30, 2026
Merged

Add: hardware docs and CANN query tools#883
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/hardware-docs-and-cann-query-tool

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 28, 2026

Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

  1. New docs/hardware/ cross-chip tree:

    • chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
      UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
      end-to-end task flow, off-chip vs on-chip cost model.
    • SoC family <-> arch mapping table cites vllm-ascend FAQ Refactor: Extract shared PTO Runtime C API to common header #21
      (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
      toolchain.py dav-c220 / dav-c310 bridge as authoritative
      sources.
    • cache-coherency.md moved from src/a2a3/docs/ and generalized:
      dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
      refs in src/a2a3/docs/platform.md and the AICPU L2 perf
      collector comment updated.
  2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:

    • a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
      sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
      UB 1.0 / HCCS on Kunpeng.
    • a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
    • Three views of "how many cores" section: spec view (delivered
      to user code) vs HAL view vs CANN ini view, with the observed
      discrepancy resolved by the device-side probe in thread 5.
      a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
      fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
      pattern is calibrated inference pending its own probe run.
  3. Rules reorg under .claude/rules/:

    • architecture.md + ascend-device.md merged then split by
      audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
      AICPU terminology) and project-layout.md (Python wheel split,
      build system lookup, test layout).
    • Inbound refs in docs/python-packaging.md and
      review-pr/SKILL.md updated.
  4. tools/cann-examples/query/ — host-side CLI:

    • Subcommands: devices, device (full per-device dump:
      identification + cores + memory hierarchy with per-field
      comments), mem , version (compiler/version.info — toolkit
      version, not aclrtGetVersion's runtime lib version).
    • Compile-time link to ascendcl + runtime + ascend_hal +
      drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
      sibling of ASCEND_HOME_PATH, override via cmake -D.
    • Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
      ini because the matching ACL device-attribute queries return 0
      on CANN 9.0 / a3.
  5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:

    • halGetDeviceInfo has queries flagged "used in device" in the
      header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
      succeed when called from inside an AICPU OS process. This tool
      uploads a small inner SO via the dispatcher bootstrap path
      (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
      in Mode A, no sudo / no pre-deployment), runs the queries
      device-side, and reads results back through GM.
    • Closes the long-standing a3 question of whether the 8 -> 6
      AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
      proves cpu_id 0 is OS-owned (single bit), and the absence of
      cpu_id 1 from every other CPU module's OCCUPY mask plus
      not-in-vNPU-mode rules out virtualization remapping. The gap
      is therefore 1 OS + 1 PG, not 2 OS.
    • Tool README documents how to run it on a5 to close the
      analogous question there.
  6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:

    • Refuses pytest / task-submit invocations with
      --platform a2a3|a5 when the host's actual silicon is the other
      family, before any device lock is acquired. CI is fine because
      each onboard runner is labeled with its arch; local hardware
      work bypasses that protection. Wrong-arch runs produce
      507018 / 507899 cascades that LOOK LIKE genuine bugs and
      routinely waste hours on phantom investigations.
    • Detection reads the same source as the query tool: npu-smi for
      Chip Name + NPU Name, then
      $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/.ini
      for Short_SoC_version, then maps to repo arch. No ACL init,
      no device binding, ~600 ms cold and ~5 ms cached
      (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
      (a2a3sim, a5sim) pass through unconditionally.
    • .claude/rules/task-submit-isolation.md links to the skill from
      its pre-flight section and adds bypass-the-precheck to the
      anti-patterns list.
  7. CI integration in .github/workflows/ci.yml:

    • ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
      query version (no device locked, no resource-spec conflict).
    • Same jobs build tools/cann-examples/aicpu-device-query
      (cross-compiled device SO + native host) as a link smoke test.
    • docs/ci.md job table updated; tools/README.md updated.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@hw-native-sys-bot, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 57 minutes and 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a06c22a-71e9-4f4d-8718-1fdb5efd31ad

📥 Commits

Reviewing files that changed from the base of the PR and between d48fe39 and 2f28986.

📒 Files selected for processing (24)
  • .claude/rules/ascend-device.md
  • .claude/rules/ascend.md
  • .claude/rules/project-layout.md
  • .claude/rules/task-submit-isolation.md
  • .claude/skills/onboard-arch-precheck/SKILL.md
  • .claude/skills/onboard-arch-precheck/check.sh
  • .claude/skills/review-pr/SKILL.md
  • .github/workflows/ci.yml
  • docs/ci.md
  • docs/hardware/cache-coherency.md
  • docs/hardware/chip-architecture.md
  • docs/python-packaging.md
  • src/a2a3/docs/hardware.md
  • src/a2a3/docs/platform.md
  • src/a5/docs/hardware.md
  • src/a5/docs/platform.md
  • tools/README.md
  • tools/cann-examples/aicpu-device-query/README.md
  • tools/cann-examples/aicpu-device-query/device/CMakeLists.txt
  • tools/cann-examples/aicpu-device-query/device/aicpu_query.cpp
  • tools/cann-examples/aicpu-device-query/host/CMakeLists.txt
  • tools/cann-examples/aicpu-device-query/host/query_device_hal.cpp
  • tools/cann-examples/query/CMakeLists.txt
  • tools/cann-examples/query/query.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reorganizes and expands the documentation regarding the Ascend NPU architecture (covering both a2a3 and a5 generations), cache coherency, and project layout. It also introduces a standalone host-side device-info CLI tool (query) under tools/cann-examples/query to query device counts, SoC names, core counts, and HBM memory info using CANN ACL APIs, integrating its build and execution into the CI workflow. The review feedback highlights two important improvements: ensuring that aclrtResetDevice is reliably called in cmd_mem even if aclrtGetMemInfo fails to prevent thread-state pollution, and wrapping ASCEND_HOME_PATH references in double quotes within the CMake configuration to safely handle paths containing spaces.

Comment thread tools/cann-examples/query/query.cpp
Comment thread tools/cann-examples/query/CMakeLists.txt
@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/hardware-docs-and-cann-query-tool branch from 7b637a4 to e509bae Compare May 30, 2026 08:19
@hw-native-sys-bot hw-native-sys-bot changed the title Add: hardware docs and CANN query tool Add: hardware docs and CANN query tools May 30, 2026
@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/hardware-docs-and-cann-query-tool branch from e509bae to ba5e9df Compare May 30, 2026 08:31
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

1. New docs/hardware/ cross-chip tree:
   - chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
     UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
     end-to-end task flow, off-chip vs on-chip cost model.
   - SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
     (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
     toolchain.py dav-c220 / dav-c310 bridge as authoritative
     sources.
   - cache-coherency.md moved from src/a2a3/docs/ and generalized:
     dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
     refs in src/a2a3/docs/platform.md and the AICPU L2 perf
     collector comment updated.

2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
   - a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
     sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
     UB 1.0 / HCCS on Kunpeng.
   - a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
   - Three views of "how many cores" section: spec view (delivered
     to user code) vs HAL view vs CANN ini view, with the observed
     discrepancy resolved by the device-side probe in thread 5.
     a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
     fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
     pattern is calibrated inference pending its own probe run.

3. Rules reorg under .claude/rules/:
   - architecture.md + ascend-device.md merged then split by
     audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
     AICPU terminology) and project-layout.md (Python wheel split,
     build system lookup, test layout).
   - Inbound refs in docs/python-packaging.md and
     review-pr/SKILL.md updated.

4. tools/cann-examples/query/ — host-side CLI:
   - Subcommands: devices, device <id> (full per-device dump:
     identification + cores + memory hierarchy with per-field
     comments), mem <id>, version (compiler/version.info — toolkit
     version, not aclrtGetVersion's runtime lib version).
   - Compile-time link to ascendcl + runtime + ascend_hal +
     drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
     sibling of ASCEND_HOME_PATH, override via cmake -D.
   - Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
     ini because the matching ACL device-attribute queries return 0
     on CANN 9.0 / a3.

5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
   - halGetDeviceInfo has queries flagged "used in device" in the
     header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
     succeed when called from inside an AICPU OS process. This tool
     uploads a small inner SO via the dispatcher bootstrap path
     (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
     in Mode A, no sudo / no pre-deployment), runs the queries
     device-side, and reads results back through GM.
   - Closes the long-standing a3 question of whether the 8 -> 6
     AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
     proves cpu_id 0 is OS-owned (single bit), and the absence of
     cpu_id 1 from every other CPU module's OCCUPY mask plus
     not-in-vNPU-mode rules out virtualization remapping. The gap
     is therefore 1 OS + 1 PG, not 2 OS.
   - Tool README documents how to run it on a5 to close the
     analogous question there.

6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
   - Refuses pytest / task-submit invocations with
     --platform a2a3|a5 when the host's actual silicon is the other
     family, before any device lock is acquired. CI is fine because
     each onboard runner is labeled with its arch; local hardware
     work bypasses that protection. Wrong-arch runs produce
     507018 / 507899 cascades that LOOK LIKE genuine bugs and
     routinely waste hours on phantom investigations.
   - Detection reads the same source as the query tool: npu-smi for
     Chip Name + NPU Name, then
     $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
     for Short_SoC_version, then maps to repo arch. No ACL init,
     no device binding, ~600 ms cold and ~5 ms cached
     (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
     (a2a3sim, a5sim) pass through unconditionally.
   - .claude/rules/task-submit-isolation.md links to the skill from
     its pre-flight section and adds bypass-the-precheck to the
     anti-patterns list.

7. CI integration in .github/workflows/ci.yml:
   - ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
     `query version` (no device locked, no resource-spec conflict).
   - Same jobs build tools/cann-examples/aicpu-device-query
     (cross-compiled device SO + native host) as a link smoke test.
   - docs/ci.md job table updated; tools/README.md updated.
@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/hardware-docs-and-cann-query-tool branch from ba5e9df to 2f28986 Compare May 30, 2026 08:47
@ChaoWao ChaoWao merged commit a51429a into hw-native-sys:main May 30, 2026
29 of 31 checks passed
@ChaoWao ChaoWao deleted the feat/hardware-docs-and-cann-query-tool branch May 30, 2026 09:19
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
Smallest possible end-to-end demonstration of the AICPU kernel launch
pipeline used by this repo's runtime — no scene-test plumbing, no
ringbuffer / tensormap, no ChipWorker fork. Strips PR hw-native-sys#883's
aicpu-device-query down to the bootstrap path itself with a trivial
inner kernel so a reader who wants to add new AICPU work can use it as
a copy-paste template.

Pipeline (Method 2 / "Path A" — see docs/aicpu-kernel-launch-mechanisms.md):
  1. host  : rtAicpuKernelLaunchExWithArgs(KERNEL_TYPE_AICPU_KFC,
             libaicpu_extend_kernels.so) hands dispatcher SO + inner SO
             bytes to the AICPU OS process.
  2. device: dispatcher (DynTileFwkBackendKernelServerInit) writes
             inner SO to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/
             simpler_inner_<fp>_<dev>.so.
  3. host  : fingerprint inner SO by ELF Build-ID, emit a JSON descriptor
             pointing at the preinstall basename, register via
             rtsBinaryLoadFromFile.
  4. host  : rtsFuncGetByName for init + run handles.
  5. host  : rewrite DeviceArgs (offsets 96 / 104) to point at the
             HelloResult + input_token, rtsLaunchCpuKernel(run).
  6. device: kernel logs via DlogRecord, calls halGetDeviceInfo, writes
             HelloResult{ magic, echoed_token, hal_rc, hal_value }.
  7. host  : D2H + verify magic == 0xDEADBEEFC0FFEE01 + echoed token.

Files:
- docs/aicpu-kernel-launch-mechanisms.md : NEW canonical doc covering
  all three known methods of getting a custom AICPU SO onto the device
  (tar.gz pre-deployment, Path A dispatcher bootstrap, broken Path B
  KERNEL_TYPE_AICPU_CUSTOM). Records issue hw-native-sys#822's full failure
  forensics: cust-subprocess L1 stale on AICore HBM writes, the four
  user-space workarounds that all fail (volatile / ldar / dc civac /
  dc ivac) with the architectural reason each fails, and the CANN-side
  fix options (A/B/C/D). Sedimentation of the PR hw-native-sys#537 debugging
  session so future readers don't re-derive any of it.
- tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp :
  simpler_aicpu_init no-op + simpler_aicpu_run reads DeviceArgs, calls
  halGetDeviceInfo(AICPU, CORE_NUM), writes HelloResult to GM.
- tools/cann-examples/aicpu-kernel-launch/device/CMakeLists.txt :
  aarch64 cross, links ascend_hal, emits --build-id for fingerprint
  stability.
- tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp :
  full Mode A bootstrap + JSON descriptor + binary load + launch +
  verify. Inlined ELF Build-ID reader so the example is standalone
  (no headers from src/).
- tools/cann-examples/aicpu-kernel-launch/host/CMakeLists.txt :
  links ascendcl + runtime.
- tools/cann-examples/aicpu-kernel-launch/README.md : pipeline
  diagram, I/O contract, build/run, scope+limits; references the
  mechanisms doc for the full comparison.

Cross-links updated:
- src/common/aicpu_dispatcher/README.md and
- tools/cann-examples/aicpu-device-query/README.md
  both now point at the new mechanisms doc so anyone reading the
  dispatcher or the device-query tool can find the full Path A vs
  Path B vs tar.gz comparison.

Integration:
- ut-a2a3 + ut-a5 CI jobs build the device SO (cross-compile) + host
  launcher as a link smoke test, mirroring the aicpu-device-query
  precedent. No device locked, no resource-spec conflict.
- tools/README.md adds an "aicpu-kernel-launch" subsection.
- docs/ci.md job table updated.

Why bundle as one PR: the reference tool and the mechanisms doc
exist to answer the same question ("how do I launch a custom AICPU
kernel"). Splitting them would mean the doc points at a tool that
doesn't exist yet, or the tool points at a doc that doesn't exist
yet. Reviewable as one logical unit.
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
Smallest possible end-to-end demonstration of the AICPU kernel launch
pipeline used by this repo's runtime — no scene-test plumbing, no
ringbuffer / tensormap, no ChipWorker fork. Strips PR hw-native-sys#883's
aicpu-device-query down to the bootstrap path itself with a trivial
inner kernel so a reader who wants to add new AICPU work can use it as
a copy-paste template.

Pipeline (Method 2 / "Path A" — see docs/aicpu-kernel-launch-mechanisms.md):
  1. host  : rtAicpuKernelLaunchExWithArgs(KERNEL_TYPE_AICPU_KFC,
             libaicpu_extend_kernels.so) hands dispatcher SO + inner SO
             bytes to the AICPU OS process.
  2. device: dispatcher (DynTileFwkBackendKernelServerInit) writes
             inner SO to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/
             simpler_inner_<fp>_<dev>.so.
  3. host  : fingerprint inner SO by ELF Build-ID, emit a JSON descriptor
             pointing at the preinstall basename, register via
             rtsBinaryLoadFromFile.
  4. host  : rtsFuncGetByName for init + run handles.
  5. host  : rewrite DeviceArgs (offsets 96 / 104) to point at the
             HelloResult + input_token, rtsLaunchCpuKernel(run).
  6. device: kernel logs via DlogRecord, calls halGetDeviceInfo, writes
             HelloResult{ magic, echoed_token, hal_rc, hal_value }.
  7. host  : D2H + verify magic == 0xDEADBEEFC0FFEE01 + echoed token.

Files:
- docs/aicpu-kernel-launch-mechanisms.md : NEW canonical doc covering
  all three known methods of getting a custom AICPU SO onto the device
  (tar.gz pre-deployment, Path A dispatcher bootstrap, broken Path B
  KERNEL_TYPE_AICPU_CUSTOM). Records issue hw-native-sys#822's full failure
  forensics: cust-subprocess L1 stale on AICore HBM writes, the four
  user-space workarounds that all fail (volatile / ldar / dc civac /
  dc ivac) with the architectural reason each fails, and the CANN-side
  fix options (A/B/C/D). Sedimentation of the PR hw-native-sys#537 debugging
  session so future readers don't re-derive any of it.
- tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp :
  simpler_aicpu_init no-op + simpler_aicpu_run reads DeviceArgs, calls
  halGetDeviceInfo(AICPU, CORE_NUM), writes HelloResult to GM.
- tools/cann-examples/aicpu-kernel-launch/device/CMakeLists.txt :
  aarch64 cross, links ascend_hal, emits --build-id for fingerprint
  stability.
- tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp :
  full Mode A bootstrap + JSON descriptor + binary load + launch +
  verify. Inlined ELF Build-ID reader so the example is standalone
  (no headers from src/).
- tools/cann-examples/aicpu-kernel-launch/host/CMakeLists.txt :
  links ascendcl + runtime.
- tools/cann-examples/aicpu-kernel-launch/README.md : pipeline
  diagram, I/O contract, build/run, scope+limits; references the
  mechanisms doc for the full comparison.

Cross-links updated:
- src/common/aicpu_dispatcher/README.md and
- tools/cann-examples/aicpu-device-query/README.md
  both now point at the new mechanisms doc so anyone reading the
  dispatcher or the device-query tool can find the full Path A vs
  Path B vs tar.gz comparison.

Integration:
- ut-a2a3 + ut-a5 CI jobs build the device SO (cross-compile) + host
  launcher as a link smoke test, mirroring the aicpu-device-query
  precedent. No device locked, no resource-spec conflict.
- tools/README.md adds an "aicpu-kernel-launch" subsection.
- docs/ci.md job table updated.

Why bundle as one PR: the reference tool and the mechanisms doc
exist to answer the same question ("how do I launch a custom AICPU
kernel"). Splitting them would mean the doc points at a tool that
doesn't exist yet, or the tool points at a doc that doesn't exist
yet. Reviewable as one logical unit.
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
Smallest possible end-to-end demonstration of the AICPU kernel launch
pipeline used by this repo's runtime — no scene-test plumbing, no
ringbuffer / tensormap, no ChipWorker fork. Strips PR hw-native-sys#883's
aicpu-device-query down to the bootstrap path itself with a trivial
inner kernel so a reader who wants to add new AICPU work can use it as
a copy-paste template.

Pipeline (Method 2 / "Path A" — see docs/aicpu-kernel-launch-mechanisms.md):
  1. host  : rtAicpuKernelLaunchExWithArgs(KERNEL_TYPE_AICPU_KFC,
             libaicpu_extend_kernels.so) hands dispatcher SO + inner SO
             bytes to the AICPU OS process.
  2. device: dispatcher (DynTileFwkBackendKernelServerInit) writes
             inner SO to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/
             simpler_inner_<fp>_<dev>.so.
  3. host  : fingerprint inner SO by ELF Build-ID, emit a JSON descriptor
             pointing at the preinstall basename, register via
             rtsBinaryLoadFromFile.
  4. host  : rtsFuncGetByName for init + run handles.
  5. host  : rewrite DeviceArgs (offsets 96 / 104) to point at the
             HelloResult + input_token, rtsLaunchCpuKernel(run).
  6. device: kernel logs via DlogRecord, calls halGetDeviceInfo, writes
             HelloResult{ magic, echoed_token, hal_rc, hal_value }.
  7. host  : D2H + verify magic == 0xDEADBEEFC0FFEE01 + echoed token.

Files:
- docs/aicpu-kernel-launch-mechanisms.md : NEW canonical doc covering
  all three known methods of getting a custom AICPU SO onto the device
  (tar.gz pre-deployment, Path A dispatcher bootstrap, broken Path B
  KERNEL_TYPE_AICPU_CUSTOM). Records issue hw-native-sys#822's full failure
  forensics: cust-subprocess L1 stale on AICore HBM writes, the four
  user-space workarounds that all fail (volatile / ldar / dc civac /
  dc ivac) with the architectural reason each fails, and the CANN-side
  fix options (A/B/C/D). Sedimentation of the PR hw-native-sys#537 debugging
  session so future readers don't re-derive any of it.
- tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp :
  simpler_aicpu_init no-op + simpler_aicpu_run reads DeviceArgs, calls
  halGetDeviceInfo(AICPU, CORE_NUM), writes HelloResult to GM.
- tools/cann-examples/aicpu-kernel-launch/device/CMakeLists.txt :
  aarch64 cross, links ascend_hal, emits --build-id for fingerprint
  stability.
- tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp :
  full Mode A bootstrap + JSON descriptor + binary load + launch +
  verify. Inlined ELF Build-ID reader so the example is standalone
  (no headers from src/).
- tools/cann-examples/aicpu-kernel-launch/host/CMakeLists.txt :
  links ascendcl + runtime.
- tools/cann-examples/aicpu-kernel-launch/README.md : pipeline
  diagram, I/O contract, build/run, scope+limits; references the
  mechanisms doc for the full comparison.

Cross-links updated:
- src/common/aicpu_dispatcher/README.md and
- tools/cann-examples/aicpu-device-query/README.md
  both now point at the new mechanisms doc so anyone reading the
  dispatcher or the device-query tool can find the full Path A vs
  Path B vs tar.gz comparison.

Integration:
- ut-a2a3 + ut-a5 CI jobs build the device SO (cross-compile) + host
  launcher as a link smoke test, mirroring the aicpu-device-query
  precedent. No device locked, no resource-spec conflict.
- tools/README.md adds an "aicpu-kernel-launch" subsection.
- docs/ci.md job table updated.

Why bundle as one PR: the reference tool and the mechanisms doc
exist to answer the same question ("how do I launch a custom AICPU
kernel"). Splitting them would mean the doc points at a tool that
doesn't exist yet, or the tool points at a doc that doesn't exist
yet. Reviewable as one logical unit.
ChaoWao added a commit that referenced this pull request May 31, 2026
#923)

Smallest possible end-to-end demonstration of the AICPU kernel launch
pipeline used by this repo's runtime — no scene-test plumbing, no
ringbuffer / tensormap, no ChipWorker fork. Strips PR #883's
aicpu-device-query down to the bootstrap path itself with a trivial
inner kernel so a reader who wants to add new AICPU work can use it as
a copy-paste template.

Pipeline (Method 2 / "Path A" — see docs/aicpu-kernel-launch-mechanisms.md):
  1. host  : rtAicpuKernelLaunchExWithArgs(KERNEL_TYPE_AICPU_KFC,
             libaicpu_extend_kernels.so) hands dispatcher SO + inner SO
             bytes to the AICPU OS process.
  2. device: dispatcher (DynTileFwkBackendKernelServerInit) writes
             inner SO to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/
             simpler_inner_<fp>_<dev>.so.
  3. host  : fingerprint inner SO by ELF Build-ID, emit a JSON descriptor
             pointing at the preinstall basename, register via
             rtsBinaryLoadFromFile.
  4. host  : rtsFuncGetByName for init + run handles.
  5. host  : rewrite DeviceArgs (offsets 96 / 104) to point at the
             HelloResult + input_token, rtsLaunchCpuKernel(run).
  6. device: kernel logs via DlogRecord, calls halGetDeviceInfo, writes
             HelloResult{ magic, echoed_token, hal_rc, hal_value }.
  7. host  : D2H + verify magic == 0xDEADBEEFC0FFEE01 + echoed token.

Files:
- docs/aicpu-kernel-launch-mechanisms.md : NEW canonical doc covering
  all three known methods of getting a custom AICPU SO onto the device
  (tar.gz pre-deployment, Path A dispatcher bootstrap, broken Path B
  KERNEL_TYPE_AICPU_CUSTOM). Records issue #822's full failure
  forensics: cust-subprocess L1 stale on AICore HBM writes, the four
  user-space workarounds that all fail (volatile / ldar / dc civac /
  dc ivac) with the architectural reason each fails, and the CANN-side
  fix options (A/B/C/D). Sedimentation of the PR #537 debugging
  session so future readers don't re-derive any of it.
- tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp :
  simpler_aicpu_init no-op + simpler_aicpu_run reads DeviceArgs, calls
  halGetDeviceInfo(AICPU, CORE_NUM), writes HelloResult to GM.
- tools/cann-examples/aicpu-kernel-launch/device/CMakeLists.txt :
  aarch64 cross, links ascend_hal, emits --build-id for fingerprint
  stability.
- tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp :
  full Mode A bootstrap + JSON descriptor + binary load + launch +
  verify. Inlined ELF Build-ID reader so the example is standalone
  (no headers from src/).
- tools/cann-examples/aicpu-kernel-launch/host/CMakeLists.txt :
  links ascendcl + runtime.
- tools/cann-examples/aicpu-kernel-launch/README.md : pipeline
  diagram, I/O contract, build/run, scope+limits; references the
  mechanisms doc for the full comparison.

Cross-links updated:
- src/common/aicpu_dispatcher/README.md and
- tools/cann-examples/aicpu-device-query/README.md
  both now point at the new mechanisms doc so anyone reading the
  dispatcher or the device-query tool can find the full Path A vs
  Path B vs tar.gz comparison.

Integration:
- ut-a2a3 + ut-a5 CI jobs build the device SO (cross-compile) + host
  launcher as a link smoke test, mirroring the aicpu-device-query
  precedent. No device locked, no resource-spec conflict.
- tools/README.md adds an "aicpu-kernel-launch" subsection.
- docs/ci.md job table updated.

Why bundle as one PR: the reference tool and the mechanisms doc
exist to answer the same question ("how do I launch a custom AICPU
kernel"). Splitting them would mean the doc points at a tool that
doesn't exist yet, or the tool points at a doc that doesn't exist
yet. Reviewable as one logical unit.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants