Skip to content

Refactor: Extract shared PTO Runtime C API to common header#21

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:platform-include
Jan 30, 2026
Merged

Refactor: Extract shared PTO Runtime C API to common header#21
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:platform-include

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

Consolidate duplicate pto_runtime_c_api.h headers from a2a3 and a2a3sim platforms into a single shared header, ensuring interface consistency across platforms.

Changes:

  • Add src/platform/include/host/pto_runtime_c_api.h as common header
  • Remove src/platform/a2a3/host/pto_runtime_c_api.h
  • Remove src/platform/a2a3sim/host/pto_runtime_c_api.h
  • Update a2a3 and a2a3sim CMakeLists.txt to include platform/include
  • Update include paths in both platform implementations

Consolidate duplicate pto_runtime_c_api.h headers from a2a3 and a2a3sim
platforms into a single shared header, ensuring interface consistency
across platforms.

Changes:
- Add src/platform/include/host/pto_runtime_c_api.h as common header
- Remove src/platform/a2a3/host/pto_runtime_c_api.h
- Remove src/platform/a2a3sim/host/pto_runtime_c_api.h
- Update a2a3 and a2a3sim CMakeLists.txt to include platform/include
- Update include paths in both platform implementations
@ChaoWao ChaoWao merged commit 0db0ea4 into hw-native-sys:main Jan 30, 2026
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

1. New docs/hardware/ cross-chip tree:
   - chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
     UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
     end-to-end task flow, off-chip vs on-chip cost model.
   - SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
     (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
     toolchain.py dav-c220 / dav-c310 bridge as authoritative
     sources.
   - cache-coherency.md moved from src/a2a3/docs/ and generalized:
     dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
     refs in src/a2a3/docs/platform.md and the AICPU L2 perf
     collector comment updated.

2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
   - a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
     sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
     UB 1.0 / HCCS on Kunpeng.
   - a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
   - Three views of "how many cores" section: spec view (delivered
     to user code) vs HAL view vs CANN ini view, with the observed
     discrepancy resolved by the device-side probe in thread 5.
     a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
     fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
     pattern is calibrated inference pending its own probe run.

3. Rules reorg under .claude/rules/:
   - architecture.md + ascend-device.md merged then split by
     audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
     AICPU terminology) and project-layout.md (Python wheel split,
     build system lookup, test layout).
   - Inbound refs in docs/python-packaging.md and
     review-pr/SKILL.md updated.

4. tools/cann-examples/query/ — host-side CLI:
   - Subcommands: devices, device <id> (full per-device dump:
     identification + cores + memory hierarchy with per-field
     comments), mem <id>, version (compiler/version.info — toolkit
     version, not aclrtGetVersion's runtime lib version).
   - Compile-time link to ascendcl + runtime + ascend_hal +
     drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
     sibling of ASCEND_HOME_PATH, override via cmake -D.
   - Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
     ini because the matching ACL device-attribute queries return 0
     on CANN 9.0 / a3.

5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
   - halGetDeviceInfo has queries flagged "used in device" in the
     header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
     succeed when called from inside an AICPU OS process. This tool
     uploads a small inner SO via the dispatcher bootstrap path
     (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
     in Mode A, no sudo / no pre-deployment), runs the queries
     device-side, and reads results back through GM.
   - Closes the long-standing a3 question of whether the 8 -> 6
     AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
     proves cpu_id 0 is OS-owned (single bit), and the absence of
     cpu_id 1 from every other CPU module's OCCUPY mask plus
     not-in-vNPU-mode rules out virtualization remapping. The gap
     is therefore 1 OS + 1 PG, not 2 OS.
   - Tool README documents how to run it on a5 to close the
     analogous question there.

6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
   - Refuses pytest / task-submit invocations with
     --platform a2a3|a5 when the host's actual silicon is the other
     family, before any device lock is acquired. CI is fine because
     each onboard runner is labeled with its arch; local hardware
     work bypasses that protection. Wrong-arch runs produce
     507018 / 507899 cascades that LOOK LIKE genuine bugs and
     routinely waste hours on phantom investigations.
   - Detection reads the same source as the query tool: npu-smi for
     Chip Name + NPU Name, then
     $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
     for Short_SoC_version, then maps to repo arch. No ACL init,
     no device binding, ~600 ms cold and ~5 ms cached
     (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
     (a2a3sim, a5sim) pass through unconditionally.
   - .claude/rules/task-submit-isolation.md links to the skill from
     its pre-flight section and adds bypass-the-precheck to the
     anti-patterns list.

7. CI integration in .github/workflows/ci.yml:
   - ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
     `query version` (no device locked, no resource-spec conflict).
   - Same jobs build tools/cann-examples/aicpu-device-query
     (cross-compiled device SO + native host) as a link smoke test.
   - docs/ci.md job table updated; tools/README.md updated.
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

1. New docs/hardware/ cross-chip tree:
   - chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
     UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
     end-to-end task flow, off-chip vs on-chip cost model.
   - SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
     (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
     toolchain.py dav-c220 / dav-c310 bridge as authoritative
     sources.
   - cache-coherency.md moved from src/a2a3/docs/ and generalized:
     dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
     refs in src/a2a3/docs/platform.md and the AICPU L2 perf
     collector comment updated.

2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
   - a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
     sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
     UB 1.0 / HCCS on Kunpeng.
   - a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
   - Three views of "how many cores" section: spec view (delivered
     to user code) vs HAL view vs CANN ini view, with the observed
     discrepancy resolved by the device-side probe in thread 5.
     a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
     fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
     pattern is calibrated inference pending its own probe run.

3. Rules reorg under .claude/rules/:
   - architecture.md + ascend-device.md merged then split by
     audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
     AICPU terminology) and project-layout.md (Python wheel split,
     build system lookup, test layout).
   - Inbound refs in docs/python-packaging.md and
     review-pr/SKILL.md updated.

4. tools/cann-examples/query/ — host-side CLI:
   - Subcommands: devices, device <id> (full per-device dump:
     identification + cores + memory hierarchy with per-field
     comments), mem <id>, version (compiler/version.info — toolkit
     version, not aclrtGetVersion's runtime lib version).
   - Compile-time link to ascendcl + runtime + ascend_hal +
     drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
     sibling of ASCEND_HOME_PATH, override via cmake -D.
   - Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
     ini because the matching ACL device-attribute queries return 0
     on CANN 9.0 / a3.

5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
   - halGetDeviceInfo has queries flagged "used in device" in the
     header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
     succeed when called from inside an AICPU OS process. This tool
     uploads a small inner SO via the dispatcher bootstrap path
     (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
     in Mode A, no sudo / no pre-deployment), runs the queries
     device-side, and reads results back through GM.
   - Closes the long-standing a3 question of whether the 8 -> 6
     AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
     proves cpu_id 0 is OS-owned (single bit), and the absence of
     cpu_id 1 from every other CPU module's OCCUPY mask plus
     not-in-vNPU-mode rules out virtualization remapping. The gap
     is therefore 1 OS + 1 PG, not 2 OS.
   - Tool README documents how to run it on a5 to close the
     analogous question there.

6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
   - Refuses pytest / task-submit invocations with
     --platform a2a3|a5 when the host's actual silicon is the other
     family, before any device lock is acquired. CI is fine because
     each onboard runner is labeled with its arch; local hardware
     work bypasses that protection. Wrong-arch runs produce
     507018 / 507899 cascades that LOOK LIKE genuine bugs and
     routinely waste hours on phantom investigations.
   - Detection reads the same source as the query tool: npu-smi for
     Chip Name + NPU Name, then
     $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
     for Short_SoC_version, then maps to repo arch. No ACL init,
     no device binding, ~600 ms cold and ~5 ms cached
     (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
     (a2a3sim, a5sim) pass through unconditionally.
   - .claude/rules/task-submit-isolation.md links to the skill from
     its pre-flight section and adds bypass-the-precheck to the
     anti-patterns list.

7. CI integration in .github/workflows/ci.yml:
   - ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
     `query version` (no device locked, no resource-spec conflict).
   - Same jobs build tools/cann-examples/aicpu-device-query
     (cross-compiled device SO + native host) as a link smoke test.
   - docs/ci.md job table updated; tools/README.md updated.
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

1. New docs/hardware/ cross-chip tree:
   - chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
     UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
     end-to-end task flow, off-chip vs on-chip cost model.
   - SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
     (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
     toolchain.py dav-c220 / dav-c310 bridge as authoritative
     sources.
   - cache-coherency.md moved from src/a2a3/docs/ and generalized:
     dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
     refs in src/a2a3/docs/platform.md and the AICPU L2 perf
     collector comment updated.

2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
   - a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
     sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
     UB 1.0 / HCCS on Kunpeng.
   - a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
   - Three views of "how many cores" section: spec view (delivered
     to user code) vs HAL view vs CANN ini view, with the observed
     discrepancy resolved by the device-side probe in thread 5.
     a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
     fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
     pattern is calibrated inference pending its own probe run.

3. Rules reorg under .claude/rules/:
   - architecture.md + ascend-device.md merged then split by
     audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
     AICPU terminology) and project-layout.md (Python wheel split,
     build system lookup, test layout).
   - Inbound refs in docs/python-packaging.md and
     review-pr/SKILL.md updated.

4. tools/cann-examples/query/ — host-side CLI:
   - Subcommands: devices, device <id> (full per-device dump:
     identification + cores + memory hierarchy with per-field
     comments), mem <id>, version (compiler/version.info — toolkit
     version, not aclrtGetVersion's runtime lib version).
   - Compile-time link to ascendcl + runtime + ascend_hal +
     drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
     sibling of ASCEND_HOME_PATH, override via cmake -D.
   - Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
     ini because the matching ACL device-attribute queries return 0
     on CANN 9.0 / a3.

5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
   - halGetDeviceInfo has queries flagged "used in device" in the
     header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
     succeed when called from inside an AICPU OS process. This tool
     uploads a small inner SO via the dispatcher bootstrap path
     (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
     in Mode A, no sudo / no pre-deployment), runs the queries
     device-side, and reads results back through GM.
   - Closes the long-standing a3 question of whether the 8 -> 6
     AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
     proves cpu_id 0 is OS-owned (single bit), and the absence of
     cpu_id 1 from every other CPU module's OCCUPY mask plus
     not-in-vNPU-mode rules out virtualization remapping. The gap
     is therefore 1 OS + 1 PG, not 2 OS.
   - Tool README documents how to run it on a5 to close the
     analogous question there.

6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
   - Refuses pytest / task-submit invocations with
     --platform a2a3|a5 when the host's actual silicon is the other
     family, before any device lock is acquired. CI is fine because
     each onboard runner is labeled with its arch; local hardware
     work bypasses that protection. Wrong-arch runs produce
     507018 / 507899 cascades that LOOK LIKE genuine bugs and
     routinely waste hours on phantom investigations.
   - Detection reads the same source as the query tool: npu-smi for
     Chip Name + NPU Name, then
     $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
     for Short_SoC_version, then maps to repo arch. No ACL init,
     no device binding, ~600 ms cold and ~5 ms cached
     (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
     (a2a3sim, a5sim) pass through unconditionally.
   - .claude/rules/task-submit-isolation.md links to the skill from
     its pre-flight section and adds bypass-the-precheck to the
     anti-patterns list.

7. CI integration in .github/workflows/ci.yml:
   - ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
     `query version` (no device locked, no resource-spec conflict).
   - Same jobs build tools/cann-examples/aicpu-device-query
     (cross-compiled device SO + native host) as a link smoke test.
   - docs/ci.md job table updated; tools/README.md updated.
ChaoWao added a commit that referenced this pull request May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."

1. New docs/hardware/ cross-chip tree:
   - chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
     UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
     end-to-end task flow, off-chip vs on-chip cost model.
   - SoC family <-> arch mapping table cites vllm-ascend FAQ #21
     (Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
     toolchain.py dav-c220 / dav-c310 bridge as authoritative
     sources.
   - cache-coherency.md moved from src/a2a3/docs/ and generalized:
     dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
     refs in src/a2a3/docs/platform.md and the AICPU L2 perf
     collector comment updated.

2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
   - a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
     sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
     UB 1.0 / HCCS on Kunpeng.
   - a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
   - Three views of "how many cores" section: spec view (delivered
     to user code) vs HAL view vs CANN ini view, with the observed
     discrepancy resolved by the device-side probe in thread 5.
     a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
     fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
     pattern is calibrated inference pending its own probe run.

3. Rules reorg under .claude/rules/:
   - architecture.md + ascend-device.md merged then split by
     audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
     AICPU terminology) and project-layout.md (Python wheel split,
     build system lookup, test layout).
   - Inbound refs in docs/python-packaging.md and
     review-pr/SKILL.md updated.

4. tools/cann-examples/query/ — host-side CLI:
   - Subcommands: devices, device <id> (full per-device dump:
     identification + cores + memory hierarchy with per-field
     comments), mem <id>, version (compiler/version.info — toolkit
     version, not aclrtGetVersion's runtime lib version).
   - Compile-time link to ascendcl + runtime + ascend_hal +
     drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
     sibling of ASCEND_HOME_PATH, override via cmake -D.
   - Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
     ini because the matching ACL device-attribute queries return 0
     on CANN 9.0 / a3.

5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
   - halGetDeviceInfo has queries flagged "used in device" in the
     header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
     succeed when called from inside an AICPU OS process. This tool
     uploads a small inner SO via the dispatcher bootstrap path
     (rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
     in Mode A, no sudo / no pre-deployment), runs the queries
     device-side, and reads results back through GM.
   - Closes the long-standing a3 question of whether the 8 -> 6
     AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
     proves cpu_id 0 is OS-owned (single bit), and the absence of
     cpu_id 1 from every other CPU module's OCCUPY mask plus
     not-in-vNPU-mode rules out virtualization remapping. The gap
     is therefore 1 OS + 1 PG, not 2 OS.
   - Tool README documents how to run it on a5 to close the
     analogous question there.

6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
   - Refuses pytest / task-submit invocations with
     --platform a2a3|a5 when the host's actual silicon is the other
     family, before any device lock is acquired. CI is fine because
     each onboard runner is labeled with its arch; local hardware
     work bypasses that protection. Wrong-arch runs produce
     507018 / 507899 cascades that LOOK LIKE genuine bugs and
     routinely waste hours on phantom investigations.
   - Detection reads the same source as the query tool: npu-smi for
     Chip Name + NPU Name, then
     $ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
     for Short_SoC_version, then maps to repo arch. No ACL init,
     no device binding, ~600 ms cold and ~5 ms cached
     (/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
     (a2a3sim, a5sim) pass through unconditionally.
   - .claude/rules/task-submit-isolation.md links to the skill from
     its pre-flight section and adds bypass-the-precheck to the
     anti-patterns list.

7. CI integration in .github/workflows/ci.yml:
   - ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
     `query version` (no device locked, no resource-spec conflict).
   - Same jobs build tools/cann-examples/aicpu-device-query
     (cross-compiled device SO + native host) as a link smoke test.
   - docs/ci.md job table updated; tools/README.md updated.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants