Refactor: Extract shared PTO Runtime C API to common header#21
Merged
Conversation
Consolidate duplicate pto_runtime_c_api.h headers from a2a3 and a2a3sim platforms into a single shared header, ensuring interface consistency across platforms. Changes: - Add src/platform/include/host/pto_runtime_c_api.h as common header - Remove src/platform/a2a3/host/pto_runtime_c_api.h - Remove src/platform/a2a3sim/host/pto_runtime_c_api.h - Update a2a3 and a2a3sim CMakeLists.txt to include platform/include - Update include paths in both platform implementations
hw-native-sys-bot
pushed a commit
to hw-native-sys-bot/simpler
that referenced
this pull request
May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."
1. New docs/hardware/ cross-chip tree:
- chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
end-to-end task flow, off-chip vs on-chip cost model.
- SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
(Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
toolchain.py dav-c220 / dav-c310 bridge as authoritative
sources.
- cache-coherency.md moved from src/a2a3/docs/ and generalized:
dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
refs in src/a2a3/docs/platform.md and the AICPU L2 perf
collector comment updated.
2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
- a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
UB 1.0 / HCCS on Kunpeng.
- a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
- Three views of "how many cores" section: spec view (delivered
to user code) vs HAL view vs CANN ini view, with the observed
discrepancy resolved by the device-side probe in thread 5.
a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
pattern is calibrated inference pending its own probe run.
3. Rules reorg under .claude/rules/:
- architecture.md + ascend-device.md merged then split by
audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
AICPU terminology) and project-layout.md (Python wheel split,
build system lookup, test layout).
- Inbound refs in docs/python-packaging.md and
review-pr/SKILL.md updated.
4. tools/cann-examples/query/ — host-side CLI:
- Subcommands: devices, device <id> (full per-device dump:
identification + cores + memory hierarchy with per-field
comments), mem <id>, version (compiler/version.info — toolkit
version, not aclrtGetVersion's runtime lib version).
- Compile-time link to ascendcl + runtime + ascend_hal +
drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
sibling of ASCEND_HOME_PATH, override via cmake -D.
- Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
ini because the matching ACL device-attribute queries return 0
on CANN 9.0 / a3.
5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
- halGetDeviceInfo has queries flagged "used in device" in the
header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
succeed when called from inside an AICPU OS process. This tool
uploads a small inner SO via the dispatcher bootstrap path
(rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
in Mode A, no sudo / no pre-deployment), runs the queries
device-side, and reads results back through GM.
- Closes the long-standing a3 question of whether the 8 -> 6
AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
proves cpu_id 0 is OS-owned (single bit), and the absence of
cpu_id 1 from every other CPU module's OCCUPY mask plus
not-in-vNPU-mode rules out virtualization remapping. The gap
is therefore 1 OS + 1 PG, not 2 OS.
- Tool README documents how to run it on a5 to close the
analogous question there.
6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
- Refuses pytest / task-submit invocations with
--platform a2a3|a5 when the host's actual silicon is the other
family, before any device lock is acquired. CI is fine because
each onboard runner is labeled with its arch; local hardware
work bypasses that protection. Wrong-arch runs produce
507018 / 507899 cascades that LOOK LIKE genuine bugs and
routinely waste hours on phantom investigations.
- Detection reads the same source as the query tool: npu-smi for
Chip Name + NPU Name, then
$ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
for Short_SoC_version, then maps to repo arch. No ACL init,
no device binding, ~600 ms cold and ~5 ms cached
(/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
(a2a3sim, a5sim) pass through unconditionally.
- .claude/rules/task-submit-isolation.md links to the skill from
its pre-flight section and adds bypass-the-precheck to the
anti-patterns list.
7. CI integration in .github/workflows/ci.yml:
- ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
`query version` (no device locked, no resource-spec conflict).
- Same jobs build tools/cann-examples/aicpu-device-query
(cross-compiled device SO + native host) as a link smoke test.
- docs/ci.md job table updated; tools/README.md updated.
hw-native-sys-bot
pushed a commit
to hw-native-sys-bot/simpler
that referenced
this pull request
May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."
1. New docs/hardware/ cross-chip tree:
- chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
end-to-end task flow, off-chip vs on-chip cost model.
- SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
(Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
toolchain.py dav-c220 / dav-c310 bridge as authoritative
sources.
- cache-coherency.md moved from src/a2a3/docs/ and generalized:
dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
refs in src/a2a3/docs/platform.md and the AICPU L2 perf
collector comment updated.
2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
- a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
UB 1.0 / HCCS on Kunpeng.
- a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
- Three views of "how many cores" section: spec view (delivered
to user code) vs HAL view vs CANN ini view, with the observed
discrepancy resolved by the device-side probe in thread 5.
a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
pattern is calibrated inference pending its own probe run.
3. Rules reorg under .claude/rules/:
- architecture.md + ascend-device.md merged then split by
audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
AICPU terminology) and project-layout.md (Python wheel split,
build system lookup, test layout).
- Inbound refs in docs/python-packaging.md and
review-pr/SKILL.md updated.
4. tools/cann-examples/query/ — host-side CLI:
- Subcommands: devices, device <id> (full per-device dump:
identification + cores + memory hierarchy with per-field
comments), mem <id>, version (compiler/version.info — toolkit
version, not aclrtGetVersion's runtime lib version).
- Compile-time link to ascendcl + runtime + ascend_hal +
drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
sibling of ASCEND_HOME_PATH, override via cmake -D.
- Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
ini because the matching ACL device-attribute queries return 0
on CANN 9.0 / a3.
5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
- halGetDeviceInfo has queries flagged "used in device" in the
header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
succeed when called from inside an AICPU OS process. This tool
uploads a small inner SO via the dispatcher bootstrap path
(rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
in Mode A, no sudo / no pre-deployment), runs the queries
device-side, and reads results back through GM.
- Closes the long-standing a3 question of whether the 8 -> 6
AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
proves cpu_id 0 is OS-owned (single bit), and the absence of
cpu_id 1 from every other CPU module's OCCUPY mask plus
not-in-vNPU-mode rules out virtualization remapping. The gap
is therefore 1 OS + 1 PG, not 2 OS.
- Tool README documents how to run it on a5 to close the
analogous question there.
6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
- Refuses pytest / task-submit invocations with
--platform a2a3|a5 when the host's actual silicon is the other
family, before any device lock is acquired. CI is fine because
each onboard runner is labeled with its arch; local hardware
work bypasses that protection. Wrong-arch runs produce
507018 / 507899 cascades that LOOK LIKE genuine bugs and
routinely waste hours on phantom investigations.
- Detection reads the same source as the query tool: npu-smi for
Chip Name + NPU Name, then
$ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
for Short_SoC_version, then maps to repo arch. No ACL init,
no device binding, ~600 ms cold and ~5 ms cached
(/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
(a2a3sim, a5sim) pass through unconditionally.
- .claude/rules/task-submit-isolation.md links to the skill from
its pre-flight section and adds bypass-the-precheck to the
anti-patterns list.
7. CI integration in .github/workflows/ci.yml:
- ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
`query version` (no device locked, no resource-spec conflict).
- Same jobs build tools/cann-examples/aicpu-device-query
(cross-compiled device SO + native host) as a link smoke test.
- docs/ci.md job table updated; tools/README.md updated.
hw-native-sys-bot
pushed a commit
to hw-native-sys-bot/simpler
that referenced
this pull request
May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."
1. New docs/hardware/ cross-chip tree:
- chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
end-to-end task flow, off-chip vs on-chip cost model.
- SoC family <-> arch mapping table cites vllm-ascend FAQ hw-native-sys#21
(Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
toolchain.py dav-c220 / dav-c310 bridge as authoritative
sources.
- cache-coherency.md moved from src/a2a3/docs/ and generalized:
dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
refs in src/a2a3/docs/platform.md and the AICPU L2 perf
collector comment updated.
2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
- a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
UB 1.0 / HCCS on Kunpeng.
- a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
- Three views of "how many cores" section: spec view (delivered
to user code) vs HAL view vs CANN ini view, with the observed
discrepancy resolved by the device-side probe in thread 5.
a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
pattern is calibrated inference pending its own probe run.
3. Rules reorg under .claude/rules/:
- architecture.md + ascend-device.md merged then split by
audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
AICPU terminology) and project-layout.md (Python wheel split,
build system lookup, test layout).
- Inbound refs in docs/python-packaging.md and
review-pr/SKILL.md updated.
4. tools/cann-examples/query/ — host-side CLI:
- Subcommands: devices, device <id> (full per-device dump:
identification + cores + memory hierarchy with per-field
comments), mem <id>, version (compiler/version.info — toolkit
version, not aclrtGetVersion's runtime lib version).
- Compile-time link to ascendcl + runtime + ascend_hal +
drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
sibling of ASCEND_HOME_PATH, override via cmake -D.
- Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
ini because the matching ACL device-attribute queries return 0
on CANN 9.0 / a3.
5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
- halGetDeviceInfo has queries flagged "used in device" in the
header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
succeed when called from inside an AICPU OS process. This tool
uploads a small inner SO via the dispatcher bootstrap path
(rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
in Mode A, no sudo / no pre-deployment), runs the queries
device-side, and reads results back through GM.
- Closes the long-standing a3 question of whether the 8 -> 6
AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
proves cpu_id 0 is OS-owned (single bit), and the absence of
cpu_id 1 from every other CPU module's OCCUPY mask plus
not-in-vNPU-mode rules out virtualization remapping. The gap
is therefore 1 OS + 1 PG, not 2 OS.
- Tool README documents how to run it on a5 to close the
analogous question there.
6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
- Refuses pytest / task-submit invocations with
--platform a2a3|a5 when the host's actual silicon is the other
family, before any device lock is acquired. CI is fine because
each onboard runner is labeled with its arch; local hardware
work bypasses that protection. Wrong-arch runs produce
507018 / 507899 cascades that LOOK LIKE genuine bugs and
routinely waste hours on phantom investigations.
- Detection reads the same source as the query tool: npu-smi for
Chip Name + NPU Name, then
$ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
for Short_SoC_version, then maps to repo arch. No ACL init,
no device binding, ~600 ms cold and ~5 ms cached
(/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
(a2a3sim, a5sim) pass through unconditionally.
- .claude/rules/task-submit-isolation.md links to the skill from
its pre-flight section and adds bypass-the-precheck to the
anti-patterns list.
7. CI integration in .github/workflows/ci.yml:
- ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
`query version` (no device locked, no resource-spec conflict).
- Same jobs build tools/cann-examples/aicpu-device-query
(cross-compiled device SO + native host) as a link smoke test.
- docs/ci.md job table updated; tools/README.md updated.
ChaoWao
added a commit
that referenced
this pull request
May 30, 2026
Bundles six related threads that together answer "how do I reason
about Ascend hardware in this repo and not get burned by a wrong
--platform invocation."
1. New docs/hardware/ cross-chip tree:
- chip-architecture.md: Host CPU + DDR attached via PCIe (x86) or
UB / HCCS (Kunpeng), on-chip AICPU + AICore clusters, GM,
end-to-end task flow, off-chip vs on-chip cost model.
- SoC family <-> arch mapping table cites vllm-ascend FAQ #21
(Atlas A2 = ascend910b1, Atlas A3 = ascend910_9391) plus the
toolchain.py dav-c220 / dav-c310 bridge as authoritative
sources.
- cache-coherency.md moved from src/a2a3/docs/ and generalized:
dcci + cache_invalidate_range live on both a2a3 and a5. Inbound
refs in src/a2a3/docs/platform.md and the AICPU L2 perf
collector comment updated.
2. Per-chip facts in src/{a2a3,a5}/docs/hardware.md:
- a2a3: a2 vs a3 packaging (single die vs dual die / 2 device IDs
sharing AICPU OS), per-die 24 AIC + 48 AIV, 64 GiB HBM,
UB 1.0 / HCCS on Kunpeng.
- a5: 2 dies as 1 device id, per-die layout, UB 2.0 on Kunpeng.
- Three views of "how many cores" section: spec view (delivered
to user code) vs HAL view vs CANN ini view, with the observed
discrepancy resolved by the device-side probe in thread 5.
a3 closure: cpu_id 0 = AICPU OS scheduler, cpu_id 1 = PG
fab-disabled, cpu_id 2..7 = 6 user-schedulable. a5 same
pattern is calibrated inference pending its own probe run.
3. Rules reorg under .claude/rules/:
- architecture.md + ascend-device.md merged then split by
audience: ascend.md (HW + SW arch quick reference + AIC / AIV /
AICPU terminology) and project-layout.md (Python wheel split,
build system lookup, test layout).
- Inbound refs in docs/python-packaging.md and
review-pr/SKILL.md updated.
4. tools/cann-examples/query/ — host-side CLI:
- Subcommands: devices, device <id> (full per-device dump:
identification + cores + memory hierarchy with per-field
comments), mem <id>, version (compiler/version.info — toolkit
version, not aclrtGetVersion's runtime lib version).
- Compile-time link to ascendcl + runtime + ascend_hal +
drvdsmi_host (no dlopen). ASCEND_DRIVER_PATH defaults to the
sibling of ASCEND_HOME_PATH, override via cmake -D.
- Buffer sizes (UB / L1 / L0A/B/C) read from CANN platform_config
ini because the matching ACL device-attribute queries return 0
on CANN 9.0 / a3.
5. tools/cann-examples/aicpu-device-query/ — device-side HAL probe:
- halGetDeviceInfo has queries flagged "used in device" in the
header (notably AICPU + OS_SCHED, AICPU + PF_OCCUPY) that only
succeed when called from inside an AICPU OS process. This tool
uploads a small inner SO via the dispatcher bootstrap path
(rtAicpuKernelLaunchExWithArgs with libaicpu_extend_kernels.so
in Mode A, no sudo / no pre-deployment), runs the queries
device-side, and reads results back through GM.
- Closes the long-standing a3 question of whether the 8 -> 6
AICPU gap is OS-reservation or PG: device-side OS_SCHED = 0x1
proves cpu_id 0 is OS-owned (single bit), and the absence of
cpu_id 1 from every other CPU module's OCCUPY mask plus
not-in-vNPU-mode rules out virtualization remapping. The gap
is therefore 1 OS + 1 PG, not 2 OS.
- Tool README documents how to run it on a5 to close the
analogous question there.
6. .claude/skills/onboard-arch-precheck/ — wrong-arch gate:
- Refuses pytest / task-submit invocations with
--platform a2a3|a5 when the host's actual silicon is the other
family, before any device lock is acquired. CI is fine because
each onboard runner is labeled with its arch; local hardware
work bypasses that protection. Wrong-arch runs produce
507018 / 507899 cascades that LOOK LIKE genuine bugs and
routinely waste hours on phantom investigations.
- Detection reads the same source as the query tool: npu-smi for
Chip Name + NPU Name, then
$ASCEND_HOME_PATH/{arch}-linux/data/platform_config/<SoC>.ini
for Short_SoC_version, then maps to repo arch. No ACL init,
no device binding, ~600 ms cold and ~5 ms cached
(/tmp/onboard-arch-precheck.cache, 1 hr TTL). Sim variants
(a2a3sim, a5sim) pass through unconditionally.
- .claude/rules/task-submit-isolation.md links to the skill from
its pre-flight section and adds bypass-the-precheck to the
anti-patterns list.
7. CI integration in .github/workflows/ci.yml:
- ut-a2a3 and ut-a5 jobs build tools/cann-examples/query and run
`query version` (no device locked, no resource-spec conflict).
- Same jobs build tools/cann-examples/aicpu-device-query
(cross-compiled device SO + native host) as a link smoke test.
- docs/ci.md job table updated; tools/README.md updated.
Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidate duplicate pto_runtime_c_api.h headers from a2a3 and a2a3sim platforms into a single shared header, ensuring interface consistency across platforms.
Changes: