Refactor: unify profiling collector framework across a2a3/a5 (-855 lines)#944
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR generalizes the profiling infrastructure to support both SVM and non-SVM platforms through explicit buffer ownership tracking, architecture-aware copy callbacks, and centralized paired buffer allocation in ProfilerBase. The a5-specific buffer pool manager moves to common platform, buffer ownership detection shifts from pointer comparison to an explicit malloc_shadows_ set, and all collectors delegate allocation to the base class instead of maintaining local allocator methods. ChangesCross-Platform Profiling Infrastructure Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request unifies the profiling framework across SVM (a2a3) and non-SVM (a5) architectures by moving "buffer_pool_manager.h" and "profiler_base.h" to a shared common directory. It refactors "ProfilerBase" to support both memory models at runtime, introducing a unified "alloc_paired_buffer" helper that replaces duplicate "alloc_single_buffer" implementations in individual collectors. Additionally, it adds stub implementations of "profiling_copy" for "a2a3" to satisfy linker requirements. No review comments were provided, and the implementation appears clean and well-structured, so I have no further feedback.
9927e4d to
bd8cce5
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a5/platform/src/host/l2_swimlane_collector.cpp (1)
87-110:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnwind paired allocations on init failure.
Once
alloc_paired_buffer()succeeds here, the buffer and its host shadow are registered inmanager_. Any laterreturn -1ininitialize()leaks those allocations becausefinalize()short-circuits whileshm_host_ == nullptr. Add an init-scope cleanup guard or explicit unwind for all earlier paired allocations before each failure return.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/src/host/l2_swimlane_collector.cpp` around lines 87 - 110, The alloc_paired_buffer call registers the paired device/host buffer in manager_ but later early returns in initialize() leak that allocation because finalize() is a no-op when shm_host_==nullptr; wrap the init path after alloc_paired_buffer (and any prior paired allocations) with an init-scope cleanup guard (RAII) or explicitly unwind the paired allocations from manager_ on every error return: ensure you undo the registration for perf_dev_ptr/perf_host_ptr (the entries added by alloc_paired_buffer) before returning -1, or set shm_host_ appropriately so finalize() will free them; update code around set_memory_context, alloc_paired_buffer, perf_dev_ptr/perf_host_ptr, initialize(), finalize(), and manager_ usage to perform the cleanup on all failure paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/src/host/dep_gen_collector.cpp`:
- Around line 117-120: The call to set_memory_context currently passes shm_host_
twice and hardcodes shm_size to 0, leaving the profiler with incorrect SHM
metadata; update the call in dep_gen_collector (the set_memory_context
invocation that currently uses alloc_cb, register_cb, free_cb,
/*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/shm_host_, shm_host_,
/*shm_size=*/0, device_id) to pass the actual shm_dev_, shm_host_, and shm_size_
members (shm_dev_, shm_host_, shm_size_) so the registered path receives correct
shared-memory tuple information.
In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 121-124: The initial set_memory_context call seeds the callbacks
but leaves SHM fields as nullptr/0 so ProfilerBase sees an empty SHM context;
after the SHM allocation succeeds and perf_dev_ptr, perf_host_ptr (and the SHM
size) are known, call set_memory_context again with the same alloc_cb,
register_cb, free_cb and the actual shm_dev (perf_dev_ptr), shm_host
(perf_host_ptr), shm_size, and device_id to refresh the memory context used by
ProfilerBase.
In `@src/a2a3/platform/src/host/pmu_collector.cpp`:
- Around line 152-155: The call to set_memory_context is passing shm_host_ twice
and 0 for size; replace those placeholders with the actual PMU SHM metadata that
init() computed — pass the real SHM device pointer, the host pointer
(shm_host_), and the real SHM size (e.g., shm_size or shm_size_) instead of the
duplicate shm_host_ and 0 so set_memory_context(alloc_cb, register_cb, free_cb,
/*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/<actual_shm_device_ptr>,
shm_host_, /*shm_size=*/<actual_shm_size>, device_id).
In `@src/a2a3/platform/src/host/scope_stats_collector.cpp`:
- Around line 116-119: The call to set_memory_context is passing placeholders
for device pointer and size; update the call in scope_stats_collector.cpp to
forward the actual initialized shared-memory state by passing shm_dev_ as the
device pointer, shm_host_ as the host pointer, and shm_size_ as the size (keep
alloc_cb, register_cb, free_cb and device_id as-is) so the base profiler
receives the correct memory context.
In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 75-78: The initial call to set_memory_context(alloc_cb,
register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr,
/*shm_dev=*/nullptr, /*shm_host=*/nullptr, /*shm_size=*/0, device_id) seeds the
profiler with empty SHM fields and is never updated; after dump_shared_mem_dev_
and shm_host_ (and shm_size_) are initialized you must call set_memory_context
again to provide the real SHM pointers and size so the base profiler uses the
correct shared memory context. Locate where dump_shared_mem_dev_ and shm_host_
are assigned and invoke set_memory_context with the same
alloc_cb/register_cb/free_cb and the actual copy_to/copy_from (if available) and
the initialized shm_dev = dump_shared_mem_dev_, shm_host = shm_host_, shm_size =
dump_shm_size (and device_id) to overwrite the previously seeded empty context
(ensure you reference set_memory_context, dump_shared_mem_dev_, shm_host_, and
device_id).
In `@src/a5/platform/src/host/pmu_collector.cpp`:
- Around line 68-79: The PMU init path can leak resources because
alloc_paired_buffer() registers ownership in manager_ before initialized_ is set
and later failures return without finalize(); add a local rollback guard in
init() after alloc_paired_buffer(shm_size, &shm_host_local) (and after any
subsequent allocation that can fail) to free the paired SHM/device buffer,
unregister the ownership from manager_, and free any malloc-shadow allocations;
implement this as a small scope guard/RAII object (or explicit cleanup block)
that is cancelled when init() completes successfully or when initialized_
becomes true so finalize() remains the single normal cleanup path.
In `@src/a5/platform/src/host/scope_stats_collector.cpp`:
- Around line 65-76: After alloc_paired_buffer returns a valid shm_dev_local and
shm_host_local but the function later fails and returns -1 while initialized_ is
still false, you must rollback the partial init: explicitly release the device
buffer and unregister/free the host shadow and reset local pointers before
returning instead of relying on finalize(); call the same
deallocation/unregister routines used elsewhere (e.g. the free callback and host
unregister path associated with set_memory_context/alloc_paired_buffer) to free
shm_dev_local and unregister or free shm_host_local, set
shm_dev_local/shm_host_local to nullptr, and ensure no ownership is left behind
prior to returning.
In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 60-73: The init path registers paired allocations early (via
set_memory_context and alloc_paired_buffer) but lacks unwind on subsequent
allocation failures, leaking device buffers/malloc shadows; modify the init to
track shm_dev_local/shm_host_local returned by alloc_paired_buffer and wrap the
remaining allocation steps (e.g., further arena/meta-buffer allocations after
calc_dump_data_size/alloc_paired_buffer) in an error-path guard that on any
failure frees the paired allocation (use the paired free API you have for
alloc_paired_buffer), resets/shrinks any changed state (shm_host_/shm_dev_ or
related context), and only then returns an error, or alternatively defer
registering the memory context until after all allocations succeed; reference
set_memory_context, alloc_paired_buffer, shm_host_, shm_dev_local, finalize()
when making the change.
---
Outside diff comments:
In `@src/a5/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 87-110: The alloc_paired_buffer call registers the paired
device/host buffer in manager_ but later early returns in initialize() leak that
allocation because finalize() is a no-op when shm_host_==nullptr; wrap the init
path after alloc_paired_buffer (and any prior paired allocations) with an
init-scope cleanup guard (RAII) or explicitly unwind the paired allocations from
manager_ on every error return: ensure you undo the registration for
perf_dev_ptr/perf_host_ptr (the entries added by alloc_paired_buffer) before
returning -1, or set shm_host_ appropriately so finalize() will free them;
update code around set_memory_context, alloc_paired_buffer,
perf_dev_ptr/perf_host_ptr, initialize(), finalize(), and manager_ usage to
perform the cleanup on all failure paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e4a61483-9e3f-4447-a298-d3d109a6c1c2
📒 Files selected for processing (23)
src/a2a3/platform/include/host/profiling_common/buffer_pool_manager.hsrc/a2a3/platform/include/host/profiling_common/profiler_base.hsrc/a2a3/platform/onboard/host/CMakeLists.txtsrc/a2a3/platform/onboard/host/profiling_copy.cppsrc/a2a3/platform/sim/host/CMakeLists.txtsrc/a2a3/platform/sim/host/profiling_copy.cppsrc/a2a3/platform/src/host/dep_gen_collector.cppsrc/a2a3/platform/src/host/l2_swimlane_collector.cppsrc/a2a3/platform/src/host/pmu_collector.cppsrc/a2a3/platform/src/host/scope_stats_collector.cppsrc/a2a3/platform/src/host/tensor_dump_collector.cppsrc/a5/platform/include/host/l2_swimlane_collector.hsrc/a5/platform/include/host/pmu_collector.hsrc/a5/platform/include/host/scope_stats_collector.hsrc/a5/platform/include/host/tensor_dump_collector.hsrc/a5/platform/onboard/host/device_runner.cppsrc/a5/platform/src/host/l2_swimlane_collector.cppsrc/a5/platform/src/host/pmu_collector.cppsrc/a5/platform/src/host/scope_stats_collector.cppsrc/a5/platform/src/host/tensor_dump_collector.cppsrc/common/platform/include/host/profiling_common/buffer_pool_manager.hsrc/common/platform/include/host/profiling_common/profiler_base.hsrc/common/platform/include/host/profiling_copy.h
💤 Files with no reviewable changes (3)
- src/a2a3/platform/include/host/profiling_common/profiler_base.h
- src/a2a3/platform/include/host/profiling_common/buffer_pool_manager.h
- src/a5/platform/include/host/scope_stats_collector.h
…nes)
Pulls a2a3's and a5's `profiler_base.h` / `buffer_pool_manager.h` /
`profiling_copy.h` into a single shared implementation under
`src/common/platform/include/host/`, and extracts the buffer-pairing
helper that every collector's init() used to inline.
Background
----------
a2a3 and a5 host-side profiling stacks had diverged because:
- a2a3 has SVM (halHostRegister maps device pointers into host address
space), so collectors directly read/write the device-side memory
through the registered host view.
- a5 has no halHostRegister, so device↔host transfers go through
rtMemcpy (onboard) or memcpy (sim) via profiling_copy.h, with a
paired malloc'd host shadow that the mgmt loop mirrors per-tick.
The frameworks were ~47%/89% diff at the header level. This PR makes
the choice between the two paths a runtime decision driven by what the
collector installs in MemoryOps, rather than per-arch source code.
Framework changes
-----------------
- `MemoryOps` adds optional `copy_to_device` / `copy_from_device`
function fields. Non-SVM platforms install them; SVM platforms leave
them null and the manager's internal null-check makes every
mirror_/range_/copy_buffer_ method a single-call no-op on SVM.
- `BufferPoolManager::set_memory_context` now takes shm_dev + shm_host
+ shm_size + device_id. SVM platforms pass `shm_dev == shm_host` so
range/mirror operations bounds-check successfully but never copy.
- `ProfilerBase::set_memory_context` propagates the copy callbacks and
the dev/host/size triple from the collector into the manager.
- `ProfilerBase::start()` picks the register fallback when the
collector passes nullptr: identity-map on SVM (copy_to_device_ also
null) or `default_host_shadow_register` on non-SVM (copy_to_device_
installed). The previous a2a3 ProfilerBase hard-coded identity and
the a5 one hard-coded host-shadow.
- `ProfilerBase::alloc_paired_buffer(size, &host_out)` is new and
replaces a5's per-collector `alloc_single_buffer` and a2a3's inline
`if (register_cb != nullptr) {…} else {…}` blocks. It picks among
three paths from the stashed memory context: register_cb_ (a2a3
onboard halHostRegister), copy_to_device_ + malloc (a5 host-shadow),
or identity-map (a2a3 sim / pre-init).
Collector changes
-----------------
- Both arches' 4 profiling collectors (l2_swimlane, tensor_dump, pmu,
scope_stats) updated to pass the new 9-arg `set_memory_context`
signature. a2a3 collectors pass nullptr copy_* + identical shm_dev
and shm_host. a5 collectors pass profiling_copy_*_for_ops shims
(declared in the moved profiling_copy.h) + distinct shm_dev/host.
- a5's per-collector `alloc_single_buffer` private helpers (4 copies,
~30 lines each) are deleted. Init sites switched to the base's
`alloc_paired_buffer`.
- a2a3 onboard + sim now compile a tiny `profiling_copy.cpp` stub
(returns 0) so the framework's reference to profiling_copy_*
resolves at link time. The stubs are never reached on a2a3 because
the collectors never install the copy_to_device / copy_from_device
callbacks; only the symbol-resolution edge of the build requires
them.
What remains divergent (deliberately)
-------------------------------------
The leaf collectors' init() bodies still carry per-arch state (a2a3
collectors keep `shm_dev_` / `shm_host_` / `shm_size_` / `shm_registered_`
as members; a5 collectors store on the base + manager). Unifying those
is the natural next step but requires lifecycle restructuring per
collector that wants its own PR. The framework changes in this PR are
the load-bearing pieces — leaf-collector .cpp files staying per-arch
is a follow-up.
Test plan
---------
- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
— confirms the SVM path through the unified framework matches the
previous behavior end-to-end
Net: -855 lines (23 files touched).
Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
bd8cce5 to
a17cafe
Compare
…lines) Builds on hw-native-sys#944 (profiling framework unification). With the framework's `alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow disambiguation, the per-arch `scope_stats_collector` and `tensor_dump_collector` implementations converged to identical code (modulo header guards and arch-name comments). Move both pairs to `src/common/platform/{include,src}/host/`. Files moved ----------- - `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's identical copy deleted. The two files differed only in: * arch-name comments (a5's "a5 specifics" generalized to "SVM vs non-SVM" paragraph applicable to both arches) * header guard names (SRC_A5_… → SRC_COMMON_…) * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`, `buffers_registered_`) that were only there to gate `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr, free_cb)` calls. The gate is redundant — callers already pair `register_cb` and `unregister_cb` consistently (a2a3 onboard passes both halHostRegister + halHostUnregister; a2a3 sim and a5 pass nullptr for both). Dropped along with the duplicate implementation. - `tensor_dump_collector.{h,cpp}` — same pattern, same dedup. CMakeLists updates ------------------ The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to pull both collectors from `common/platform/src/host/` instead of their per-arch `src/host/` directories. What stays per-arch (deliberate, out of scope for this PR) ---------------------------------------------------------- - `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` / `pmu_resolve_event_config_a5` are arch-specific PMU event tables rooted in silicon revision differences; a5 also carries a `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled without changing the on-device protocol layout. - `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT` enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr` member, `mismatch_record_count` and `fanout_count` fields that a2a3 doesn't have. Same on-device protocol divergence. Both could potentially be unified later by first reconciling the on-device data structures, but that's an aicpu/aicore protocol refactor — much heavier than this purely-host refactor. Test plan --------- - a2a3sim ST L1+L2: 38/38 pass (devices 0,1) - a5sim ST L1+L2: 22/22 pass (devices 0,1) - a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass — confirms the unified scope_stats + tensor_dump go through the SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still per-arch) keep working against the unified framework. Net: -1466 lines on this branch (12 files touched, no behavior change). Stacks on top of hw-native-sys#944. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…lines) Builds on hw-native-sys#944 (profiling framework unification). With the framework's `alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow disambiguation, the per-arch `scope_stats_collector` and `tensor_dump_collector` implementations converged to identical code (modulo header guards and arch-name comments). Move both pairs to `src/common/platform/{include,src}/host/`. Files moved ----------- - `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's identical copy deleted. The two files differed only in: * arch-name comments (a5's "a5 specifics" generalized to "SVM vs non-SVM" paragraph applicable to both arches) * header guard names (SRC_A5_… → SRC_COMMON_…) * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`, `buffers_registered_`) that were only there to gate `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr, free_cb)` calls. The gate is redundant — callers already pair `register_cb` and `unregister_cb` consistently (a2a3 onboard passes both halHostRegister + halHostUnregister; a2a3 sim and a5 pass nullptr for both). Dropped along with the duplicate implementation. - `tensor_dump_collector.{h,cpp}` — same pattern, same dedup. CMakeLists updates ------------------ The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to pull both collectors from `common/platform/src/host/` instead of their per-arch `src/host/` directories. What stays per-arch (deliberate, out of scope for this PR) ---------------------------------------------------------- - `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` / `pmu_resolve_event_config_a5` are arch-specific PMU event tables rooted in silicon revision differences; a5 also carries a `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled without changing the on-device protocol layout. - `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT` enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr` member, `mismatch_record_count` and `fanout_count` fields that a2a3 doesn't have. Same on-device protocol divergence. Both could potentially be unified later by first reconciling the on-device data structures, but that's an aicpu/aicore protocol refactor — much heavier than this purely-host refactor. Test plan --------- - a2a3sim ST L1+L2: 38/38 pass (devices 0,1) - a5sim ST L1+L2: 22/22 pass (devices 0,1) - a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass — confirms the unified scope_stats + tensor_dump go through the SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still per-arch) keep working against the unified framework. Net: -1466 lines on this branch (12 files touched, no behavior change). Stacks on top of hw-native-sys#944. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…lines) Builds on hw-native-sys#944 (profiling framework unification). With the framework's `alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow disambiguation, the per-arch `scope_stats_collector` and `tensor_dump_collector` implementations converged to identical code (modulo header guards and arch-name comments). Move both pairs to `src/common/platform/{include,src}/host/`. Files moved ----------- - `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's identical copy deleted. The two files differed only in: * arch-name comments (a5's "a5 specifics" generalized to "SVM vs non-SVM" paragraph applicable to both arches) * header guard names (SRC_A5_… → SRC_COMMON_…) * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`, `buffers_registered_`) that were only there to gate `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr, free_cb)` calls. The gate is redundant — callers already pair `register_cb` and `unregister_cb` consistently (a2a3 onboard passes both halHostRegister + halHostUnregister; a2a3 sim and a5 pass nullptr for both). Dropped along with the duplicate implementation. - `tensor_dump_collector.{h,cpp}` — same pattern, same dedup. CMakeLists updates ------------------ The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to pull both collectors from `common/platform/src/host/` instead of their per-arch `src/host/` directories. What stays per-arch (deliberate, out of scope for this PR) ---------------------------------------------------------- - `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` / `pmu_resolve_event_config_a5` are arch-specific PMU event tables rooted in silicon revision differences; a5 also carries a `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled without changing the on-device protocol layout. - `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT` enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr` member, `mismatch_record_count` and `fanout_count` fields that a2a3 doesn't have. Same on-device protocol divergence. Both could potentially be unified later by first reconciling the on-device data structures, but that's an aicpu/aicore protocol refactor — much heavier than this purely-host refactor. Test plan --------- - a2a3sim ST L1+L2: 38/38 pass (devices 0,1) - a5sim ST L1+L2: 22/22 pass (devices 0,1) - a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass — confirms the unified scope_stats + tensor_dump go through the SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still per-arch) keep working against the unified framework. Net: -1466 lines on this branch (12 files touched, no behavior change). Stacks on top of hw-native-sys#944. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…lines) (#945) Builds on #944 (profiling framework unification). With the framework's `alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow disambiguation, the per-arch `scope_stats_collector` and `tensor_dump_collector` implementations converged to identical code (modulo header guards and arch-name comments). Move both pairs to `src/common/platform/{include,src}/host/`. Files moved ----------- - `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's identical copy deleted. The two files differed only in: * arch-name comments (a5's "a5 specifics" generalized to "SVM vs non-SVM" paragraph applicable to both arches) * header guard names (SRC_A5_… → SRC_COMMON_…) * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`, `buffers_registered_`) that were only there to gate `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr, free_cb)` calls. The gate is redundant — callers already pair `register_cb` and `unregister_cb` consistently (a2a3 onboard passes both halHostRegister + halHostUnregister; a2a3 sim and a5 pass nullptr for both). Dropped along with the duplicate implementation. - `tensor_dump_collector.{h,cpp}` — same pattern, same dedup. CMakeLists updates ------------------ The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to pull both collectors from `common/platform/src/host/` instead of their per-arch `src/host/` directories. What stays per-arch (deliberate, out of scope for this PR) ---------------------------------------------------------- - `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` / `pmu_resolve_event_config_a5` are arch-specific PMU event tables rooted in silicon revision differences; a5 also carries a `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled without changing the on-device protocol layout. - `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT` enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr` member, `mismatch_record_count` and `fanout_count` fields that a2a3 doesn't have. Same on-device protocol divergence. Both could potentially be unified later by first reconciling the on-device data structures, but that's an aicpu/aicore protocol refactor — much heavier than this purely-host refactor. Test plan --------- - a2a3sim ST L1+L2: 38/38 pass (devices 0,1) - a5sim ST L1+L2: 22/22 pass (devices 0,1) - a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass — confirms the unified scope_stats + tensor_dump go through the SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still per-arch) keep working against the unified framework. Net: -1466 lines on this branch (12 files touched, no behavior change). Stacks on top of #944. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Plug the latent leak CodeRabbit flagged on hw-native-sys#944: a5 PmuCollector::init, ScopeStatsCollector::init, TensorDumpCollector::init register paired device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE flipping the initialized_ flag. If a subsequent allocation fails, init() returns -1; finalize() then early-exits (gated on initialized_ / shm_host_) and every registered device buffer + framework-malloc'd host shadow leaks. The pattern existed pre-hw-native-sys#944 (old alloc_single_buffer + manual register_mapping had the same shape), so this is not a regression of the unification work — just the cleanup it enables. Framework changes ----------------- - BufferPoolManager::release_all_owned(release_fn) [new]: abort-path cleanup that releases EVERY framework-tracked dev_ptr (via release_fn) and every framework-malloc'd host shadow (via std::free), then clears all internal containers. Distinct from release_owned_buffers() because this also catches buffers parked in callers' SPSC free_queues (tracked via register_mapping but not framework-owned via a queue). Drains recycled/done/ready first (just clears — release goes via dev_to_host_ to avoid double-free) then walks the full mapping table. - profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]: RAII scope guard for collector init() rollback. Holds a manager reference + release_fn + a vector of "extra direct dev_ptrs" the collector owns outside the framework (e.g. PMU per-core PmuAicoreRings on a5 — plain alloc_cb allocations with no host shadow). On destruction without commit(), calls manager.release_all_owned + free_cb on each direct ptr. Move-only. Collector wiring ---------------- - common/scope_stats_collector.cpp init(): construct guard after set_memory_context, commit() right before return 0. Catches the shm region + ScopeStatsBuffer entries (free_queue and recycled pool). - common/tensor_dump_collector.cpp init(): same pattern. Catches the shm region + per-thread arenas + DumpMetaBuffers (free_queue and recycled pool). - a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring) for each per-core PmuAicoreRing (those don't go through alloc_paired_buffer so the framework doesn't track them — register them with the guard explicitly). Test plan --------- - a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit short-circuits the destructor). - a5sim ST L1+L2: pass (same). - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean. Net: +122 lines (5 files touched).
Plug the latent leak CodeRabbit flagged on hw-native-sys#944: a5 PmuCollector::init, ScopeStatsCollector::init, TensorDumpCollector::init register paired device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE flipping the initialized_ flag. If a subsequent allocation fails, init() returns -1; finalize() then early-exits (gated on initialized_ / shm_host_) and every registered device buffer + framework-malloc'd host shadow leaks. The pattern existed pre-hw-native-sys#944 (old alloc_single_buffer + manual register_mapping had the same shape), so this is not a regression of the unification work — just the cleanup it enables. Framework changes ----------------- - BufferPoolManager::release_all_owned(release_fn) [new]: abort-path cleanup that releases EVERY framework-tracked dev_ptr (via release_fn) and every framework-malloc'd host shadow (via std::free), then clears all internal containers. Distinct from release_owned_buffers() because this also catches buffers parked in callers' SPSC free_queues (tracked via register_mapping but not framework-owned via a queue). Drains recycled/done/ready first (just clears — release goes via dev_to_host_ to avoid double-free) then walks the full mapping table. - profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]: RAII scope guard for collector init() rollback. Holds a manager reference + release_fn + a vector of "extra direct dev_ptrs" the collector owns outside the framework (e.g. PMU per-core PmuAicoreRings on a5 — plain alloc_cb allocations with no host shadow). On destruction without commit(), calls manager.release_all_owned + free_cb on each direct ptr. Move-only. Collector wiring ---------------- - common/scope_stats_collector.cpp init(): construct guard after set_memory_context, commit() right before return 0. Catches the shm region + ScopeStatsBuffer entries (free_queue and recycled pool). - common/tensor_dump_collector.cpp init(): same pattern. Catches the shm region + per-thread arenas + DumpMetaBuffers (free_queue and recycled pool). - a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring) for each per-core PmuAicoreRing (those don't go through alloc_paired_buffer so the framework doesn't track them — register them with the guard explicitly). Test plan --------- - a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit short-circuits the destructor). - a5sim ST L1+L2: pass (same). - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean. Net: +122 lines (5 files touched).
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…s) (#948) Plug the latent leak CodeRabbit flagged on #944: a5 PmuCollector::init, ScopeStatsCollector::init, TensorDumpCollector::init register paired device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE flipping the initialized_ flag. If a subsequent allocation fails, init() returns -1; finalize() then early-exits (gated on initialized_ / shm_host_) and every registered device buffer + framework-malloc'd host shadow leaks. The pattern existed pre-#944 (old alloc_single_buffer + manual register_mapping had the same shape), so this is not a regression of the unification work — just the cleanup it enables. Framework changes ----------------- - BufferPoolManager::release_all_owned(release_fn) [new]: abort-path cleanup that releases EVERY framework-tracked dev_ptr (via release_fn) and every framework-malloc'd host shadow (via std::free), then clears all internal containers. Distinct from release_owned_buffers() because this also catches buffers parked in callers' SPSC free_queues (tracked via register_mapping but not framework-owned via a queue). Drains recycled/done/ready first (just clears — release goes via dev_to_host_ to avoid double-free) then walks the full mapping table. - profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]: RAII scope guard for collector init() rollback. Holds a manager reference + release_fn + a vector of "extra direct dev_ptrs" the collector owns outside the framework (e.g. PMU per-core PmuAicoreRings on a5 — plain alloc_cb allocations with no host shadow). On destruction without commit(), calls manager.release_all_owned + free_cb on each direct ptr. Move-only. Collector wiring ---------------- - common/scope_stats_collector.cpp init(): construct guard after set_memory_context, commit() right before return 0. Catches the shm region + ScopeStatsBuffer entries (free_queue and recycled pool). - common/tensor_dump_collector.cpp init(): same pattern. Catches the shm region + per-thread arenas + DumpMetaBuffers (free_queue and recycled pool). - a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring) for each per-core PmuAicoreRing (those don't go through alloc_paired_buffer so the framework doesn't track them — register them with the guard explicitly). Test plan --------- - a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit short-circuits the destructor). - a5sim ST L1+L2: pass (same). - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean. Net: +122 lines (5 files touched). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…n, consolidate common/, rename platform/src→shared) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Four orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean three more rebuild rounds and three more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~30 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
…n, consolidate common/, rename platform/src→shared, merge aicpu_loader) Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of duplicated and oddly-placed files that the recent unification work left behind. Five orthogonal changes bundled here because each touches the same set of CMakeLists; splitting would mean four more rebuild rounds and four more cmake-include-path edits across the same files. 1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/ ------------------------------------------------------------------------ Seven files per backend (14 total) were byte-identical (or differed only in a one-line @brief arch qualifier) between a2a3 and a5: cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp, orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h Moved a2a3's copy to common/, deleted a5's duplicate, and extended each arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to pick them up from common/platform/{onboard,sim}/aicpu/. The device_malloc.cpp arch tag in its @brief was the only real content diff; generalized to "Real Hardware" / "Simulation" without the arch qualifier. Backfilled a copyright header that was missing on device_time.cpp (caught by the check-headers hook). The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp) have real arch-specific divergence (register addresses, kernel protocols) and stay where they are. 2. Flatten profiling_common/ subdir ------------------------------------ src/common/platform/include/host/profiling_common/{buffer_pool_manager, profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager, profiler_base}.h. Updated 10 #include sites and the 2 header guards. The profiling_common:: C++ namespace stays — file path and namespace don't have to match. 3. Consolidate small src/common subdirs ---------------------------------------- - src/common/device_comm/device_arena.h → src/common/utils/device_arena.h. The file is a generic bump-arena utility, not a comm primitive; the enclosing dir name was misleading. Updated 10 #include sites "device_arena.h" → "utils/device_arena.h" and dropped the common/device_comm entry from 8 CMakeLists (replaced with common since utils/ resolves there). - src/common/sim_context/ → src/common/platform/sim/sim_context/. The dir is sim-only infrastructure (CPU sim context for CANN intrinsic emulation), so it belongs next to the other common/platform/sim/ shared sim infrastructure. Updated: * the dir's own CMakeLists relative path to log/include; * simpler_setup/runtime_compiler.py::compile_sim_context source path; * 4 sim-host CMakeLists references; * a small handful of docs that named the old path. 4. Rename platform/src → platform/shared ------------------------------------------ Per-arch src/{arch}/platform/src/ was confusingly nested inside the top-level src/ directory and read as "src/src" in many paths. Renamed to shared/ across all 3 trees (a2a3, a5, common), matching its actual semantic ("shared between onboard and sim within one arch"). Updated 21 files that referenced the old path: CMakeLists, host headers, docs, one test file, and the src/{arch}/docs/platform.md map. 5. Merge host/load_aicpu_op + aicpu_dispatcher → aicpu_loader/{host,device} --------------------------------------------------------------------------- src/common/host/ held a single file pair (load_aicpu_op.{h,cpp}) — the host-side loader that uploads runtime AICPU SOs to CANN's preinstall path. src/common/aicpu_dispatcher/ held the AICPU-side bootstrap helper the loader actually drives. They are the two halves of one mechanism and the README of aicpu_dispatcher already cross-references the loader. Merged into src/common/aicpu_loader/{host,device}/ with the shared README at the top level. host/ holds load_aicpu_op.{h,cpp}, device/ holds aicpu_dispatcher.{h,cpp}. Updated: * 5 #include sites (4 device_runner.{cpp,h} bare includes plus device_runner_base.h's prefixed include) to the consistent aicpu_loader/host/load_aicpu_op.h form; * 4 CMakeLists source-path entries; * dropped the now-unneeded common/host and common/aicpu_dispatcher explicit include-path entries (common/ is already on the path); * updated the README cross-reference comment in load_aicpu_op.h. Test plan --------- - Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim) + libcpu_sim_context.so + aicpu and aicore artifacts: clean. - CI will run the full ST + UT suite. Net: ~37 file renames, ~14 files extracted to common, +0 / -0 behavioral changes (pure layout).
Summary
Pulls a2a3's and a5's
profiler_base.h/buffer_pool_manager.h/profiling_copy.hinto a single shared implementation undersrc/common/platform/include/host/, and extracts the buffer-pairing helper that every collector'sinit()used to inline.Why
a2a3 and a5 host-side profiling stacks had diverged because:
halHostRegistermaps device pointers into host address space), so collectors directly read/write the device-side memory through the registered host view.halHostRegister, so device↔host transfers go throughrtMemcpy(onboard) ormemcpy(sim) viaprofiling_copy.h, with a pairedmalloc'd host shadow that the mgmt loop mirrors per-tick.The frameworks were ~47%/89% diff at the header level. This PR makes the choice between the two paths a runtime decision driven by what the collector installs in
MemoryOps, rather than per-arch source code.Framework changes
MemoryOpsadds optionalcopy_to_device/copy_from_devicefunction fields. Non-SVM platforms install them; SVM platforms leave them null and the manager's internal null-check makes everymirror_*/*_range_*/copy_buffer_*method a single-call no-op on SVM.BufferPoolManager::set_memory_contextnow takesshm_dev+shm_host+shm_size+device_id. a2a3 collectors passshm_size=0(everything short-circuits at theshm_size_ == 0early-return); a5 collectors pass the real triple.ProfilerBase::set_memory_contextpropagates the copy callbacks and the dev/host/size triple from the collector into the manager.ProfilerBase::start()picks the register fallback when the collector passes nullptr: identity-map on SVM (copy_to_device_also null) or an inline host-shadow malloc lambda on non-SVM (copy_to_device_installed). The previous a2a3 ProfilerBase hard-coded identity and the a5 one hard-coded host-shadow.ProfilerBase::alloc_paired_buffer(size, &host_out)is new and replaces a5's per-collectoralloc_single_bufferand a2a3's inlineif (register_cb != nullptr) {…} else {…}blocks. It picks among three paths from the stashed memory context:register_cb_(a2a3 onboard halHostRegister),copy_to_device_+ malloc (a5 host-shadow), or identity-map (a2a3 sim / pre-init).BufferPoolManagergains amalloc_shadows_set and anadd_malloc_shadow()API. The framework malloc paths (the host-shadow register lambda instart(), the copy-to-device branch inalloc_paired_buffer) add the malloc'd shadow to the set;clear_mappings()/release_owned_buffers()/free_buffer()onlystd::freeshadows that are in the set. The previous heuristic of "free unlesshost_ptr == dev_ptr" was fragile on a2a3 onboard, wherehalHostRegisterproduces a HAL-managed pointer that std::free must never touch — the alias check happened to skip it becauseDEV_SVM_MAP_HOSTreturns the same VA in practice, but that's an invariant the code shouldn't depend on. The new set-based check is exact.Collector changes
set_memory_contextsignature. a2a3 collectors pass nullptr copy_* + identical shm_dev/shm_host. a5 collectors passprofiling_copy_*_for_opsshims (declared in the movedprofiling_copy.h) + distinct shm_dev/host.alloc_single_bufferprivate helpers (4 copies, ~30 lines each) are deleted. Init sites switched to the base'salloc_paired_buffer.profiling_copy.cppstub (returns 0) so the framework's reference toprofiling_copy_*resolves at link time. The stubs are never reached on a2a3 because the collectors never install the copy_to_device / copy_from_device callbacks; only the symbol-resolution edge of the build requires them.Incidental fix: tensor-dump arena host-shadow leak
Pre-PR a5
TensorDumpCollector::alloc_single_bufferwas the only variant that did NOT callmanager_.register_mapping, so per-thread arena host shadows were missing fromdev_to_host_and never freed byclear_mappings(). Post-PR,alloc_paired_bufferalways registers, the newmalloc_shadows_set tracks them, andclear_mappings()frees them. Net: arena shadows now released cleanly at collector teardown.What remains divergent (deliberately)
The leaf collectors' init() bodies still carry per-arch state (a2a3 collectors keep
shm_dev_/shm_host_/shm_size_/shm_registered_as members; a5 collectors store on the base + manager). Unifying those is the natural next step but requires lifecycle restructuring per collector that wants its own PR. The framework changes in this PR are the load-bearing pieces — leaf-collector.cppfiles staying per-arch is a follow-up.Test plan
Net: ~-820 lines (23 files touched).