Skip to content

Refactor: unify profiling collector framework across a2a3/a5 (-855 lines)#944

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:profiling-framework-unify
May 31, 2026
Merged

Refactor: unify profiling collector framework across a2a3/a5 (-855 lines)#944
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:profiling-framework-unify

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 31, 2026

Summary

Pulls a2a3's and a5's profiler_base.h / buffer_pool_manager.h / profiling_copy.h into a single shared implementation under src/common/platform/include/host/, and extracts the buffer-pairing helper that every collector's init() used to inline.

Why

a2a3 and a5 host-side profiling stacks had diverged because:

  • a2a3 has SVM (halHostRegister maps device pointers into host address space), so collectors directly read/write the device-side memory through the registered host view.
  • a5 has no halHostRegister, so device↔host transfers go through rtMemcpy (onboard) or memcpy (sim) via profiling_copy.h, with a paired malloc'd host shadow that the mgmt loop mirrors per-tick.

The frameworks were ~47%/89% diff at the header level. This PR makes the choice between the two paths a runtime decision driven by what the collector installs in MemoryOps, rather than per-arch source code.

Framework changes

  • MemoryOps adds optional copy_to_device / copy_from_device function fields. Non-SVM platforms install them; SVM platforms leave them null and the manager's internal null-check makes every mirror_* / *_range_* / copy_buffer_* method a single-call no-op on SVM.
  • BufferPoolManager::set_memory_context now takes shm_dev + shm_host + shm_size + device_id. a2a3 collectors pass shm_size=0 (everything short-circuits at the shm_size_ == 0 early-return); a5 collectors pass the real triple.
  • ProfilerBase::set_memory_context propagates the copy callbacks and the dev/host/size triple from the collector into the manager.
  • ProfilerBase::start() picks the register fallback when the collector passes nullptr: identity-map on SVM (copy_to_device_ also null) or an inline host-shadow malloc lambda on non-SVM (copy_to_device_ installed). The previous a2a3 ProfilerBase hard-coded identity and the a5 one hard-coded host-shadow.
  • ProfilerBase::alloc_paired_buffer(size, &host_out) is new and replaces a5's per-collector alloc_single_buffer and a2a3's inline if (register_cb != nullptr) {…} else {…} blocks. It picks among three paths from the stashed memory context: register_cb_ (a2a3 onboard halHostRegister), copy_to_device_ + malloc (a5 host-shadow), or identity-map (a2a3 sim / pre-init).
  • Explicit shadow-ownership tracking. BufferPoolManager gains a malloc_shadows_ set and an add_malloc_shadow() API. The framework malloc paths (the host-shadow register lambda in start(), the copy-to-device branch in alloc_paired_buffer) add the malloc'd shadow to the set; clear_mappings() / release_owned_buffers() / free_buffer() only std::free shadows that are in the set. The previous heuristic of "free unless host_ptr == dev_ptr" was fragile on a2a3 onboard, where halHostRegister produces a HAL-managed pointer that std::free must never touch — the alias check happened to skip it because DEV_SVM_MAP_HOST returns the same VA in practice, but that's an invariant the code shouldn't depend on. The new set-based check is exact.

Collector changes

  • Both arches' 4 profiling collectors (l2_swimlane, tensor_dump, pmu, scope_stats) updated to pass the new 9-arg set_memory_context signature. a2a3 collectors pass nullptr copy_* + identical shm_dev/shm_host. a5 collectors pass profiling_copy_*_for_ops shims (declared in the moved profiling_copy.h) + distinct shm_dev/host.
  • a5's per-collector alloc_single_buffer private helpers (4 copies, ~30 lines each) are deleted. Init sites switched to the base's alloc_paired_buffer.
  • a2a3 onboard + sim now compile a tiny profiling_copy.cpp stub (returns 0) so the framework's reference to profiling_copy_* resolves at link time. The stubs are never reached on a2a3 because the collectors never install the copy_to_device / copy_from_device callbacks; only the symbol-resolution edge of the build requires them.

Incidental fix: tensor-dump arena host-shadow leak

Pre-PR a5 TensorDumpCollector::alloc_single_buffer was the only variant that did NOT call manager_.register_mapping, so per-thread arena host shadows were missing from dev_to_host_ and never freed by clear_mappings(). Post-PR, alloc_paired_buffer always registers, the new malloc_shadows_ set tracks them, and clear_mappings() frees them. Net: arena shadows now released cleanly at collector teardown.

What remains divergent (deliberately)

The leaf collectors' init() bodies still carry per-arch state (a2a3 collectors keep shm_dev_ / shm_host_ / shm_size_ / shm_registered_ as members; a5 collectors store on the base + manager). Unifying those is the natural next step but requires lifecycle restructuring per collector that wants its own PR. The framework changes in this PR are the load-bearing pieces — leaf-collector .cpp files staying per-arch is a follow-up.

Test plan

  • a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
  • a5sim ST L1+L2: 22/22 pass (devices 0,1)
  • a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass — confirms the SVM path through the unified framework matches the previous behavior end-to-end
  • CI green

Net: ~-820 lines (23 files touched).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 29515ab1-bd43-46c4-adef-9e37d0d35ba7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR generalizes the profiling infrastructure to support both SVM and non-SVM platforms through explicit buffer ownership tracking, architecture-aware copy callbacks, and centralized paired buffer allocation in ProfilerBase. The a5-specific buffer pool manager moves to common platform, buffer ownership detection shifts from pointer comparison to an explicit malloc_shadows_ set, and all collectors delegate allocation to the base class instead of maintaining local allocator methods.

Changes

Cross-Platform Profiling Infrastructure Refactoring

Layer / File(s) Summary
Profiling Copy Hook Infrastructure
src/common/platform/include/host/profiling_copy.h, src/a2a3/platform/onboard/host/profiling_copy.cpp, src/a2a3/platform/sim/host/profiling_copy.cpp, src/a2a3/platform/onboard/host/CMakeLists.txt, src/a2a3/platform/sim/host/CMakeLists.txt
New platform-specific profiling copy header and implementations. Declares profiling_copy_to_device and profiling_copy_from_device functions with inline operation wrappers. A2A3 implementations are no-op stubs (return 0) since SVM platforms do not require device copies.
Buffer Pool Manager Generalization
src/common/platform/include/host/profiling_common/buffer_pool_manager.h
Relocates from a5-specific to common platform directory. Replaces pointer-comparison ownership heuristic (host_ptr != dev_ptr) with explicit malloc_shadows_ set tracking. Methods release_owned_buffers(), clear_mappings(), and free_buffer() now check set membership to distinguish framework-allocated from HAL-managed shadows. Adds public add_malloc_shadow() method. Documentation rewritten to explain SVM vs non-SVM buffer strategies.
ProfilerBase: Copy Callbacks and Paired Buffer Allocation
src/common/platform/include/host/profiling_common/profiler_base.h
Extends public set_memory_context() signature to accept copy_to_device and copy_from_device callbacks before memory pointers. Adds private copy_to_device_ and copy_from_device_ members. Implements alloc_paired_buffer() to allocate device buffers with paired host views, routing through copy_to_device_ for non-SVM paths. Adds non-SVM host-shadow registration lambda in start() when copy callbacks are provided. Updates clear_memory_context() to null copy callbacks.
A5 Collector Refactoring
src/a5/platform/include/host/*_collector.h, src/a5/platform/src/host/*_collector.cpp, src/a5/platform/onboard/host/device_runner.cpp
Removes private alloc_single_buffer() methods from PMU, L2 Swimlane, Scope Stats, and Tensor Dump collectors. All buffer allocations (shared memory, address tables, per-core/per-thread buffers) now use ProfilerBase::alloc_paired_buffer(). Adds profiling_copy_to_device_for_ops and profiling_copy_from_device_for_ops to both set_memory_context() calls. Removes manual register_mapping() calls since alloc_paired_buffer() handles registration.
A2A3 Collector Memory Context Wiring
src/a2a3/platform/src/host/*_collector.cpp
Updates all a2a3 platform collectors to call expanded set_memory_context() signature with copy callbacks set to nullptr (SVM path), explicit shm_dev/shm_host/shm_size parameters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/simpler#858: Prior refactoring of profiling framework infrastructure that this PR extends with SVM-agnostic ownership tracking and copy callback support for non-SVM platforms.

Poem

🐰 Buffers now wear name tags clear,
No guessing shadows far or near!
ProfilerBase allocates paired views,
Copy callbacks light the way through—
A5 and a2a3, hand in hand, pursue. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.48% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main refactoring effort: unifying the profiling collector framework across a2a3/a5 architectures into a shared implementation, with a quantifiable line reduction.
Description check ✅ Passed The PR description comprehensively explains the motivation (a2a3/a5 divergence), framework changes, collector updates, testing, and remaining work. It is clearly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request unifies the profiling framework across SVM (a2a3) and non-SVM (a5) architectures by moving "buffer_pool_manager.h" and "profiler_base.h" to a shared common directory. It refactors "ProfilerBase" to support both memory models at runtime, introducing a unified "alloc_paired_buffer" helper that replaces duplicate "alloc_single_buffer" implementations in individual collectors. Additionally, it adds stub implementations of "profiling_copy" for "a2a3" to satisfy linker requirements. No review comments were provided, and the implementation appears clean and well-structured, so I have no further feedback.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a5/platform/src/host/l2_swimlane_collector.cpp (1)

87-110: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unwind paired allocations on init failure.

Once alloc_paired_buffer() succeeds here, the buffer and its host shadow are registered in manager_. Any later return -1 in initialize() leaks those allocations because finalize() short-circuits while shm_host_ == nullptr. Add an init-scope cleanup guard or explicit unwind for all earlier paired allocations before each failure return.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/host/l2_swimlane_collector.cpp` around lines 87 - 110,
The alloc_paired_buffer call registers the paired device/host buffer in manager_
but later early returns in initialize() leak that allocation because finalize()
is a no-op when shm_host_==nullptr; wrap the init path after alloc_paired_buffer
(and any prior paired allocations) with an init-scope cleanup guard (RAII) or
explicitly unwind the paired allocations from manager_ on every error return:
ensure you undo the registration for perf_dev_ptr/perf_host_ptr (the entries
added by alloc_paired_buffer) before returning -1, or set shm_host_
appropriately so finalize() will free them; update code around
set_memory_context, alloc_paired_buffer, perf_dev_ptr/perf_host_ptr,
initialize(), finalize(), and manager_ usage to perform the cleanup on all
failure paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/src/host/dep_gen_collector.cpp`:
- Around line 117-120: The call to set_memory_context currently passes shm_host_
twice and hardcodes shm_size to 0, leaving the profiler with incorrect SHM
metadata; update the call in dep_gen_collector (the set_memory_context
invocation that currently uses alloc_cb, register_cb, free_cb,
/*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/shm_host_, shm_host_,
/*shm_size=*/0, device_id) to pass the actual shm_dev_, shm_host_, and shm_size_
members (shm_dev_, shm_host_, shm_size_) so the registered path receives correct
shared-memory tuple information.

In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 121-124: The initial set_memory_context call seeds the callbacks
but leaves SHM fields as nullptr/0 so ProfilerBase sees an empty SHM context;
after the SHM allocation succeeds and perf_dev_ptr, perf_host_ptr (and the SHM
size) are known, call set_memory_context again with the same alloc_cb,
register_cb, free_cb and the actual shm_dev (perf_dev_ptr), shm_host
(perf_host_ptr), shm_size, and device_id to refresh the memory context used by
ProfilerBase.

In `@src/a2a3/platform/src/host/pmu_collector.cpp`:
- Around line 152-155: The call to set_memory_context is passing shm_host_ twice
and 0 for size; replace those placeholders with the actual PMU SHM metadata that
init() computed — pass the real SHM device pointer, the host pointer
(shm_host_), and the real SHM size (e.g., shm_size or shm_size_) instead of the
duplicate shm_host_ and 0 so set_memory_context(alloc_cb, register_cb, free_cb,
/*copy_to=*/nullptr, /*copy_from=*/nullptr, /*shm_dev=*/<actual_shm_device_ptr>,
shm_host_, /*shm_size=*/<actual_shm_size>, device_id).

In `@src/a2a3/platform/src/host/scope_stats_collector.cpp`:
- Around line 116-119: The call to set_memory_context is passing placeholders
for device pointer and size; update the call in scope_stats_collector.cpp to
forward the actual initialized shared-memory state by passing shm_dev_ as the
device pointer, shm_host_ as the host pointer, and shm_size_ as the size (keep
alloc_cb, register_cb, free_cb and device_id as-is) so the base profiler
receives the correct memory context.

In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 75-78: The initial call to set_memory_context(alloc_cb,
register_cb, free_cb, /*copy_to=*/nullptr, /*copy_from=*/nullptr,
/*shm_dev=*/nullptr, /*shm_host=*/nullptr, /*shm_size=*/0, device_id) seeds the
profiler with empty SHM fields and is never updated; after dump_shared_mem_dev_
and shm_host_ (and shm_size_) are initialized you must call set_memory_context
again to provide the real SHM pointers and size so the base profiler uses the
correct shared memory context. Locate where dump_shared_mem_dev_ and shm_host_
are assigned and invoke set_memory_context with the same
alloc_cb/register_cb/free_cb and the actual copy_to/copy_from (if available) and
the initialized shm_dev = dump_shared_mem_dev_, shm_host = shm_host_, shm_size =
dump_shm_size (and device_id) to overwrite the previously seeded empty context
(ensure you reference set_memory_context, dump_shared_mem_dev_, shm_host_, and
device_id).

In `@src/a5/platform/src/host/pmu_collector.cpp`:
- Around line 68-79: The PMU init path can leak resources because
alloc_paired_buffer() registers ownership in manager_ before initialized_ is set
and later failures return without finalize(); add a local rollback guard in
init() after alloc_paired_buffer(shm_size, &shm_host_local) (and after any
subsequent allocation that can fail) to free the paired SHM/device buffer,
unregister the ownership from manager_, and free any malloc-shadow allocations;
implement this as a small scope guard/RAII object (or explicit cleanup block)
that is cancelled when init() completes successfully or when initialized_
becomes true so finalize() remains the single normal cleanup path.

In `@src/a5/platform/src/host/scope_stats_collector.cpp`:
- Around line 65-76: After alloc_paired_buffer returns a valid shm_dev_local and
shm_host_local but the function later fails and returns -1 while initialized_ is
still false, you must rollback the partial init: explicitly release the device
buffer and unregister/free the host shadow and reset local pointers before
returning instead of relying on finalize(); call the same
deallocation/unregister routines used elsewhere (e.g. the free callback and host
unregister path associated with set_memory_context/alloc_paired_buffer) to free
shm_dev_local and unregister or free shm_host_local, set
shm_dev_local/shm_host_local to nullptr, and ensure no ownership is left behind
prior to returning.

In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 60-73: The init path registers paired allocations early (via
set_memory_context and alloc_paired_buffer) but lacks unwind on subsequent
allocation failures, leaking device buffers/malloc shadows; modify the init to
track shm_dev_local/shm_host_local returned by alloc_paired_buffer and wrap the
remaining allocation steps (e.g., further arena/meta-buffer allocations after
calc_dump_data_size/alloc_paired_buffer) in an error-path guard that on any
failure frees the paired allocation (use the paired free API you have for
alloc_paired_buffer), resets/shrinks any changed state (shm_host_/shm_dev_ or
related context), and only then returns an error, or alternatively defer
registering the memory context until after all allocations succeed; reference
set_memory_context, alloc_paired_buffer, shm_host_, shm_dev_local, finalize()
when making the change.

---

Outside diff comments:
In `@src/a5/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 87-110: The alloc_paired_buffer call registers the paired
device/host buffer in manager_ but later early returns in initialize() leak that
allocation because finalize() is a no-op when shm_host_==nullptr; wrap the init
path after alloc_paired_buffer (and any prior paired allocations) with an
init-scope cleanup guard (RAII) or explicitly unwind the paired allocations from
manager_ on every error return: ensure you undo the registration for
perf_dev_ptr/perf_host_ptr (the entries added by alloc_paired_buffer) before
returning -1, or set shm_host_ appropriately so finalize() will free them;
update code around set_memory_context, alloc_paired_buffer,
perf_dev_ptr/perf_host_ptr, initialize(), finalize(), and manager_ usage to
perform the cleanup on all failure paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e4a61483-9e3f-4447-a298-d3d109a6c1c2

📥 Commits

Reviewing files that changed from the base of the PR and between 6ebcdc8 and bd8cce5.

📒 Files selected for processing (23)
  • src/a2a3/platform/include/host/profiling_common/buffer_pool_manager.h
  • src/a2a3/platform/include/host/profiling_common/profiler_base.h
  • src/a2a3/platform/onboard/host/CMakeLists.txt
  • src/a2a3/platform/onboard/host/profiling_copy.cpp
  • src/a2a3/platform/sim/host/CMakeLists.txt
  • src/a2a3/platform/sim/host/profiling_copy.cpp
  • src/a2a3/platform/src/host/dep_gen_collector.cpp
  • src/a2a3/platform/src/host/l2_swimlane_collector.cpp
  • src/a2a3/platform/src/host/pmu_collector.cpp
  • src/a2a3/platform/src/host/scope_stats_collector.cpp
  • src/a2a3/platform/src/host/tensor_dump_collector.cpp
  • src/a5/platform/include/host/l2_swimlane_collector.h
  • src/a5/platform/include/host/pmu_collector.h
  • src/a5/platform/include/host/scope_stats_collector.h
  • src/a5/platform/include/host/tensor_dump_collector.h
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/platform/src/host/l2_swimlane_collector.cpp
  • src/a5/platform/src/host/pmu_collector.cpp
  • src/a5/platform/src/host/scope_stats_collector.cpp
  • src/a5/platform/src/host/tensor_dump_collector.cpp
  • src/common/platform/include/host/profiling_common/buffer_pool_manager.h
  • src/common/platform/include/host/profiling_common/profiler_base.h
  • src/common/platform/include/host/profiling_copy.h
💤 Files with no reviewable changes (3)
  • src/a2a3/platform/include/host/profiling_common/profiler_base.h
  • src/a2a3/platform/include/host/profiling_common/buffer_pool_manager.h
  • src/a5/platform/include/host/scope_stats_collector.h

Comment thread src/a2a3/platform/src/host/dep_gen_collector.cpp
Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Comment thread src/a2a3/platform/src/host/pmu_collector.cpp
Comment thread src/a2a3/platform/src/host/scope_stats_collector.cpp
Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp
Comment thread src/a5/platform/src/host/pmu_collector.cpp
Comment thread src/a5/platform/src/host/scope_stats_collector.cpp
Comment thread src/a5/platform/src/host/tensor_dump_collector.cpp
…nes)

Pulls a2a3's and a5's `profiler_base.h` / `buffer_pool_manager.h` /
`profiling_copy.h` into a single shared implementation under
`src/common/platform/include/host/`, and extracts the buffer-pairing
helper that every collector's init() used to inline.

Background
----------

a2a3 and a5 host-side profiling stacks had diverged because:

- a2a3 has SVM (halHostRegister maps device pointers into host address
  space), so collectors directly read/write the device-side memory
  through the registered host view.
- a5 has no halHostRegister, so device↔host transfers go through
  rtMemcpy (onboard) or memcpy (sim) via profiling_copy.h, with a
  paired malloc'd host shadow that the mgmt loop mirrors per-tick.

The frameworks were ~47%/89% diff at the header level. This PR makes
the choice between the two paths a runtime decision driven by what the
collector installs in MemoryOps, rather than per-arch source code.

Framework changes
-----------------

- `MemoryOps` adds optional `copy_to_device` / `copy_from_device`
  function fields. Non-SVM platforms install them; SVM platforms leave
  them null and the manager's internal null-check makes every
  mirror_/range_/copy_buffer_ method a single-call no-op on SVM.

- `BufferPoolManager::set_memory_context` now takes shm_dev + shm_host
  + shm_size + device_id. SVM platforms pass `shm_dev == shm_host` so
  range/mirror operations bounds-check successfully but never copy.

- `ProfilerBase::set_memory_context` propagates the copy callbacks and
  the dev/host/size triple from the collector into the manager.

- `ProfilerBase::start()` picks the register fallback when the
  collector passes nullptr: identity-map on SVM (copy_to_device_ also
  null) or `default_host_shadow_register` on non-SVM (copy_to_device_
  installed). The previous a2a3 ProfilerBase hard-coded identity and
  the a5 one hard-coded host-shadow.

- `ProfilerBase::alloc_paired_buffer(size, &host_out)` is new and
  replaces a5's per-collector `alloc_single_buffer` and a2a3's inline
  `if (register_cb != nullptr) {…} else {…}` blocks. It picks among
  three paths from the stashed memory context: register_cb_ (a2a3
  onboard halHostRegister), copy_to_device_ + malloc (a5 host-shadow),
  or identity-map (a2a3 sim / pre-init).

Collector changes
-----------------

- Both arches' 4 profiling collectors (l2_swimlane, tensor_dump, pmu,
  scope_stats) updated to pass the new 9-arg `set_memory_context`
  signature. a2a3 collectors pass nullptr copy_* + identical shm_dev
  and shm_host. a5 collectors pass profiling_copy_*_for_ops shims
  (declared in the moved profiling_copy.h) + distinct shm_dev/host.

- a5's per-collector `alloc_single_buffer` private helpers (4 copies,
  ~30 lines each) are deleted. Init sites switched to the base's
  `alloc_paired_buffer`.

- a2a3 onboard + sim now compile a tiny `profiling_copy.cpp` stub
  (returns 0) so the framework's reference to profiling_copy_*
  resolves at link time. The stubs are never reached on a2a3 because
  the collectors never install the copy_to_device / copy_from_device
  callbacks; only the symbol-resolution edge of the build requires
  them.

What remains divergent (deliberately)
-------------------------------------

The leaf collectors' init() bodies still carry per-arch state (a2a3
collectors keep `shm_dev_` / `shm_host_` / `shm_size_` / `shm_registered_`
as members; a5 collectors store on the base + manager). Unifying those
is the natural next step but requires lifecycle restructuring per
collector that wants its own PR. The framework changes in this PR are
the load-bearing pieces — leaf-collector .cpp files staying per-arch
is a follow-up.

Test plan
---------

- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
  — confirms the SVM path through the unified framework matches the
  previous behavior end-to-end

Net: -855 lines (23 files touched).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
@ChaoWao ChaoWao force-pushed the profiling-framework-unify branch from bd8cce5 to a17cafe Compare May 31, 2026 11:26
@ChaoWao ChaoWao merged commit d6ee27b into hw-native-sys:main May 31, 2026
16 checks passed
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…lines)

Builds on hw-native-sys#944 (profiling framework unification). With the framework's
`alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow
disambiguation, the per-arch `scope_stats_collector` and
`tensor_dump_collector` implementations converged to identical code
(modulo header guards and arch-name comments). Move both pairs to
`src/common/platform/{include,src}/host/`.

Files moved
-----------

- `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's
  identical copy deleted. The two files differed only in:
    * arch-name comments (a5's "a5 specifics" generalized to "SVM vs
      non-SVM" paragraph applicable to both arches)
    * header guard names (SRC_A5_… → SRC_COMMON_…)
    * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`,
      `buffers_registered_`) that were only there to gate
      `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr,
      free_cb)` calls. The gate is redundant — callers already pair
      `register_cb` and `unregister_cb` consistently (a2a3 onboard
      passes both halHostRegister + halHostUnregister; a2a3 sim and
      a5 pass nullptr for both). Dropped along with the duplicate
      implementation.

- `tensor_dump_collector.{h,cpp}` — same pattern, same dedup.

CMakeLists updates
------------------

The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to
pull both collectors from `common/platform/src/host/` instead of
their per-arch `src/host/` directories.

What stays per-arch (deliberate, out of scope for this PR)
----------------------------------------------------------

- `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` /
  `pmu_resolve_event_config_a5` are arch-specific PMU event tables
  rooted in silicon revision differences; a5 also carries a
  `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on
  `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled
  without changing the on-device protocol layout.
- `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT`
  enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr`
  member, `mismatch_record_count` and `fanout_count` fields that
  a2a3 doesn't have. Same on-device protocol divergence.

Both could potentially be unified later by first reconciling the
on-device data structures, but that's an aicpu/aicore protocol
refactor — much heavier than this purely-host refactor.

Test plan
---------

- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
  — confirms the unified scope_stats + tensor_dump go through the
  SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still
  per-arch) keep working against the unified framework.

Net: -1466 lines on this branch (12 files touched, no behavior
change).

Stacks on top of hw-native-sys#944.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…lines)

Builds on hw-native-sys#944 (profiling framework unification). With the framework's
`alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow
disambiguation, the per-arch `scope_stats_collector` and
`tensor_dump_collector` implementations converged to identical code
(modulo header guards and arch-name comments). Move both pairs to
`src/common/platform/{include,src}/host/`.

Files moved
-----------

- `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's
  identical copy deleted. The two files differed only in:
    * arch-name comments (a5's "a5 specifics" generalized to "SVM vs
      non-SVM" paragraph applicable to both arches)
    * header guard names (SRC_A5_… → SRC_COMMON_…)
    * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`,
      `buffers_registered_`) that were only there to gate
      `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr,
      free_cb)` calls. The gate is redundant — callers already pair
      `register_cb` and `unregister_cb` consistently (a2a3 onboard
      passes both halHostRegister + halHostUnregister; a2a3 sim and
      a5 pass nullptr for both). Dropped along with the duplicate
      implementation.

- `tensor_dump_collector.{h,cpp}` — same pattern, same dedup.

CMakeLists updates
------------------

The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to
pull both collectors from `common/platform/src/host/` instead of
their per-arch `src/host/` directories.

What stays per-arch (deliberate, out of scope for this PR)
----------------------------------------------------------

- `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` /
  `pmu_resolve_event_config_a5` are arch-specific PMU event tables
  rooted in silicon revision differences; a5 also carries a
  `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on
  `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled
  without changing the on-device protocol layout.
- `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT`
  enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr`
  member, `mismatch_record_count` and `fanout_count` fields that
  a2a3 doesn't have. Same on-device protocol divergence.

Both could potentially be unified later by first reconciling the
on-device data structures, but that's an aicpu/aicore protocol
refactor — much heavier than this purely-host refactor.

Test plan
---------

- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
  — confirms the unified scope_stats + tensor_dump go through the
  SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still
  per-arch) keep working against the unified framework.

Net: -1466 lines on this branch (12 files touched, no behavior
change).

Stacks on top of hw-native-sys#944.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…lines)

Builds on hw-native-sys#944 (profiling framework unification). With the framework's
`alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow
disambiguation, the per-arch `scope_stats_collector` and
`tensor_dump_collector` implementations converged to identical code
(modulo header guards and arch-name comments). Move both pairs to
`src/common/platform/{include,src}/host/`.

Files moved
-----------

- `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's
  identical copy deleted. The two files differed only in:
    * arch-name comments (a5's "a5 specifics" generalized to "SVM vs
      non-SVM" paragraph applicable to both arches)
    * header guard names (SRC_A5_… → SRC_COMMON_…)
    * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`,
      `buffers_registered_`) that were only there to gate
      `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr,
      free_cb)` calls. The gate is redundant — callers already pair
      `register_cb` and `unregister_cb` consistently (a2a3 onboard
      passes both halHostRegister + halHostUnregister; a2a3 sim and
      a5 pass nullptr for both). Dropped along with the duplicate
      implementation.

- `tensor_dump_collector.{h,cpp}` — same pattern, same dedup.

CMakeLists updates
------------------

The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to
pull both collectors from `common/platform/src/host/` instead of
their per-arch `src/host/` directories.

What stays per-arch (deliberate, out of scope for this PR)
----------------------------------------------------------

- `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` /
  `pmu_resolve_event_config_a5` are arch-specific PMU event tables
  rooted in silicon revision differences; a5 also carries a
  `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on
  `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled
  without changing the on-device protocol layout.
- `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT`
  enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr`
  member, `mismatch_record_count` and `fanout_count` fields that
  a2a3 doesn't have. Same on-device protocol divergence.

Both could potentially be unified later by first reconciling the
on-device data structures, but that's an aicpu/aicore protocol
refactor — much heavier than this purely-host refactor.

Test plan
---------

- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
  — confirms the unified scope_stats + tensor_dump go through the
  SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still
  per-arch) keep working against the unified framework.

Net: -1466 lines on this branch (12 files touched, no behavior
change).

Stacks on top of hw-native-sys#944.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit that referenced this pull request May 31, 2026
…lines) (#945)

Builds on #944 (profiling framework unification). With the framework's
`alloc_paired_buffer` + `set_memory_context` doing the SVM-vs-shadow
disambiguation, the per-arch `scope_stats_collector` and
`tensor_dump_collector` implementations converged to identical code
(modulo header guards and arch-name comments). Move both pairs to
`src/common/platform/{include,src}/host/`.

Files moved
-----------

- `scope_stats_collector.{h,cpp}` — a5's version is canonical; a2a3's
  identical copy deleted. The two files differed only in:
    * arch-name comments (a5's "a5 specifics" generalized to "SVM vs
      non-SVM" paragraph applicable to both arches)
    * header guard names (SRC_A5_… → SRC_COMMON_…)
    * a2a3 carried 3 extra members (`shm_registered_`, `shm_size_`,
      `buffers_registered_`) that were only there to gate
      `release_one_buffer(p, *_registered_ ? unregister_cb : nullptr,
      free_cb)` calls. The gate is redundant — callers already pair
      `register_cb` and `unregister_cb` consistently (a2a3 onboard
      passes both halHostRegister + halHostUnregister; a2a3 sim and
      a5 pass nullptr for both). Dropped along with the duplicate
      implementation.

- `tensor_dump_collector.{h,cpp}` — same pattern, same dedup.

CMakeLists updates
------------------

The 4 host CMakeLists (a2a3 onboard/sim + a5 onboard/sim) updated to
pull both collectors from `common/platform/src/host/` instead of
their per-arch `src/host/` directories.

What stays per-arch (deliberate, out of scope for this PR)
----------------------------------------------------------

- `pmu_collector.{h,cpp}` — `pmu_resolve_event_config_a2a3` /
  `pmu_resolve_event_config_a5` are arch-specific PMU event tables
  rooted in silicon revision differences; a5 also carries a
  `PmuAicoreRing` infrastructure (with `aicore_ring_ptr` on
  `PmuBufferState`) that a2a3 doesn't have. Cannot be reconciled
  without changing the on-device protocol layout.
- `l2_swimlane_collector.{h,cpp}` — a5 carries `SCHED_PHASE_COUNT`
  enum value, `L2SwimlaneAicoreRing` struct + `aicore_ring_ptr`
  member, `mismatch_record_count` and `fanout_count` fields that
  a2a3 doesn't have. Same on-device protocol divergence.

Both could potentially be unified later by first reconciling the
on-device data structures, but that's an aicpu/aicore protocol
refactor — much heavier than this purely-host refactor.

Test plan
---------

- a2a3sim ST L1+L2: 38/38 pass (devices 0,1)
- a5sim ST L1+L2: 22/22 pass (devices 0,1)
- a2a3sim DFX (l2_swimlane, scope_stats, tensor_dump, pmu): 8/8 pass
  — confirms the unified scope_stats + tensor_dump go through the
  SVM path on a2a3 correctly, AND that l2_swimlane / pmu (still
  per-arch) keep working against the unified framework.

Net: -1466 lines on this branch (12 files touched, no behavior
change).

Stacks on top of #944.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
Plug the latent leak CodeRabbit flagged on hw-native-sys#944: a5 PmuCollector::init,
ScopeStatsCollector::init, TensorDumpCollector::init register paired
device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE
flipping the initialized_ flag. If a subsequent allocation fails, init()
returns -1; finalize() then early-exits (gated on initialized_ / shm_host_)
and every registered device buffer + framework-malloc'd host shadow leaks.

The pattern existed pre-hw-native-sys#944 (old alloc_single_buffer + manual
register_mapping had the same shape), so this is not a regression of the
unification work — just the cleanup it enables.

Framework changes
-----------------

- BufferPoolManager::release_all_owned(release_fn) [new]: abort-path
  cleanup that releases EVERY framework-tracked dev_ptr (via release_fn)
  and every framework-malloc'd host shadow (via std::free), then clears
  all internal containers. Distinct from release_owned_buffers() because
  this also catches buffers parked in callers' SPSC free_queues (tracked
  via register_mapping but not framework-owned via a queue). Drains
  recycled/done/ready first (just clears — release goes via dev_to_host_
  to avoid double-free) then walks the full mapping table.

- profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]:
  RAII scope guard for collector init() rollback. Holds a manager
  reference + release_fn + a vector of "extra direct dev_ptrs" the
  collector owns outside the framework (e.g. PMU per-core PmuAicoreRings
  on a5 — plain alloc_cb allocations with no host shadow). On destruction
  without commit(), calls manager.release_all_owned + free_cb on each
  direct ptr. Move-only.

Collector wiring
----------------

- common/scope_stats_collector.cpp init(): construct guard after
  set_memory_context, commit() right before return 0. Catches the shm
  region + ScopeStatsBuffer entries (free_queue and recycled pool).

- common/tensor_dump_collector.cpp init(): same pattern. Catches the
  shm region + per-thread arenas + DumpMetaBuffers (free_queue and
  recycled pool).

- a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring)
  for each per-core PmuAicoreRing (those don't go through
  alloc_paired_buffer so the framework doesn't track them — register
  them with the guard explicitly).

Test plan
---------

- a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit
  short-circuits the destructor).
- a5sim ST L1+L2: pass (same).
- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean.

Net: +122 lines (5 files touched).
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
Plug the latent leak CodeRabbit flagged on hw-native-sys#944: a5 PmuCollector::init,
ScopeStatsCollector::init, TensorDumpCollector::init register paired
device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE
flipping the initialized_ flag. If a subsequent allocation fails, init()
returns -1; finalize() then early-exits (gated on initialized_ / shm_host_)
and every registered device buffer + framework-malloc'd host shadow leaks.

The pattern existed pre-hw-native-sys#944 (old alloc_single_buffer + manual
register_mapping had the same shape), so this is not a regression of the
unification work — just the cleanup it enables.

Framework changes
-----------------

- BufferPoolManager::release_all_owned(release_fn) [new]: abort-path
  cleanup that releases EVERY framework-tracked dev_ptr (via release_fn)
  and every framework-malloc'd host shadow (via std::free), then clears
  all internal containers. Distinct from release_owned_buffers() because
  this also catches buffers parked in callers' SPSC free_queues (tracked
  via register_mapping but not framework-owned via a queue). Drains
  recycled/done/ready first (just clears — release goes via dev_to_host_
  to avoid double-free) then walks the full mapping table.

- profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]:
  RAII scope guard for collector init() rollback. Holds a manager
  reference + release_fn + a vector of "extra direct dev_ptrs" the
  collector owns outside the framework (e.g. PMU per-core PmuAicoreRings
  on a5 — plain alloc_cb allocations with no host shadow). On destruction
  without commit(), calls manager.release_all_owned + free_cb on each
  direct ptr. Move-only.

Collector wiring
----------------

- common/scope_stats_collector.cpp init(): construct guard after
  set_memory_context, commit() right before return 0. Catches the shm
  region + ScopeStatsBuffer entries (free_queue and recycled pool).

- common/tensor_dump_collector.cpp init(): same pattern. Catches the
  shm region + per-thread arenas + DumpMetaBuffers (free_queue and
  recycled pool).

- a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring)
  for each per-core PmuAicoreRing (those don't go through
  alloc_paired_buffer so the framework doesn't track them — register
  them with the guard explicitly).

Test plan
---------

- a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit
  short-circuits the destructor).
- a5sim ST L1+L2: pass (same).
- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean.

Net: +122 lines (5 files touched).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
ChaoWao added a commit that referenced this pull request Jun 1, 2026
…s) (#948)

Plug the latent leak CodeRabbit flagged on #944: a5 PmuCollector::init,
ScopeStatsCollector::init, TensorDumpCollector::init register paired
device+host buffers in BufferPoolManager via alloc_paired_buffer BEFORE
flipping the initialized_ flag. If a subsequent allocation fails, init()
returns -1; finalize() then early-exits (gated on initialized_ / shm_host_)
and every registered device buffer + framework-malloc'd host shadow leaks.

The pattern existed pre-#944 (old alloc_single_buffer + manual
register_mapping had the same shape), so this is not a regression of the
unification work — just the cleanup it enables.

Framework changes
-----------------

- BufferPoolManager::release_all_owned(release_fn) [new]: abort-path
  cleanup that releases EVERY framework-tracked dev_ptr (via release_fn)
  and every framework-malloc'd host shadow (via std::free), then clears
  all internal containers. Distinct from release_owned_buffers() because
  this also catches buffers parked in callers' SPSC free_queues (tracked
  via register_mapping but not framework-owned via a queue). Drains
  recycled/done/ready first (just clears — release goes via dev_to_host_
  to avoid double-free) then walks the full mapping table.

- profiling_common::InitRollbackGuard<Manager> [new, profiler_base.h]:
  RAII scope guard for collector init() rollback. Holds a manager
  reference + release_fn + a vector of "extra direct dev_ptrs" the
  collector owns outside the framework (e.g. PMU per-core PmuAicoreRings
  on a5 — plain alloc_cb allocations with no host shadow). On destruction
  without commit(), calls manager.release_all_owned + free_cb on each
  direct ptr. Move-only.

Collector wiring
----------------

- common/scope_stats_collector.cpp init(): construct guard after
  set_memory_context, commit() right before return 0. Catches the shm
  region + ScopeStatsBuffer entries (free_queue and recycled pool).

- common/tensor_dump_collector.cpp init(): same pattern. Catches the
  shm region + per-thread arenas + DumpMetaBuffers (free_queue and
  recycled pool).

- a5/pmu_collector.cpp init(): same pattern + guard.add_direct_ptr(ring)
  for each per-core PmuAicoreRing (those don't go through
  alloc_paired_buffer so the framework doesn't track them — register
  them with the guard explicitly).

Test plan
---------

- a2a3sim ST L1+L2: pass (rollback path inert on success — guard.commit
  short-circuits the destructor).
- a5sim ST L1+L2: pass (same).
- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim): clean.

Net: +122 lines (5 files touched).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Four orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean three more rebuild rounds
and three more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~30 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…n, consolidate common/, rename platform/src→shared, merge aicpu_loader)

Mechanical post-hw-native-sys#944/hw-native-sys#945/hw-native-sys#948 layout cleanup, motivated by an audit of
duplicated and oddly-placed files that the recent unification work left
behind. Five orthogonal changes bundled here because each touches the
same set of CMakeLists; splitting would mean four more rebuild rounds
and four more cmake-include-path edits across the same files.

1. Extract identical aicpu sources to common/platform/{onboard,sim}/aicpu/
------------------------------------------------------------------------

Seven files per backend (14 total) were byte-identical (or differed only
in a one-line @brief arch qualifier) between a2a3 and a5:

  cache_ops.cpp, device_log.cpp, device_time.cpp, device_malloc.cpp,
  orch_so_file.cpp, platform_aicpu_affinity.cpp, spin_hint.h

Moved a2a3's copy to common/, deleted a5's duplicate, and extended each
arch's onboard/aicpu and sim/aicpu CMakeLists COMMON_SOURCES glob to
pick them up from common/platform/{onboard,sim}/aicpu/. The
device_malloc.cpp arch tag in its @brief was the only real content
diff; generalized to "Real Hardware" / "Simulation" without the arch
qualifier. Backfilled a copyright header that was missing on
device_time.cpp (caught by the check-headers hook).

The remaining files in per-arch aicpu/ (kernel.cpp, inner_platform_regs.cpp)
have real arch-specific divergence (register addresses, kernel protocols)
and stay where they are.

2. Flatten profiling_common/ subdir
------------------------------------

src/common/platform/include/host/profiling_common/{buffer_pool_manager,
profiler_base}.h → src/common/platform/include/host/{buffer_pool_manager,
profiler_base}.h. Updated 10 #include sites and the 2 header guards. The
profiling_common:: C++ namespace stays — file path and namespace don't
have to match.

3. Consolidate small src/common subdirs
----------------------------------------

- src/common/device_comm/device_arena.h → src/common/utils/device_arena.h.
  The file is a generic bump-arena utility, not a comm primitive; the
  enclosing dir name was misleading. Updated 10 #include sites
  "device_arena.h" → "utils/device_arena.h" and dropped the
  common/device_comm entry from 8 CMakeLists (replaced with common
  since utils/ resolves there).

- src/common/sim_context/ → src/common/platform/sim/sim_context/. The
  dir is sim-only infrastructure (CPU sim context for CANN intrinsic
  emulation), so it belongs next to the other common/platform/sim/
  shared sim infrastructure. Updated:
    * the dir's own CMakeLists relative path to log/include;
    * simpler_setup/runtime_compiler.py::compile_sim_context source
      path;
    * 4 sim-host CMakeLists references;
    * a small handful of docs that named the old path.

4. Rename platform/src → platform/shared
------------------------------------------

Per-arch src/{arch}/platform/src/ was confusingly nested inside the
top-level src/ directory and read as "src/src" in many paths. Renamed
to shared/ across all 3 trees (a2a3, a5, common), matching its actual
semantic ("shared between onboard and sim within one arch"). Updated 21
files that referenced the old path: CMakeLists, host headers, docs, one
test file, and the src/{arch}/docs/platform.md map.

5. Merge host/load_aicpu_op + aicpu_dispatcher → aicpu_loader/{host,device}
---------------------------------------------------------------------------

src/common/host/ held a single file pair (load_aicpu_op.{h,cpp}) — the
host-side loader that uploads runtime AICPU SOs to CANN's preinstall
path. src/common/aicpu_dispatcher/ held the AICPU-side bootstrap helper
the loader actually drives. They are the two halves of one mechanism
and the README of aicpu_dispatcher already cross-references the loader.

Merged into src/common/aicpu_loader/{host,device}/ with the shared
README at the top level. host/ holds load_aicpu_op.{h,cpp}, device/
holds aicpu_dispatcher.{h,cpp}. Updated:
  * 5 #include sites (4 device_runner.{cpp,h} bare includes plus
    device_runner_base.h's prefixed include) to the consistent
    aicpu_loader/host/load_aicpu_op.h form;
  * 4 CMakeLists source-path entries;
  * dropped the now-unneeded common/host and common/aicpu_dispatcher
    explicit include-path entries (common/ is already on the path);
  * updated the README cross-reference comment in load_aicpu_op.h.

Test plan
---------

- Build all four libhost_runtime.so (a2a3 onboard/sim, a5 onboard/sim)
  + libcpu_sim_context.so + aicpu and aicore artifacts: clean.
- CI will run the full ST + UT suite.

Net: ~37 file renames, ~14 files extracted to common, +0 / -0
behavioral changes (pure layout).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants