Skip to content

Refactor: extract setup_static_arena + launch_aicore_kernel into base#922

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:extract-arena-launch-profiling
May 30, 2026
Merged

Refactor: extract setup_static_arena + launch_aicore_kernel into base#922
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:extract-arena-launch-profiling

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Two more mechanical-clean items extracted into DeviceRunnerBase. Net -104 lines.

Moved to base

  • setup_static_arena: bodies were near-identical between a2a3 and a5. The only difference was the rollback-on-failure semantic, which is now normalized to a5's pattern: when a later region fails to commit, earlier committed peers are released.
  • launch_aicore_kernel: byte-identical body. Lazy-registers the AICore binary via rtRegisterAllKernel, caches the handle, builds rtArgsEx_t, launches via rtKernelLaunchWithHandleV2. Whether k_args is host- or device-resident is decided by the caller's run().
  • cached_gm_heap_size_ / cached_gm_sm_size_ / cached_runtime_arena_size_ moved alongside setup_static_arena.

Behavior normalization (call out)

a2a3's setup_static_arena previously kept earlier committed peers alive when a later region failed, on the rationale that "pooled pointers previously returned to callers must stay valid even if this resize attempt aborts." That rationale leaks: if the caller already committed pointers to earlier regions, they keep working; but the caller also gets a failure rc, so it cannot rely on the layout. The earlier-peers-stay-alive behavior masks the failure — the caller might happily continue using stale pointers without realizing the configured runtime arena (or whatever later region failed) never got committed.

a5's "release earlier peers on later failure" semantic is the safer default: failure means failure across the board, caller re-tries with a different layout or bails out. The current setup_static_arena callers all retry the full layout when it fails, so this is a no-op behavioral change in practice — but it removes a footgun.

Skipped (per-arch divergence too tangled)

  • init_l2_perf / init_tensor_dump / init_pmu / init_scope_stats: set arch-specific KernelArgs fields (a2a3: aicore_ring_addr; a5: aicore_l2_perf_ring_addrs + aicore_pmu_ring_addrs). Unifying needs multiple virtual hooks per method — the virtual overhead exceeds the savings.
  • finalize_collectors: a5 only stop()s collectors (per-run cleanup); a2a3 fully finalize()s with halHostUnregister + dep_gen. Semantically different functions sharing a name.

Verification

  • Both arches built clean (onboard + sim, both runtimes).
  • Local a2a3 onboard smoke (dummy_task, alternating_matmul_add, prepared_callable suite) — 8/8 passed in 15s.

Test plan

  • CI st-sim-a2a3 / st-sim-a5
  • CI st-onboard-a2a3 / st-onboard-a5
  • CI ut-a2a3 / ut-a5

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3c38119-93c5-45a6-b020-26138d3b1a6e

📥 Commits

Reviewing files that changed from the base of the PR and between fbc5618 and eac2ddb.

📒 Files selected for processing (6)
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/platform/onboard/host/device_runner.h
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/platform/onboard/host/device_runner.h
  • src/common/platform/onboard/host/device_runner_base.cpp
  • src/common/platform/onboard/host/device_runner_base.h

📝 Walkthrough

Walkthrough

Methods setup_static_arena and launch_aicore_kernel are consolidated from two platform-specific DeviceRunner classes into the shared DeviceRunnerBase class. New cached arena-size fields are added to the base, and platform-specific subclass declarations and implementations are removed.

Changes

Consolidation of arena and kernel launch logic into DeviceRunnerBase

Layer / File(s) Summary
DeviceRunnerBase interface expansion
src/common/platform/onboard/host/device_runner_base.h
DeviceRunnerBase gains public setup_static_arena and launch_aicore_kernel method declarations with documented semantics for arena commit/rollback and lazy kernel registration. Three new cached_gm_*_size_ member fields track allocation state for idempotent re-allocation checks.
DeviceRunnerBase implementation
src/common/platform/onboard/host/device_runner_base.cpp
setup_static_arena implementation commits three device arenas idempotently with full rollback on failure; launch_aicore_kernel lazily registers the AICore ELF kernel and launches it with batch-mode scheduling.
A2A3 platform cleanup
src/a2a3/platform/onboard/host/device_runner.h, device_runner.cpp
A2A3 DeviceRunner removes public method declarations and implementations for setup_static_arena and launch_aicore_kernel; comments updated to indicate inherited methods from base class.
A5 platform cleanup
src/a5/platform/onboard/host/device_runner.h, device_runner.cpp
A5 DeviceRunner removes public method declarations, implementations, and cached arena-size member fields; header documentation updated to reflect inheritance from base class.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hw-native-sys/simpler#880: Both PRs are part of the same "extract DeviceRunnerBase" refactor: the retrieved PR creates DeviceRunnerBase and moves pooled arena/tensor helpers into it, while the main PR further removes DeviceRunner::setup_static_arena and DeviceRunner::launch_aicore_kernel from each arch and re-implements them in DeviceRunnerBase (plus shared cached state/handle).

Poem

🐰 Shared methods find their home at last,
Platform divergence is now past,
Base class gathers launch and arena's dance,
One kernel, one setup—each subclass needs not glance. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main refactoring: extracting two methods (setup_static_arena and launch_aicore_kernel) into the base class.
Description check ✅ Passed The description thoroughly explains the refactoring, including which methods were moved, behavioral normalizations, and verification steps performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DeviceRunner implementations for the a2a3 and a5 platforms by moving the common setup_static_arena and launch_aicore_kernel methods, along with their cached arena size member variables, into the shared DeviceRunnerBase class. This consolidation normalizes the rollback behavior on arena commit failures. A review comment points out a potential issue where sequential failures during re-initialization could lead to an incomplete rollback of previously committed arenas, and suggests using a try-catch block to guarantee a clean release of all resources.

Comment thread src/common/platform/onboard/host/device_runner_base.cpp Outdated
@ChaoWao ChaoWao force-pushed the extract-arena-launch-profiling branch from e44b9f3 to eac2ddb Compare May 30, 2026 11:26
Move two more mechanical-clean items into DeviceRunnerBase (-104 net
lines):

- `setup_static_arena`: bodies were near-identical; rollback-on-failure
  semantic normalized to a5's (release earlier committed peers when a
  later region fails). a2a3's prior "keep peers committed on failure"
  behavior is dropped — that pattern silently masked errors by handing
  pooled pointers back to the caller alongside a failure rc.
- `launch_aicore_kernel`: byte-identical body. Lazy-registers the
  AICore binary via `rtRegisterAllKernel`, caches the handle, builds
  rtArgsEx_t, launches via `rtKernelLaunchWithHandleV2`.
- `cached_gm_heap_size_`, `cached_gm_sm_size_`,
  `cached_runtime_arena_size_` moved alongside `setup_static_arena`.

Skipped (per-arch divergence too tangled for clean shared body):
- `init_l2_perf` / `init_tensor_dump` / `init_pmu` / `init_scope_stats`:
  set arch-specific KernelArgs fields (a2a3 sets `aicore_ring_addr`;
  a5 sets `aicore_l2_perf_ring_addrs` / `aicore_pmu_ring_addrs`).
- `finalize_collectors`: a5 only stops collectors; a2a3 fully finalizes
  with halHostUnregister + dep_gen.

Both arches built clean. a2a3 onboard smoke 8/8 in 15s.
@ChaoWao ChaoWao merged commit d9cdf01 into hw-native-sys:main May 30, 2026
16 checks passed
ChaoWao added a commit that referenced this pull request May 31, 2026
)

Three more small helpers + one tweak to finalize_common(). Net -1
line, but real consolidation: each chunk that used to be duplicated
in two arches now lives once.

New base helpers (called from each arch's run()):
- init_runtime_args_with_metadata(Runtime&) — H2D the Runtime struct
  via kernel_args_.init_runtime_args, then publish log_level /
  log_info_v / device_id into KernelArgs from HostLogger.
- start_shared_collectors_for_run() — builds the thread_factory
  lambda and calls .start() on the four shared collectors
  (l2_perf, dump, pmu, scope_stats) gated by their enable flags.
- teardown_shared_collectors_after_run() — the post-sync
  stop+reconcile+export sequence for those four. l2_perf adds
  read_phase_header_metadata + export_swimlane_json; dump exports
  dump files; pmu just reconciles; scope_stats writes JSONL.

a2a3 still inlines its dep_gen collector start + teardown (with
dep_gen_replay_emit_deps_json) right after the shared helpers —
that's a2a3-only.

finalize_common() now resets cached_gm_heap_size_ / cached_gm_sm_size_
/ cached_runtime_arena_size_ alongside the other identity state. These
fields live on base (moved in PR #922) so resetting them there is the
natural home; each arch's finalize() no longer needs the three explicit
= 0 lines after calling finalize_common().

Both arches built clean. a2a3 onboard smoke (dummy_task,
alternating_matmul_add, prepared_callable suite, spmd_basic) — 9/9
passed in 19s.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants