Skip to content

Refactor: AICPU fills AICore rotation table (move from host)#921

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicpu-fills-rotation-table
May 30, 2026
Merged

Refactor: AICPU fills AICore rotation table (move from host)#921
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicpu-fills-rotation-table

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Move the AICore rotation-table filling step from host into AICPU.

The table is a flat `uint64_t[num_aicore]` whose entries are per-core
`&L2PerfAicoreBufferState::rotation` device addresses. AICore reads
`aicore_ring_addr[block_idx]` from KernelArgs to find its rotation
channel.

Before

Host did the fill (a2a3 `l2_perf_collector.cpp::initialize`):

```cpp
auto host_to_dev = [&](void *host_addr) -> uint64_t {
uintptr_t offset = (uintptr_t)host_addr - (uintptr_t)perf_host_ptr;
return (uint64_t)perf_dev_ptr + offset;
};
for (int i = 0; i < num_aicore; i++) {
L2PerfAicoreBufferState *ac_state = get_aicore_buffer_state(...);
rotation_table[i] = host_to_dev(&ac_state->rotation);
}
```

This coupled host to the AICore-side shared-memory layout: had to call
`get_aicore_buffer_state` and translate host pointers to device
pointers, for a table whose entries are pure device-internal addresses
(only AICore reads them).

After

Host allocates the table bytes and hands the device pointer to AICPU
via the existing `KernelArgs::aicore_ring_addr`. AICPU fills the
table inside `l2_perf_aicpu_init` reading
`&ac_state->rotation` directly — no translation needed since AICPU
already runs on the device.

Plumbing follows the existing `set_platform_l2_perf_base` pattern:

  • New `set_platform_aicore_rotation_table` / `get_*` setters on AICPU
  • On onboard: AICPU `kernel.cpp` calls the setter from `k_args`
  • On sim: host dlsyms the setter and calls it before triggering AICPU

Scope

a2a3 only. a5 uses an independent dev alloc for each AICore ring
(`state->aicore_ring_ptr`), so the host already has the dev address at
alloc time and the translation issue doesn't exist there.

Why

This is part of a holistic L2 perf refactor (see PR #916 for the
`L2PerfBufferState` → `AicpuBufferState` rename). The principle: host
should know how many bytes to allocate and where to hand them to AICPU;
it should not have to know that `AicoreRotation` is a field of
`L2PerfAicoreBufferState` at offset Y. After this PR, host's only
contact with the rotation channel is "alloc the address table + put the
table ptr in KernelArgs". The rotation protocol itself stays purely
between AICPU (writer) and AICore (reader).

Test plan

  • `a2a3` + `a2a3sim` builds clean
  • `tests/st/.../l2_swimlane` ST (`test_l2_swimlane` +
    `test_l2_swimlane_mixed`) pass on `a2a3sim`
  • same ST passes on `a2a3` device 1

Independent of #916.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b681c7f1-da52-4c7c-bb36-d850358a7d68

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR shifts responsibility for populating an AICore rotation-table from the host to AICPU. The host now allocates the buffer and passes its device address to AICPU via a new setter function. AICPU fills the table with per-core rotation addresses during device initialization, eliminating host-side address translation.

Changes

Rotation table population coordination

Layer / File(s) Summary
Rotation table public API contract
src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h, src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
New extern "C" functions set_platform_aicore_rotation_table(uint64_t table_addr) and get_platform_aicore_rotation_table() establish the contract for passing the table address between host and AICPU.
AICPU initialization and per-core table population
src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp, src/a2a3/platform/onboard/aicpu/kernel.cpp
During device initialization, AICPU casts the table address to uint64_t* and fills each core's entry with &ac_state->rotation. The kernel entry point calls the setter to publish the table address before device setup.
Host dynamic loader integration
src/a2a3/platform/sim/host/device_runner.h, src/a2a3/platform/sim/host/device_runner.cpp
DeviceRunner adds a function-pointer member, loads the setter via dlsym, invokes it in run() with kernel_args_.aicore_ring_addr, and clears it during unload.
Host allocation refactor
src/a2a3/platform/src/host/l2_perf_collector.cpp
L2PerfCollector's Step 5c now allocates the rotation table without filling per-core entries, deferring that work to AICPU during l2_perf_aicpu_init.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A table is born on the host's side,
But AICPU claims it with rightful pride,
Each core's rotation address penned by its own hand,
No host translations needed across the device land! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: moving AICore rotation table population from host to AICPU, which matches the core objective of the entire changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the before/after refactoring, the motivation, plumbing details, scope, and test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request decouples the host from the AICore-side shared-memory layout by moving the population of the AICore rotation-table device pointer from the host to the AICPU. The host now only allocates the table and passes its pointer, while the AICPU directly populates it with device-internal addresses during initialization. The review feedback recommends defensively clamping the worker_count parameter read from shared memory to prevent potential out-of-bounds writes or stack overflows, along with logging any clamping events for observability.

Comment thread src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp (1)

144-164: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Don't populate the rotation table after AICore can already consume it.

AICore reads aicore_ring_addr[block_idx] in kernel entry, but this code only fills the table during l2_perf_aicpu_init(). In sim, src/a2a3/platform/sim/host/device_runner.cpp still starts AICore workers without waiting for that init to finish, and src/a2a3/platform/src/host/l2_perf_collector.cpp now leaves the table bytes uninitialized. That means AICore can cache a garbage/null AicoreRotation* for the whole run. Either keep the pre-launch host fill, or add an explicit init-complete barrier before AICore reads the table.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp` around lines 144 -
164, The rotation table (rotation_table / g_platform_aicore_rotation_table) is
being filled only in l2_perf_aicpu_init() after AICore may already read
aicore_ring_addr[block_idx], causing AICore to see uninitialized pointers; fix
by ensuring the table is populated before AICore starts: either restore the
pre-launch host-side fill in the host path that launches workers (the code in
l2_perf_collector.cpp that used to populate per-core rotation entries) or add an
explicit init-complete handshake (e.g., a g_platform_aicore_rotation_table_ready
atomic/flag set after filling in l2_perf_aicpu_init() and checked/waited on by
AICore startup code that reads aicore_ring_addr in the kernel entry); update
get_aicore_buffer_state / l2_perf_aicpu_init to set the flag after writing
rotation_table and ensure device_runner/AICore startup waits for that flag
before reading aicore_ring_addr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp`:
- Around line 144-164: The rotation table (rotation_table /
g_platform_aicore_rotation_table) is being filled only in l2_perf_aicpu_init()
after AICore may already read aicore_ring_addr[block_idx], causing AICore to see
uninitialized pointers; fix by ensuring the table is populated before AICore
starts: either restore the pre-launch host-side fill in the host path that
launches workers (the code in l2_perf_collector.cpp that used to populate
per-core rotation entries) or add an explicit init-complete handshake (e.g., a
g_platform_aicore_rotation_table_ready atomic/flag set after filling in
l2_perf_aicpu_init() and checked/waited on by AICore startup code that reads
aicore_ring_addr in the kernel entry); update get_aicore_buffer_state /
l2_perf_aicpu_init to set the flag after writing rotation_table and ensure
device_runner/AICore startup waits for that flag before reading
aicore_ring_addr.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d1e03ec4-9cc5-4d34-9873-0a4641e08deb

📥 Commits

Reviewing files that changed from the base of the PR and between fbc5618 and 1a7650e.

📒 Files selected for processing (6)
  • src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h
  • src/a2a3/platform/onboard/aicpu/kernel.cpp
  • src/a2a3/platform/sim/host/device_runner.cpp
  • src/a2a3/platform/sim/host/device_runner.h
  • src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_perf_collector.cpp

@ChaoWao ChaoWao force-pushed the refactor/aicpu-fills-rotation-table branch from 1a7650e to c378e9d Compare May 30, 2026 12:05
Before: host computed each per-core `&L2PerfAicoreBufferState::rotation`
device address (host-to-device translation against `perf_dev_ptr`) and
wrote the resulting uint64_t into the `aicore_ring_addr` table inside
`L2PerfCollector::initialize`. This coupled host to the AICore-side
shared-memory layout — it had to reach into `get_aicore_buffer_state`
and translate addresses for a table whose entries are pure device-
internal pointers (only AICore reads them).

After: AICPU fills the table inside `l2_perf_aicpu_init` reading
`&ac_state->rotation` directly (no translation needed — it's already
on the device). Host's job shrinks to "alloc the table bytes and hand
the device pointer to AICPU via `KernelArgs::aicore_ring_addr`".

To make this safe, AICore now defers reading the table until the first
task is dispatched (AICPU writes the dispatch register only after init
has completed, so by the time AICore reaches into the table, the entries
are populated):

- AICore kernel entry stashes the *slot pointer*
  (`&table[block_idx]`) via the new `set_aicore_rotation_slot`, NOT the
  dereferenced rotation address. Old `set_aicore_rotation` is removed.
- `get_aicore_rotation()` lazily dereferences the slot on first call and
  caches the result.
- aicore_executor moves the `get_aicore_rotation()` call from
  pre-loop init into the per-task branch (still amortized to 0 after
  first task since the getter caches).

Without this deferral, AICore's kernel-entry deref would race with
AICPU's `l2_perf_aicpu_init` table fill — observed as a hardware-only
`aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018` because
AICore stashed a garbage rotation pointer at entry and crashed when
later dcci'ing it.

The plumbing follows the existing `set_platform_l2_perf_base` pattern:
- New `set_platform_aicore_rotation_table` / `get_*` setters on AICPU
- On onboard: AICPU `kernel.cpp` calls the setter from `k_args`
- On sim: host dlsyms the setter and calls it before triggering AICPU

a2a3 only — a5 uses an independent dev alloc for each AICore ring
(`state->aicore_ring_ptr`), so the host already has the dev address and
the translation issue doesn't exist there.

Tested: a2a3 + a2a3sim build clean; tests/st/.../l2_swimlane pass on
both a2a3sim and a2a3 device 1 (test_l2_swimlane + test_l2_swimlane_mixed).
@ChaoWao ChaoWao merged commit dc913c4 into hw-native-sys:main May 30, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/aicpu-fills-rotation-table branch May 30, 2026 12:27
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 30, 2026
A holistic pass over the L2 perf type names in src/{a2a3,a5}/platform/.
Renames consolidate the misleading parts and drop two fields that have
been dead since hw-native-sys#878 / hw-native-sys#921. No on-device behaviour change beyond the
field removals.

== Step 1: buffer name symmetry ==

  L2PerfBuffer        -> L2PerfAicpuBuffer
  PhaseBuffer         -> AicpuPhaseBuffer

The producer is now in the name: AICPU-written buffers carry `Aicpu`,
mirroring the existing `L2PerfAicoreBuffer` for the AICore-written one.
On a2a3, AicpuPhaseBuffer also moves from a hand-written struct to the
existing `TypedBuffer<R, N>` template -- matches L2PerfAicpuBuffer and
L2PerfAicoreBuffer.

== Step 4: pool struct rename ==

  AicpuBufferState        -> L2PerfAicpuPool
  L2PerfAicoreBufferState -> L2PerfAicorePool
  PhaseBufferState        -> AicpuPhasePool   (alias of L2PerfAicpuPool)

Replaces the misleading `BufferState` suffix (it was a *pool* state
holding free_queue + counters, not a buffer) and gives the Phase usage
its own explicit type name even though it shares layout with
L2PerfAicpuPool.

== Step 2: BufferKind enum ==

  ReadyQueueEntry::is_phase (uint32_t) -> kind (BufferKind, uint32_t underlying)

  enum class BufferKind : uint32_t {
      L2PerfAicpu  = 0,
      Phase        = 1,
      L2PerfAicore = 2,
  };

The field was already a 3-value discriminator (despite the bool-shaped
name); enqueue sites and the host's switch site now read in terms of
the named enum instead of magic numbers. uint32_t underlying preserves
the byte layout -- no ABI break.

== Step 3: drop dead fields (a2a3 only) ==

  L2PerfAicpuPool::aicore_ring_ptr        -- removed
  L2PerfAicpuPool::mismatch_record_count  -- removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2PerfAicoreBuffer pool with
    its own state (L2PerfAicorePool), so `aicore_ring_ptr` no longer
    points anywhere live.
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU, removing the
    last code path that even read `aicore_ring_ptr` (host's flush join
    now indexes the AICore pool directly).
  - `mismatch_record_count` was last written before hw-native-sys#878 (legacy ring/
    task_id invariant violation) and is no longer written by anyone;
    the host reconcile() loop's reading-and-warning on a counter that
    can never advance is removed.

Pad is adjusted to keep `sizeof(L2PerfAicpuPool) == 192` (static_assert
remains green). Host reconcile() loop simplifies to `total/dropped` only.

a5 keeps both fields -- its AICore pipeline still uses the legacy
staging-ring design where the host's per-core `aicore_ring_ptr` and the
mismatch counter are both live.

== Step 5: deferred ==

Collapsing AicpuPhaseHeader into the root L2PerfDataHeader is a
shared-memory layout change and is left for a focused follow-up PR.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic pass over the L2 perf type names in src/{a2a3,a5}/platform/.
Renames consolidate the misleading parts and drop two fields that have
been dead since hw-native-sys#878 / hw-native-sys#921. No on-device behaviour change beyond the
field removals.

== Step 1: buffer name symmetry ==

  L2PerfBuffer        -> L2PerfAicpuBuffer
  PhaseBuffer         -> AicpuPhaseBuffer

The producer is now in the name: AICPU-written buffers carry `Aicpu`,
mirroring the existing `L2PerfAicoreBuffer` for the AICore-written one.
On a2a3, AicpuPhaseBuffer also moves from a hand-written struct to the
existing `TypedBuffer<R, N>` template -- matches L2PerfAicpuBuffer and
L2PerfAicoreBuffer.

== Step 4: pool struct rename ==

  AicpuBufferState        -> L2PerfAicpuPool
  L2PerfAicoreBufferState -> L2PerfAicorePool
  PhaseBufferState        -> AicpuPhasePool   (alias of L2PerfAicpuPool)

Replaces the misleading `BufferState` suffix (it was a *pool* state
holding free_queue + counters, not a buffer) and gives the Phase usage
its own explicit type name even though it shares layout with
L2PerfAicpuPool.

== Step 2: BufferKind enum ==

  ReadyQueueEntry::is_phase (uint32_t) -> kind (BufferKind, uint32_t underlying)

  enum class BufferKind : uint32_t {
      L2PerfAicpu  = 0,
      Phase        = 1,
      L2PerfAicore = 2,
  };

The field was already a 3-value discriminator (despite the bool-shaped
name); enqueue sites and the host's switch site now read in terms of
the named enum instead of magic numbers. uint32_t underlying preserves
the byte layout -- no ABI break.

== Step 3: drop dead fields (a2a3 only) ==

  L2PerfAicpuPool::aicore_ring_ptr        -- removed
  L2PerfAicpuPool::mismatch_record_count  -- removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2PerfAicoreBuffer pool with
    its own state (L2PerfAicorePool), so `aicore_ring_ptr` no longer
    points anywhere live.
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU, removing the
    last code path that even read `aicore_ring_ptr` (host's flush join
    now indexes the AICore pool directly).
  - `mismatch_record_count` was last written before hw-native-sys#878 (legacy ring/
    task_id invariant violation) and is no longer written by anyone;
    the host reconcile() loop's reading-and-warning on a counter that
    can never advance is removed.

Pad is adjusted to keep `sizeof(L2PerfAicpuPool) == 192` (static_assert
remains green). Host reconcile() loop simplifies to `total/dropped` only.

a5 keeps both fields -- its AICore pipeline still uses the legacy
staging-ring design where the host's per-core `aicore_ring_ptr` and the
mismatch counter are both live.

== Step 5: deferred ==

Collapsing AicpuPhaseHeader into the root L2PerfDataHeader is a
shared-memory layout change and is left for a focused follow-up PR.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before hw-native-sys#878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before hw-native-sys#878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before hw-native-sys#878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before hw-native-sys#878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - hw-native-sys#921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before hw-native-sys#878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
ChaoWao added a commit that referenced this pull request May 31, 2026
A holistic naming pass over the L2 swimlane subsystem:

  1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
  2. Pool struct rename (BufferState → Pool) — captures the actual
     semantic (a pool of buffers, not "a buffer's state").
  3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
     L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
  4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
     mismatch_record_count, both dead since #878 / #921.
  5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.

== Renaming scheme ==

Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:

  L2PerfRecord            -> L2SwimlaneAicpuTaskRecord
  L2PerfAicoreRecord      -> L2SwimlaneAicoreTaskRecord
  AicpuPhaseRecord        -> L2SwimlaneAicpuPhaseRecord
  L2PerfBuffer            -> L2SwimlaneAicpuTaskBuffer
  L2PerfAicoreBuffer      -> L2SwimlaneAicoreTaskBuffer
  PhaseBuffer             -> L2SwimlaneAicpuPhaseBuffer
  L2PerfBufferState       -> L2SwimlaneAicpuTaskPool
  L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
  PhaseBufferState        -> L2SwimlaneAicpuPhasePool
  L2PerfDataHeader        -> L2SwimlaneDataHeader
  L2PerfLevel             -> L2SwimlaneLevel
  L2PerfFreeQueue         -> L2SwimlaneFreeQueue
  L2PerfModule            -> L2SwimlaneModule
  L2PerfCollector         -> L2SwimlaneCollector
  AicpuPhaseId            -> L2SwimlaneAicpuPhaseId
  AicpuPhaseHeader        -> L2SwimlaneAicpuPhaseHeader
  AicoreRotation          -> L2SwimlaneAicoreRotation
  AicoreLocalState        -> L2SwimlaneAicoreLocalState
  L2PerfAicoreRing  (a5)  -> L2SwimlaneAicoreRing

Enums (uint32_t underlying — wire format preserved):

  enum class L2SwimlaneBufferKind {
      AicpuTask  = 0,
      AicpuPhase = 1,
      AicoreTask = 2,
  };
  enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };

Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.

KernelArgs cross-platform fields:
  l2_perf_data_base -> l2_swimlane_base
  aicore_ring_addr  -> l2_swimlane_aicore_rotation_table

Files:
  include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
  -> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
  Output artifact: l2_perf_records.json -> l2_swimlane_records.json

== Dead-field drop (a2a3 only) ==

L2SwimlaneAicpuTaskPool::aicore_ring_ptr        — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count  — removed

Both fields have been dead since:
  - #878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
    pool with its own state (L2SwimlaneAicoreTaskPool).
  - #921 moved the rotation-table fill from host to AICPU.
  - mismatch_record_count was last written before #878.

Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.

a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.

== Behaviour ==

Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.

== Test plan ==

- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants