Refactor: decouple AICore L2Perf writes via stable per-core staging ring#709
Merged
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a unified profiling framework for the a2a3 and a5 platforms, consolidating the L2 swimlane, PMU, and Tensor Dump collectors. It replaces per-subsystem thread management with a shared ProfilerBase and BufferPoolManager architecture, improves the AICore-to-AICPU timing publication protocol by using a stable staging ring, and enhances error reporting for buffer management and invariant violations. My feedback highlights potential performance bottlenecks in the management loop's global scan and suggests evaluating whether silent record drops on invariant violations should trigger more explicit error handling.
2 tasks
Introduce L2PerfAicoreRing — a per-core ring written exclusively by AICore at dual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Its address is published once via Handshake::l2_perf_aicore_ring_addr and never reassigned, so AICore is fully decoupled from the AICPU's records-buffer rotation: - AICore caches the ring pointer once after init; the per-iteration handshake reload + dcci on host_build_graph go away. - AICPU rotates L2PerfBuffer internally inside l2_perf_aicpu_complete_record the moment a buffer fills, instead of relying on dispatch-side counters (CoreExecState::dispatch_count / AicpuExecutor::core_dispatch_counts_) driving an externally visible buffer swap. The public switch_buffer API is gone; rotation is a private switch_records_buffer. - L2Perf init moves before handshake_all_cores so AICore observes a non-zero ring address the moment aicpu_ready=1 unblocks Phase 1. - Add mismatch_record_count to L2PerfBufferState — the runtime's completion-before-dispatch invariant says ring/task_id mismatch must never happen, so it gets its own bucket (hard error, DEV_ERROR-logged) distinct from capacity-driven dropped_record_count. Reconcile arithmetic becomes collected + dropped + mismatch == device_total. - Drop the now-unused Runtime* parameter from SchedulerContext dispatch_* helpers (it was only threaded through to reach the old switch API). Doc + comments in docs/l2-swimlane-profiling.md updated to match.
ChaoWao
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The previous design coupled AICore L2Perf writes to AICPU's
L2PerfBufferrotation: AICPU exposed a publicl2_perf_aicpu_switch_buffer(buffer_idx)and AICore reloaded the buffer address every iteration via the handshake. Two problems:Leaky API.
switch_bufferwas a public collector method driven by dispatch-side counters (CoreExecState::dispatch_count,AicpuExecutor::core_dispatch_counts_). Buffer rotation is an AICPU-internal concern, but the dispatch path had to know about it and thread aRuntime*throughSchedulerContext::dispatch_*just to reach the API.Two rotation races that lost records:
What
Decouple AICore from
L2PerfBufferrotation by introducing a stable per-core staging ring.L2PerfAicoreRing— written exclusively by AICore atdual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Address is published once viaHandshake::l2_perf_aicore_ring_addr(renamed froml2_perf_records_addr) and never reassigned. AICore caches it once after init; the per-iteration handshake reload +dcci(my_hank, …)inhost_build_graphgo away.L2PerfBufferrotation becomes AICPU-internal — it happens insidel2_perf_aicpu_complete_recordthe moment a buffer fills, replacing the dispatch-side counters. Publicl2_perf_aicpu_switch_bufferAPI is removed; rotation is now a privateswitch_records_buffer.handshake_all_coresso AICore observes a non-zero ring address the momentaicpu_ready=1unblocks Phase 1.mismatch_record_countbucket onL2PerfBufferState— distinguishes ring/task_id invariant violations (hard error,DEV_ERROR-logged) from capacity-drivendropped_record_count. Reconcile arithmetic becomescollected + dropped + mismatch == device_total.Runtime*parameter fromSchedulerContextdispatch_*helpers (it was only threaded through to reach the old switch API).docs/l2-swimlane-profiling.mdupdated to match.Testing
collected + dropped + mismatch == device_total,mismatch == 0)