Skip to content

[Performance] Cut device-side fanout collection from a5 L2 swimlane (dep_gen is sole source) #908

@ChaoZheng109

Description

@ChaoZheng109

Depends on #907 (a5 dep_gen port).

Note on scope: a5 only

a2a3 already cut this. Its l2_perf_aicpu_complete_record no longer takes fanout params and the commit path explicitly "does not touch fanout" — deps.json (from dep_gen) is the sole source, joined post-run by swimlane_converter.py (src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h:77-79, src/a2a3/platform/include/common/l2_perf_profiling.h:89-96). This issue brings a5 to the same state. (The title/request originally said "a2a3 and a5"; a2a3 needs no change.)

Summary

On a5, the L2 swimlane hot path still records L2PerfRecord::fanout[] on the scheduler critical completion path. Once a5 has dep_gen (#907), deps.json becomes the complete fanout source, so this device-side collection is redundant overhead and should be removed — mirroring a2a3.

Motivation / Use Case

a5 T&R scheduler_completion.cpp:159-175 walks the fanout linked list and builds a uint64_t fanout_arr[RUNTIME_MAX_FANOUT] (128 entries) per completed task, then l2_perf_aicpu_complete_record(... fanout_arr, fanout_n) stores it into the GM record (src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp:290-295). That is a per-task linked-list walk + ~1 KB GM store on the schedulers critical fanin/completion tail — the exact cost a2a3 removed when dep_gen landed.

dep_gens replay sees every submit (no "already retired" producer race, see #599), so deps.json is a strict superset of the swimlane fanout edges. Keeping device-side fanout on a5 buys nothing once #907 is in and just taxes the hot path.

Proposed API / Behavior

Mirror a2a3s already-shipped change on a5:

  • Drop the fanout / fanout_count params from a5 l2_perf_aicpu_complete_record so the commit path no longer touches fanout (src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h, .../src/aicpu/l2_perf_collector_aicpu.cpp).
  • Remove the fanout linked-list walk + fanout_arr build at a5 scheduler_completion.cpp:159-175.
  • a5 host collector: stop emitting per-record fanout; emit empty fanout fields and let swimlane_converter.py join deps.json post-run (src/a5/platform/src/host/l2_perf_collector.cpp:610-620), matching a2a3 (src/a2a3/platform/src/host/l2_perf_collector.cpp:608).
  • Handle the L2PerfRecord::fanout[] / fanout_count struct fields the same way a2a3 does (the struct is shared common/l2_perf_profiling.h; HBG still uses fanout, so follow a2a3s exact treatment rather than deleting the fields outright).

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

performancePerformance regression or optimization

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions