Depends on #907 (a5 dep_gen port).
Note on scope: a5 only
a2a3 already cut this. Its l2_perf_aicpu_complete_record no longer takes fanout params and the commit path explicitly "does not touch fanout" — deps.json (from dep_gen) is the sole source, joined post-run by swimlane_converter.py (src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h:77-79, src/a2a3/platform/include/common/l2_perf_profiling.h:89-96). This issue brings a5 to the same state. (The title/request originally said "a2a3 and a5"; a2a3 needs no change.)
Summary
On a5, the L2 swimlane hot path still records L2PerfRecord::fanout[] on the scheduler critical completion path. Once a5 has dep_gen (#907), deps.json becomes the complete fanout source, so this device-side collection is redundant overhead and should be removed — mirroring a2a3.
Motivation / Use Case
a5 T&R scheduler_completion.cpp:159-175 walks the fanout linked list and builds a uint64_t fanout_arr[RUNTIME_MAX_FANOUT] (128 entries) per completed task, then l2_perf_aicpu_complete_record(... fanout_arr, fanout_n) stores it into the GM record (src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp:290-295). That is a per-task linked-list walk + ~1 KB GM store on the schedulers critical fanin/completion tail — the exact cost a2a3 removed when dep_gen landed.
dep_gens replay sees every submit (no "already retired" producer race, see #599), so deps.json is a strict superset of the swimlane fanout edges. Keeping device-side fanout on a5 buys nothing once #907 is in and just taxes the hot path.
Proposed API / Behavior
Mirror a2a3s already-shipped change on a5:
- Drop the
fanout / fanout_count params from a5 l2_perf_aicpu_complete_record so the commit path no longer touches fanout (src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h, .../src/aicpu/l2_perf_collector_aicpu.cpp).
- Remove the fanout linked-list walk +
fanout_arr build at a5 scheduler_completion.cpp:159-175.
- a5 host collector: stop emitting per-record fanout; emit empty fanout fields and let
swimlane_converter.py join deps.json post-run (src/a5/platform/src/host/l2_perf_collector.cpp:610-620), matching a2a3 (src/a2a3/platform/src/host/l2_perf_collector.cpp:608).
- Handle the
L2PerfRecord::fanout[] / fanout_count struct fields the same way a2a3 does (the struct is shared common/l2_perf_profiling.h; HBG still uses fanout, so follow a2a3s exact treatment rather than deleting the fields outright).
Alternatives Considered
Additional Context
Depends on #907 (a5 dep_gen port).
Note on scope: a5 only
a2a3 already cut this. Its
l2_perf_aicpu_complete_recordno longer takes fanout params and the commit path explicitly "does not touch fanout" — deps.json (from dep_gen) is the sole source, joined post-run byswimlane_converter.py(src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h:77-79,src/a2a3/platform/include/common/l2_perf_profiling.h:89-96). This issue brings a5 to the same state. (The title/request originally said "a2a3 and a5"; a2a3 needs no change.)Summary
On a5, the L2 swimlane hot path still records
L2PerfRecord::fanout[]on the scheduler critical completion path. Once a5 has dep_gen (#907), deps.json becomes the complete fanout source, so this device-side collection is redundant overhead and should be removed — mirroring a2a3.Motivation / Use Case
a5 T&R
scheduler_completion.cpp:159-175walks the fanout linked list and builds auint64_t fanout_arr[RUNTIME_MAX_FANOUT](128 entries) per completed task, thenl2_perf_aicpu_complete_record(... fanout_arr, fanout_n)stores it into the GM record (src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp:290-295). That is a per-task linked-list walk + ~1 KB GM store on the schedulers critical fanin/completion tail — the exact cost a2a3 removed when dep_gen landed.dep_gens replay sees every submit (no "already retired" producer race, see #599), so deps.json is a strict superset of the swimlane fanout edges. Keeping device-side fanout on a5 buys nothing once #907 is in and just taxes the hot path.
Proposed API / Behavior
Mirror a2a3s already-shipped change on a5:
fanout/fanout_countparams from a5l2_perf_aicpu_complete_recordso the commit path no longer touches fanout (src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h,.../src/aicpu/l2_perf_collector_aicpu.cpp).fanout_arrbuild at a5scheduler_completion.cpp:159-175.swimlane_converter.pyjoin deps.json post-run (src/a5/platform/src/host/l2_perf_collector.cpp:610-620), matching a2a3 (src/a2a3/platform/src/host/l2_perf_collector.cpp:608).L2PerfRecord::fanout[] / fanout_countstruct fields the same way a2a3 does (the struct is sharedcommon/l2_perf_profiling.h; HBG still uses fanout, so follow a2a3s exact treatment rather than deleting the fields outright).Alternatives Considered
Additional Context
swimlane_converter.pyjoin.scheduler_completion.cpp:159-175,l2_perf_collector_aicpu.{h,cpp},l2_perf_collector.cpp:610-620.docs/dfx/dep_gen.md, [Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring #599.