You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dep_gen — the complete per-submit dependency-graph collector (tensor-annotated, host-replayed deps.json) — exists only on a2a3. a5 has every other DFX collector mirrored (pmu, l2_perf, tensor_dump, scope_stats) but no dep_gen at all. This issue tracks porting dep_gen to a5 so the two platforms have parity.
Motivation / Use Case
dep_gen captures the inputs to every Orchestrator::submit_task into a host-resident record stream and replays them offline through the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, producing deps.json — a strict superset of the swimlane fanout[] edges (it recovers edges that fanout silently drops when a producer has already retired; see #599).
On a5 this diagnostic is currently unavailable, so a5 users cannot get a complete dependency graph for debugging missed/incorrect dependencies. Every other DFX subsystem already works on both platforms; dep_gen is the lone gap.
Proposed API / Behavior
Mirror the a2a3 dep_gen implementation under src/a5/..., preserving the same enable flag, record schema, and deps.json output so existing tooling and docs apply unchanged. Files to port (a2a3 → a5):
ST coverage: tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ → tests/st/a5/...
Wire-up in a5 kernel.cpp, device_runner.cpp, host CMakeLists, and pto_runtime_c_api, matching how a5 already wires pmu/scope_stats.
a5-specific adaptation: the a5 host collector must follow a5s no-SVM model (malloc host shadow + copy_to/from_device, per-tick shm mirror), the same pattern the a5 PmuCollector / ScopeStatsCollector ports already use — a verbatim copy of the a2a3 host collector will not work.
Alternatives Considered
Leave dep_gen a2a3-only — rejected: leaves a5 without complete dependency-graph diagnostics while every other DFX collector has parity.
Share one collector across both platforms — out of scope here; the host side genuinely differs (SVM vs no-SVM), which is why a5 maintains its own mirror of each collector.
Additional Context
Baseline + design: docs/dfx/dep_gen.md.
a2a3 reference implementation: the files listed above.
a5 already mirrors pmu / l2_perf / tensor_dump / scope_stats, so the porting pattern (incl. the no-SVM host adaptation) is well established — see the a5 scope_stats port (Add: scope_stats collector for per-scope queue-fill peaks #858) and PmuCollector for the template.
Summary
dep_gen— the complete per-submit dependency-graph collector (tensor-annotated, host-replayeddeps.json) — exists only on a2a3. a5 has every other DFX collector mirrored (pmu, l2_perf, tensor_dump, scope_stats) but no dep_gen at all. This issue tracks porting dep_gen to a5 so the two platforms have parity.Motivation / Use Case
dep_gencaptures the inputs to everyOrchestrator::submit_taskinto a host-resident record stream and replays them offline through the samecompute_task_fanin/register_task_outputsprimitives the device orchestrator uses, producingdeps.json— a strict superset of the swimlanefanout[]edges (it recovers edges that fanout silently drops when a producer has already retired; see #599).On a5 this diagnostic is currently unavailable, so a5 users cannot get a complete dependency graph for debugging missed/incorrect dependencies. Every other DFX subsystem already works on both platforms; dep_gen is the lone gap.
Proposed API / Behavior
Mirror the a2a3 dep_gen implementation under
src/a5/..., preserving the same enable flag, record schema, anddeps.jsonoutput so existing tooling and docs apply unchanged. Files to port (a2a3 → a5):src/a2a3/platform/include/common/dep_gen.hsrc/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.hsrc/a2a3/platform/include/host/dep_gen_collector.hsrc/a2a3/platform/src/aicpu/dep_gen_collector_aicpu.cppsrc/a2a3/platform/src/host/dep_gen_collector.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/→tests/st/a5/...kernel.cpp,device_runner.cpp, host CMakeLists, andpto_runtime_c_api, matching how a5 already wires pmu/scope_stats.a5-specific adaptation: the a5 host collector must follow a5s no-SVM model (malloc host shadow +
copy_to/from_device, per-tick shm mirror), the same pattern the a5PmuCollector/ScopeStatsCollectorports already use — a verbatim copy of the a2a3 host collector will not work.Alternatives Considered
Additional Context
docs/dfx/dep_gen.md.scope_statsport (Add: scope_stats collector for per-scope queue-fill peaks #858) andPmuCollectorfor the template.