Skip to content

[Feature] Port dep_gen DFX collector to a5 (align with a2a3) #907

@ChaoZheng109

Description

@ChaoZheng109

Summary

dep_gen — the complete per-submit dependency-graph collector (tensor-annotated, host-replayed deps.json) — exists only on a2a3. a5 has every other DFX collector mirrored (pmu, l2_perf, tensor_dump, scope_stats) but no dep_gen at all. This issue tracks porting dep_gen to a5 so the two platforms have parity.

Motivation / Use Case

dep_gen captures the inputs to every Orchestrator::submit_task into a host-resident record stream and replays them offline through the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, producing deps.json — a strict superset of the swimlane fanout[] edges (it recovers edges that fanout silently drops when a producer has already retired; see #599).

On a5 this diagnostic is currently unavailable, so a5 users cannot get a complete dependency graph for debugging missed/incorrect dependencies. Every other DFX subsystem already works on both platforms; dep_gen is the lone gap.

Proposed API / Behavior

Mirror the a2a3 dep_gen implementation under src/a5/..., preserving the same enable flag, record schema, and deps.json output so existing tooling and docs apply unchanged. Files to port (a2a3 → a5):

  • src/a2a3/platform/include/common/dep_gen.h
  • src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.h
  • src/a2a3/platform/include/host/dep_gen_collector.h
  • src/a2a3/platform/src/aicpu/dep_gen_collector_aicpu.cpp
  • src/a2a3/platform/src/host/dep_gen_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}
  • ST coverage: tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/tests/st/a5/...
  • Wire-up in a5 kernel.cpp, device_runner.cpp, host CMakeLists, and pto_runtime_c_api, matching how a5 already wires pmu/scope_stats.

a5-specific adaptation: the a5 host collector must follow a5s no-SVM model (malloc host shadow + copy_to/from_device, per-tick shm mirror), the same pattern the a5 PmuCollector / ScopeStatsCollector ports already use — a verbatim copy of the a2a3 host collector will not work.

Alternatives Considered

  • Leave dep_gen a2a3-only — rejected: leaves a5 without complete dependency-graph diagnostics while every other DFX collector has parity.
  • Share one collector across both platforms — out of scope here; the host side genuinely differs (SVM vs no-SVM), which is why a5 maintains its own mirror of each collector.

Additional Context

  • Baseline + design: docs/dfx/dep_gen.md.
  • a2a3 reference implementation: the files listed above.
  • a5 already mirrors pmu / l2_perf / tensor_dump / scope_stats, so the porting pattern (incl. the no-SVM host adaptation) is well established — see the a5 scope_stats port (Add: scope_stats collector for per-scope queue-fill peaks #858) and PmuCollector for the template.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions