Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer#736
Merged
Merged
Conversation
b267dac to
f50bb7d
Compare
There was a problem hiding this comment.
Code Review
This pull request implements the dep_gen (SubmitTrace) feature to capture orchestrator task submissions for offline replay and dependency graph reconstruction. The changes include new shared-memory protocols, AICPU-side recording, host-side collection, and integration into the Python and C++ runtime layers. Feedback was provided to improve the cache alignment of the tensors array in the DepGenRecord structure to meet performance and architectural standards.
562c243 to
3582e6e
Compare
4 tasks
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3582e6e to
143d6ab
Compare
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
143d6ab to
7a3dce0
Compare
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. 5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host collector now accumulates `DepGenRecord` entries directly in a `std::vector<DepGenRecord>` instead of streaming them to `submit_trace.bin` on disk. The replay function takes a pointer + count from `DepGenCollector::records()` and skips the file round-trip entirely. deps.json is the only on-disk dep_gen artifact now; the `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer takes a path, and the replay C ABI is now `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS but this prevents a future builder bypass from overflowing the stack buffers). The weak fallback in non-dep_gen runtimes (host_build_graph) drops from LOG_WARN to LOG_DEBUG since that path is unreachable for end users — it exists only to keep the .so loadable. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fixes hw-native-sys#599.
ChaoWao
added a commit
that referenced
this pull request
May 11, 2026
…s gate (#737) Stacked on #736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. 5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host collector now accumulates `DepGenRecord` entries directly in a `std::vector<DepGenRecord>` instead of streaming them to `submit_trace.bin` on disk. The replay function takes a pointer + count from `DepGenCollector::records()` and skips the file round-trip entirely. deps.json is the only on-disk dep_gen artifact now; the `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer takes a path, and the replay C ABI is now `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS but this prevents a future builder bypass from overflowing the stack buffers). The weak fallback in non-dep_gen runtimes (host_build_graph) drops from LOG_WARN to LOG_DEBUG since that path is unreachable for end users — it exists only to keep the .so loadable. Also fixes a width-mismatch bug in #736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Capture every
Orchestrator::submit_taskcall into a streamingSubmitTracering on the a2a3tensormap_and_ringbufferruntime, drainedby a host collector to
submit_trace.bin. Phase 1 of replacing #500 — thehost-side offline replay that reconstructs
deps.jsonfrom these recordsships in a follow-up PR; this PR is intentionally scoped to the capture
path so it can land independently and be reviewed in isolation.
Motivation: today's swimlane
L2PerfRecord::fanout[]is filled at theproducer's completion-record commit moment, so a fast producer that
finishes before a later consumer is submitted has its record sealed
without the new edge — those dep edges are silently lost. Capturing the
orch's submit-time inputs and replaying offline reconstructs the complete
logical graph because the replay can run with no eviction.
What lands in this PR
src/a2a3/platform/include/common/dep_gen.h:DepGenRecord(2240 B per submit: task_id, flags, arg_types[16],explicit_deps[16], tensors[16][128] as opaque blobs). Plus the
PMU/L2Perf-style SPSC buffer family (FreeQueue / BufferState /
ReadyQueueEntry / DataHeader). Single-instance (orch is one AICPU
thread).
src/a2a3/platform/{include/aicpu,src/aicpu}/dep_gen_collector_aicpu.{h,cpp}:all-primitive interface (raw uint64 task_id, void* per-Tensor blob,
uint64* explicit_deps) so the platform header stays runtime-agnostic.
submit_taskentry callsdep_gen_aicpu_record_submitgated onis_dep_gen_enabled(). Orch callsite static_assertssizeof(Tensor) == DEP_GEN_TENSOR_SIZEandPTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbackslet host builds link without the AICPU strong symbols.
src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}:DepGenModuletrait +DepGenCollector : ProfilerBase<...>, so themgmt thread / buffer-pool manager / poll loop come from the unified
profiling framework (Refactor(a2a3): decouple profiling from runtime, own it in platform #714).
on_buffer_collectedappendsDepGenRecordvalues to
submit_trace.bin;reconcile_counterscross-checkscollected + dropped == device_totaland returns clean/dirty so thefuture replay step can skip
deps.jsonon incomplete traces.sim+onboard):DepGenCollectorfield,set_dep_gen_enabled()setter,init_dep_gen()helper,kernel_args.dep_gen_data_baseplumbed through to AICPU, RAII cleanup,end-of-run flush + reconcile,
make_dep_gen_path()helper.set_platform_dep_gen_base/set_dep_gen_enabled/dep_gen_aicpu_set_orch_thread_idx/dep_gen_aicpu_init/dep_gen_aicpu_flush/dep_gen_aicpu_finalizehooked alongside the existing PMU calls.enable_dep_genint32 flag, nanobind Pythonproperty (bool),
conftest.pyandscene_test.py--enable-dep-genCLI option threaded through
run_class_cases/_run_and_validate/_build_config.--rounds > 1disables capture (matches the--enable-l2-swimlanepattern).tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py:re-uses the
vector_exampleorchestration (5submit_taskcalls), runswith
--enable-dep-gen, then assertssubmit_trace.binsize equals5 * sizeof(DepGenRecord)and spot-checks the first record'stensor_count == 3.What does NOT land in this PR
deps.jsonswimlane_converter.pyintegration that prefersdeps.jsonover thein-record
fanout[]separate wiring)
These follow in subsequent PRs once the capture path is reviewed.
Testing
pre-commit run(clang-format, clang-tidy, cpplint, ruff,pyright) all pass
spmd_sync_startbaseline (no--enable-dep-gen)PASSED — wiring does not perturb the default path
dep_gen_capture(with--enable-dep-gen) PASSED,post-run verification PASSED (
submit_trace.binhas exactly5 records totalling 11200 bytes)
Notes
profiler_base.hhad to gain a one-line// NOLINTNEXTLINE(bugprone-crtp-constructor-accessibility)dep_gen_collector.cppis the first translationunit in this checkout to pull the header into pre-commit's clang-tidy
scope. Existing PmuCollector / L2PerfCollector use the same public
ctor pattern; making it
protectedwould force every derived classto declare
friend ProfilerBasesolely for ctor visibility.