Refactor: replace arg_index with positional dump + func_id array#1181
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughCallable construction now uses signature-only mappings without ChangesCallable and tensor-dump migration
Sequence Diagram(s)sequenceDiagram
participant dump_args_for_task
participant dump_arg_record
participant TensorDumpCollector
participant export_dump_files
dump_args_for_task->>dump_arg_record: map signature entry i to payload slot i
dump_arg_record->>TensorDumpCollector: write func_count and func_ids[]
TensorDumpCollector->>export_dump_files: emit func_id array in JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request simplifies the tensor dump interface by removing the explicit arg_index mapping and instead positionally mapping signature entries to payload slots. It also updates the tensor dump record to carry an array of active subtask kernel IDs (func_ids) to support cooperative mix tasks without duplicating records. The review feedback highlights critical safety and correctness issues: potential out-of-bounds reads in both tensor_dump_collector.cpp and tensor_dump_aicpu.cpp if the subtask count exceeds TENSOR_DUMP_MAX_FUNC_IDS, and a positional mapping bug in tensor_dump_aicpu.h where skipped SCALAR entries incorrectly offset the payload slot index for subsequent tensors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/workers/l3/ep_dispatch_combine/main.py`:
- Around line 221-223: The child callables built in the dispatch/combine setup
still use compact signatures, which no longer line up with the task’s positional
payload after dropping arg_index. Update the CoreCallable.build usage for
sig_local_expert and sig_combine so they use the full positional child signature
shape expected by sig_orch, or add an explicit mapping layer for non-prefix
slots; keep the fix localized to the child signature construction in main.py.
In `@src/common/platform/shared/host/tensor_dump_collector.cpp`:
- Around line 227-230: Clamp rec.func_count before assigning it to dt.func_count
in tensor_dump_collector.cpp so later consumers like the serialization path in
the same file don’t trust an oversized count and read past dt.func_ids. Update
the ingest logic around the record handling to store only the clamped value
based on TENSOR_DUMP_MAX_FUNC_IDS, and make the serialization loop use that
sanitized dt.func_count consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: af6ef20e-b132-4e5e-9686-c9cf28e7aaf0
📒 Files selected for processing (100)
docs/dfx/args-dump.mdexamples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.pyexamples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.pyexamples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.pyexamples/a2a3/tensormap_and_ringbuffer/l3_l2_orch_comm_stream/l3_l2_orch_comm_stream.pyexamples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.pyexamples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.pyexamples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.pyexamples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.pyexamples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.pyexamples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.pyexamples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.pyexamples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.pyexamples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.pyexamples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.pyexamples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.pyexamples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.pyexamples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.pyexamples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.pyexamples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.pyexamples/workers/l2/per_task_runtime_env/main.pyexamples/workers/l2/vector_add/main.pyexamples/workers/l3/all_to_all_distributed/main.pyexamples/workers/l3/allgather_distributed/main.pyexamples/workers/l3/allreduce_distributed/main.pyexamples/workers/l3/allreduce_ring_distributed/main.pyexamples/workers/l3/broadcast_distributed/main.pyexamples/workers/l3/child_memory/main.pyexamples/workers/l3/domain_rank_map/main.pyexamples/workers/l3/dual_domain_overlap/main.pyexamples/workers/l3/ep_dispatch_combine/main.pyexamples/workers/l3/ffn_tp_parallel/main.pyexamples/workers/l3/multi_chip_dispatch/main.pyexamples/workers/l3/per_task_runtime_env/main.pyexamples/workers/l3/reduce_scatter_distributed/main.pypython/bindings/task_interface.cppsimpler_setup/scene_test.pysrc/a2a3/platform/include/common/tensor_dump.hsrc/a5/platform/include/common/tensor_dump.hsrc/common/platform/include/aicpu/tensor_dump_aicpu.hsrc/common/platform/include/host/tensor_dump_collector.hsrc/common/platform/shared/aicpu/tensor_dump_aicpu.cppsrc/common/platform/shared/host/tensor_dump_collector.cppsrc/common/task_interface/callable.htests/st/a2a3/host_build_graph/bgemm/test_bgemm.pytests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.pytests/st/a2a3/host_build_graph/matmul/test_matmul.pytests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.pytests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.pytests/st/a2a3/host_build_graph/vector_example/test_vector_example.pytests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.pytests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane_mixed.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.pytests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.pytests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.pytests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.pytests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.pytests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.pytests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.pytests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.pytests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.pytests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.pytests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.pytests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.pytests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.pytests/st/a5/host_build_graph/paged_attention/test_paged_attention.pytests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.pytests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.pytests/st/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.pytests/st/a5/tensormap_and_ringbuffer/mixed_example/test_mixed_example.pytests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.pytests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.pytests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.pytests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.pytests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.pytests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.pytests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.pytests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.pytests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.pytests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.pytests/st/aicore_op_timeout/test_aicore_op_timeout.pytests/ut/cpp/types/test_chip_callable_upload_immutable.cpptests/ut/py/test_task_interface.py
💤 Files with no reviewable changes (79)
- tests/st/a2a3/host_build_graph/matmul/test_matmul.py
- examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
- tests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.py
- examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
- tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
- examples/workers/l3/multi_chip_dispatch/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.py
- tests/st/a5/host_build_graph/paged_attention/test_paged_attention.py
- examples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
- examples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
- examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
- tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py
- tests/st/a2a3/host_build_graph/bgemm/test_bgemm.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
- examples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
- tests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
- tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py
- tests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.py
- tests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.py
- examples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.py
- examples/workers/l3/reduce_scatter_distributed/main.py
- examples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.py
- examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
- tests/st/a2a3/host_build_graph/vector_example/test_vector_example.py
- examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
- tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
- tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.py
- tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
- examples/workers/l3/domain_rank_map/main.py
- examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
- examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
- tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
- examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
- examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
- tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
- tests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.py
- examples/workers/l3/all_to_all_distributed/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
- examples/workers/l3/allreduce_ring_distributed/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
- tests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
- examples/workers/l3/dual_domain_overlap/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
- examples/workers/l3/ffn_tp_parallel/main.py
- examples/workers/l3/allgather_distributed/main.py
- examples/workers/l3/broadcast_distributed/main.py
- examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.py
- tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
- examples/workers/l3/per_task_runtime_env/main.py
- examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.py
- examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.py
- tests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
- tests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
- examples/workers/l2/per_task_runtime_env/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
- examples/workers/l2/vector_add/main.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py
- tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
- tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.py
- tests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
- examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
- tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
- tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
- tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
- examples/workers/l3/allreduce_distributed/main.py
|
整体认可这个重构方向(位置映射 + func_id 数组,去掉 1.〔健壮性 / 中〕位置 dump 依赖"第一个 active subtask 全宽",会静默漏 dump
if (covered_count != pl.tensor_count && try_log_dump_args_layout_mismatch())
LOG_WARN("... signature covers %d tensor slots but payload has %d; the rest are not dumped.");旧机制是遍历每个 subtask、用各自 建议改成取最宽的 active subtask 签名(选 // 现状:
if (sig_src == nullptr) {
sig_src = reinterpret_cast<const CoreCallable *>(callable_addr);
}
// 建议:
const CoreCallable *cand = reinterpret_cast<const CoreCallable *>(callable_addr);
if (sig_src == nullptr || cand->sig_count() > sig_src->sig_count()) {
sig_src = cand;
}下游 walk / 2.〔测试覆盖 / 轻〕没有 ST 断言新的
|
The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id. This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel. - dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped. - record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert). - args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task). - CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it. - migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload. - docs/dfx/args-dump.md updated for the func_id array + positional model. Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.
…3 skip after rebase Rebased onto main (which now has hw-native-sys#1188, hw-native-sys#1189, hw-native-sys#1181): - hw-native-sys#1181 removed CoreCallable.build's arg_index param. Update _build_chip_callable to the new build(signature, binary) signature (the async + require_sync_start + hang cases all build a CoreCallable). - hw-native-sys#1189 made the tensor-data wait timeout 15 s on both arches (was 300 s on a2a3), so tensor_wait_timeout no longer needs to skip a2a3. Drop onboard_skip={"a2a3"} and the now-unused onboard_skip machinery; code 8 is now covered on both arches. Verified: a2a3sim suite green; full a2a3 onboard suite (10 cases) green, including tensor_wait_timeout firing code 8 at 15 s with clean teardown.
Dump-driven tool that traces one task's intra-core AICore pipeline under the msprof op simulator (camodel), one level below an L2 task block. - Capture a task's real args[] from a JSON-only args dump (hw-native-sys#1181 positional model), reconstruct them, and generate a combined replay workspace — zero hand-written shapes or scalars. - Mix-together replay: a whole mix task (AIC + AIV0 + AIV1) runs as one msprof op. Two same-source AIV members collapse to a single-AIV include (both lanes run it), covering SPMD mixes whose aiv0 == aiv1. - Flags for the manual decisions: --func-id (member set), --set-arg (shrink a scalar / control-tensor loop count without distorting the per-iteration pipeline), --spmd-block-num, --case, --debug-line. - docs/dfx/l0-swimlane-profiling.md: usage, fidelity rules, and a hw-native-sys#1181-suite coverage table with a runnable command per task shape. - .claude/skills/l0-swimlane/SKILL.md: agent-facing operating procedure, centered on how to pick --set-arg / --func-id / --case.
…native-sys#1181) The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id. This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel. - dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped. - record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert). - args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task). - CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it. - migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload. - docs/dfx/args-dump.md updated for the func_id array + positional model. Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.
The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id.
This supersedes the #1123 arg_index mechanism (and the #1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel.
Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (#1171 site), dummy_task.