Refactor: replace arg_index with positional dump + func_id array by indigo1973 · Pull Request #1181 · hw-native-sys/simpler

indigo1973 · 2026-06-27T09:15:37Z

The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id.

This supersedes the #1123 arg_index mechanism (and the #1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel.

dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped.
record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert).
args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task).
CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it.
migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload.
docs/dfx/args-dump.md updated for the func_id array + positional model.

Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (#1171 site), dummy_task.

coderabbitai · 2026-06-27T09:15:57Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e5b1add4-8da9-48f9-986b-2f0db15b73cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Callable construction now uses signature-only mappings without arg_index. Dump records now store multiple active subtask IDs and serialize them as arrays. The docs, runtime, examples, and tests were updated to match the new callable and dump shapes.

Changes

Callable and tensor-dump migration

Layer / File(s)	Summary
Leaf callable contract `src/common/task_interface/callable.h`, `python/bindings/task_interface.cpp`, `simpler_setup/scene_test.py`, `tests/ut/cpp/types/test_chip_callable_upload_immutable.cpp`, `tests/ut/py/test_task_interface.py`	`Callable<void, MaxSig, 0>` drops `arg_index`, `CoreCallable.build` no longer accepts it in Python, and helper tests build callables without it.
Dump metadata and collector `docs/dfx/args-dump.md`, `src/a2a3/platform/include/common/tensor_dump.h`, `src/a5/platform/include/common/tensor_dump.h`, `src/common/platform/include/aicpu/tensor_dump_aicpu.h`, `src/common/platform/include/host/tensor_dump_collector.h`, `src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp`, `src/common/platform/shared/host/tensor_dump_collector.cpp`, `tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py`, `tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane_mixed.py`, `tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py`, `tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py`, `tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py`, `tests/st/a5/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py`	Dump records now carry `func_ids` arrays and `func_count`, `dump_args_for_task` maps signature entries positionally, and the dump-focused docs/tests reflect that layout.
Build call sites `tests/st/aicore_op_timeout/test_aicore_op_timeout.py`, `examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py`, `examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py`, `examples/a2a3/tensormap_and_ringbuffer/l3_l2_orch_comm_stream/l3_l2_orch_comm_stream.py`, `examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`, `examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py`, `examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py`, `examples/workers/l2/`, `examples/workers/l3/`, `tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py`, `tests/st/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py`	Examples and tests that call `CoreCallable.build` now omit `arg_index` and use the provided signatures for argument ordering.
A2A3 config updates `examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/`, `examples/a2a3/tensormap_and_ringbuffer/paged_attention`, `examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/`, `examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/`, `examples/a2a3/tensormap_and_ringbuffer/vector_example/*`, `tests/st/a2a3/tensormap_and_ringbuffer/...`	A2A3 tensormap_and_ringbuffer example and regression configs remove per-core `arg_index` fields and rely on signature ordering.
A5 config updates `examples/a5/tensormap_and_ringbuffer/bgemm/`, `examples/a5/tensormap_and_ringbuffer/paged_attention`, `examples/a5/tensormap_and_ringbuffer/vector_example/*`, `tests/st/a5/tensormap_and_ringbuffer/...`	A5 tensormap_and_ringbuffer example and regression configs remove per-core `arg_index` fields and rely on signature ordering.
Host-build-graph tests `tests/st/a2a3/host_build_graph/`, `tests/st/a5/host_build_graph/`	Host-build-graph callable configs remove per-core `arg_index` fields in the updated scenes.

Sequence Diagram(s)

sequenceDiagram
  participant dump_args_for_task
  participant dump_arg_record
  participant TensorDumpCollector
  participant export_dump_files

  dump_args_for_task->>dump_arg_record: map signature entry i to payload slot i
  dump_arg_record->>TensorDumpCollector: write func_count and func_ids[]
  TensorDumpCollector->>export_dump_files: emit func_id array in JSON

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/simpler#1123: Changes the same tensor-dump and argument-mapping path in the opposite direction by making arg_index mandatory again.
hw-native-sys/simpler#1171: Touches the same examples/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py call site and adjusts CoreCallable.build arguments there.
hw-native-sys/simpler#1100: Updates tensormap_and_ringbuffer signatures and mix-task callable shapes, which align with the positional signature coverage used here.

Suggested labels

enhancement

Poem

I nibbled the old arg_index grass away,
then hopped by signature all day.
func_ids jingled in my pack,
and dump paths sparkled on the track.
(/)✨ hop-hop, the bytes now know their way.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: replacing arg_index with positional dumping and func_id arrays.
Description check	✅ Passed	The description matches the PR’s core refactor and supporting updates, including positional mapping, func_id arrays, and CoreCallable changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request simplifies the tensor dump interface by removing the explicit arg_index mapping and instead positionally mapping signature entries to payload slots. It also updates the tensor dump record to carry an array of active subtask kernel IDs (func_ids) to support cooperative mix tasks without duplicating records. The review feedback highlights critical safety and correctness issues: potential out-of-bounds reads in both tensor_dump_collector.cpp and tensor_dump_aicpu.cpp if the subtask count exceeds TENSOR_DUMP_MAX_FUNC_IDS, and a positional mapping bug in tensor_dump_aicpu.h where skipped SCALAR entries incorrectly offset the payload slot index for subsequent tensors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/workers/l3/ep_dispatch_combine/main.py`:
- Around line 221-223: The child callables built in the dispatch/combine setup
still use compact signatures, which no longer line up with the task’s positional
payload after dropping arg_index. Update the CoreCallable.build usage for
sig_local_expert and sig_combine so they use the full positional child signature
shape expected by sig_orch, or add an explicit mapping layer for non-prefix
slots; keep the fix localized to the child signature construction in main.py.

In `@src/common/platform/shared/host/tensor_dump_collector.cpp`:
- Around line 227-230: Clamp rec.func_count before assigning it to dt.func_count
in tensor_dump_collector.cpp so later consumers like the serialization path in
the same file don’t trust an oversized count and read past dt.func_ids. Update
the ingest logic around the record handling to store only the clamped value
based on TENSOR_DUMP_MAX_FUNC_IDS, and make the serialization loop use that
sanitized dt.func_count consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af6ef20e-b132-4e5e-9686-c9cf28e7aaf0

📥 Commits

Reviewing files that changed from the base of the PR and between 47a411c and 9dcdcda.

📒 Files selected for processing (100)

docs/dfx/args-dump.md
examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.py
examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
examples/a2a3/tensormap_and_ringbuffer/l3_l2_orch_comm_stream/l3_l2_orch_comm_stream.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.py
examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.py
examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
examples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.py
examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
examples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
examples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
examples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.py
examples/workers/l2/per_task_runtime_env/main.py
examples/workers/l2/vector_add/main.py
examples/workers/l3/all_to_all_distributed/main.py
examples/workers/l3/allgather_distributed/main.py
examples/workers/l3/allreduce_distributed/main.py
examples/workers/l3/allreduce_ring_distributed/main.py
examples/workers/l3/broadcast_distributed/main.py
examples/workers/l3/child_memory/main.py
examples/workers/l3/domain_rank_map/main.py
examples/workers/l3/dual_domain_overlap/main.py
examples/workers/l3/ep_dispatch_combine/main.py
examples/workers/l3/ffn_tp_parallel/main.py
examples/workers/l3/multi_chip_dispatch/main.py
examples/workers/l3/per_task_runtime_env/main.py
examples/workers/l3/reduce_scatter_distributed/main.py
python/bindings/task_interface.cpp
simpler_setup/scene_test.py
src/a2a3/platform/include/common/tensor_dump.h
src/a5/platform/include/common/tensor_dump.h
src/common/platform/include/aicpu/tensor_dump_aicpu.h
src/common/platform/include/host/tensor_dump_collector.h
src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp
src/common/platform/shared/host/tensor_dump_collector.cpp
src/common/task_interface/callable.h
tests/st/a2a3/host_build_graph/bgemm/test_bgemm.py
tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
tests/st/a2a3/host_build_graph/matmul/test_matmul.py
tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py
tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
tests/st/a2a3/host_build_graph/vector_example/test_vector_example.py
tests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.py
tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane_mixed.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
tests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.py
tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py
tests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.py
tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py
tests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py
tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
tests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.py
tests/st/a5/host_build_graph/paged_attention/test_paged_attention.py
tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
tests/st/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py
tests/st/a5/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py
tests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
tests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.py
tests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
tests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
tests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
tests/st/aicore_op_timeout/test_aicore_op_timeout.py
tests/ut/cpp/types/test_chip_callable_upload_immutable.cpp
tests/ut/py/test_task_interface.py

💤 Files with no reviewable changes (79)

tests/st/a2a3/host_build_graph/matmul/test_matmul.py
examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
tests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.py
examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
examples/workers/l3/multi_chip_dispatch/main.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.py
tests/st/a5/host_build_graph/paged_attention/test_paged_attention.py
examples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py
tests/st/a2a3/host_build_graph/bgemm/test_bgemm.py
tests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
examples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py
tests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.py
tests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.py
examples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.py
examples/workers/l3/reduce_scatter_distributed/main.py
examples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.py
examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
tests/st/a2a3/host_build_graph/vector_example/test_vector_example.py
examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.py
tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
examples/workers/l3/domain_rank_map/main.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
tests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.py
examples/workers/l3/all_to_all_distributed/main.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
examples/workers/l3/allreduce_ring_distributed/main.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
tests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
examples/workers/l3/dual_domain_overlap/main.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
examples/workers/l3/ffn_tp_parallel/main.py
examples/workers/l3/allgather_distributed/main.py
examples/workers/l3/broadcast_distributed/main.py
examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.py
tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
tests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
examples/workers/l3/per_task_runtime_env/main.py
examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.py
examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.py
tests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
tests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
examples/workers/l2/per_task_runtime_env/main.py
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
examples/workers/l2/vector_add/main.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py
tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.py
tests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
examples/workers/l3/allreduce_distributed/main.py

ChaoZheng109 · 2026-06-29T01:48:53Z

整体认可这个重构方向(位置映射 + func_id 数组,去掉 arg_index 的双数组维护),128B record 不变式有 static_assert 兜底、主机侧对不可信 device 内存做了 clamp,都很稳。合入前建议处理下面三点:

1.〔健壮性 / 中〕位置 dump 依赖"第一个 active subtask 全宽",会静默漏 dump

dump_args_for_task 现在只取槽位序最小的 active subtask 的 signature 驱动整个位置映射,并假设它是全宽的;覆盖不全时只有一条 LOG_WARN,不报错、不影响 golden:

if (covered_count != pl.tensor_count && try_log_dump_args_layout_mismatch())
    LOG_WARN("... signature covers %d tensor slots but payload has %d; the rest are not dumped.");

旧机制是遍历每个 subtask、用各自 arg_index 取并集覆盖,不依赖任何单个 subtask 全宽。本 PR 已把现存 offset-mix(l2_swimlane_mixed、mixed_example)的每个 incore 手工补成全宽,所以现存场景不漏。但接口语义从"机器保证完整覆盖"退化成"靠人工把第一个槽位 incore 写成全宽",未来按旧 offset-mix 的自然写法新增 mix 会静默丢 tensor。

建议改成取最宽的 active subtask 签名(选 sig_src 的循环本来就遍历了全部 active subtask,零额外开销,且在 common 层单点改、a2a3/a5 同时生效):

// 现状:
if (sig_src == nullptr) {
    sig_src = reinterpret_cast<const CoreCallable *>(callable_addr);
}
// 建议:
const CoreCallable *cand = reinterpret_cast<const CoreCallable *>(callable_addr);
if (sig_src == nullptr || cand->sig_count() > sig_src->sig_count()) {
    sig_src = cand;
}

下游 walk / active_fids 收集 / covered[] 全部不动。注意这只修"漏 slot 覆盖";role/capture-stage 仍只取单个 subtask 的方向视角,这个弱化建议按 doc 里那句免责声明接受即可,不值得为它做大改。若采纳,记得同步函数头注释和 docs/dfx/args-dump.md 里 "driven by the first active subtask's signature" 的措辞。

2.〔测试覆盖 / 轻〕没有 ST 断言新的 `func_id` 数组

test_args_dump.py 只验证了 arg_index / arg_index_ambiguous,完全没读 dump 出来的 func_id。也就是说 func_id 数组即便 emit 全错,测试照样 PASS——这个 PR 的核心新行为没有回归屏障(目前只靠 silicon 人工核验)。建议补一条断言,覆盖:cooperative mix 的 slot func_id == [0,1]、单 kernel task == [N]、且每个 slot 只出现一次。

3.〔doc-consistency / 轻〕5 处过时注释残留

机械删 arg_index 时,描述它的注释作为 context 行留了下来,现在描述的是已不存在的机制:

tests/st/a2a3/.../dfx/dep_gen/test_dep_gen_chain.py
tests/st/a2a3/.../dummy_task/test_dummy_task.py
tests/st/a2a3/.../fanin_lookup_perf/test_fanin_lookup_perf.py
tests/st/a2a3/.../spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
tests/st/a2a3/.../spmd_sync_start_aiv/test_spmd_sync_start_aiv.py

均为 # arg_index maps it explicitly. 这一行,删掉即可(前一行 # Single-AIC/AIV task with one INOUT tensor at payload slot 0; 仍正确)。

The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id. This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel. - dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped. - record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert). - args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task). - CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it. - migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload. - docs/dfx/args-dump.md updated for the func_id array + positional model. Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.

…3 skip after rebase Rebased onto main (which now has hw-native-sys#1188, hw-native-sys#1189, hw-native-sys#1181): - hw-native-sys#1181 removed CoreCallable.build's arg_index param. Update _build_chip_callable to the new build(signature, binary) signature (the async + require_sync_start + hang cases all build a CoreCallable). - hw-native-sys#1189 made the tensor-data wait timeout 15 s on both arches (was 300 s on a2a3), so tensor_wait_timeout no longer needs to skip a2a3. Drop onboard_skip={"a2a3"} and the now-unused onboard_skip machinery; code 8 is now covered on both arches. Verified: a2a3sim suite green; full a2a3 onboard suite (10 cases) green, including tensor_wait_timeout firing code 8 at 15 s with clean teardown.

Dump-driven tool that traces one task's intra-core AICore pipeline under the msprof op simulator (camodel), one level below an L2 task block. - Capture a task's real args[] from a JSON-only args dump (hw-native-sys#1181 positional model), reconstruct them, and generate a combined replay workspace — zero hand-written shapes or scalars. - Mix-together replay: a whole mix task (AIC + AIV0 + AIV1) runs as one msprof op. Two same-source AIV members collapse to a single-AIV include (both lanes run it), covering SPMD mixes whose aiv0 == aiv1. - Flags for the manual decisions: --func-id (member set), --set-arg (shrink a scalar / control-tensor loop count without distorting the per-iteration pipeline), --spmd-block-num, --case, --debug-line. - docs/dfx/l0-swimlane-profiling.md: usage, fidelity rules, and a hw-native-sys#1181-suite coverage table with a runnable command per task shape. - .claude/skills/l0-swimlane/SKILL.md: agent-facing operating procedure, centered on how to pick --set-arg / --func-id / --case.

…native-sys#1181) The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id. This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel. - dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped. - record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert). - args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task). - CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it. - migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload. - docs/dfx/args-dump.md updated for the func_id array + positional model. Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread src/common/platform/shared/host/tensor_dump_collector.cpp Outdated

Comment thread src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp

Comment thread src/common/platform/include/aicpu/tensor_dump_aicpu.h

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread examples/workers/l3/ep_dispatch_combine/main.py

Comment thread src/common/platform/shared/host/tensor_dump_collector.cpp Outdated

indigo1973 force-pushed the l0swim_0626 branch from 9dcdcda to 6c84974 Compare June 27, 2026 10:00

indigo1973 force-pushed the l0swim_0626 branch from 6c84974 to 139d87e Compare June 29, 2026 03:22

ChaoZheng109 approved these changes Jun 29, 2026

View reviewed changes

ChaoZheng109 merged commit b1e4bd2 into hw-native-sys:main Jun 29, 2026
30 of 31 checks passed

indigo1973 mentioned this pull request Jun 30, 2026

Support: add l0_swimlane intra-core pipeline profiler #1213

Open

coderabbitai Bot mentioned this pull request Jul 1, 2026

Support: reuse resident prebuilt runtime arenas #1234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: replace arg_index with positional dump + func_id array#1181

Refactor: replace arg_index with positional dump + func_id array#1181
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
indigo1973:l0swim_0626

indigo1973 commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ChaoZheng109 commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

indigo1973 commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChaoZheng109 commented Jun 29, 2026

1.〔健壮性 / 中〕位置 dump 依赖"第一个 active subtask 全宽",会静默漏 dump

2.〔测试覆盖 / 轻〕没有 ST 断言新的 func_id 数组

3.〔doc-consistency / 轻〕5 处过时注释残留

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

2.〔测试覆盖 / 轻〕没有 ST 断言新的 `func_id` 数组