Skip to content

Refactor: replace arg_index with positional dump + func_id array#1181

Merged
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
indigo1973:l0swim_0626
Jun 29, 2026
Merged

Refactor: replace arg_index with positional dump + func_id array#1181
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
indigo1973:l0swim_0626

Conversation

@indigo1973

Copy link
Copy Markdown
Contributor

The args dump no longer needs a per-incore arg_index to map each declared tensor to a payload slot. Each incore declares its full (mix-task) signature and the dump maps signature entry i to payload slot i positionally; every record carries the task's active-subtask set as a func_id array (its mix membership) rather than a single scalar func_id.

This supersedes the #1123 arg_index mechanism (and the #1171 follow-up that backfilled arg_index for l3_l2_orch_comm), tailored for the upcoming l0_swimlane tool, which reconstructs and replays a whole mix task rather than one kernel.

  • dump: a single positional walk over the first active subtask's signature; each payload tensor is emitted once, stamped with the func array (no per-subtask geometry duplication). Slots beyond the payload (a prefix-dispatched task) are skipped.
  • record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert).
  • args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a single-kernel task).
  • CoreCallable.build / make_callable: drop the arg_index parameter, field, and accessor; scene_test and the binding stop threading it.
  • migrate every CALLABLE incore / CoreCallable.build repo-wide to drop arg_index; complete the offset-mix signatures (mixed_example a2a3/a5, l2_swimlane_mixed) to full width so positional mapping covers payload.
  • docs/dfx/args-dump.md updated for the func_id array + positional model.

Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays correct, each slot emitted once): mixed_example (offset mix, 108 records), l2_swimlane_mixed, spmd_basic (cooperative mix), l3_l2_orch_comm_stream (#1171 site), dummy_task.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e5b1add4-8da9-48f9-986b-2f0db15b73cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Callable construction now uses signature-only mappings without arg_index. Dump records now store multiple active subtask IDs and serialize them as arrays. The docs, runtime, examples, and tests were updated to match the new callable and dump shapes.

Changes

Callable and tensor-dump migration

Layer / File(s) Summary
Leaf callable contract
src/common/task_interface/callable.h, python/bindings/task_interface.cpp, simpler_setup/scene_test.py, tests/ut/cpp/types/test_chip_callable_upload_immutable.cpp, tests/ut/py/test_task_interface.py
Callable<void, MaxSig, 0> drops arg_index, CoreCallable.build no longer accepts it in Python, and helper tests build callables without it.
Dump metadata and collector
docs/dfx/args-dump.md, src/a2a3/platform/include/common/tensor_dump.h, src/a5/platform/include/common/tensor_dump.h, src/common/platform/include/aicpu/tensor_dump_aicpu.h, src/common/platform/include/host/tensor_dump_collector.h, src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp, src/common/platform/shared/host/tensor_dump_collector.cpp, tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py, tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane_mixed.py, tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py, tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py, tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py, tests/st/a5/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py
Dump records now carry func_ids arrays and func_count, dump_args_for_task maps signature entries positionally, and the dump-focused docs/tests reflect that layout.
Build call sites
tests/st/aicore_op_timeout/test_aicore_op_timeout.py, examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py, examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py, examples/a2a3/tensormap_and_ringbuffer/l3_l2_orch_comm_stream/l3_l2_orch_comm_stream.py, examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py, examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py, examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py, examples/workers/l2/*, examples/workers/l3/*, tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py, tests/st/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py
Examples and tests that call CoreCallable.build now omit arg_index and use the provided signatures for argument ordering.
A2A3 config updates
examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/*, examples/a2a3/tensormap_and_ringbuffer/paged_attention*, examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/*, examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/*, examples/a2a3/tensormap_and_ringbuffer/vector_example/*, tests/st/a2a3/tensormap_and_ringbuffer/...
A2A3 tensormap_and_ringbuffer example and regression configs remove per-core arg_index fields and rely on signature ordering.
A5 config updates
examples/a5/tensormap_and_ringbuffer/bgemm/*, examples/a5/tensormap_and_ringbuffer/paged_attention*, examples/a5/tensormap_and_ringbuffer/vector_example/*, tests/st/a5/tensormap_and_ringbuffer/...
A5 tensormap_and_ringbuffer example and regression configs remove per-core arg_index fields and rely on signature ordering.
Host-build-graph tests
tests/st/a2a3/host_build_graph/*, tests/st/a5/host_build_graph/*
Host-build-graph callable configs remove per-core arg_index fields in the updated scenes.

Sequence Diagram(s)

sequenceDiagram
  participant dump_args_for_task
  participant dump_arg_record
  participant TensorDumpCollector
  participant export_dump_files

  dump_args_for_task->>dump_arg_record: map signature entry i to payload slot i
  dump_arg_record->>TensorDumpCollector: write func_count and func_ids[]
  TensorDumpCollector->>export_dump_files: emit func_id array in JSON
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#1123: Changes the same tensor-dump and argument-mapping path in the opposite direction by making arg_index mandatory again.
  • hw-native-sys/simpler#1171: Touches the same examples/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py call site and adjusts CoreCallable.build arguments there.
  • hw-native-sys/simpler#1100: Updates tensormap_and_ringbuffer signatures and mix-task callable shapes, which align with the positional signature coverage used here.

Suggested labels

enhancement

Poem

I nibbled the old arg_index grass away,
then hopped by signature all day.
func_ids jingled in my pack,
and dump paths sparkled on the track.
(/)✨ hop-hop, the bytes now know their way.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: replacing arg_index with positional dumping and func_id arrays.
Description check ✅ Passed The description matches the PR’s core refactor and supporting updates, including positional mapping, func_id arrays, and CoreCallable changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the tensor dump interface by removing the explicit arg_index mapping and instead positionally mapping signature entries to payload slots. It also updates the tensor dump record to carry an array of active subtask kernel IDs (func_ids) to support cooperative mix tasks without duplicating records. The review feedback highlights critical safety and correctness issues: potential out-of-bounds reads in both tensor_dump_collector.cpp and tensor_dump_aicpu.cpp if the subtask count exceeds TENSOR_DUMP_MAX_FUNC_IDS, and a positional mapping bug in tensor_dump_aicpu.h where skipped SCALAR entries incorrectly offset the payload slot index for subsequent tensors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/common/platform/shared/host/tensor_dump_collector.cpp Outdated
Comment thread src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp
Comment thread src/common/platform/include/aicpu/tensor_dump_aicpu.h

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/workers/l3/ep_dispatch_combine/main.py`:
- Around line 221-223: The child callables built in the dispatch/combine setup
still use compact signatures, which no longer line up with the task’s positional
payload after dropping arg_index. Update the CoreCallable.build usage for
sig_local_expert and sig_combine so they use the full positional child signature
shape expected by sig_orch, or add an explicit mapping layer for non-prefix
slots; keep the fix localized to the child signature construction in main.py.

In `@src/common/platform/shared/host/tensor_dump_collector.cpp`:
- Around line 227-230: Clamp rec.func_count before assigning it to dt.func_count
in tensor_dump_collector.cpp so later consumers like the serialization path in
the same file don’t trust an oversized count and read past dt.func_ids. Update
the ingest logic around the record handling to store only the clamped value
based on TENSOR_DUMP_MAX_FUNC_IDS, and make the serialization loop use that
sanitized dt.func_count consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af6ef20e-b132-4e5e-9686-c9cf28e7aaf0

📥 Commits

Reviewing files that changed from the base of the PR and between 47a411c and 9dcdcda.

📒 Files selected for processing (100)
  • docs/dfx/args-dump.md
  • examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
  • examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.py
  • examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
  • examples/a2a3/tensormap_and_ringbuffer/l3_l2_orch_comm_stream/l3_l2_orch_comm_stream.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
  • examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.py
  • examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.py
  • examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
  • examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
  • examples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.py
  • examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
  • examples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.py
  • examples/workers/l2/per_task_runtime_env/main.py
  • examples/workers/l2/vector_add/main.py
  • examples/workers/l3/all_to_all_distributed/main.py
  • examples/workers/l3/allgather_distributed/main.py
  • examples/workers/l3/allreduce_distributed/main.py
  • examples/workers/l3/allreduce_ring_distributed/main.py
  • examples/workers/l3/broadcast_distributed/main.py
  • examples/workers/l3/child_memory/main.py
  • examples/workers/l3/domain_rank_map/main.py
  • examples/workers/l3/dual_domain_overlap/main.py
  • examples/workers/l3/ep_dispatch_combine/main.py
  • examples/workers/l3/ffn_tp_parallel/main.py
  • examples/workers/l3/multi_chip_dispatch/main.py
  • examples/workers/l3/per_task_runtime_env/main.py
  • examples/workers/l3/reduce_scatter_distributed/main.py
  • python/bindings/task_interface.cpp
  • simpler_setup/scene_test.py
  • src/a2a3/platform/include/common/tensor_dump.h
  • src/a5/platform/include/common/tensor_dump.h
  • src/common/platform/include/aicpu/tensor_dump_aicpu.h
  • src/common/platform/include/host/tensor_dump_collector.h
  • src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp
  • src/common/platform/shared/host/tensor_dump_collector.cpp
  • src/common/task_interface/callable.h
  • tests/st/a2a3/host_build_graph/bgemm/test_bgemm.py
  • tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
  • tests/st/a2a3/host_build_graph/matmul/test_matmul.py
  • tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py
  • tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
  • tests/st/a2a3/host_build_graph/vector_example/test_vector_example.py
  • tests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.py
  • tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane_mixed.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py
  • tests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.py
  • tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py
  • tests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
  • tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
  • tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py
  • tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
  • tests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.py
  • tests/st/a5/host_build_graph/paged_attention/test_paged_attention.py
  • tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
  • tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
  • tests/st/a5/tensormap_and_ringbuffer/l3_l2_orch_comm/test_l3_l2_orch_comm.py
  • tests/st/a5/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py
  • tests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
  • tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
  • tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • tests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
  • tests/st/aicore_op_timeout/test_aicore_op_timeout.py
  • tests/ut/cpp/types/test_chip_callable_upload_immutable.cpp
  • tests/ut/py/test_task_interface.py
💤 Files with no reviewable changes (79)
  • tests/st/a2a3/host_build_graph/matmul/test_matmul.py
  • examples/a2a3/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dummy_task/test_dummy_task.py
  • examples/a5/tensormap_and_ringbuffer/deferred_notify_demo/test_deferred_notify_demo.py
  • tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
  • examples/workers/l3/multi_chip_dispatch/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/pmu/test_pmu.py
  • tests/st/a5/host_build_graph/paged_attention/test_paged_attention.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start/test_spmd_sync_start.py
  • examples/a2a3/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
  • tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py
  • tests/st/a2a3/host_build_graph/bgemm/test_bgemm.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
  • tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py
  • tests/st/a5/tensormap_and_ringbuffer/simt_basic/test_simt_basic.py
  • tests/st/a2a3/tensormap_and_ringbuffer/alternating_matmul_add/test_alternating_matmul_add.py
  • examples/a5/tensormap_and_ringbuffer/vector_example/test_vector_example.py
  • examples/workers/l3/reduce_scatter_distributed/main.py
  • examples/a5/tensormap_and_ringbuffer/bgemm/test_bgemm.py
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
  • tests/st/a2a3/host_build_graph/vector_example/test_vector_example.py
  • examples/a2a3/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
  • tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/test_batch_paged_attention.py
  • tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
  • examples/workers/l3/domain_rank_map/main.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py
  • examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
  • tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
  • examples/a5/tensormap_and_ringbuffer/async_notify_demo/test_async_notify_demo.py
  • tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
  • tests/st/a2a3/tensormap_and_ringbuffer/fanin_lookup_perf/test_fanin_lookup_perf.py
  • examples/workers/l3/all_to_all_distributed/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
  • examples/workers/l3/allreduce_ring_distributed/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/test_spmd_sync_start_stress.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_aiv/test_spmd_sync_start_aiv.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
  • tests/st/a2a3/tensormap_and_ringbuffer/multi_round_paged_attention/test_multi_round_paged_attention.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
  • examples/workers/l3/dual_domain_overlap/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_sync_start_edge/test_spmd_sync_start_edge.py
  • examples/workers/l3/ffn_tp_parallel/main.py
  • examples/workers/l3/allgather_distributed/main.py
  • examples/workers/l3/broadcast_distributed/main.py
  • examples/a2a3/tensormap_and_ringbuffer/scalar_data_test/test_scalar_data.py
  • tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/test_spmd_multiblock_mix.py
  • examples/workers/l3/per_task_runtime_env/main.py
  • examples/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/test_benchmark_bgemm.py
  • examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/test_qwen3_14b_decode.py
  • tests/st/a5/host_build_graph/dump_tensor/test_dump_tensor_example.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py
  • tests/st/a5/tensormap_and_ringbuffer/spmd_starvation/test_spmd_starvation.py
  • examples/workers/l2/per_task_runtime_env/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
  • examples/workers/l2/vector_add/main.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_batch_dispatch_oob/test_spmd_batch_dispatch_oob.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/args_dump/test_args_dump.py
  • tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/test_l2_swimlane.py
  • tests/st/a5/tensormap_and_ringbuffer/orch_so_cache/test_orch_so_cache.py
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/test_paged_attention.py
  • tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
  • examples/workers/l3/allreduce_distributed/main.py

Comment thread examples/workers/l3/ep_dispatch_combine/main.py
Comment thread src/common/platform/shared/host/tensor_dump_collector.cpp Outdated
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

整体认可这个重构方向(位置映射 + func_id 数组,去掉 arg_index 的双数组维护),128B record 不变式有 static_assert 兜底、主机侧对不可信 device 内存做了 clamp,都很稳。合入前建议处理下面三点:

1.〔健壮性 / 中〕位置 dump 依赖"第一个 active subtask 全宽",会静默漏 dump

dump_args_for_task 现在只取槽位序最小的 active subtask 的 signature 驱动整个位置映射,并假设它是全宽的;覆盖不全时只有一条 LOG_WARN,不报错、不影响 golden:

if (covered_count != pl.tensor_count && try_log_dump_args_layout_mismatch())
    LOG_WARN("... signature covers %d tensor slots but payload has %d; the rest are not dumped.");

旧机制是遍历每个 subtask、用各自 arg_index 取并集覆盖,不依赖任何单个 subtask 全宽。本 PR 已把现存 offset-mix(l2_swimlane_mixedmixed_example)的每个 incore 手工补成全宽,所以现存场景不漏。但接口语义从"机器保证完整覆盖"退化成"靠人工把第一个槽位 incore 写成全宽",未来按旧 offset-mix 的自然写法新增 mix 会静默丢 tensor

建议改成取最宽的 active subtask 签名(选 sig_src 的循环本来就遍历了全部 active subtask,零额外开销,且在 common 层单点改、a2a3/a5 同时生效):

// 现状:
if (sig_src == nullptr) {
    sig_src = reinterpret_cast<const CoreCallable *>(callable_addr);
}
// 建议:
const CoreCallable *cand = reinterpret_cast<const CoreCallable *>(callable_addr);
if (sig_src == nullptr || cand->sig_count() > sig_src->sig_count()) {
    sig_src = cand;
}

下游 walk / active_fids 收集 / covered[] 全部不动。注意这只修"漏 slot 覆盖";role/capture-stage 仍只取单个 subtask 的方向视角,这个弱化建议按 doc 里那句免责声明接受即可,不值得为它做大改。若采纳,记得同步函数头注释和 docs/dfx/args-dump.md 里 "driven by the first active subtask's signature" 的措辞。

2.〔测试覆盖 / 轻〕没有 ST 断言新的 func_id 数组

test_args_dump.py 只验证了 arg_index / arg_index_ambiguous,完全没读 dump 出来的 func_id。也就是说 func_id 数组即便 emit 全错,测试照样 PASS——这个 PR 的核心新行为没有回归屏障(目前只靠 silicon 人工核验)。建议补一条断言,覆盖:cooperative mix 的 slot func_id == [0,1]、单 kernel task == [N]、且每个 slot 只出现一次。

3.〔doc-consistency / 轻〕5 处过时注释残留

机械删 arg_index 时,描述它的注释作为 context 行留了下来,现在描述的是已不存在的机制:

  • tests/st/a2a3/.../dfx/dep_gen/test_dep_gen_chain.py
  • tests/st/a2a3/.../dummy_task/test_dummy_task.py
  • tests/st/a2a3/.../fanin_lookup_perf/test_fanin_lookup_perf.py
  • tests/st/a2a3/.../spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
  • tests/st/a2a3/.../spmd_sync_start_aiv/test_spmd_sync_start_aiv.py

均为 # arg_index maps it explicitly. 这一行,删掉即可(前一行 # Single-AIC/AIV task with one INOUT tensor at payload slot 0; 仍正确)。

The args dump no longer needs a per-incore arg_index to map each
declared tensor to a payload slot. Each incore declares its full
(mix-task) signature and the dump maps signature entry i to payload
slot i positionally; every record carries the task's active-subtask
set as a func_id array (its mix membership) rather than a single
scalar func_id.

This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up
that backfilled arg_index for l3_l2_orch_comm), tailored for the
upcoming l0_swimlane tool, which reconstructs and replays a whole mix
task rather than one kernel.

- dump: a single positional walk over the first active subtask's
  signature; each payload tensor is emitted once, stamped with the
  func array (no per-subtask geometry duplication). Slots beyond the
  payload (a prefix-dispatched task) are skipped.
- record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count
  (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS
  is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert).
- args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a
  single-kernel task).
- CoreCallable.build / make_callable: drop the arg_index parameter,
  field, and accessor; scene_test and the binding stop threading it.
- migrate every CALLABLE incore / CoreCallable.build repo-wide to drop
  arg_index; complete the offset-mix signatures (mixed_example a2a3/a5,
  l2_swimlane_mixed) to full width so positional mapping covers payload.
- docs/dfx/args-dump.md updated for the func_id array + positional model.

Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays
correct, each slot emitted once): mixed_example (offset mix, 108
records), l2_swimlane_mixed, spmd_basic (cooperative mix),
l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.
@ChaoZheng109 ChaoZheng109 merged commit b1e4bd2 into hw-native-sys:main Jun 29, 2026
30 of 31 checks passed
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Jun 29, 2026
…3 skip after rebase

Rebased onto main (which now has hw-native-sys#1188, hw-native-sys#1189, hw-native-sys#1181):

- hw-native-sys#1181 removed CoreCallable.build's arg_index param. Update _build_chip_callable
  to the new build(signature, binary) signature (the async + require_sync_start +
  hang cases all build a CoreCallable).
- hw-native-sys#1189 made the tensor-data wait timeout 15 s on both arches (was 300 s on a2a3),
  so tensor_wait_timeout no longer needs to skip a2a3. Drop onboard_skip={"a2a3"}
  and the now-unused onboard_skip machinery; code 8 is now covered on both arches.

Verified: a2a3sim suite green; full a2a3 onboard suite (10 cases) green, including
tensor_wait_timeout firing code 8 at 15 s with clean teardown.
indigo1973 added a commit to indigo1973/simpler that referenced this pull request Jul 1, 2026
Dump-driven tool that traces one task's intra-core AICore pipeline under
the msprof op simulator (camodel), one level below an L2 task block.

- Capture a task's real args[] from a JSON-only args dump (hw-native-sys#1181 positional
  model), reconstruct them, and generate a combined replay workspace —
  zero hand-written shapes or scalars.
- Mix-together replay: a whole mix task (AIC + AIV0 + AIV1) runs as one
  msprof op. Two same-source AIV members collapse to a single-AIV include
  (both lanes run it), covering SPMD mixes whose aiv0 == aiv1.
- Flags for the manual decisions: --func-id (member set), --set-arg
  (shrink a scalar / control-tensor loop count without distorting the
  per-iteration pipeline), --spmd-block-num, --case, --debug-line.
- docs/dfx/l0-swimlane-profiling.md: usage, fidelity rules, and a
  hw-native-sys#1181-suite coverage table with a runnable command per task shape.
- .claude/skills/l0-swimlane/SKILL.md: agent-facing operating procedure,
  centered on how to pick --set-arg / --func-id / --case.
doraemonmj pushed a commit to doraemonmj/simpler_wc that referenced this pull request Jul 1, 2026
…native-sys#1181)

The args dump no longer needs a per-incore arg_index to map each
declared tensor to a payload slot. Each incore declares its full
(mix-task) signature and the dump maps signature entry i to payload
slot i positionally; every record carries the task's active-subtask
set as a func_id array (its mix membership) rather than a single
scalar func_id.

This supersedes the hw-native-sys#1123 arg_index mechanism (and the hw-native-sys#1171 follow-up
that backfilled arg_index for l3_l2_orch_comm), tailored for the
upcoming l0_swimlane tool, which reconstructs and replays a whole mix
task rather than one kernel.

- dump: a single positional walk over the first active subtask's
  signature; each payload tensor is emitted once, stamped with the
  func array (no per-subtask geometry duplication). Slots beyond the
  payload (a prefix-dispatched task) are skipped.
- record/info/DumpedTensor: func_id scalar -> func_ids[3] + func_count
  (reuses the existing pad, 128B record unchanged; TENSOR_DUMP_MAX_FUNC_IDS
  is tied to PTO2_SUBTASK_SLOT_COUNT by a static_assert).
- args_dump.json: func_id is now an array ([0,1,2] for a mix, [0] for a
  single-kernel task).
- CoreCallable.build / make_callable: drop the arg_index parameter,
  field, and accessor; scene_test and the binding stop threading it.
- migrate every CALLABLE incore / CoreCallable.build repo-wide to drop
  arg_index; complete the offset-mix signatures (mixed_example a2a3/a5,
  l2_swimlane_mixed) to full width so positional mapping covers payload.
- docs/dfx/args-dump.md updated for the func_id array + positional model.

Verified on a2a3 silicon (--dump-args 3, golden PASS, func_id arrays
correct, each slot emitted once): mixed_example (offset mix, 108
records), l2_swimlane_mixed, spmd_basic (cooperative mix),
l3_l2_orch_comm_stream (hw-native-sys#1171 site), dummy_task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants