Add: RunTiming return from Worker.run / run_prepared by ChaoWao · Pull Request #790 · hw-native-sys/simpler

ChaoWao · 2026-05-16T00:37:13Z

Summary

Worker.run and run_prepared now return a RunTiming struct with host_wall / device_wall accessors so benchmarks can read per-run timing directly instead of scraping device logs. Builds on top of #802.

Why an always-on mailbox

Before this PR, host code that wanted device_wall_ns (the on-NPU wall of the most recent orchestrator phase) had to grep the device log for orch_start=… orch_end=… orch_cost=… lines. The natural fix — surface that timing via the runtime → host shared memory — needed a stable home that's allocated on every run. Two candidates existed:

AicpuPhaseHeader::orch_summary (the swimlane-side aggregate struct). Only allocated when enable_l2_swimlane is on. This is what an earlier draft of this PR used, and it's why scene_test could only collect device_wall on round 0 (swimlane is heavy, so rounds 1..N-1 had it off and reported 0).
PTO2SharedMemoryHeader (allocated unconditionally for every run). The right home — "is the runtime alive?" is the same gate as "can I read device_wall?".

#802 retired the swimlane-side aggregate entirely. This PR builds on that by adding two cycle-counter fields directly to PTO2SharedMemoryHeader and a small extern-C pto2_read_orch_wall_ns(Runtime *) helper in runtime_maker.cpp that piggybacks on the existing copy_from_device of the runtime header (the one that already pulls graph_output_ptr back to host). No new shared region, no new copy, no swimlane dependency.

Design

[AICPU, aicpu_executor.cpp post-orchestration, under PTO2_PROFILING]
    sm_header->orch_start_cycle.store(orch_cycle_start, relaxed)
    sm_header->orch_end_cycle.store(orch_cycle_end, relaxed)
    rt_orchestration_done(rt)  ← release on orchestrator_done provides
                                  the happens-before edge

[Host, runtime_maker.cpp]
    extern "C" pto2_read_orch_wall_ns(Runtime *runtime):
        copy_from_device(&host_header, sm_ptr, sizeof(PTO2SharedMemoryHeader))
        return cycles_to_ns(end - start)

[Host, pto_runtime_c_api.cpp run_prepared]
    out_timing->device_wall_ns = pto2_read_orch_wall_ns(runtime)

The mailbox is decoupled from L2PerfCollector / AicpuPhaseRecord — the swimlane data path stays as it is after #802 (per-event records only, swimlane-gated). RunTiming has its own three-line round-trip.

Changes

C ABI (src/common/worker/pto_runtime_c_api.h): add PtoRunTiming struct + trailing nullable out_timing param on run_prepared. Updated in lockstep across all 4 platform impls (src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp).
Shared memory (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h): add orch_start_cycle + orch_end_cycle (std::atomic<uint64_t>) to PTO2SharedMemoryHeader. Always allocated.
AICPU writer (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp): store the two cycle counters once per run, under PTO2_PROFILING, before rt_orchestration_done(rt) so the release-store on orchestrator_done synchronizes them.
Host reader (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp): extern "C" uint64_t pto2_read_orch_wall_ns(Runtime *) that copies the header from device and returns ns.
C++ (src/common/worker/chip_worker.{h,cpp}): ChipWorker::run and IWorker::run return RunTiming. RunPreparedFn typedef extended.
Python (python/bindings/task_interface.cpp): nanobind RunTiming class with host_wall_us / device_wall_us / *_ns properties. Worker.run / run_prepared and the simpler.task_interface.ChipWorker wrapper return it.
Benchmark (simpler_setup/scene_test.py): collects per-round timings, prints a host/device wall table via new _log_round_timings. tools/benchmark_rounds.sh rewritten to parse the printed Avg lines instead of grepping device logs (drops awk parser, wait_for_new_log, ASCEND_WORK_PATH discovery).

Test plan

Verified locally on macOS sim:

The regression this PR fixes: a5sim vector_add with enable_l2_swimlane=False returns host_wall_us=582298.583, device_wall_us=4.000 — pre-PR this would have been device_wall_us=0 because the old orch_summary lived in the swimlane shared region
a5sim vector_example default + --enable-l2-swimlane — PASSED
tests/ut/py/test_run_timing.py: 13/13 pass (nanobind RunTiming contract)
examples/workers/l2/vector_add/test_run_timing.py on a2a3sim hardware (left for CI; test is @pytest.mark.platforms(["a2a3sim", "a2a3"]))
Onboard hardware end-to-end (left for CI)

gemini-code-assist

Code Review

This pull request introduces a unified mechanism for capturing and reporting host and device wall-clock timings across the C++ platform, Python bindings, and test framework. Key changes include the addition of a RunTiming structure to return timing data from worker execution calls and the implementation of _log_round_timings in the test setup to display per-round performance metrics. Consequently, the benchmark_rounds.sh tool has been refactored to parse these reported averages instead of scraping device logs. Feedback indicates that profiling should be explicitly disabled when multiple rounds are requested via the CLI to maintain benchmarking consistency.

Worker.run and run_prepared now return a RunTiming struct (host_wall / device_wall in ns + µs accessors) so benchmarks can read timing directly instead of scraping device logs. - C ABI: add PtoRunTiming struct + trailing nullable out_timing param on run_prepared in src/common/worker/pto_runtime_c_api.h. Updated in lockstep across all 4 platform impls (src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp). - host_wall_ns: steady_clock delta wrapping the dispatch. - device_wall_ns: an always-on mailbox in PTO2SharedMemoryHeader (orch_start_cycle / orch_end_cycle, std::atomic<uint64_t>). AICPU's aicpu_executor publishes the two cycle counters once per run under PTO2_PROFILING (no swimlane gate); host reads them via the existing copy_from_device of the runtime header that runtime_maker already does for graph_output_ptr — same shared region, zero additional copies. A new extern "C" pto2_read_orch_wall_ns(Runtime*) in runtime_maker.cpp does the cycles → ns conversion; the C ABI in every platform impl calls it to fill out_timing->device_wall_ns. Independent of L2PerfCollector / AicpuPhaseRecord — the RunTiming feature owns its own data path so it stays available on every build with PTO2_PROFILING, regardless of swimlane / phase capture. - C++: ChipWorker::run / run_prepared and IWorker::run now return RunTiming. RunPreparedFn typedef extended. - Python: nanobind RunTiming class exposed as _task_interface.RunTiming with host_wall_us / device_wall_us / *_ns properties and an (host_ns, device_ns=0) constructor. Worker.run / run_prepared and the simpler.task_interface.ChipWorker wrapper return it. - Benchmark: simpler_setup/scene_test.py collects per-round timings and prints a host/device wall table via new _log_round_timings. tools/benchmark_rounds.sh rewritten to parse the printed Avg lines instead of grepping device logs (drops awk parser, wait_for_new_log, ASCEND_WORK_PATH discovery). device_wall_ns is now populated on every run that has PTO2_PROFILING compiled in (the default), regardless of enable_l2_swimlane. This matters for benchmarks: scene_test only keeps swimlane on for round 0 to avoid the JSON-dump overhead — before this PR, rounds 1..N-1 read device_wall = 0 because the old orch_summary lived in the L2 perf shared region which is only allocated when swimlane is on. The mailbox lives in PTO2SharedMemoryHeader instead, which is always allocated. Verified locally on a5sim with --enable-l2-swimlane=OFF: vector_add returns host_wall_us=582298.583, device_wall_us=4.000, both > 0, host > device. tests/ut/py/test_run_timing.py: 13/13 pass.

PR #790 rewrote the script to read Worker.run's RunTiming (host/device wall) and print Avg Host / Avg Device lines. That broke the benchmark workflow: - The /benchmark skill (SKILL.md Step 6) parses `Trimmed Avg:` and `Orch Trimmed Avg:` and reports Elapsed / Sched / Orch columns — lines the RunTiming-based script no longer emitted, so /benchmark reported no timing. - Host wall (Python dispatch) and the coarse device wall are not the per-round Elapsed / Sched / Orch latencies the benchmark targets. Restore the device-log parser: scrape the runtime's `Thread N: orch_start=… orch_end=…` / `sched_start=… sched_end=…` lines (emitted under compile-time PTO2_PROFILING, independent of the enable_l2_swimlane flag that --rounds > 1 forces off) and report the three per-round times the skill consumes. RunTiming itself is untouched.

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Comment thread simpler_setup/scene_test.py Outdated

ChaoWao force-pushed the worktree-unified-herding-lollipop branch 10 times, most recently from 5b018dc to ee5c4f1 Compare May 19, 2026 06:20

ChaoWao force-pushed the worktree-unified-herding-lollipop branch from ee5c4f1 to c4ec5db Compare May 19, 2026 06:58

ChaoWao merged commit f6ae05d into hw-native-sys:main May 19, 2026
15 checks passed

ChaoWao mentioned this pull request May 20, 2026

Fix: benchmark_rounds.sh parses device log instead of RunTiming #828

Merged

2 tasks

hw-native-sys-bot mentioned this pull request May 21, 2026

Update: benchmark_rounds.sh adds Host/Device, renames Elapsed to Total #832

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: RunTiming return from Worker.run / run_prepared#790

Add: RunTiming return from Worker.run / run_prepared#790
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-unified-herding-lollipop

ChaoWao commented May 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why an always-on mailbox

Design

Changes

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented May 16, 2026 •

edited

Loading