Skip to content

Add: RunTiming return from Worker.run / run_prepared#790

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-unified-herding-lollipop
May 19, 2026
Merged

Add: RunTiming return from Worker.run / run_prepared#790
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-unified-herding-lollipop

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 16, 2026

Summary

Worker.run and run_prepared now return a RunTiming struct with host_wall / device_wall accessors so benchmarks can read per-run timing directly instead of scraping device logs. Builds on top of #802.

Why an always-on mailbox

Before this PR, host code that wanted device_wall_ns (the on-NPU wall of the most recent orchestrator phase) had to grep the device log for orch_start=… orch_end=… orch_cost=… lines. The natural fix — surface that timing via the runtime → host shared memory — needed a stable home that's allocated on every run. Two candidates existed:

  1. AicpuPhaseHeader::orch_summary (the swimlane-side aggregate struct). Only allocated when enable_l2_swimlane is on. This is what an earlier draft of this PR used, and it's why scene_test could only collect device_wall on round 0 (swimlane is heavy, so rounds 1..N-1 had it off and reported 0).
  2. PTO2SharedMemoryHeader (allocated unconditionally for every run). The right home — "is the runtime alive?" is the same gate as "can I read device_wall?".

#802 retired the swimlane-side aggregate entirely. This PR builds on that by adding two cycle-counter fields directly to PTO2SharedMemoryHeader and a small extern-C pto2_read_orch_wall_ns(Runtime *) helper in runtime_maker.cpp that piggybacks on the existing copy_from_device of the runtime header (the one that already pulls graph_output_ptr back to host). No new shared region, no new copy, no swimlane dependency.

Design

[AICPU, aicpu_executor.cpp post-orchestration, under PTO2_PROFILING]
    sm_header->orch_start_cycle.store(orch_cycle_start, relaxed)
    sm_header->orch_end_cycle.store(orch_cycle_end, relaxed)
    rt_orchestration_done(rt)  ← release on orchestrator_done provides
                                  the happens-before edge

[Host, runtime_maker.cpp]
    extern "C" pto2_read_orch_wall_ns(Runtime *runtime):
        copy_from_device(&host_header, sm_ptr, sizeof(PTO2SharedMemoryHeader))
        return cycles_to_ns(end - start)

[Host, pto_runtime_c_api.cpp run_prepared]
    out_timing->device_wall_ns = pto2_read_orch_wall_ns(runtime)

The mailbox is decoupled from L2PerfCollector / AicpuPhaseRecord — the swimlane data path stays as it is after #802 (per-event records only, swimlane-gated). RunTiming has its own three-line round-trip.

Changes

  • C ABI (src/common/worker/pto_runtime_c_api.h): add PtoRunTiming struct + trailing nullable out_timing param on run_prepared. Updated in lockstep across all 4 platform impls (src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp).
  • Shared memory (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h): add orch_start_cycle + orch_end_cycle (std::atomic<uint64_t>) to PTO2SharedMemoryHeader. Always allocated.
  • AICPU writer (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp): store the two cycle counters once per run, under PTO2_PROFILING, before rt_orchestration_done(rt) so the release-store on orchestrator_done synchronizes them.
  • Host reader (src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp): extern "C" uint64_t pto2_read_orch_wall_ns(Runtime *) that copies the header from device and returns ns.
  • C++ (src/common/worker/chip_worker.{h,cpp}): ChipWorker::run and IWorker::run return RunTiming. RunPreparedFn typedef extended.
  • Python (python/bindings/task_interface.cpp): nanobind RunTiming class with host_wall_us / device_wall_us / *_ns properties. Worker.run / run_prepared and the simpler.task_interface.ChipWorker wrapper return it.
  • Benchmark (simpler_setup/scene_test.py): collects per-round timings, prints a host/device wall table via new _log_round_timings. tools/benchmark_rounds.sh rewritten to parse the printed Avg lines instead of grepping device logs (drops awk parser, wait_for_new_log, ASCEND_WORK_PATH discovery).

Test plan

Verified locally on macOS sim:

  • The regression this PR fixes: a5sim vector_add with enable_l2_swimlane=False returns host_wall_us=582298.583, device_wall_us=4.000 — pre-PR this would have been device_wall_us=0 because the old orch_summary lived in the swimlane shared region
  • a5sim vector_example default + --enable-l2-swimlane — PASSED
  • tests/ut/py/test_run_timing.py: 13/13 pass (nanobind RunTiming contract)
  • examples/workers/l2/vector_add/test_run_timing.py on a2a3sim hardware (left for CI; test is @pytest.mark.platforms(["a2a3sim", "a2a3"]))
  • Onboard hardware end-to-end (left for CI)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a unified mechanism for capturing and reporting host and device wall-clock timings across the C++ platform, Python bindings, and test framework. Key changes include the addition of a RunTiming structure to return timing data from worker execution calls and the implementation of _log_round_timings in the test setup to display per-round performance metrics. Consequently, the benchmark_rounds.sh tool has been refactored to parse these reported averages instead of scraping device logs. Feedback indicates that profiling should be explicitly disabled when multiple rounds are requested via the CLI to maintain benchmarking consistency.

Comment thread simpler_setup/scene_test.py Outdated
@ChaoWao ChaoWao force-pushed the worktree-unified-herding-lollipop branch 10 times, most recently from 5b018dc to ee5c4f1 Compare May 19, 2026 06:20
Worker.run and run_prepared now return a RunTiming struct (host_wall /
device_wall in ns + µs accessors) so benchmarks can read timing
directly instead of scraping device logs.

- C ABI: add PtoRunTiming struct + trailing nullable out_timing param
  on run_prepared in src/common/worker/pto_runtime_c_api.h. Updated in
  lockstep across all 4 platform impls
  (src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp).
- host_wall_ns: steady_clock delta wrapping the dispatch.
- device_wall_ns: an always-on mailbox in PTO2SharedMemoryHeader
  (orch_start_cycle / orch_end_cycle, std::atomic<uint64_t>). AICPU's
  aicpu_executor publishes the two cycle counters once per run under
  PTO2_PROFILING (no swimlane gate); host reads them via the existing
  copy_from_device of the runtime header that runtime_maker already
  does for graph_output_ptr — same shared region, zero additional
  copies. A new extern "C" pto2_read_orch_wall_ns(Runtime*) in
  runtime_maker.cpp does the cycles → ns conversion; the C ABI in
  every platform impl calls it to fill out_timing->device_wall_ns.
  Independent of L2PerfCollector / AicpuPhaseRecord — the RunTiming
  feature owns its own data path so it stays available on every build
  with PTO2_PROFILING, regardless of swimlane / phase capture.
- C++: ChipWorker::run / run_prepared and IWorker::run now return
  RunTiming. RunPreparedFn typedef extended.
- Python: nanobind RunTiming class exposed as _task_interface.RunTiming
  with host_wall_us / device_wall_us / *_ns properties and an
  (host_ns, device_ns=0) constructor. Worker.run / run_prepared and the
  simpler.task_interface.ChipWorker wrapper return it.
- Benchmark: simpler_setup/scene_test.py collects per-round timings and
  prints a host/device wall table via new _log_round_timings.
  tools/benchmark_rounds.sh rewritten to parse the printed Avg lines
  instead of grepping device logs (drops awk parser, wait_for_new_log,
  ASCEND_WORK_PATH discovery).

device_wall_ns is now populated on every run that has PTO2_PROFILING
compiled in (the default), regardless of enable_l2_swimlane. This
matters for benchmarks: scene_test only keeps swimlane on for round 0
to avoid the JSON-dump overhead — before this PR, rounds 1..N-1 read
device_wall = 0 because the old orch_summary lived in the L2 perf
shared region which is only allocated when swimlane is on. The mailbox
lives in PTO2SharedMemoryHeader instead, which is always allocated.

Verified locally on a5sim with --enable-l2-swimlane=OFF: vector_add
returns host_wall_us=582298.583, device_wall_us=4.000, both > 0,
host > device. tests/ut/py/test_run_timing.py: 13/13 pass.
@ChaoWao ChaoWao force-pushed the worktree-unified-herding-lollipop branch from ee5c4f1 to c4ec5db Compare May 19, 2026 06:58
@ChaoWao ChaoWao merged commit f6ae05d into hw-native-sys:main May 19, 2026
15 checks passed
ChaoWao added a commit that referenced this pull request May 20, 2026
PR #790 rewrote the script to read Worker.run's RunTiming (host/device
wall) and print Avg Host / Avg Device lines. That broke the benchmark
workflow:

- The /benchmark skill (SKILL.md Step 6) parses `Trimmed Avg:` and
  `Orch Trimmed Avg:` and reports Elapsed / Sched / Orch columns — lines
  the RunTiming-based script no longer emitted, so /benchmark reported
  no timing.
- Host wall (Python dispatch) and the coarse device wall are not the
  per-round Elapsed / Sched / Orch latencies the benchmark targets.

Restore the device-log parser: scrape the runtime's
`Thread N: orch_start=… orch_end=…` / `sched_start=… sched_end=…` lines
(emitted under compile-time PTO2_PROFILING, independent of the
enable_l2_swimlane flag that --rounds > 1 forces off) and report the
three per-round times the skill consumes. RunTiming itself is untouched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant