Add: RunTiming return from Worker.run / run_prepared#790
Merged
ChaoWao merged 1 commit intoMay 19, 2026
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a unified mechanism for capturing and reporting host and device wall-clock timings across the C++ platform, Python bindings, and test framework. Key changes include the addition of a RunTiming structure to return timing data from worker execution calls and the implementation of _log_round_timings in the test setup to display per-round performance metrics. Consequently, the benchmark_rounds.sh tool has been refactored to parse these reported averages instead of scraping device logs. Feedback indicates that profiling should be explicitly disabled when multiple rounds are requested via the CLI to maintain benchmarking consistency.
5b018dc to
ee5c4f1
Compare
Worker.run and run_prepared now return a RunTiming struct (host_wall /
device_wall in ns + µs accessors) so benchmarks can read timing
directly instead of scraping device logs.
- C ABI: add PtoRunTiming struct + trailing nullable out_timing param
on run_prepared in src/common/worker/pto_runtime_c_api.h. Updated in
lockstep across all 4 platform impls
(src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp).
- host_wall_ns: steady_clock delta wrapping the dispatch.
- device_wall_ns: an always-on mailbox in PTO2SharedMemoryHeader
(orch_start_cycle / orch_end_cycle, std::atomic<uint64_t>). AICPU's
aicpu_executor publishes the two cycle counters once per run under
PTO2_PROFILING (no swimlane gate); host reads them via the existing
copy_from_device of the runtime header that runtime_maker already
does for graph_output_ptr — same shared region, zero additional
copies. A new extern "C" pto2_read_orch_wall_ns(Runtime*) in
runtime_maker.cpp does the cycles → ns conversion; the C ABI in
every platform impl calls it to fill out_timing->device_wall_ns.
Independent of L2PerfCollector / AicpuPhaseRecord — the RunTiming
feature owns its own data path so it stays available on every build
with PTO2_PROFILING, regardless of swimlane / phase capture.
- C++: ChipWorker::run / run_prepared and IWorker::run now return
RunTiming. RunPreparedFn typedef extended.
- Python: nanobind RunTiming class exposed as _task_interface.RunTiming
with host_wall_us / device_wall_us / *_ns properties and an
(host_ns, device_ns=0) constructor. Worker.run / run_prepared and the
simpler.task_interface.ChipWorker wrapper return it.
- Benchmark: simpler_setup/scene_test.py collects per-round timings and
prints a host/device wall table via new _log_round_timings.
tools/benchmark_rounds.sh rewritten to parse the printed Avg lines
instead of grepping device logs (drops awk parser, wait_for_new_log,
ASCEND_WORK_PATH discovery).
device_wall_ns is now populated on every run that has PTO2_PROFILING
compiled in (the default), regardless of enable_l2_swimlane. This
matters for benchmarks: scene_test only keeps swimlane on for round 0
to avoid the JSON-dump overhead — before this PR, rounds 1..N-1 read
device_wall = 0 because the old orch_summary lived in the L2 perf
shared region which is only allocated when swimlane is on. The mailbox
lives in PTO2SharedMemoryHeader instead, which is always allocated.
Verified locally on a5sim with --enable-l2-swimlane=OFF: vector_add
returns host_wall_us=582298.583, device_wall_us=4.000, both > 0,
host > device. tests/ut/py/test_run_timing.py: 13/13 pass.
ee5c4f1 to
c4ec5db
Compare
2 tasks
ChaoWao
added a commit
that referenced
this pull request
May 20, 2026
PR #790 rewrote the script to read Worker.run's RunTiming (host/device wall) and print Avg Host / Avg Device lines. That broke the benchmark workflow: - The /benchmark skill (SKILL.md Step 6) parses `Trimmed Avg:` and `Orch Trimmed Avg:` and reports Elapsed / Sched / Orch columns — lines the RunTiming-based script no longer emitted, so /benchmark reported no timing. - Host wall (Python dispatch) and the coarse device wall are not the per-round Elapsed / Sched / Orch latencies the benchmark targets. Restore the device-log parser: scrape the runtime's `Thread N: orch_start=… orch_end=…` / `sched_start=… sched_end=…` lines (emitted under compile-time PTO2_PROFILING, independent of the enable_l2_swimlane flag that --rounds > 1 forces off) and report the three per-round times the skill consumes. RunTiming itself is untouched.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Worker.runandrun_preparednow return aRunTimingstruct withhost_wall/device_wallaccessors so benchmarks can read per-run timing directly instead of scraping device logs. Builds on top of #802.Why an always-on mailbox
Before this PR, host code that wanted
device_wall_ns(the on-NPU wall of the most recent orchestrator phase) had to grep the device log fororch_start=… orch_end=… orch_cost=…lines. The natural fix — surface that timing via the runtime → host shared memory — needed a stable home that's allocated on every run. Two candidates existed:AicpuPhaseHeader::orch_summary(the swimlane-side aggregate struct). Only allocated whenenable_l2_swimlaneis on. This is what an earlier draft of this PR used, and it's whyscene_testcould only collectdevice_wallon round 0 (swimlane is heavy, so rounds 1..N-1 had it off and reported 0).PTO2SharedMemoryHeader(allocated unconditionally for every run). The right home — "is the runtime alive?" is the same gate as "can I read device_wall?".#802 retired the swimlane-side aggregate entirely. This PR builds on that by adding two cycle-counter fields directly to
PTO2SharedMemoryHeaderand a small extern-Cpto2_read_orch_wall_ns(Runtime *)helper inruntime_maker.cppthat piggybacks on the existingcopy_from_deviceof the runtime header (the one that already pullsgraph_output_ptrback to host). No new shared region, no new copy, no swimlane dependency.Design
The mailbox is decoupled from
L2PerfCollector/AicpuPhaseRecord— the swimlane data path stays as it is after #802 (per-event records only, swimlane-gated). RunTiming has its own three-line round-trip.Changes
src/common/worker/pto_runtime_c_api.h): addPtoRunTimingstruct + trailing nullableout_timingparam onrun_prepared. Updated in lockstep across all 4 platform impls (src/{a2a3,a5}/platform/{onboard,sim}/host/pto_runtime_c_api.cpp).src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h): addorch_start_cycle+orch_end_cycle(std::atomic<uint64_t>) toPTO2SharedMemoryHeader. Always allocated.src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp): store the two cycle counters once per run, underPTO2_PROFILING, beforert_orchestration_done(rt)so the release-store onorchestrator_donesynchronizes them.src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp):extern "C" uint64_t pto2_read_orch_wall_ns(Runtime *)that copies the header from device and returns ns.src/common/worker/chip_worker.{h,cpp}):ChipWorker::runandIWorker::runreturnRunTiming.RunPreparedFntypedef extended.python/bindings/task_interface.cpp): nanobindRunTimingclass withhost_wall_us/device_wall_us/*_nsproperties.Worker.run/run_preparedand thesimpler.task_interface.ChipWorkerwrapper return it.simpler_setup/scene_test.py): collects per-round timings, prints a host/device wall table via new_log_round_timings.tools/benchmark_rounds.shrewritten to parse the printed Avg lines instead of grepping device logs (drops awk parser,wait_for_new_log,ASCEND_WORK_PATHdiscovery).Test plan
Verified locally on macOS sim:
a5simvector_add withenable_l2_swimlane=Falsereturnshost_wall_us=582298.583, device_wall_us=4.000— pre-PR this would have beendevice_wall_us=0because the oldorch_summarylived in the swimlane shared regiona5simvector_example default +--enable-l2-swimlane— PASSEDtests/ut/py/test_run_timing.py: 13/13 pass (nanobind RunTiming contract)examples/workers/l2/vector_add/test_run_timing.pyona2a3simhardware (left for CI; test is@pytest.mark.platforms(["a2a3sim", "a2a3"]))