Refactor: host-build trb runtime arena (a2a3 only)#846
Conversation
Move the per-slot payload/task pointer assignments out of the RingSchedState::init() O(task_window_size) loop and into orch::prepare_task. Their value is per-slot constant (&task_payloads[slot] / &task_descriptors[slot]) but writing them at submit time, on the same 64B slot_state cache line prepare_task is already dirtying, is essentially free — while removing the only "scale-dependent" pointer assignments from the init path. ring_id stays in init (its value is per-ring constant, so rewriting it each submit would only add noise without removing a loop). Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and bind_buffers() (per-submit) to make the two call-site shapes explicit. Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime, orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper, mailbox) on every device boot via runtime_create_from_sm. This commit moves layout + data init onto the host so the AICPU only does a cheap arena-internal pointer wire pass plus the SM reset that can't run off-device. Multi-run boots reuse the pooled prebuilt image with a single rtMemcpy. Mechanism - DeviceArena::attach() wraps an externally-owned buffer; re-attach is permitted so each AICPU boot can reuse the pooled image. - runtime_create_from_sm split into reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler / tensor_map / ready_queue / spsc gain matching data+wire pairs; finalize_after_wire stays AICPU-only since it binds s_runtime_ops. - pto2_sm_layout helper computes SM field device addresses by pure offset arithmetic so host init never dereferences SM. - Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask) moved from RingSchedState::init into PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it after the split. - runtime/shared/pto_runtime2_init.cpp — new file holding the host-able pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp / pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay in place. Host wiring (runtime_maker.cpp) - DeviceRunner::setup_static_arena gains a third runtime_arena_size region (hbg passes 0). The prebuilt image lives in the same pooled backing allocation as gm_heap and SM, keeping worker lifetime to one rtMalloc. - bind_prepared_to_runtime_impl reserves layout on a host arena, sizes the pooled regions, runs init_data + wire, stashes prebuilt metadata into the rt image, rtMemcpys to device, and records base/offset on Runtime so the AICPU boot can find it. AICPU boot (aicpu_executor.cpp) - attach the runtime arena to the pooled buffer, take rt from base+off_runtime, wire arena-internal pointers, sm_handle->init (SM reset including the per-slot fields above), mailbox reset, finalize_after_wire (ops table + cluster/aiv counts). Tests - cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state / task_state / wiring / tensormap UTs migrated to the data+wire API. task_allocator.init grew an optional initial_local_task_id (default 0) so UTs can still exercise task_id near INT32_MAX without reading the SM. - a2a3sim trb: standalone (dynamic_register variants, L3 group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass. - a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi changes don't break hbg). - a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9 (--build with pto-isa commit pinned to CI).
There was a problem hiding this comment.
Code Review
This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.
Address review feedback from PR hw-native-sys#846: - pto2_sm_layout::ring_task_descriptors_addr: take per-ring task_window_sizes[] array (mirroring PTO2SharedMemoryHandle's SM API) and assert ring_id range, so a future per-ring SM layout cannot silently disagree with the addresses the host bakes into the prebuilt image. - DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): return nullptr when runtime_arena_region_off_ == SIZE_MAX so a stray hbg-path call cannot resolve to base + SIZE_MAX. Failure is now loud and contained at the acquire boundary. - DeviceArena::attach(): rewrite doc to match real behavior (region table is not repopulated after attach, reserve() asserts !committed_ so cannot replay, region_size() returns 0); promote the pre-alignment / non-null / power-of-two checks from plain assert() to an unconditional abort() so release builds still trap on contract violations. - PTO2TensorMap: drop the dead `orch` back-pointer field (a2a3 never dereferences it), strip parent_orch parameter from wire_arena_pointers, and remove the now-unused PTO2OrchestratorState forward declaration. - PTO2RingFlowControl::init(): add a coupling comment so future fc-initial- value or boot-order changes flag PTO2TaskAllocator::init's initial_local_task_id default in the same edit. - PTO2SchedulerState::init_data_from_layout / RingSchedState:: init_data_from_layout: drop the task_window_size / dep_pool_capacity parameters that were never consumed (scheduler only needs SM base + ring index, both window-size-independent; orchestrator counterpart still takes task_window_size for ring_task_descriptors arithmetic). Updated all callsites (pto_runtime2_init.cpp + 4 cpput suites). - PTO2Runtime::prebuilt_arena_base: removed the dead mirror field. The host Runtime's prebuilt_arena_base_ is the real source of truth (AICPU reads it to locate the pooled buffer *before* dereferencing the image); the PTO2Runtime image still carries prebuilt_layout, which the AICPU does consume. cpput: 25/25 pass. a2a3sim trb: dummy_task / dynamic_register / L2 trb suite pass with --build.
Summary
Two-commit refactor on the trb runtime, both authored by @poursoul. The PR
bundles them because the second is built on top of the first; squash-merge
gives a single coherent landing.
fe5d662 — Refactor: defer slot_state payload/task bind to orch::prepare_task
RingSchedState::initinto per-submit
prepare_task, making startup independent of window size.pto_orchestrator.cpp,pto_runtime2_types.h,pto_scheduler.cpponeach arch).
d33daa5 — Refactor: host-build trb runtime arena, AICPU does only wire + SM reset ⚠ a2a3 only
runtime_create_from_smonto the host. AICPU boot becomes a cheaparena-internal pointer wire pass + the SM reset that can't run off-device.
gm_heap and SM (one rtMalloc per worker), reused across all subsequent
runs via a single rtMemcpy.
src/a5/**is untouched in thiscommit (a5 keeps its current AICPU-side
runtime_create_from_smpath).The plan is to mirror to a5 in a follow-up PR after this lands and
stabilizes on a2a3 hardware/sim.
Mechanism (commit 2 / d33daa5)
DeviceArena::attach()wraps an externally-owned buffer; re-attach ispermitted so each AICPU boot can reuse the pooled image.
runtime_create_from_smsplit intoreserve_layout/init_data_from_layout/
wire_arena_pointers/finalize_after_wire; orchestrator / scheduler /tensor_map / ready_queue / spsc gain matching data+wire pairs.
finalize_after_wirestays AICPU-only since it bindss_runtime_ops.pto2_sm_layouthelper computes SM device-side field addresses by pureoffset arithmetic so host init never dereferences SM.
RingSchedState::initintoPTO2SharedMemoryHandle::init_header_per_ringso the AICPU still owns it.runtime/shared/pto_runtime2_init.cppholds the host-able pieceslifted out of
pto_runtime2.cpp/pto_orchestrator.cpp/pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay put.DeviceRunner::setup_static_arenanow takes a thirdruntime_arena_sizeregion (hbg passes 0 — hbg has no prebuilt runtime arena).
Why a5 is deliberately not touched in this PR
The host-build refactor is a non-trivial reshape of the runtime arena init
path. Keeping a5 on the old AICPU-side path until a2a3 has time on real
hardware lets us validate the new contract (layout/init/wire/finalize phases,
pooled image lifecycle, SM-reset boundary) without making a5 a moving target.
Once stable, the a5 mirror is a mechanical follow-up.
Test plan
task_state / wiring / tensormap UTs migrated to the data+wire API.
task_allocator.initgrew an optionalinitial_local_task_id(default0) so the near-INT32_MAX corner case is still exercised without an SM
dereference.
group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
changes (3-arg
setup_static_arena, newacquire_pooled_runtime_arenafield) don't break hbg.
tests/st/.../paged_attention_unrollpasses ondevice 9 (
--build, pto-isa commit pinned to CI).Post-review hardening (commit 75f2562)
Address feedback after two independent review passes:
pto2_sm_layout::ring_task_descriptors_addr: now takes a per-ringtask_window_sizes[]array (mirroring the SM API) instead of a singleuniform value; adds a
ring_idrange assert. Structurally prevents thehost-built image from silently disagreeing with the SM layout if anyone
later introduces per-ring window sizes.
DeviceRunner::acquire_pooled_runtime_arena(onboard + sim): nowreturns
nullptrwhenruntime_arena_region_off_ == SIZE_MAXso a strayhbg-path call cannot resolve to
base + SIZE_MAX.DeviceArena::attach(): documentation rewritten to match realbehavior (region table is not repopulated;
reserve()cannot replay;region_size()returns 0); the pre-alignment / non-null / power-of-twochecks now
std::abort()unconditionally instead of relying onassert()(which is stripped in release builds).PTO2TensorMap::orch: dead back-pointer field removed (a2a3 neverdereferences it);
wire_arena_pointersloses itsparent_orchparameter; forward declaration of
PTO2OrchestratorStateremoved.PTO2Runtime::prebuilt_arena_base: dead mirror field removed. Thehost
Runtime::prebuilt_arena_base_is the real source of truth (AICPUreads it to locate the pooled buffer before it can dereference the
image); the image still carries
prebuilt_layout, which is consumed.PTO2SchedulerState::init_data_from_layout/RingSchedState:: init_data_from_layout: the unusedtask_window_size/dep_pool_capacityparameters are dropped (scheduler only needs SM baseupdated.
PTO2RingFlowControl::init(): comment added pointing back atPTO2TaskAllocator::init'sinitial_local_task_iddefault, so futurechanges to fc initial value / boot ordering are flagged in the same
edit.
Test plan (post-hardening)
dummy_task+dynamic_register+ L2 trb suitepass with
--build(forces host + AICPU recompile).