Refactor: extract pto_runtime_c_api shared glue into common (-303 lines)#928
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR consolidates duplicated C API logic across a2a3 and a5 on-board host runtimes into a new shared implementation file. ChangesC API Extraction and Consolidation
Sequence DiagramsequenceDiagram
participant ChipWorker as ChipWorker
participant c_api_shared as c_api_shared.cpp
participant DeviceRunnerBase as DeviceRunnerBase*
participant Runtime as Runtime
ChipWorker->>c_api_shared: simpler_init(ctx, binaries)
c_api_shared->>DeviceRunnerBase: attach_thread()
c_api_shared->>DeviceRunnerBase: ensure_device_initialized()
ChipWorker->>c_api_shared: prepare_callable(ctx, id, callable)
c_api_shared->>DeviceRunnerBase: prepare_callable_impl()
c_api_shared->>DeviceRunnerBase: register_callable()
ChipWorker->>c_api_shared: run_prepared(ctx, runtime, id, args, ...)
c_api_shared->>Runtime: new Runtime(host_apis, callbacks)
c_api_shared->>DeviceRunnerBase: bind_callable()
c_api_shared->>DeviceRunnerBase: run_task()
c_api_shared->>ChipWorker: return status & timing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the PTO Runtime C API by extracting the common, architecture-independent C API glue from the arch-specific pto_runtime_c_api.cpp files into a new shared file, c_api_shared.cpp. To facilitate this, DeviceRunnerBase is updated with a public virtual destructor and virtual entry points (run, finalize, and set_dep_gen_enabled), which are overridden in the concrete subclasses. The review feedback highlights an exception-safety issue in the shared run_prepared function, where the placement-new'ed Runtime object's destructor is called manually on each return path. If an exception is thrown, the destructor is bypassed, leading to potential resource leaks. Using RAIIScopeGuard is recommended to ensure exception-safe cleanup.
dbde828 to
65dddf4
Compare
Move the byte-identical c_api functions from a2a3 and a5 onboard
pto_runtime_c_api.cpp into a single shared
src/common/platform/onboard/host/c_api_shared.cpp linked into each
arch's libhost_runtime.so. Works through DeviceRunnerBase * and
dispatches arch-specific behavior through three new virtuals.
DeviceRunnerBase additions:
- Public virtual destructor (was protected non-virtual). Lets the
shared destroy_device_context delete polymorphically.
- virtual int run(Runtime&, int, int) = 0 — each arch already had run().
- virtual int finalize() = 0 — each arch already had finalize().
- virtual void set_dep_gen_enabled(bool) — default no-op for a5;
a2a3 overrides to wire enable_dep_gen_. Lets the shared
run_prepared call set_dep_gen_enabled unconditionally.
c_api_shared.cpp owns (12 dlsym'd + 7 static helpers + TSD glue):
- TSD glue (g_runner_key, pthread_once, current_runner)
- Static internal: device_malloc, device_free, copy_to_device,
copy_from_device, upload_chip_callable_buffer_wrapper,
setup_static_arena_wrapper, acquire_pooled_*_wrapper
- Public C ABI: destroy_device_context, get_runtime_size,
device_malloc_ctx, device_free_ctx, copy_to_device_ctx,
copy_from_device_ctx, finalize_device, simpler_init,
prepare_callable, run_prepared, unregister_callable,
get_aicpu_dlopen_count, get_host_dlopen_count
Each arch's pto_runtime_c_api.cpp now only carries arch-specific
entries (also dlsym'd by ChipWorker):
- create_device_context (must \`new\` the concrete subclass)
- a2a3: ensure_acl_ready_ctx + create_comm_stream_ctx +
destroy_comm_stream_ctx (real implementations through
DeviceRunner::{ensure_acl_ready, create_comm_stream, destroy_comm_stream})
- a5: same three as not-supported stubs plus all comm_* stubs
(returns NULL / -1, distributed runtime not yet on a5)
Both arches built clean. nm -D verifies all 26 ChipWorker dlsym
targets are exported on both libhost_runtime.so. a2a3 onboard smoke
(dummy_task, alternating_matmul_add, prepared_callable suite,
spmd_basic) — 9/9 passed in 19s.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/common/platform/onboard/host/device_runner_base.h (1)
67-71:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale class-level docstring: destructor is now public virtual, not protected non-virtual.
Lines 67-71 still describe the destructor as "
protected" and "non-virtual", but lines 75-78 now declarevirtual ~DeviceRunnerBase() = default;underpublic. The comment should be updated to match the new design.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/platform/onboard/host/device_runner_base.h` around lines 67 - 71, The class-level comment is stale: it claims the destructor is protected and non-virtual but the code now declares "virtual ~DeviceRunnerBase() = default;" in public. Update the docstring for DeviceRunnerBase to state that the destructor is public and virtual, remove the claim that direct instantiation/delete through base pointer are compile errors, and adjust the rationale to explain that a public virtual destructor allows safe polymorphic deletion while destroy_device_context still operates on the arch subclass's DeviceRunner; reference DeviceRunnerBase, ~DeviceRunnerBase, and destroy_device_context in the updated comment.
🧹 Nitpick comments (1)
src/common/platform/onboard/host/c_api_shared.cpp (1)
68-68: 💤 Low value
pthread_key_createreturn value is unchecked.If
pthread_key_createfails (e.g.,EAGAINfrom exhausted keys), subsequentpthread_getspecific/pthread_setspecificcalls operate on an uninitialized key, causing undefined behavior. Consider checking the return and logging on failure.Defensive fix
-static void create_runner_key() { pthread_key_create(&g_runner_key, nullptr); } +static void create_runner_key() { + int rc = pthread_key_create(&g_runner_key, nullptr); + if (rc != 0) { + LOG_ERROR("pthread_key_create failed: %d", rc); + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/platform/onboard/host/c_api_shared.cpp` at line 68, The call to pthread_key_create in create_runner_key does not check its return value, so if it fails (e.g., EAGAIN) g_runner_key remains uninitialized; update create_runner_key to check the return of pthread_key_create, log an error (using the existing logging facility) when it fails, and take a safe recovery action (e.g., abort/exit or set a sentinel and avoid calls to pthread_getspecific/pthread_setspecific) so later uses of g_runner_key are guarded; reference the create_runner_key function, the g_runner_key variable, and the pthread_key_create call when applying this fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/common/platform/onboard/host/device_runner_base.h`:
- Around line 67-71: The class-level comment is stale: it claims the destructor
is protected and non-virtual but the code now declares "virtual
~DeviceRunnerBase() = default;" in public. Update the docstring for
DeviceRunnerBase to state that the destructor is public and virtual, remove the
claim that direct instantiation/delete through base pointer are compile errors,
and adjust the rationale to explain that a public virtual destructor allows safe
polymorphic deletion while destroy_device_context still operates on the arch
subclass's DeviceRunner; reference DeviceRunnerBase, ~DeviceRunnerBase, and
destroy_device_context in the updated comment.
---
Nitpick comments:
In `@src/common/platform/onboard/host/c_api_shared.cpp`:
- Line 68: The call to pthread_key_create in create_runner_key does not check
its return value, so if it fails (e.g., EAGAIN) g_runner_key remains
uninitialized; update create_runner_key to check the return of
pthread_key_create, log an error (using the existing logging facility) when it
fails, and take a safe recovery action (e.g., abort/exit or set a sentinel and
avoid calls to pthread_getspecific/pthread_setspecific) so later uses of
g_runner_key are guarded; reference the create_runner_key function, the
g_runner_key variable, and the pthread_key_create call when applying this fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8386eb5c-922f-46cf-9851-6b8c5508809d
📒 Files selected for processing (8)
src/a2a3/platform/onboard/host/CMakeLists.txtsrc/a2a3/platform/onboard/host/device_runner.hsrc/a2a3/platform/onboard/host/pto_runtime_c_api.cppsrc/a5/platform/onboard/host/CMakeLists.txtsrc/a5/platform/onboard/host/device_runner.hsrc/a5/platform/onboard/host/pto_runtime_c_api.cppsrc/common/platform/onboard/host/c_api_shared.cppsrc/common/platform/onboard/host/device_runner_base.h
Mirror the onboard host refactor (hw-native-sys#880…hw-native-sys#928) on the sim path. Pulls ~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only the bits that genuinely differ between a2a3 and a5 sim: - aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg) - dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs / aicore_rotation_table setters; a5 doesn't) - init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas; a5 uses std::malloc via prof_alloc_cb static functions — preserved as-is, no behavior change) - finalize() collector semantics (a2a3 releases shm to mem_alloc_ per-run; a5 stop()s per-run, full finalize at run-end via prof_free_cb — preserved as-is) - run() middle (dep_gen gating on a2a3 only; different SIM_REG_* constants) SimDeviceRunnerBase hosts the byte-identical methods + their state: setup_static_arena, acquire_pooled_*, create_thread, attach_current_thread, allocate_tensor / free_tensor / copy_*, register_callable[_host_orch], unregister_callable, has_callable, bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer, print_handshake_results, release_callable_state, ensure_device_initialized, set_*_enabled / output_prefix accessors, last_device_wall_ns, the shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable buffer pool / collector instances / kernel_args_ / device_wall ptr / log/dlopen counters. Mechanical-fix: setup_static_arena standardized to "release all arenas on any failure" (matches a2a3 sim + onboard PR hw-native-sys#922). a5 sim had been keeping earlier-committed peers alive on later-region failure; the new common impl drops that to match the onboard invariant. Latent-fix carried over from onboard hw-native-sys#928: c_api_shared's run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so its dtor fires on every exit path (manual r->~Runtime() in the prior sim c_api was bypassed by catch(...) on exception). Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase (public), run(), finalize(), set_dep_gen_enabled() (default no-op, a2a3 overrides). c_api_shared.cpp works through SimDeviceRunnerBase * and dispatches via those virtuals — same pattern as the onboard hw-native-sys#928 split. Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just create_device_context + ACL stubs). memory_allocator.cpp was byte-identical, deleted from both arch subdirs and lives once in common/. Both arches built clean. nm -D verifies all 17 ChipWorker dlsym targets exported on both sim libhost_runtime.so. ST passes 38/38 on a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1); examples spot-checked (scalar_data_test, vector_example, benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Mirror the onboard host refactor (#880…#928) on the sim path. Pulls ~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only the bits that genuinely differ between a2a3 and a5 sim: - aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg) - dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs / aicore_rotation_table setters; a5 doesn't) - init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas; a5 uses std::malloc via prof_alloc_cb static functions — preserved as-is, no behavior change) - finalize() collector semantics (a2a3 releases shm to mem_alloc_ per-run; a5 stop()s per-run, full finalize at run-end via prof_free_cb — preserved as-is) - run() middle (dep_gen gating on a2a3 only; different SIM_REG_* constants) SimDeviceRunnerBase hosts the byte-identical methods + their state: setup_static_arena, acquire_pooled_*, create_thread, attach_current_thread, allocate_tensor / free_tensor / copy_*, register_callable[_host_orch], unregister_callable, has_callable, bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer, print_handshake_results, release_callable_state, ensure_device_initialized, set_*_enabled / output_prefix accessors, last_device_wall_ns, the shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable buffer pool / collector instances / kernel_args_ / device_wall ptr / log/dlopen counters. Mechanical-fix: setup_static_arena standardized to "release all arenas on any failure" (matches a2a3 sim + onboard PR #922). a5 sim had been keeping earlier-committed peers alive on later-region failure; the new common impl drops that to match the onboard invariant. Latent-fix carried over from onboard #928: c_api_shared's run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so its dtor fires on every exit path (manual r->~Runtime() in the prior sim c_api was bypassed by catch(...) on exception). Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase (public), run(), finalize(), set_dep_gen_enabled() (default no-op, a2a3 overrides). c_api_shared.cpp works through SimDeviceRunnerBase * and dispatches via those virtuals — same pattern as the onboard #928 split. Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just create_device_context + ACL stubs). memory_allocator.cpp was byte-identical, deleted from both arch subdirs and lives once in common/. Both arches built clean. nm -D verifies all 17 ChipWorker dlsym targets exported on both sim libhost_runtime.so. ST passes 38/38 on a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1); examples spot-checked (scalar_data_test, vector_example, benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
Last big chunk of duplicated onboard host code:
pto_runtime_c_api.cppwas ~80% identical between a2a3 and a5. Extract the shared part intosrc/common/platform/onboard/host/c_api_shared.cpp, linked into each arch'slibhost_runtime.so. Works throughDeviceRunnerBase *. Net -303 lines.This is the first PR that introduces
virtualtoDeviceRunnerBase. Three new virtuals (run,finalize,set_dep_gen_enabled) plus a public virtual destructor. All three are methods each arch already defined;set_dep_gen_enabledhad only existed on a2a3, the base now provides a default no-op so the sharedrun_preparedcan call it unconditionally without splitting the c_api per-arch.DeviceRunnerBaseadditionsvirtual ~DeviceRunnerBase() = default;(public)destroy_device_contextdeletes through aDeviceRunnerBase *.virtual int run(Runtime&, int, int) = 0run()is too divergent to share; the c_api just calls through.virtual int finalize() = 0finalize()already exists; the c_api just calls through.virtual void set_dep_gen_enabled(bool) {}run_preparedarch-agnostic.c_api_shared.cppowns12
dlsym'd entries + 7 static helpers + TSD glue:g_runner_key,pthread_once,current_runnerdevice_malloc/device_free/copy_to_device/copy_from_device/upload_chip_callable_buffer_wrapper/setup_static_arena_wrapper/acquire_pooled_*_wrapperdestroy_device_context,get_runtime_size,device_malloc_ctx,device_free_ctx,copy_to_device_ctx,copy_from_device_ctx,finalize_device,simpler_init,prepare_callable,run_prepared,unregister_callable,get_aicpu_dlopen_count,get_host_dlopen_countEach arch's
pto_runtime_c_api.cppnow only carriescreate_device_contextnew a2a3::DeviceRunner()new a5::DeviceRunner()ensure_acl_ready_ctxcreate_comm_stream_ctxdestroy_comm_stream_ctxcomm_*(init/alloc/derive/barrier/destroy/...)comm_hccl.cpp)Verification
nm -Don bothlibhost_runtime.soconfirms all 26 ChipWorkerdlsymtargets are exported:create_device_context destroy_device_context device_malloc_ctx device_free_ctx copy_to_device_ctx copy_from_device_ctx get_runtime_size simpler_init prepare_callable run_prepared unregister_callable get_aicpu_dlopen_count get_host_dlopen_count finalize_device ensure_acl_ready_ctx create_comm_stream_ctx destroy_comm_stream_ctx comm_init comm_alloc_windows comm_get_local_window_base comm_get_window_size comm_derive_context comm_alloc_domain_windows comm_release_domain_windows comm_barrier comm_destroyTest plan