Skip to content

Refactor: extract pto_runtime_c_api shared glue into common (-303 lines)#928

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:extract-c-api-common
May 31, 2026
Merged

Refactor: extract pto_runtime_c_api shared glue into common (-303 lines)#928
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:extract-c-api-common

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Last big chunk of duplicated onboard host code: pto_runtime_c_api.cpp was ~80% identical between a2a3 and a5. Extract the shared part into src/common/platform/onboard/host/c_api_shared.cpp, linked into each arch's libhost_runtime.so. Works through DeviceRunnerBase *. Net -303 lines.

This is the first PR that introduces virtual to DeviceRunnerBase. Three new virtuals (run, finalize, set_dep_gen_enabled) plus a public virtual destructor. All three are methods each arch already defined; set_dep_gen_enabled had only existed on a2a3, the base now provides a default no-op so the shared run_prepared can call it unconditionally without splitting the c_api per-arch.

DeviceRunnerBase additions

Member Why
virtual ~DeviceRunnerBase() = default; (public) Shared destroy_device_context deletes through a DeviceRunnerBase *.
virtual int run(Runtime&, int, int) = 0 Each arch's run() is too divergent to share; the c_api just calls through.
virtual int finalize() = 0 Each arch's finalize() already exists; the c_api just calls through.
virtual void set_dep_gen_enabled(bool) {} dep_gen is a2a3-only today; default no-op for a5 keeps the shared run_prepared arch-agnostic.

c_api_shared.cpp owns

12 dlsym'd entries + 7 static helpers + TSD glue:

  • TSD: g_runner_key, pthread_once, current_runner
  • Static: device_malloc / device_free / copy_to_device / copy_from_device / upload_chip_callable_buffer_wrapper / setup_static_arena_wrapper / acquire_pooled_*_wrapper
  • C ABI: destroy_device_context, get_runtime_size, device_malloc_ctx, device_free_ctx, copy_to_device_ctx, copy_from_device_ctx, finalize_device, simpler_init, prepare_callable, run_prepared, unregister_callable, get_aicpu_dlopen_count, get_host_dlopen_count

Each arch's pto_runtime_c_api.cpp now only carries

a2a3 a5
create_device_context new a2a3::DeviceRunner() new a5::DeviceRunner()
ensure_acl_ready_ctx real not-supported stub
create_comm_stream_ctx real NULL stub
destroy_comm_stream_ctx real 0-return stub
comm_* (init/alloc/derive/barrier/destroy/...) (provided by comm_hccl.cpp) not-supported stubs

Verification

  • Both arches built clean (onboard + sim, both runtimes).
  • nm -D on both libhost_runtime.so confirms all 26 ChipWorker dlsym targets are exported:
    create_device_context destroy_device_context device_malloc_ctx device_free_ctx copy_to_device_ctx copy_from_device_ctx get_runtime_size simpler_init prepare_callable run_prepared unregister_callable get_aicpu_dlopen_count get_host_dlopen_count finalize_device ensure_acl_ready_ctx create_comm_stream_ctx destroy_comm_stream_ctx comm_init comm_alloc_windows comm_get_local_window_base comm_get_window_size comm_derive_context comm_alloc_domain_windows comm_release_domain_windows comm_barrier comm_destroy
  • Local a2a3 onboard smoke (dummy_task, alternating_matmul_add, prepared_callable suite, spmd_basic) — 9/9 passed in 19s.

Test plan

  • CI st-sim-a2a3 / st-sim-a5
  • CI st-onboard-a2a3 / st-onboard-a5
  • CI ut-a2a3 / ut-a5

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0f8d201a-fb65-4ced-ba21-206759510982

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR consolidates duplicated C API logic across a2a3 and a5 on-board host runtimes into a new shared implementation file. DeviceRunnerBase is updated with a public virtual destructor and pure virtual entry points to define the contract. Both platform-specific DeviceRunner classes mark their methods as overrides. New c_api_shared.cpp implements all common C ABI glue using thread-local binding. Platform-specific files are reduced to only entry points and ACL/comm placeholders, with duplicate implementations removed and source inclusion updated in CMake.

Changes

C API Extraction and Consolidation

Layer / File(s) Summary
Base Class Interface Contract
src/common/platform/onboard/host/device_runner_base.h
DeviceRunnerBase gains a public virtual destructor and pure virtual run(...) and finalize() methods, plus a virtual set_dep_gen_enabled(bool) hook with default no-op. These declarations define the polymorphic contract the shared C API will call through.
Concrete DeviceRunner Override Declarations
src/a2a3/platform/onboard/host/device_runner.h, src/a5/platform/onboard/host/device_runner.h
Both a2a3 and a5 DeviceRunner classes add override specifiers to run(...) and finalize(). The a2a3 class also marks set_dep_gen_enabled(bool) as an override.
Shared C API Implementation
src/common/platform/onboard/host/c_api_shared.cpp
New 441-line shared implementation providing per-thread DeviceRunnerBase* binding via pthread TLS, internal helpers for device memory and arena operations, and all exported C API entry points: context lifecycle, tensor allocation/copy/free, device initialization with CANN logging setup and eager resource initialization, callable preparation with address repacking and dual registration paths, and runtime execution with feature toggle configuration, callback wiring, and timing collection.
Build Wiring and Platform Migration
src/a2a3/platform/onboard/host/CMakeLists.txt, src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp, src/a5/platform/onboard/host/CMakeLists.txt, src/a5/platform/onboard/host/pto_runtime_c_api.cpp
Both CMakeLists add c_api_shared.cpp to HOST_RUNTIME_SOURCES. a2a3 removes destroy_device_context and get_runtime_size from its C API file, retaining only a2a3-specific and ACL/comm entries. a5 removes all device context, memory, and lifecycle implementations, keeping only create_device_context and ACL/comm placeholders. File headers updated to reflect shared logic location.

Sequence Diagram

sequenceDiagram
    participant ChipWorker as ChipWorker
    participant c_api_shared as c_api_shared.cpp
    participant DeviceRunnerBase as DeviceRunnerBase*
    participant Runtime as Runtime
    ChipWorker->>c_api_shared: simpler_init(ctx, binaries)
    c_api_shared->>DeviceRunnerBase: attach_thread()
    c_api_shared->>DeviceRunnerBase: ensure_device_initialized()
    ChipWorker->>c_api_shared: prepare_callable(ctx, id, callable)
    c_api_shared->>DeviceRunnerBase: prepare_callable_impl()
    c_api_shared->>DeviceRunnerBase: register_callable()
    ChipWorker->>c_api_shared: run_prepared(ctx, runtime, id, args, ...)
    c_api_shared->>Runtime: new Runtime(host_apis, callbacks)
    c_api_shared->>DeviceRunnerBase: bind_callable()
    c_api_shared->>DeviceRunnerBase: run_task()
    c_api_shared->>ChipWorker: return status & timing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#880: Introduces and moves the shared DeviceRunnerBase foundation and tensor/arena helpers that this PR builds upon with new virtual entry points and C API glue.
  • hw-native-sys/simpler#913: Refactors runner diagnostics, collectors, and enablement hooks on DeviceRunnerBase that the new c_api_shared.cpp calls to configure runner toggles during task execution.
  • hw-native-sys/simpler#909: Extracts lifecycle helpers (device initialization, thread binding, callable running) into DeviceRunnerBase that align with the virtual run and finalize hooks newly exposed in this PR's base interface.

Poem

🐰 A refactor hops through code with care,
Shared logic finds a home up there,
From base to glue, we weave the thread,
One API shines where two once spread!
The platforms dance in harmony—
Less duplication, more to see! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main refactoring effort: extracting shared C API implementation code from duplicate copies in a2a3 and a5, with the net result of removing 303 lines.
Description check ✅ Passed The PR description provides comprehensive context about the refactoring, including detailed explanations of what was extracted, why virtual methods were added, verification results, and a test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the PTO Runtime C API by extracting the common, architecture-independent C API glue from the arch-specific pto_runtime_c_api.cpp files into a new shared file, c_api_shared.cpp. To facilitate this, DeviceRunnerBase is updated with a public virtual destructor and virtual entry points (run, finalize, and set_dep_gen_enabled), which are overridden in the concrete subclasses. The review feedback highlights an exception-safety issue in the shared run_prepared function, where the placement-new'ed Runtime object's destructor is called manually on each return path. If an exception is thrown, the destructor is bypassed, leading to potential resource leaks. Using RAIIScopeGuard is recommended to ensure exception-safe cleanup.

Comment thread src/common/platform/onboard/host/c_api_shared.cpp Outdated
@ChaoWao ChaoWao force-pushed the extract-c-api-common branch from dbde828 to 65dddf4 Compare May 31, 2026 02:07
Move the byte-identical c_api functions from a2a3 and a5 onboard
pto_runtime_c_api.cpp into a single shared
src/common/platform/onboard/host/c_api_shared.cpp linked into each
arch's libhost_runtime.so. Works through DeviceRunnerBase * and
dispatches arch-specific behavior through three new virtuals.

DeviceRunnerBase additions:
- Public virtual destructor (was protected non-virtual). Lets the
  shared destroy_device_context delete polymorphically.
- virtual int run(Runtime&, int, int) = 0  — each arch already had run().
- virtual int finalize() = 0  — each arch already had finalize().
- virtual void set_dep_gen_enabled(bool) — default no-op for a5;
  a2a3 overrides to wire enable_dep_gen_. Lets the shared
  run_prepared call set_dep_gen_enabled unconditionally.

c_api_shared.cpp owns (12 dlsym'd + 7 static helpers + TSD glue):
- TSD glue (g_runner_key, pthread_once, current_runner)
- Static internal: device_malloc, device_free, copy_to_device,
  copy_from_device, upload_chip_callable_buffer_wrapper,
  setup_static_arena_wrapper, acquire_pooled_*_wrapper
- Public C ABI: destroy_device_context, get_runtime_size,
  device_malloc_ctx, device_free_ctx, copy_to_device_ctx,
  copy_from_device_ctx, finalize_device, simpler_init,
  prepare_callable, run_prepared, unregister_callable,
  get_aicpu_dlopen_count, get_host_dlopen_count

Each arch's pto_runtime_c_api.cpp now only carries arch-specific
entries (also dlsym'd by ChipWorker):
- create_device_context (must \`new\` the concrete subclass)
- a2a3: ensure_acl_ready_ctx + create_comm_stream_ctx +
  destroy_comm_stream_ctx (real implementations through
  DeviceRunner::{ensure_acl_ready, create_comm_stream, destroy_comm_stream})
- a5: same three as not-supported stubs plus all comm_* stubs
  (returns NULL / -1, distributed runtime not yet on a5)

Both arches built clean. nm -D verifies all 26 ChipWorker dlsym
targets are exported on both libhost_runtime.so. a2a3 onboard smoke
(dummy_task, alternating_matmul_add, prepared_callable suite,
spmd_basic) — 9/9 passed in 19s.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/common/platform/onboard/host/device_runner_base.h (1)

67-71: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale class-level docstring: destructor is now public virtual, not protected non-virtual.

Lines 67-71 still describe the destructor as "protected" and "non-virtual", but lines 75-78 now declare virtual ~DeviceRunnerBase() = default; under public. The comment should be updated to match the new design.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/platform/onboard/host/device_runner_base.h` around lines 67 - 71,
The class-level comment is stale: it claims the destructor is protected and
non-virtual but the code now declares "virtual ~DeviceRunnerBase() = default;"
in public. Update the docstring for DeviceRunnerBase to state that the
destructor is public and virtual, remove the claim that direct
instantiation/delete through base pointer are compile errors, and adjust the
rationale to explain that a public virtual destructor allows safe polymorphic
deletion while destroy_device_context still operates on the arch subclass's
DeviceRunner; reference DeviceRunnerBase, ~DeviceRunnerBase, and
destroy_device_context in the updated comment.
🧹 Nitpick comments (1)
src/common/platform/onboard/host/c_api_shared.cpp (1)

68-68: 💤 Low value

pthread_key_create return value is unchecked.

If pthread_key_create fails (e.g., EAGAIN from exhausted keys), subsequent pthread_getspecific/pthread_setspecific calls operate on an uninitialized key, causing undefined behavior. Consider checking the return and logging on failure.

Defensive fix
-static void create_runner_key() { pthread_key_create(&g_runner_key, nullptr); }
+static void create_runner_key() {
+    int rc = pthread_key_create(&g_runner_key, nullptr);
+    if (rc != 0) {
+        LOG_ERROR("pthread_key_create failed: %d", rc);
+    }
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/platform/onboard/host/c_api_shared.cpp` at line 68, The call to
pthread_key_create in create_runner_key does not check its return value, so if
it fails (e.g., EAGAIN) g_runner_key remains uninitialized; update
create_runner_key to check the return of pthread_key_create, log an error (using
the existing logging facility) when it fails, and take a safe recovery action
(e.g., abort/exit or set a sentinel and avoid calls to
pthread_getspecific/pthread_setspecific) so later uses of g_runner_key are
guarded; reference the create_runner_key function, the g_runner_key variable,
and the pthread_key_create call when applying this fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/common/platform/onboard/host/device_runner_base.h`:
- Around line 67-71: The class-level comment is stale: it claims the destructor
is protected and non-virtual but the code now declares "virtual
~DeviceRunnerBase() = default;" in public. Update the docstring for
DeviceRunnerBase to state that the destructor is public and virtual, remove the
claim that direct instantiation/delete through base pointer are compile errors,
and adjust the rationale to explain that a public virtual destructor allows safe
polymorphic deletion while destroy_device_context still operates on the arch
subclass's DeviceRunner; reference DeviceRunnerBase, ~DeviceRunnerBase, and
destroy_device_context in the updated comment.

---

Nitpick comments:
In `@src/common/platform/onboard/host/c_api_shared.cpp`:
- Line 68: The call to pthread_key_create in create_runner_key does not check
its return value, so if it fails (e.g., EAGAIN) g_runner_key remains
uninitialized; update create_runner_key to check the return of
pthread_key_create, log an error (using the existing logging facility) when it
fails, and take a safe recovery action (e.g., abort/exit or set a sentinel and
avoid calls to pthread_getspecific/pthread_setspecific) so later uses of
g_runner_key are guarded; reference the create_runner_key function, the
g_runner_key variable, and the pthread_key_create call when applying this fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8386eb5c-922f-46cf-9851-6b8c5508809d

📥 Commits

Reviewing files that changed from the base of the PR and between 55c3d8b and dbde828.

📒 Files selected for processing (8)
  • src/a2a3/platform/onboard/host/CMakeLists.txt
  • src/a2a3/platform/onboard/host/device_runner.h
  • src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a5/platform/onboard/host/CMakeLists.txt
  • src/a5/platform/onboard/host/device_runner.h
  • src/a5/platform/onboard/host/pto_runtime_c_api.cpp
  • src/common/platform/onboard/host/c_api_shared.cpp
  • src/common/platform/onboard/host/device_runner_base.h

@ChaoWao ChaoWao merged commit 8199c70 into hw-native-sys:main May 31, 2026
29 of 31 checks passed
@ChaoWao ChaoWao deleted the extract-c-api-common branch May 31, 2026 02:51
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
Mirror the onboard host refactor (hw-native-sys#880hw-native-sys#928) on the sim path. Pulls
~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a
shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp
under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only
the bits that genuinely differ between a2a3 and a5 sim:

- aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg)
- dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs /
  aicore_rotation_table setters; a5 doesn't)
- init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas;
  a5 uses std::malloc via prof_alloc_cb static functions — preserved
  as-is, no behavior change)
- finalize() collector semantics (a2a3 releases shm to mem_alloc_
  per-run; a5 stop()s per-run, full finalize at run-end via
  prof_free_cb — preserved as-is)
- run() middle (dep_gen gating on a2a3 only; different SIM_REG_*
  constants)

SimDeviceRunnerBase hosts the byte-identical methods + their state:
setup_static_arena, acquire_pooled_*, create_thread,
attach_current_thread, allocate_tensor / free_tensor / copy_*,
register_callable[_host_orch], unregister_callable, has_callable,
bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer,
print_handshake_results, release_callable_state, ensure_device_initialized,
set_*_enabled / output_prefix accessors, last_device_wall_ns, the
shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable
buffer pool / collector instances / kernel_args_ / device_wall ptr /
log/dlopen counters.

Mechanical-fix: setup_static_arena standardized to "release all
arenas on any failure" (matches a2a3 sim + onboard PR hw-native-sys#922). a5 sim
had been keeping earlier-committed peers alive on later-region
failure; the new common impl drops that to match the onboard
invariant.

Latent-fix carried over from onboard hw-native-sys#928: c_api_shared's
run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so
its dtor fires on every exit path (manual r->~Runtime() in the prior
sim c_api was bypassed by catch(...) on exception).

Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase
(public), run(), finalize(), set_dep_gen_enabled() (default no-op,
a2a3 overrides). c_api_shared.cpp works through
SimDeviceRunnerBase * and dispatches via those virtuals — same
pattern as the onboard hw-native-sys#928 split.

Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just
create_device_context + ACL stubs). memory_allocator.cpp was
byte-identical, deleted from both arch subdirs and lives once in
common/.

Both arches built clean. nm -D verifies all 17 ChipWorker dlsym
targets exported on both sim libhost_runtime.so. ST passes 38/38 on
a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1);
examples spot-checked (scalar_data_test, vector_example,
benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit that referenced this pull request May 31, 2026
Mirror the onboard host refactor (#880#928) on the sim path. Pulls
~2800 duplicate lines out of src/{a2a3,a5}/platform/sim/host/ into a
shared SimDeviceRunnerBase + c_api_shared.cpp + memory_allocator.cpp
under src/common/platform/sim/host/. Per-arch DeviceRunner keeps only
the bits that genuinely differ between a2a3 and a5 sim:

- aicore_execute signature (a5 has extra aicore_pmu_ring_addrs arg)
- dlsym'd function-pointer table (a2a3 has dep_gen / pmu_reg_addrs /
  aicore_rotation_table setters; a5 doesn't)
- init_* alloc strategy (a2a3 uses mem_alloc_ via captured lambdas;
  a5 uses std::malloc via prof_alloc_cb static functions — preserved
  as-is, no behavior change)
- finalize() collector semantics (a2a3 releases shm to mem_alloc_
  per-run; a5 stop()s per-run, full finalize at run-end via
  prof_free_cb — preserved as-is)
- run() middle (dep_gen gating on a2a3 only; different SIM_REG_*
  constants)

SimDeviceRunnerBase hosts the byte-identical methods + their state:
setup_static_arena, acquire_pooled_*, create_thread,
attach_current_thread, allocate_tensor / free_tensor / copy_*,
register_callable[_host_orch], unregister_callable, has_callable,
bind_callable_to_runtime, prepare_orch_so, upload_chip_callable_buffer,
print_handshake_results, release_callable_state, ensure_device_initialized,
set_*_enabled / output_prefix accessors, last_device_wall_ns, the
shared mem_alloc_ / gm_*_arena_ / callable maps / chip_callable
buffer pool / collector instances / kernel_args_ / device_wall ptr /
log/dlopen counters.

Mechanical-fix: setup_static_arena standardized to "release all
arenas on any failure" (matches a2a3 sim + onboard PR #922). a5 sim
had been keeping earlier-committed peers alive on later-region
failure; the new common impl drops that to match the onboard
invariant.

Latent-fix carried over from onboard #928: c_api_shared's
run_prepared wraps the placement-new'd Runtime in RAIIScopeGuard so
its dtor fires on every exit path (manual r->~Runtime() in the prior
sim c_api was bypassed by catch(...) on exception).

Polymorphism via SimDeviceRunnerBase virtuals: ~SimDeviceRunnerBase
(public), run(), finalize(), set_dep_gen_enabled() (default no-op,
a2a3 overrides). c_api_shared.cpp works through
SimDeviceRunnerBase * and dispatches via those virtuals — same
pattern as the onboard #928 split.

Per-arch pto_runtime_c_api.cpp shrinks from ~420 lines to ~55 (just
create_device_context + ACL stubs). memory_allocator.cpp was
byte-identical, deleted from both arch subdirs and lives once in
common/.

Both arches built clean. nm -D verifies all 17 ChipWorker dlsym
targets exported on both sim libhost_runtime.so. ST passes 38/38 on
a2a3sim L1+L2 (devices 0,1) and 22/22 on a5sim L1+L2 (devices 0,1);
examples spot-checked (scalar_data_test, vector_example,
benchmark_bgemm on a2a3sim; vector_example, bgemm on a5sim).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants