Feat: prepared callable — register + run(cid) on a unified ABI#710
Open
poursoul wants to merge 28 commits intohw-native-sys:mainfrom
Open
Feat: prepared callable — register + run(cid) on a unified ABI#710poursoul wants to merge 28 commits intohw-native-sys:mainfrom
poursoul wants to merge 28 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a 'prepared callable' mechanism to reduce orchestration overhead by caching kernel binaries and SO handles across repeated launches. It transitions the system to use registered callable_id integers instead of raw pointers in submit_next_level and run calls. The changes span the Python API, C++ bindings, and the AICPU executor, which now maintains a per-ID orchestration table. Review feedback identifies a performance issue in the Python chip loop due to TaskArgs instantiation and points out missing bounds checks for callable_id in both the host-side registration and the AICPU-side dispatch logic.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 6, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 6, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim device_runner: in upload_kernel_binary, when func_id is cached, compare the new bytes against the cached binary and re-dlopen on mismatch. Stage 4 (hw-native-sys#710) wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache-hit handed the AICore the previous callable's kernel and segfaulted at dispatch. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
51d2d8f to
c500507
Compare
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim device_runner: in upload_kernel_binary, when func_id is cached, compare the new bytes against the cached binary and re-dlopen on mismatch. Stage 4 (hw-native-sys#710) wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache-hit handed the AICore the previous callable's kernel and segfaulted at dispatch. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the incoming bytes and re-upload when a cached func_id entry holds a different binary. Stage 4 wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache hit handed the AICore the previous callable's kernel and segfaulted (sim) or hung the AICPU dispatch spin-wait (onboard) on the next run. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 8, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 8, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the incoming bytes and re-upload when a cached func_id entry holds a different binary. Stage 4 wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache hit handed the AICore the previous callable's kernel and segfaulted (sim) or hung the AICPU dispatch spin-wait (onboard) on the next run. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 8, 2026
… API - vector_add: register chip_callable before init(), pass cid to worker.run - child_memory: register before init(), pass cid to orch.submit_next_level - Update vector_add README and docstring diagram to match the new flow Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 8, 2026
Function grew to 104 statements (limit 100) after the callable refactor. The function is structured as a single dispatch loop over the bootstrap + control-mailbox protocol — splitting it would obscure the state machine, so add PLR0915 to the existing PLR0912 noqa. Resolves the pre-commit CI failure on PR hw-native-sys#710.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 9, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 9, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the incoming bytes and re-upload when a cached func_id entry holds a different binary. Stage 4 wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache hit handed the AICore the previous callable's kernel and segfaulted (sim) or hung the AICPU dispatch spin-wait (onboard) on the next run. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 9, 2026
… API - vector_add: register chip_callable before init(), pass cid to worker.run - child_memory: register before init(), pass cid to orch.submit_next_level - Update vector_add README and docstring diagram to match the new flow Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 9, 2026
Function grew to 104 statements (limit 100) after the callable refactor. The function is structured as a single dispatch loop over the bootstrap + control-mailbox protocol — splitting it would obscure the state machine, so add PLR0915 to the existing PLR0912 noqa. Resolves the pre-commit CI failure on PR hw-native-sys#710.
Foundation for the callable.md design: lift each-run dlclose+dlopen on AICPU (caused by alternating callables) to a one-time-per-callable_id load. Adds active_callable_id_/register_new_callable_id_ to the Runtime struct and a 64-slot orch_so_table_ on the AICPU executor. active_callable_id_ < 0 keeps the legacy single-slot path (governed by has_new_orch_so_) untouched, so existing run_runtime() callers and all six other variants continue to work without changes. Verified: - tests/ut/py/test_chip_worker.py: 12/12 pass on a2a3sim - examples/.../vector_example: pass on a2a3sim
Implement Layer 3 of the per-callable_id dispatch protocol described in
docs/callable.md. Splits the legacy run_runtime path into a one-time
prepare phase (uploads orch SO + kernels, builds the per-cid metadata)
and a per-call run phase (binds cached state to a fresh Runtime, then
launches without re-uploading bytes).
- Extract prepare_callable_impl / bind_prepared_to_runtime_impl out of
init_runtime_impl in trb runtime_maker.cpp so the c_api layer can
drive the prepare/run split independently.
- DeviceRunner (onboard + sim) gains prepared_callables_ keyed by
callable_id, an orch_so_dedup_ table that refcounts identical SO
bytes by Build-ID hash, and aicpu_seen_callable_ids_ to drive
register_new_callable_id_ on first sighting per cid.
- prepare_orch_so resolves the active callable_id when present and
short-circuits the H2D upload to the cached buffer; legacy callers
with cid<0 still take the original pending_orch_so path.
- New ABI exported from pto_runtime_c_api.{h,cpp} on both platforms.
Variants without callable.md support (host_build_graph,
aicpu_build_graph) export stubs that return -1, gated by
RUNTIME_HAS_CALLABLE_ID defined only in the trb runtime.h, so the
shared device_runner.cpp compiles cleanly across all six variants.
… + Python Layer 4 of the callable.md migration: drive the per-callable_id C ABI (introduced in fc721150) end-to-end through ChipWorker, the nanobind surface, and the Python wrapper, plus a sticky flag in DeviceRunner that keeps finalize's "kernel still cached" leak signal honest now that the prepared-callable path legitimately keeps kernels resident until finalize. - ChipWorker (src/common/worker): dlsym the new symbols and add prepare_callable / run_prepared / unregister_callable methods with device-not-set guards. Stubs in non-trb variants surface the runtime rejection as a thrown error. - nanobind: bind the three methods on _ChipWorker so the Python wrapper can drive them without a separate raw-pointer path. - Python wrapper (simpler.task_interface.ChipWorker): thin pass-through that mirrors run()'s **kwargs config-override pattern. - DeviceRunner.finalize: distinguish legacy-path "still-cached kernels" leaks from prepared-callable kernels that live until finalize by design. Uses a sticky prepared_callable_path_used_ flag set by register_prepared_callable (never cleared, so a post-unregister finalize still routes to DEBUG instead of ERROR). - tests/ut/py/test_chip_worker.py: 3 new state-machine guards covering the new methods before set_device. - tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable: new e2e test that prepares two callable_ids sharing the vector_example orch SO, runs each via run_prepared (cid=0 twice to hit the dedup path), then unregisters both. Verified: - tests/ut/py/test_chip_worker.py: 15/15 PASSED - prepared_callable test: PASSED on a2a3sim - paged_attention_unroll on a2a3 hardware (--device 9): PASSED
Stage 2 of docs/callable.md: make the prepare/run_prepared/unregister
ABI uniform across all 5 valid runtime variants (3 a2a3 + 2 a5) so
ChipWorker dlsym is independent of which variant is loaded.
- a5/platform/{onboard,sim}/host/pto_runtime_c_api.cpp: add
unconditional stubs (prepare_callable/run_prepared return -1 with
LOG_ERROR; unregister_callable returns 0). a5 has no
RUNTIME_HAS_CALLABLE_ID-aware path yet, so the stubs are the entire
surface; full support is deferred until a5 picks up the per-cid
orch SO dispatch.
- python/simpler/worker.py: add L2 facade methods on Worker that
forward to the underlying ChipWorker. The ST framework's
conftest.st_worker fixture wraps ChipWorker in Worker(level=2),
so prepared_callable e2e tests (and any future caller going through
Worker) need this thin pass-through. L3+ still raises
NotImplementedError pending Stage 3 (mailbox protocol switch to cid).
a2a3/{host_build_graph,aicpu_build_graph} required no source changes:
the platform code is shared across the three a2a3 variants and was
already gated by `#ifdef RUNTIME_HAS_CALLABLE_ID`, which only
tensormap_and_ringbuffer's runtime.h defines. The non-trb variants
fall through to the existing `#else` stub branch automatically.
Verified on sim only:
- 5 variants compile clean (a2a3sim x3, a5sim x2; a5 has no
aicpu_build_graph).
- UT test_chip_worker.py 15/15.
- a2a3sim ST sample: host_build_graph 4/4, aicpu_build_graph 3/3,
tensormap_and_ringbuffer 4/4 (incl. prepared_callable e2e).
- a5sim ST: host_build_graph 1/1, tensormap_and_ringbuffer 10/10.
…pre-warm Replace the NEXT_LEVEL raw ChipCallable* pointer path with a unified callable_id (cid) protocol: C++ core: - Remove TaskSlotState::callable (uint64_t ptr) field; unify on callable_id - Orchestrator::submit_next_level now takes int32_t callable_id - dispatch_thread/dispatch_process write cid into mailbox for both NEXT_LEVEL and SUB worker types Python runtime: - Worker.register() accepts ChipCallable in addition to Python fns; returns cid from a single shared id space - _chip_process_loop / _chip_process_loop_with_bootstrap: accept registry dict, read cid from mailbox, lazy-prepare + run_prepared - New _CTRL_PREPARE (=4) control command for explicit pre-warming - _start_hierarchical: after init(), pushes _CTRL_PREPARE to every chip child for each registered ChipCallable (fixes first-run latency spike) - Orchestrator.submit_next_level raises TypeError on raw ChipCallable (migration guide: use Worker.register + pass cid) Nanobind: - _Orchestrator binding: submit_next_level takes int32_t callable_id - _ChipWorker.run_prepared: add TaskArgs overload (chip child path) Test infrastructure: - conftest.py st_worker L3: register ChipCallable entries before init - scene_test.py _create_standalone_worker: compile + register ChipCallable before init; CallableNamespace exposes cid (int) not ChipCallable - Migrate 7 L3 examples/demos to register + cid pattern - C++ UTs: submit_next_level(int32_t, ...) signatures Verified: C++ UT 17/17, Python UT 70/70 (65+5), a2a3sim L3 ST 3/3, a5sim ST 10/10, prepared_callable L2 e2e 1/1.
Promote the Stage 3 cid contract to the L2 entry point so every level of the hierarchy speaks the same dispatch surface. Worker (level=2): - register() now also accepts ChipCallable; returns a cid from the unified id space (callable.md §3.4). May be called either before or after init() — L2 has no fork/COW constraint. Pre-init registrations are batched and prepared at the end of init(); post-init registrations prepare on the device immediately. - run(cid, args, cfg) routes through _chip_worker.run_prepared. - _l2_use_prepared probe: when the bound runtime variant lacks prepare_callable support (host_build_graph / aicpu_build_graph stub return -1 — see Stage 2), the first prepare attempt flips the flag and every subsequent run() falls back to the legacy _chip_worker.run lower-level binding silently. Rollback knob: - PTO2_DISABLE_PREPARED_CALLABLE=1 forces L2 onto the legacy lower-level binding (skips prepare at init, resolves cid back to its ChipCallable at run time). L3+ paths are unaffected — the cid mailbox protocol has no legacy fallback. scene_test.py: - _run_and_validate_l2 now register()s the compiled ChipCallable once per class (cached via _st_l2_cid) and calls Worker.run(cid, …). Verified: Python UT 80/80 (15 chip + 65 worker), a2a3sim L2 host_build_graph 4/4 (auto fallback), aicpu_build_graph 3/3, trb spmd_sync_start (with and without PTO2_DISABLE_PREPARED_CALLABLE=1), prepared_callable e2e 1/1.
Expose a monotonic counter of distinct callable_ids the AICPU has been asked to dlopen for, so tests can assert per-cid registration eliminates redundant dlopens across repeated runs (callable.md §7 verification). - DeviceRunner (a2a3 onboard + sim): track aicpu_dlopen_total_, bumped on first-sighting bind; not decremented by unregister so case D (unregister + re-prepare) reports +2 - C ABI: get_aicpu_dlopen_count exported by all 4 a2a3/a5 variants; a5 + non-trb a2a3 return 0 (no per-cid registration there) - ChipWorker / nanobind / Python wrappers: aicpu_dlopen_count property on _ChipWorker, ChipWorker, and Worker (L2-only; non-L2 returns 0) - tests/st prepared_callable: 4 new test methods asserting counter delta for same-cid repeat (1), two-cid interleaving (2), double prepare (RuntimeError), and unregister + re-prepare (2). Each test snapshots baseline on entry and unregisters on exit so the shared st_worker fixture stays clean between cases.
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp and src/common/worker/chip_worker.h (pre-commit fix). - register_prepared_callable: enforce callable_id in [0, 64) in both a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on host instead of OOB-indexing the AICPU orch_so_table_ later. - aicpu_executor: reject negative callable_id values other than the legacy -1 sentinel (mirrors the upper-bound guard). - tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API so the negative ST works under the unified run(cid) entry point.
Previously the upper bound was hard-coded as `64` in three independent places (a2a3 onboard/sim DeviceRunner host bounds checks and the AICPU executor's `orch_so_table_[]` declaration), with three different spellings (`kMaxCallableId` vs `MAX_REGISTERED_CALLABLE_IDS`). They are the same protocol constant — diverging would silently break the host↔ AICPU contract. Move the constant into a new `src/common/task_interface/callable_protocol.h` header (cstdint-only so the AICPU side can include it without dragging in `<vector>`/`<stdexcept>` from `callable.h`) and have all three call sites reference it.
…t lacks prepare_callable a5/onboard's pto_runtime_c_api stubs `prepare_callable`/`run_prepared` to -1 (Stage 1 ABI port deferred the implementation), which hard-broke every L3+ test on a5/onboard once Stage 3 made the chip_process_loop go through `prepare_callable` + `run_prepared` unconditionally. Detect the stub at the very first prepare attempt: if the call raises RuntimeError, set `prepared_unsupported` and route every subsequent TASK_READY through the legacy `cw.run(callable_obj, args, cfg)` path (callable_obj resolved from the COW-inherited registry by cid). This keeps the L3+ mailbox protocol cid-only as designed while letting variants that have not yet picked up per-cid orch SO dispatch keep working in the meantime. Once all variants implement the prepared path, the fallback shim and the legacy ChipWorker.run binding can go. Mirror the same fallback in `_chip_process_loop_with_bootstrap` (distributed/HCCL chips).
The onboard `create_orch_so_file` named the staged SO `libdevice_orch_<pid>.so` based on the assumption that "only one runtime runs per device process, so pid uniqueness is sufficient" (in 7e071c1 / before stage 4). Stage 4 broke that assumption: per-callable_id dispatch keeps multiple orch SO images resident in the same AICPU process at once, one per cid in `orch_so_table_[]`. The reload branch first creates `orch_so_table_[cid].handle` without unlinking any pre-existing on-disk file (the unlink only fires when *that same slot's* handle is non-null), so the second cid's `open(..., O_TRUNC)` silently truncated and rewrote cid=0's file image. The kernel still mapped the old inode for cid=0's dlopen'd code; the next launch on cid=0 jumped into bytes that now belonged to cid=1 and SIGBUS'd inside AICPU. The host saw it as `rtStreamSynchronize (AICPU) failed: 507018`. Repro: examples/workers/l3/ffn_tp_parallel — two cids (ffn_local + allreduce) on a2a3/onboard. multi_chip_dispatch passed because it only register()'d a single ChipCallable. Fix: - create_orch_so_file gains a callable_id parameter. Onboard variants embed it in the file name (`libdevice_orch_<pid>_<cid>.so`) when cid >= 0; the legacy single-slot path (cid == -1) keeps pid-only naming so variants that never adopt per-cid dispatch see no change. - Sim variants embed cid for log readability only — mkstemps already guarantees uniqueness — keeping the contract symmetrical across all four implementations. - aicpu_executor.cpp at both a2a3 and a5 forwards the active cid (a5 passes -1 since it has no callable_id concept yet). Regression test: tests/ut/cpp/common/test_orch_so_file.cpp asserts that distinct cids produce distinct paths and the legacy sentinel preserves pid-only naming. Compiles the a2a3 onboard implementation directly so the ut catches the bug on no-hw runners too.
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip child loops on variants without prepare_callable can dispatch via the legacy TaskArgs path (fixes a5 multi_chip_dispatch failures). - a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the incoming bytes and re-upload when a cached func_id entry holds a different binary. Stage 4 wires multiple ChipCallables onto the same ChipWorker (and DeviceRunner) via prepare_callable, so different callables register distinct kernels under overlapping func_ids; the prior unconditional cache hit handed the AICore the previous callable's kernel and segfaulted (sim) or hung the AICPU dispatch spin-wait (onboard) on the next run. - a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in the per-core init loop (matches onboard); uninitialized garbage was being treated as a valid pointer when the L2 swimlane bit happened to be set in enable_profiling_flag, causing AICore segfaults. - a2a3 onboard host_regs: restore placeholder-address fallback for AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not dereference these); Pmu kind continues to propagate failure so the caller can disable PMU collection cleanly. - a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in this branch's logging surface) with LOG_ERROR, and drop the spurious leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form, not LOG_INFO_V). - a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so clang-tidy can lint the header without per-runtime include paths.
Add `active_callable_id_` and `register_new_callable_id_` fields plus
their setter/getter to the three runtime variants that lack them
(a2a3/host_build_graph, a5/tensormap_and_ringbuffer,
a5/host_build_graph). After this commit every runtime variant exposes
the same per-callable_id state shape that a2a3/tensormap_and_ringbuffer
already has — Phase 1+ wire AICPU and platform layers to read it.
Also gate a5/tensormap_and_ringbuffer with `#define
RUNTIME_HAS_CALLABLE_ID 1` so the shared a5 platform layer recognises
the protocol when compiled against this runtime; the macro is removed
once every variant implements the prepare/run_prepared path.
Behaviour is unchanged: the new fields are written but no caller reads
them yet. All four sim variants
(a2a3sim/{trb,hbg}, a5sim/{trb,hbg}) compile cleanly.
Mirrors the a2a3/tensormap_and_ringbuffer prepared_callable implementation onto a5: AICPU executor gains a per-cid orch_so_table_, host device runner gains register/unregister/has/bind methods + a hash-keyed orch SO buffer dedup, and runtime_maker.cpp is split into prepare_callable_impl + bind_prepared_to_runtime_impl with init_runtime_impl as a shim. The a5 platform layer (onboard + sim) is shared between trb and hbg, so callable-specific implementations are guarded by RUNTIME_HAS_CALLABLE_ID to keep hbg compiling until Phase 2 lands its prepare/bind impls.
End-to-end coverage for prepare_callable / run_prepared / unregister_callable on a5/tensormap_and_ringbuffer, structurally identical to the a2a3 test: shared-orch double-cid run, same-cid repeat dlopen accounting, two-cid interleaved dlopen accounting, double-prepare rejection, and unregister + re-prepare counter monotonicity. Reuses the orch_so_cache single-task orchestration and mixed_example kernel_add_standalone so the test stays focused on the prepare/run ABI.
…cached host dlopen
- 4 hbg runtime.h (a2a3+a5): add RUNTIME_HAS_CALLABLE_ID + RUNTIME_HOST_ORCH
defines and pending_host_dlopen_handle_/pending_host_orch_func_ptr_ host
staging fields.
- 4 runtimes (trb+hbg): add replay_function_bin_addr(func_id, addr) — does
not record into registered_kernel_func_ids_, lets platform replay prepared
kernel bindings without triggering validate-time release. Unifies
func_id_to_addr_ access via member function.
- 2 hbg runtime_maker.cpp: split init_runtime_impl into prepare_callable_impl
(dlopen+dlsym → staging fields) and bind_prepared_to_runtime_impl (read
fn_ptr, call orch_func, build graph). Legacy init_runtime_impl is now a
shim (dlclose at end).
- 4 platform device_runner.{h,cpp} (a2a3/a5 × onboard/sim):
PreparedCallableState extended with host_dlopen_handle/host_orch_func_ptr;
new register_prepared_callable_host_orch + host_dlopen_count +
host_dlopen_total_; unregister_prepared_callable branches on
host_dlopen_handle (hbg → dlclose, trb → orch_so_dedup_ refcount);
bind_prepared_callable_to_runtime uses replay_function_bin_addr; host orch
fields restored under #ifdef RUNTIME_HOST_ORCH; prepare_orch_so early-
returns for hbg (zeroes dev_orch_so to skip AICPU counting).
- 4 pto_runtime_c_api.cpp: prepare_callable uses std::unique_ptr<Runtime>
(hbg Runtime holds 131072 Tasks ≈ tens of MB, too large for stack);
routes to register_prepared_callable_host_orch under #ifdef
RUNTIME_HOST_ORCH; exports get_host_dlopen_count.
- chip_worker.{h,cpp}: add host_dlopen_count() getter and dlsym binding.
- bindings/task_interface.cpp + python/simpler/{task_interface,worker}.py:
expose host_dlopen_count attribute.
Verified: 4 sim binaries compile, 4 variants × 5 prepared_callable ST tests
pass (20 total), tests/ut/py/test_chip_worker.py 15 pass, a2a3/hbg
vector_example regression passes.
…ants
Mirror the trb prepared_callable ST suite to host_build_graph:
- tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
reuses a2a3 vector_example kernel for the 5 prepared_callable scenarios
(single-cid prepare→run, multi-cid alternation, repeated run, unregister,
host_dlopen_count assertions).
- tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
with self-contained dump_tensor-style kernels under kernels/{aiv,
orchestration}/.
Both assert host_dlopen_count == distinct_registered_cids and
aicpu_dlopen_count == 0 (hbg path does not trigger AICPU dlopen).
Verified: 5 tests pass on each variant under sim.
…E_HOST_ORCH macros
All four runtime variants (a2a3/{trb,hbg}, a5/{trb,hbg}) now implement
prepare_callable / run_prepared / unregister_callable end-to-end, so the
build-time guards that picked between the real implementation and stubs
or between trb/hbg staging fields are no longer load-bearing.
Unify the public Runtime API across variants so the platform layer can
branch at runtime instead:
- trb runtime.h (a2a3+a5): add pending_host_dlopen_handle_ /
pending_host_orch_func_ptr_ host-only fields (always nullptr on trb).
- hbg runtime.h (a2a3+a5): add device_orch_func_name_ /
device_orch_config_name_ + set/get accessors (always empty on hbg).
- 4 device_runner.cpp: bind_prepared_callable_to_runtime now writes
both host_dlopen and device_orch_func_name fields unconditionally;
whichever set was populated by the corresponding register_*
overload wins, the other stays at its default.
- 4 pto_runtime_c_api.cpp: prepare_callable picks the trb vs hbg path
by inspecting r->pending_host_dlopen_handle_ at runtime instead of
via #ifdef RUNTIME_HOST_ORCH.
Mechanical removals:
- 4 runtime.h: drop #define RUNTIME_HAS_CALLABLE_ID and (where present)
RUNTIME_HOST_ORCH.
- 8 platform files (.h/.cpp): unwrap every #ifdef RUNTIME_HAS_CALLABLE_ID
and RUNTIME_HOST_ORCH block, keeping the real implementation; delete
the dlsym-stub #else branches in the c_api files (no variant needs
them now).
Verified: 4 sim binaries compile, 4×5 prepared_callable ST tests pass
(20 total), tests/ut/py/test_chip_worker.py 15 pass.
Now that all four runtime variants implement prepare_callable / run_prepared end-to-end, Worker no longer needs a fallback to the legacy chip_worker.run(callable, args, cfg) lower-level binding when the runtime returned -1 from the C ABI stub. Worker.py removals: - _PREPARED_CALLABLE_DISABLED_ENV / _prepared_callable_disabled() and the PTO2_DISABLE_PREPARED_CALLABLE env-var rollback knob. - _l2_use_prepared field, _l2_prepare() method, and the conditional prepare-then-fallback dance in register() / _init_level2() / run(). - prepared_unsupported flag and _run_legacy() in both _chip_process_loop and _chip_process_loop_with_bootstrap. Both helpers now have a simpler _ensure_prepared() that always prepares-or-raises. Worker.run(L2) and the chip_process loops now always go through run_prepared. A registered ChipCallable that fails to prepare now surfaces the underlying RuntimeError instead of silently rerouting. Verified: tests/ut/py/test_chip_worker.py 15 pass, tests/ut/py/test_worker/ 65 pass + 3 hardware skipped, hbg prepared_callable ST 5×2 pass, a2a3/trb vector_example regression passes.
…gacy ABI
Now that all four variants implement prepare_callable / run_prepared and
the Python fallback to the legacy callable-buffer path is gone, the
single-call C ABI it relied on is dead weight. ChipWorker::run becomes a
thin forwarder to run_prepared so the hierarchical IWorker contract is
preserved; the cid still arrives via worker_manager packing s.callable_id
into uint64.
C++ removals:
- 4 platform pto_runtime_c_api.cpp: drop run_runtime() definitions and the
init_runtime_impl forward decls.
- 4 runtime_maker.cpp: drop the init_runtime_impl compatibility shim that
bundled prepare_callable_impl + bind_prepared_to_runtime_impl.
- src/common/worker/pto_runtime_c_api.h: drop run_runtime declaration and
refresh the file-header dlsym list / call-site references.
- src/common/worker/chip_worker.{h,cpp}:
* IWorker::run(uint64_t, ...) now reinterprets the uint64 as cid and
delegates to run_prepared.
* Drop ChipWorker::run(const void*, const void*, ...) overload, the
RunRuntimeFn typedef, and run_runtime_fn_ dlsym.
Python removals:
- python/bindings/task_interface.cpp: remove the four legacy nanobind
overloads (run / run / run_raw / run_from_blob); keep run_prepared /
prepare_callable / unregister_callable.
- python/simpler/task_interface.py: drop ChipWorker.run wrapper; usage
doc updated to the prepare_callable + run_prepared idiom.
- tests/ut/py/test_chip_worker.py: drop test_run_before_set_device_raises
(test_run_prepared_before_set_device_raises already covers the same
state-machine guard).
Verified: 4 sim binaries compile, nanobind wheel rebuilds,
tests/ut/py/test_chip_worker.py 14 pass + tests/ut/py/test_worker/ 65
pass + 3 hardware skipped, 4 variants × 5 prepared_callable ST = 20 pass,
a2a3/trb vector_example + orch_so_cache regression pass.
…single slot
The single-slot orch SO cache and the callable_id==-1 fallback path
existed only to serve the now-deleted run_runtime() ABI. With every
caller routed through prepare_callable / run_prepared, callable_id is
always in [0, MAX_REGISTERED_CALLABLE_IDS) and AICPU dispatches via
orch_so_table_[callable_id] unconditionally.
Runtime structure:
- 4 runtime.h (a2a3+a5 × trb+hbg): drop has_new_orch_so_ field; simplify
set_dev_orch_so to (dev_addr, size).
- 2 trb shared/runtime.cpp: drop has_new_orch_so() implementation; drop
the dirty-flag init in reset.
- 4 platform device_runner.{h,cpp}: drop the third arg from every
set_dev_orch_so call (5 sites per platform); update doc-comments that
referenced has_new_orch_so_.
AICPU executor (2 trb aicpu_executor.cpp):
- Drop legacy single-slot fields (orch_so_handle_, orch_so_path_,
orch_func_, orch_bind_runtime_, orch_config_func_) along with the
destructor branch and deinit comment that preserved them.
- Replace the use_table-ternary fork with unconditional access into
orch_so_table_[callable_id]; reload is governed by
register_new_callable_id().
- Reject any callable_id outside [0, MAX_REGISTERED_CALLABLE_IDS) (the
-1 escape hatch is gone).
- The run() teardown branch that called orch_bind_runtime_(nullptr) now
reads the per-cid bind from the table.
Verified: 4 sim binaries compile, tests/ut/py/test_chip_worker.py 14
pass + tests/ut/py/test_worker/ 65 pass + 3 hardware skipped, 4 variants
× 5 prepared_callable ST = 20 pass.
… API - vector_add: register chip_callable before init(), pass cid to worker.run - child_memory: register before init(), pass cid to orch.submit_next_level - Update vector_add README and docstring diagram to match the new flow Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
Function grew to 104 statements (limit 100) after the callable refactor. The function is structured as a single dispatch loop over the bootstrap + control-mailbox protocol — splitting it would obscure the state machine, so add PLR0915 to the existing PLR0912 noqa. Resolves the pre-commit CI failure on PR hw-native-sys#710.
…n dedup Root cause of CI a5 sim trb failures: tests/st/a5/.../prepared_callable used the vector_example orchestration (which dispatches func_ids 0/1/2) but only registered func_id=0. AICPU jumped to a NULL kernel address on func_id 1/2 and segfaulted, cascading through the pytest-xdist workers and dragging spmd_*/orch_so_cache/mixed_example down with it. Test fix: align tests/st/a5/.../prepared_callable verbatim with the a2a3 sibling — register all three vector_example AIV kernels (add/add_scalar/mul), update the golden formula to match the orchestration's 5-task DAG. Runtime parity (defensive — not exercised by current a5 CI but matches the 0715661 fix on a2a3 onboard so future cross-callable func_id reuse on a5 does not regress): - src/a5/platform/onboard: add func_id_to_hash_ map, reject cached entry on hash mismatch, evict + re-upload on changed binary. finalize() and remove_kernel_binary() clear the parallel map. - src/a5/platform/sim: compare cached CoreCallable bytes via memcmp on each upload (mirrors a2a3 sim — no separate hash map needed because the MappedKernel cache already retains the original bytes).
…tion Stage 3 (5796321) introduced `_read_args_from_mailbox` to rebuild a ChipStorageTaskArgs Python object from the mailbox blob in chip-child processes (replacing the legacy raw-bytes `run_from_blob` path). The unpacker read data/shapes/ndims/dtype but skipped the child_memory uint8 at offset 33, so every chip-child-side tensor came back with child_memory=False (the make() default). For tensors that carry a chip-owned device pointer — HCCL window slots in allreduce_distributed, deferred_notify_demo, ffn_tp_parallel — the bind_prepared_to_runtime_impl host path then treats the device address as a host pointer, allocates a fresh device buffer, and H2D copies from the (device) source: AICPU dispatches a task whose tensors point at uninitialised allocations, so the task lands in ready_queue with a kernel mask that scheduler/dispatch never advance, surfacing as the "PTO2 timeout after 800001 idle iterations" hang we saw on a2a3 onboard. multi_chip_dispatch passes because all of its tensors are host pointers (child_memory=False), so the missing byte happens to round-trip correctly. This is also why main is unaffected: there `run_from_blob` hands the mailbox bytes straight to C++ via reinterpret_cast on the 40B ContinuousTensor layout, which naturally preserves byte 33. Read offset 33 explicitly and pass it through ContinuousTensor.make. Layout matches src/common/task_interface/tensor_arg.h (40B with child_memory at byte 33). Verified on a2a3 onboard (devices 9,10): - examples/workers/l3/allreduce_distributed: PASS (was hang) - examples/a2a3/.../deferred_notify_demo: PASS (was hang) - examples/workers/l3/multi_chip_dispatch: PASS (no regression)
…rgs parsing in C++ Stage 3 (5796321) made chip-child loops re-deserialise the mailbox ChipStorageTaskArgs blob in Python via _read_args_from_mailbox before forwarding to cw.run_prepared. The hand-written Python parser dropped ContinuousTensor.child_memory at offset 33, which silently broke every tensor carrying a chip-owned device pointer (HCCL window slots in allreduce_distributed / deferred_notify_demo / ffn_tp_parallel) on a2a3 onboard — the runtime treated the device address as a host pointer, the submitted task stuck in ready_queue with kernel_id=-1 / state=0 forever, surfacing as 'PTO2 timeout after 800001 idle iterations' on st-onboard-a2a3. Root cause was duplicating the on-wire ContinuousTensor layout in Python. Fix: keep the layout single-sourced in C++ and stop redoing it in Python. - Add _ChipWorker.run_prepared_from_blob(cid, ptr, capacity, config) nanobind overload. Internally calls read_blob (already used by every C++ caller) for a zero-copy TaskArgsView, then forwards to the existing run_prepared(view, ...) path. No new C-ABI symbol — just a Python-side overload over an existing C++ entry point. - chip-child mailbox loops (_chip_process_loop and _chip_process_loop_with_bootstrap) drop the args = _read_args_from_mailbox(buf) round-trip and call run_prepared_from_blob with the mailbox address directly. The args was never inspected in Python, so the typed-object detour bought nothing and only added a place to lose fields. - _read_args_from_mailbox is kept (still used by _sub_worker_loop and _child_worker_loop, where the destination is a Python callable) but its body collapses to a one-line delegation to the existing nanobind read_args_from_blob helper. The hand-rolled struct.unpack_from layout (which had to know sizeof(ContinuousTensor)==40 and per-field offsets) is gone. Net effect on chip-child hot path: one Python->C++ call instead of N+1 (per-tensor make() + add_tensor() + a final run_prepared()), no intermediate Python TaskArgs / ContinuousTensor object construction. And there is now exactly one place that knows the on-wire layout (src/common/task_interface via read_blob), so adding a field to ContinuousTensor cannot drop it on the chip-child path again. Verified on a2a3 onboard (devices 9,10) and a2a3sim: - examples/workers/l3/allreduce_distributed: PASS (was hang) - examples/a2a3/.../deferred_notify_demo: PASS (was hang) - examples/workers/l3/multi_chip_dispatch: PASS (no regression) - examples/workers/l3/child_memory [a2a3sim]: PASS - tests/ut/py/test_chip_worker: 14/14 pass
- hbg DeviceRunner::finalize() now dlcloses any host orch handles callers forgot to unregister; the host process previously leaked one dlopen handle per re-created Worker (visible in long-running pytest). - AICPU executor unlinks the on-disk libdevice_orch_<pid>_<cid>.so immediately after dlopen, so chip/sub/next-level children that exit via os._exit(0) no longer leave stale .so files in /tmp. - ChipWorker docstring usage example now uses real keyword names (callable_id=, callable=, args=, config=) so the snippet parses as valid Python. - Drop "callable.md" / "Stage N (callable.md)" pointers from comments and docstrings; keep the semantic content but remove references to the un-archived design doc, per .claude/rules/codestyle.md item 1.
Address four review findings on the callable_id refactor:
- scene_test.py: L2 _create_standalone_worker returns (worker, {}, {})
to match the 3-tuple unpacking used by the L3 path; standalone L2
runners no longer fail with ValueError.
- sdma_async_completion_demo: register the ChipCallable before init()
and submit_next_level(chip_cid, ...). raw ChipCallable is rejected
by both register-after-init guards and Orchestrator._require_cid.
- prepared_callable ST: each of the 4 test classes now owns an isolated
L2 Worker via a directory-local conftest.py override so the cid table
is empty on entry; cid 0/1 are renamed _CID_PRIMARY/_CID_SECONDARY
to make the white-box intent explicit and a stale comment claiming
unregister decrements the dlopen counter is removed.
- Docs: worker.py module docstring, docs/getting-started.md, and the
L2/L3 example READMEs all show the full register -> cid -> run /
submit_next_level pattern, including the must-register-before-init()
rule for L>=3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces the prepared callable dispatch path and unifies the L2 / L3+
API on
register()+run(cid)/submit_*(cid). Replaces per-launchdlclose+dlopenof the orch SO on the AICPU with a one-time-per-cidupload, then removes the legacy
run_runtimeABI altogether.src/common/task_interface/callable_protocol.h):AICPU keeps a fixed
orch_so_table_[MAX_REGISTERED_CALLABLE_IDS](cap 64);host registers each callable once, AICPU dispatches by
callable_id.prepare_callable/run_prepared/unregister_callableon every variant (a2a3 + a5, both
host_build_graphandtensormap_and_ringbuffer); dropsrun_runtime/init_runtime_implandthe
RUNTIME_HAS_CALLABLE_ID/RUNTIME_HOST_ORCHcompile-time macros.prepared_callables_keyed by cid,an
orch_so_dedup_table that refcounts identical SO bytes by Build-ID,and
aicpu_seen_callable_ids_so the AICPU is registered once per cid.Worker.register(target) -> cidis the single entrypoint for sub-fn / orch-fn /
ChipCallableat every level;Worker.run,orch.submit_next_level,orch.submit_subnow takecid. L3+ forbidspost-
init()registration so forked chip / sub children inherit theregistry via COW; L2 still allows post-init register and pre-warms on the
spot.
_chip_process_loopconsolidates args parsing in C++ and walks the raw blob path.
vector_add,child_memory,ffn_tp_parallel,multi_chip_dispatch,allreduce_distributed, async-notify demos)migrated to the cid API. Getting-started doc updated.
prepared_callableST suite under all four{a2a3, a5} × {host_build_graph, tensormap_and_ringbuffer}variants,plus
tests/ut/cpp/common/test_orch_so_file.cppand anaicpu_dlopen_countgetter to assert the one-load-per-cid invariant.Backwards-compatibility shims and dual paths are removed in the same PR
(Phases 3–4 commits) so there is no
--legacyflag to maintain.