Refactor: collapse ChipWorker init/set_device → init(device_id, bins)#723
Merged
ChaoWao merged 1 commit intoMay 11, 2026
Merged
Conversation
efc0058 to
5528c02
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors the ChipWorker API by integrating NPU device attachment into the init method and removing the explicit set_device and reset_device calls. Device operations such as malloc, free, and copy now internally handle thread attachment to ensure safety across different calling threads. Documentation, examples, and tests have been updated to reflect these changes. Feedback identifies potential data races in the initialization and attachment logic, along with a resource leak in the free_tensor error path.
00889af to
f48e496
Compare
ChipWorker exposed init / set_device / reset_device / finalize as four public lifecycle methods even though set_device existed only to satisfy CANN's per-thread aclrtSetDevice attach requirement (PR hw-native-sys#715), and reset_device had zero call sites. The split was the source of one bug (hw-native-sys#715) and forced every caller through a redundant ceremony. Collapse into a single attach point: - ChipWorker::init now takes device_id and attaches the calling thread internally via simpler_init; ChipWorker::set_device and ::reset_device are deleted, and device_set_ / device_set() collapse into initialized_. - C-API set_device export is removed; simpler_init grows a device_id parameter and returns int. The per-thread attach is centralized inside DeviceRunner::attach_current_thread (sim DeviceRunner gains the same method, driving pto_cpu_sim_bind_device + idempotent acquire_device). All ChipWorker callers run on the same thread that called init, so device_id_ stays a plain int — no atomic / no CAS needed. - Python `ChipWorker.init` wrapper takes (device_id, bins) — device_id first since it's the simpler, more fundamental "where" parameter. bootstrap_context's device_id arg becomes a defensive consistency check rather than the attach point. - Callers updated: python/simpler/worker.py, simpler_setup/scene_test.py, tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_platform_comm.py and test_bootstrap_context_{sim,hw}.py. - Docs (chip-level-arch, dynamic-linking, getting-started) and the worker_malloc / hello_worker examples reflect the new flow. Drive-by: ProfilerBase's constructor/destructor are now private with `friend Derived` (bugprone-crtp-constructor-accessibility); newer clang-tidy versions on macOS would otherwise reject any TU that transitively includes profiler_base.h.
f48e496 to
9939bd3
Compare
6 tasks
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
The C-ABI run_runtime() carried three parameters that are not actually
per-run state:
- aicpu_binary / aicore_binary: executor binaries pinned to the
(platform, runtime) tuple, read off disk once during ChipWorker::init
and never changed during the worker's lifetime.
- device_id: already bound to the DeviceRunner via attach_current_thread
(called from simpler_init at init time, see hw-native-sys#715 / hw-native-sys#723).
Every run_runtime() call paid for these by copying the AICPU + AICore byte
arrays into fresh std::vector<uint8_t>'s inside the dlsym'd SO and
forwarding device_id through prepare_run_context just to re-attach the
same thread the runner already knows about.
Layer A: collapse these three into init-time setup.
- pto_runtime_c_api.h: declare `int bind_executors(ctx, aicpu_ptr,
aicpu_size, aicore_ptr, aicore_size)`; shrink run_runtime() to drop
device_id and the four executor params. The file-header dlsym list and
run_runtime doxygen both note that executor / device are init-time-only.
- 4 platform pto_runtime_c_api.cpp (a2a3 + a5, onboard + sim): implement
bind_executors as a thin wrapper that constructs vectors from the byte
ranges and moves them into DeviceRunner::set_executors. run_runtime no
longer constructs per-call vectors, no longer passes device_id; it reads
runner->device_id() and calls runner->run(rt, block_dim, aicpu_thread_num).
- 4 DeviceRunner.{h,cpp}: add `set_executors(vector<uint8_t> aicpu,
vector<uint8_t> aicore)` (by-value + move), drop `device_id` /
`aicpu_so_binary` / `aicore_kernel_binary` from `DeviceRunner::run`,
`ensure_device_initialized`, `ensure_binaries_loaded`. Onboard already
had `aicore_kernel_binary_` as a member; add the parallel `aicpu_binary_`
to it and to all four DeviceRunner variants. Add a public
`device_id()` getter so the c_api can ask the runner what to attach to.
- chip_worker.h/cpp: add `BindExecutorsFn` typedef + `bind_executors_fn_`
member; drop `aicpu_binary_` / `aicore_binary_` members (the bytes are
now owned by the DeviceRunner). `init()` reads binaries into local
vectors and hands them off via `bind_executors_fn_(...)` right after
`simpler_init_fn_(...)`; the local vectors then die at the end of the
scope. `run()` passes neither device_id nor binaries down — they are
resolved by run_runtime() inside the SO. init's rollback path mirrors
finalize's teardown exactly minus finalize_device_fn_.
- docs/chip-level-arch.md, docs/dynamic-linking.md: refresh ABI listings
and the init / run flow diagrams to show bind_executors as a separate
init-time step and to remove the device_id + binary args from
run_runtime.
Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e . (all 4 host_runtime.so compile)
- pytest tests/ut/py (116 passed, 7 skipped)
- examples/workers/l2/{hello_worker,worker_malloc} on both sims
- tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed,
exercises init → bind_executors → run end-to-end through sim)
Onboard ut + st coverage runs in CI (Linux). This is a prerequisite for
later run_runtime decomposition; the API is now narrow enough that a
register/run split won't drag the executor/device args along.
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
The C-ABI run_runtime() carried three parameters that are not actually
per-run state:
- aicpu_binary / aicore_binary: executor binaries pinned to the
(platform, runtime) tuple, read off disk once during ChipWorker::init
and never changed during the worker's lifetime.
- device_id: already bound to the DeviceRunner via attach_current_thread
(called from simpler_init at init time, see hw-native-sys#715 / hw-native-sys#723).
Every run_runtime() call paid for these by copying the AICPU + AICore byte
arrays into fresh std::vector<uint8_t>'s inside the dlsym'd SO and
forwarding device_id through prepare_run_context just to re-attach the
same thread the runner already knows about.
Layer A: collapse these three into init-time setup.
- pto_runtime_c_api.h: declare `int bind_executors(ctx, aicpu_ptr,
aicpu_size, aicore_ptr, aicore_size)`; shrink run_runtime() to drop
device_id and the four executor params. The file-header dlsym list and
run_runtime doxygen both note that executor / device are init-time-only.
- 4 platform pto_runtime_c_api.cpp (a2a3 + a5, onboard + sim): implement
bind_executors as a thin wrapper that constructs vectors from the byte
ranges and moves them into DeviceRunner::set_executors. run_runtime no
longer constructs per-call vectors, no longer passes device_id; it reads
runner->device_id() and calls runner->run(rt, block_dim, aicpu_thread_num).
- 4 DeviceRunner.{h,cpp}: add `set_executors(vector<uint8_t> aicpu,
vector<uint8_t> aicore)` (by-value + move), drop `device_id` /
`aicpu_so_binary` / `aicore_kernel_binary` from `DeviceRunner::run`,
`ensure_device_initialized`, `ensure_binaries_loaded`. Onboard already
had `aicore_kernel_binary_` as a member; add the parallel `aicpu_binary_`
to it and to all four DeviceRunner variants. Add a public
`device_id()` getter so the c_api can ask the runner what to attach to.
- chip_worker.h/cpp: add `BindExecutorsFn` typedef + `bind_executors_fn_`
member; drop `aicpu_binary_` / `aicore_binary_` members (the bytes are
now owned by the DeviceRunner). `init()` reads binaries into local
vectors and hands them off via `bind_executors_fn_(...)` right after
`simpler_init_fn_(...)`; the local vectors then die at the end of the
scope. `run()` passes neither device_id nor binaries down — they are
resolved by run_runtime() inside the SO. init's rollback path mirrors
finalize's teardown exactly minus finalize_device_fn_.
- docs/chip-level-arch.md, docs/dynamic-linking.md: refresh ABI listings
and the init / run flow diagrams to show bind_executors as a separate
init-time step and to remove the device_id + binary args from
run_runtime.
Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e . (all 4 host_runtime.so compile)
- pytest tests/ut/py (116 passed, 7 skipped)
- examples/workers/l2/{hello_worker,worker_malloc} on both sims
- tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed,
exercises init → bind_executors → run end-to-end through sim)
Onboard ut + st coverage runs in CI (Linux). This is a prerequisite for
later run_runtime decomposition; the API is now narrow enough that a
register/run split won't drag the executor/device args along.
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
… device_init Continues the API-narrowing theme of hw-native-sys#723. Two strands woven together: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_runtime could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) host_runtime.so's init surface collapses to one entry: device_init. Before: simpler_init(ctx, device_id, log_level, log_info_v) + bind_executors(ctx, aicpu_*, aicore_*) — two adjacent init-time entries always called back-to-back from ChipWorker::init. After: device_init(ctx, device_id, aicpu_*, aicore_*) — single entry that attaches the calling thread, takes ownership of executor binaries, and (onboard) syncs CANN dlog from HostLogger. Log args gone because (1) put them on a separate SO. ### Changes libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` and `bind_executors` removed. - New `device_init(ctx, device_id, aicpu_ptr, aicpu_size, aicore_ptr, aicore_size)`: attach + executor takeover + (onboard) dlog sync. Onboard's dlog_setlevel reads HostLogger::get_instance().level() — no log arg. - Header doc updated; dlsym list updated. 4 × DeviceRunner.{h,cpp}: - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters (no DeviceRunner-side log cache). - run() reads `HostLogger::get_instance().level()` / `.info_v()` directly when populating KernelArgs (onboard) or forwarding to the AICPU sim SO via dlsym (sim). ChipWorker: - ensure_simpler_log_loaded keeps `g_simpler_log_handle` (was already global) — ChipWorker now dlsym's `simpler_log_init` from it and calls it BEFORE host_runtime.so is opened, so any LOG_* macro firing during host_runtime.so's dlopen-time constructors already sees the right level. - `simpler_init_fn_` + `bind_executors_fn_` replaced by `device_init_fn_`. - init()'s rollback path collapsed: one rc check + one rollback instead of two. Docs: - `docs/chip-level-arch.md`, `docs/dynamic-linking.md`: ABI listings, lifecycle diagrams. - `docs/logging.md`, `docs/testing.md`: log configuration flow. - `python/simpler/__init__.py`, `python/simpler/_log.py`: module docstrings. - worker_malloc README. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (116 passed, 7 skipped) - examples/workers/l2/{hello_worker, worker_malloc} on both sims - tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed, exercises init → device_init → run end-to-end through sim) Onboard ut + st coverage runs in CI (Linux).
Merged
4 tasks
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 12, 2026
…pWorker::init to Python Continues the API-narrowing theme of hw-native-sys#723 / hw-native-sys#735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
ChaoWao
added a commit
that referenced
this pull request
May 12, 2026
…pWorker::init to Python (#746) Continues the API-narrowing theme of #723 / #735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChipWorkerexposedinit/set_device/reset_device/finalizeas four public lifecycle methods even thoughset_deviceexisted only to satisfy CANN's per-threadaclrtSetDeviceattach requirement (#715), andreset_devicehad zero call sites. The split was already the source of one bug (#715) and forced every caller through a redundant ceremony.This PR collapses the lifecycle to a single attach point at
init.What changed
C++ (
src/common/worker/chip_worker.{h,cpp})ChipWorker::initnow takesdevice_idand attaches the calling thread internally viasimpler_initChipWorker::set_deviceand::reset_devicedeleteddevice_set_flag anddevice_set()getter collapse intoinitialized_finalize()simplified — callsfinalize_device_fn_directlyC-API (
pto_runtime_c_api.h+ a2a3/a5 onboard+sim implementations)set_deviceexport removedsimpler_initsignature extended to(ctx, device_id, log_level, log_info_v)and now returnsint; folds the per-thread attach in alongside log configDeviceRunnerattach surface (a2a3 + a5, onboard + sim)attach_current_thread(int)is the centralized binder. Onboard:rtSetDevice. Sim:pto_cpu_sim_bind_device + pto_cpu_sim_acquire_device.device_id_stays a plainint. All ChipWorker callers run on the same thread that calledinit(), so the standard thread-spawn happens-before edge is sufficient — no atomic, no CAS, no per-op re-attach.Python
_ChipWorker.init/ChipWorker.inittakedevice_id(and the wrapper takes it beforebins, since it's the simpler, more fundamental "where" parameter)_ChipWorker.set_device/.reset_devicebindings removed;device_setproperty removedbootstrap_contextno longer callsset_device— the device is attached atinittime, the param is now used only as a defensive consistency checkDrive-by fix:
ProfilerBase's constructor/destructor are now private withfriend Derived(bugprone-crtp-constructor-accessibility). Newer clang-tidy versions on macOS would otherwise reject any TU that transitively includesprofiler_base.h.Callers updated
python/simpler/worker.py,simpler_setup/scene_test.py,tests/ut/py/test_chip_worker.py,tests/ut/py/test_worker/test_platform_comm.py, andtest_bootstrap_context_{sim,hw}.pycollapseinit+set_deviceinto a singleinit(device_id, bins)call.Docs
docs/chip-level-arch.md,docs/dynamic-linking.md,docs/getting-started.md, plus theworker_malloc/hello_workerexamples reflect the new flow.Test plan