Refactor: collapse ChipWorker init/set_device → init(device_id, bins) by ChaoWao · Pull Request #723 · hw-native-sys/simpler

ChaoWao · 2026-05-08T09:21:08Z

Summary

ChipWorker exposed init / set_device / reset_device / finalize as four public lifecycle methods even though set_device existed only to satisfy CANN's per-thread aclrtSetDevice attach requirement (#715), and reset_device had zero call sites. The split was already the source of one bug (#715) and forced every caller through a redundant ceremony.

This PR collapses the lifecycle to a single attach point at init.

What changed

C++ (src/common/worker/chip_worker.{h,cpp})

ChipWorker::init now takes device_id and attaches the calling thread internally via simpler_init
ChipWorker::set_device and ::reset_device deleted
device_set_ flag and device_set() getter collapse into initialized_
finalize() simplified — calls finalize_device_fn_ directly

C-API (pto_runtime_c_api.h + a2a3/a5 onboard+sim implementations)

set_device export removed
simpler_init signature extended to (ctx, device_id, log_level, log_info_v) and now returns int; folds the per-thread attach in alongside log config

DeviceRunner attach surface (a2a3 + a5, onboard + sim)

attach_current_thread(int) is the centralized binder. Onboard: rtSetDevice. Sim: pto_cpu_sim_bind_device + pto_cpu_sim_acquire_device.
device_id_ stays a plain int. All ChipWorker callers run on the same thread that called init(), so the standard thread-spawn happens-before edge is sufficient — no atomic, no CAS, no per-op re-attach.

Python

_ChipWorker.init / ChipWorker.init take device_id (and the wrapper takes it before bins, since it's the simpler, more fundamental "where" parameter)
_ChipWorker.set_device / .reset_device bindings removed; device_set property removed
bootstrap_context no longer calls set_device — the device is attached at init time, the param is now used only as a defensive consistency check

Drive-by fix: ProfilerBase's constructor/destructor are now private with friend Derived (bugprone-crtp-constructor-accessibility). Newer clang-tidy versions on macOS would otherwise reject any TU that transitively includes profiler_base.h.

Callers updated

python/simpler/worker.py, simpler_setup/scene_test.py, tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_platform_comm.py, and test_bootstrap_context_{sim,hw}.py collapse init + set_device into a single init(device_id, bins) call.

Docs

docs/chip-level-arch.md, docs/dynamic-linking.md, docs/getting-started.md, plus the worker_malloc / hello_worker examples reflect the new flow.

Test plan

`pip install --no-build-isolation -e .` — clean build (a2a3sim + a5sim) on macOS
`pytest tests/ut/py` — 116 passed, 7 skipped (the only unrelated failures need `torch`, which is not installed in the local env)
`python examples/workers/l2/{hello_worker,worker_malloc}/main.py -p {a2a3sim,a5sim} -d 0` — passes
Onboard CI: `ut-a2a3` / `ut-a5` / `st-onboard-*` (Linux, hardware) — needs CI run
`ctest --test-dir build/ut_cpp` (Linux) — needs CI run

gemini-code-assist

Code Review

This pull request refactors the ChipWorker API by integrating NPU device attachment into the init method and removing the explicit set_device and reset_device calls. Device operations such as malloc, free, and copy now internally handle thread attachment to ensure safety across different calling threads. Documentation, examples, and tests have been updated to reflect these changes. Feedback identifies potential data races in the initialization and attachment logic, along with a resource leak in the free_tensor error path.

ChipWorker exposed init / set_device / reset_device / finalize as four public lifecycle methods even though set_device existed only to satisfy CANN's per-thread aclrtSetDevice attach requirement (PR hw-native-sys#715), and reset_device had zero call sites. The split was the source of one bug (hw-native-sys#715) and forced every caller through a redundant ceremony. Collapse into a single attach point: - ChipWorker::init now takes device_id and attaches the calling thread internally via simpler_init; ChipWorker::set_device and ::reset_device are deleted, and device_set_ / device_set() collapse into initialized_. - C-API set_device export is removed; simpler_init grows a device_id parameter and returns int. The per-thread attach is centralized inside DeviceRunner::attach_current_thread (sim DeviceRunner gains the same method, driving pto_cpu_sim_bind_device + idempotent acquire_device). All ChipWorker callers run on the same thread that called init, so device_id_ stays a plain int — no atomic / no CAS needed. - Python `ChipWorker.init` wrapper takes (device_id, bins) — device_id first since it's the simpler, more fundamental "where" parameter. bootstrap_context's device_id arg becomes a defensive consistency check rather than the attach point. - Callers updated: python/simpler/worker.py, simpler_setup/scene_test.py, tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_platform_comm.py and test_bootstrap_context_{sim,hw}.py. - Docs (chip-level-arch, dynamic-linking, getting-started) and the worker_malloc / hello_worker examples reflect the new flow. Drive-by: ProfilerBase's constructor/destructor are now private with `friend Derived` (bugprone-crtp-constructor-accessibility); newer clang-tidy versions on macOS would otherwise reject any TU that transitively includes profiler_base.h.

The C-ABI run_runtime() carried three parameters that are not actually per-run state: - aicpu_binary / aicore_binary: executor binaries pinned to the (platform, runtime) tuple, read off disk once during ChipWorker::init and never changed during the worker's lifetime. - device_id: already bound to the DeviceRunner via attach_current_thread (called from simpler_init at init time, see hw-native-sys#715 / hw-native-sys#723). Every run_runtime() call paid for these by copying the AICPU + AICore byte arrays into fresh std::vector<uint8_t>'s inside the dlsym'd SO and forwarding device_id through prepare_run_context just to re-attach the same thread the runner already knows about. Layer A: collapse these three into init-time setup. - pto_runtime_c_api.h: declare `int bind_executors(ctx, aicpu_ptr, aicpu_size, aicore_ptr, aicore_size)`; shrink run_runtime() to drop device_id and the four executor params. The file-header dlsym list and run_runtime doxygen both note that executor / device are init-time-only. - 4 platform pto_runtime_c_api.cpp (a2a3 + a5, onboard + sim): implement bind_executors as a thin wrapper that constructs vectors from the byte ranges and moves them into DeviceRunner::set_executors. run_runtime no longer constructs per-call vectors, no longer passes device_id; it reads runner->device_id() and calls runner->run(rt, block_dim, aicpu_thread_num). - 4 DeviceRunner.{h,cpp}: add `set_executors(vector<uint8_t> aicpu, vector<uint8_t> aicore)` (by-value + move), drop `device_id` / `aicpu_so_binary` / `aicore_kernel_binary` from `DeviceRunner::run`, `ensure_device_initialized`, `ensure_binaries_loaded`. Onboard already had `aicore_kernel_binary_` as a member; add the parallel `aicpu_binary_` to it and to all four DeviceRunner variants. Add a public `device_id()` getter so the c_api can ask the runner what to attach to. - chip_worker.h/cpp: add `BindExecutorsFn` typedef + `bind_executors_fn_` member; drop `aicpu_binary_` / `aicore_binary_` members (the bytes are now owned by the DeviceRunner). `init()` reads binaries into local vectors and hands them off via `bind_executors_fn_(...)` right after `simpler_init_fn_(...)`; the local vectors then die at the end of the scope. `run()` passes neither device_id nor binaries down — they are resolved by run_runtime() inside the SO. init's rollback path mirrors finalize's teardown exactly minus finalize_device_fn_. - docs/chip-level-arch.md, docs/dynamic-linking.md: refresh ABI listings and the init / run flow diagrams to show bind_executors as a separate init-time step and to remove the device_id + binary args from run_runtime. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so compile) - pytest tests/ut/py (116 passed, 7 skipped) - examples/workers/l2/{hello_worker,worker_malloc} on both sims - tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed, exercises init → bind_executors → run end-to-end through sim) Onboard ut + st coverage runs in CI (Linux). This is a prerequisite for later run_runtime decomposition; the API is now narrow enough that a register/run split won't drag the executor/device args along.

… device_init Continues the API-narrowing theme of hw-native-sys#723. Two strands woven together: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_runtime could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) host_runtime.so's init surface collapses to one entry: device_init. Before: simpler_init(ctx, device_id, log_level, log_info_v) + bind_executors(ctx, aicpu_*, aicore_*) — two adjacent init-time entries always called back-to-back from ChipWorker::init. After: device_init(ctx, device_id, aicpu_*, aicore_*) — single entry that attaches the calling thread, takes ownership of executor binaries, and (onboard) syncs CANN dlog from HostLogger. Log args gone because (1) put them on a separate SO. ### Changes libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` and `bind_executors` removed. - New `device_init(ctx, device_id, aicpu_ptr, aicpu_size, aicore_ptr, aicore_size)`: attach + executor takeover + (onboard) dlog sync. Onboard's dlog_setlevel reads HostLogger::get_instance().level() — no log arg. - Header doc updated; dlsym list updated. 4 × DeviceRunner.{h,cpp}: - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters (no DeviceRunner-side log cache). - run() reads `HostLogger::get_instance().level()` / `.info_v()` directly when populating KernelArgs (onboard) or forwarding to the AICPU sim SO via dlsym (sim). ChipWorker: - ensure_simpler_log_loaded keeps `g_simpler_log_handle` (was already global) — ChipWorker now dlsym's `simpler_log_init` from it and calls it BEFORE host_runtime.so is opened, so any LOG_* macro firing during host_runtime.so's dlopen-time constructors already sees the right level. - `simpler_init_fn_` + `bind_executors_fn_` replaced by `device_init_fn_`. - init()'s rollback path collapsed: one rc check + one rollback instead of two. Docs: - `docs/chip-level-arch.md`, `docs/dynamic-linking.md`: ABI listings, lifecycle diagrams. - `docs/logging.md`, `docs/testing.md`: log configuration flow. - `python/simpler/__init__.py`, `python/simpler/_log.py`: module docstrings. - worker_malloc README. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (116 passed, 7 skipped) - examples/workers/l2/{hello_worker, worker_malloc} on both sims - tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed, exercises init → device_init → run end-to-end through sim) Onboard ut + st coverage runs in CI (Linux).

…pWorker::init to Python Continues the API-narrowing theme of hw-native-sys#723 / hw-native-sys#735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).

…pWorker::init to Python (#746) Continues the API-narrowing theme of #723 / #735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).

ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch from efc0058 to 5528c02 Compare May 8, 2026 09:28

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch 2 times, most recently from 00889af to f48e496 Compare May 9, 2026 08:40

ChaoWao changed the title ~~Refactor: collapse ChipWorker init/set_device, fold per-thread attach into device-ops~~ Refactor: collapse ChipWorker init/set_device → init(device_id, bins) May 9, 2026

ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch from f48e496 to 9939bd3 Compare May 9, 2026 09:32

ChaoWao merged commit 359fc3d into hw-native-sys:main May 11, 2026
14 checks passed

ChaoWao deleted the refactor/chip-worker-fold-set-device branch May 11, 2026 03:38

ChaoWao mentioned this pull request May 11, 2026

Refactor: decouple logger SO + collapse simpler_init/bind_executors → device_init #735

Merged

6 tasks

ChaoWao mentioned this pull request May 12, 2026

Refactor: move RTLD_GLOBAL SO preload + simpler_log_init from C++ ChipWorker::init to Python #746

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: collapse ChipWorker init/set_device → init(device_id, bins)#723

Refactor: collapse ChipWorker init/set_device → init(device_id, bins)#723
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/chip-worker-fold-set-device

ChaoWao commented May 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Callers updated

Docs

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented May 8, 2026 •

edited

Loading