Add: orch-driven dynamic CommDomain allocation (Option A)#817
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements dynamic CommDomain allocation and release, enabling collective window pool allocation for subsets of ranks at runtime. The changes span the Python API, C++ backend implementations for HCCL and simulation, and the hierarchical worker orchestration logic. Review feedback highlights opportunities to improve backend consistency, reduce code duplication in the bootstrap process, optimize worker lookup performance, and ensure the C API parameters align with their documented contracts.
75f3096 to
1009fdb
Compare
uv-xiao
left a comment
There was a problem hiding this comment.
Just glanced over the PR and leave small comments. I suggest the biggest problem is whether we should reuse the same multi-comm-domain path as PR #752 . This new PR seems to have created a completely new path. I'll use AI for more review.
1. P1:
|
c9d2bca to
681f331
Compare
|
Thanks for the thorough review — addressed in 681f331 along with deleting the entire static-path API (per the maintainer's call: P1#1 — use-after-free on P1#2 — buffer overflow validates after backend alloc: fixed by checking P2 — dynamic windows not zeroed: fixed in both backends. Coverage gap (no regression test): addressed implicitly via the static-path deletion — the test that used to need adding ("submit a task using dynamic device_ctx") is now the only path; Open question — lifetime model: settled as "release on with-exit, free after drain." The orch_fn observes the handle as released as soon as |
|
@uv-xiao Re your meta-question ("should reuse the same multi-comm-domain path as PR #752 ... seems to have created a completely new path"): you're right that we had two paths. Per the maintainer's call ( What's gone:
What's there now: So the path duplication is gone. Examples need to be re-written to the orch-only API in a follow-up — that's a bigger kernel-level migration than fits this PR's scope. |
|
@ChaoWao Will the sub-communicators be created every time if a worker runs multiple times, since the comm-domains are allocated inside |
|
Comment 1: P1 cleanup skipped when drain raises P1: dynamic domain cleanup is skipped if
self._orch._scope_end()
self._orch._drain()
self._execute_pending_domain_releases()
if self._live_domains:
self._release_all_live_domains()But the comment above this block says drain() rethrows the first dispatch failure. If a submitted chip task fails, or a worker/dispatch error is surfaced from drain, both That reintroduces a leak path for exactly the domains this PR is trying to make lifetime-safe:
Please wrap the drain and cleanup ordering so domain cleanup runs even when drain raises, while preserving the original exception. For example, use a nested try/finally around Comment 2: stale docstring Small consistency issue: this docstring is stale after the parent-side overflow validation was added. It still says: buffers are carved sequentially inside the
window in declaration order; their ``nbytes`` sum may exceed
``window_size`` — the chip-side carve will raise then.But _allocate_domain() now checks sum(b.nbytes) <= window_size before dispatching any chip-side allocation, which is the right behavior. Please update this text so users do not expect chip-side overflow handling here. Comment 3: test coverage The new tests cover allocate/release mechanics well, including sim interleaved subsets and one hardware smoke path, but they still do not exercise a real submitted chip task that consumes the returned |
cd2bfda to
5d83e7e
Compare
Adds a runtime path for allocating CommDomain windows from inside the
orch function, alongside the existing static `Worker(comm_plan=...)`
path. Dynamic allocation pays a collective cost (~hundreds of ms on
HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready
barrier) per orch.allocate_domain call, sized to actual need rather
than worst-case-pre-declared.
Public surface
with orch.allocate_domain(
name="tp",
workers=[0, 1],
window_size=4 * 1024 * 1024,
buffers=[CommBufferSpec("scratch", "float32", n, n*4)],
) as tp:
for chip_idx in tp.workers:
sub_args = TaskArgs()
sub_args.add_scalar(tp[chip_idx].device_ctx)
sub_args.add_tensor(..., data=tp[chip_idx].buffer_ptrs["scratch"])
orch.submit_next_level(cid, sub_args, cfg, worker=chip_idx)
# `with`-exit auto-releases across all participating chips.
Implementation
- C ABI (sim + HCCL): comm_alloc_domain_windows /
comm_release_domain_windows. HCCL impl mirrors alloc_windows_via_ipc
but on a fresh per-allocation buffer with every barrier / IPC announce
filename scoped by `allocation_id` so concurrent allocations don't
collide. Sim impl creates a fresh POSIX shm per allocation. P2P
routes (aclrtDeviceEnablePeerAccess) are inherited from the base
allocation; only IPC export/import + symmetric pool is per-call.
a5 backend ships not-supported stubs (-1).
- ChipWorker C++ wrappers + nanobind binding. Both methods return
`(device_ctx, local_window_base)` for the calling rank.
- Mailbox: CTRL_ALLOC_DOMAIN / CTRL_RELEASE_DOMAIN. Variable payloads
cross via per-call POSIX shms (request + reply); two shm names are
staged back-to-back at MAILBOX_OFF_ARGS. WorkerThread holds
mailbox_mu_ so they serialise with task dispatch.
- Orchestrator.allocate_domain / release_domain. Drives the collective
by fanning out per-chip control calls on threads (one per
participating chip) and joining; raises if any chip fails after all
join.
- Worker._live_domains: tracks active handles; auto-released in
Worker.run's `finally` on orch-fn exception and in Worker.close
in LIFO order. Each handle is a `CommDomainHandle` context manager.
Backward compatibility
- `Worker(comm_plan=...)` still works unchanged. comm_plan now
serves a second purpose: declare a minimal "membership" domain so
bootstrap_context establishes the HCCL/sim base communicator that
dynamic allocations need as their root.
Sim test coverage
tests/ut/py/test_worker/test_dynamic_alloc_sim.py (5 tests):
- allocate returns distinct per-chip contexts
- with-statement auto-releases on exit
- alloc-after-release reuses the same name; allocation_id
monotonically increments
- duplicate name while live raises
- orch_fn raises mid-DAG → Worker.run releases all live handles
before propagating
Local: pytest tests/ut/py/test_worker/ → 102 passed, 1 hw skip
Pre-commit clean: clang-format / clang-tidy / cpplint / ruff / pyright
Non-goals
- Async / Future-based allocation
- Cross-host domains
- Resize / grow existing domain
- Allocating outside an orch function
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5d83e7e to
1e71e3f
Compare
|
@puddingfjz Thanks — addressed in Comment 1 (cleanup skipped when drain raises) — fixed. self._orch._scope_end()
try:
self._orch._drain()
finally:
self._execute_pending_domain_releases()
if self._live_domains:
self._release_all_live_domains()So even when Comment 2 (stale docstring) — fixed. Comment 3 (real-task test coverage) — the migrated examples now exercise real submitted tasks consuming the dynamic Also added |
|
@uv-xiao No — the sub-communicators are not recreated per run. The base communicator is built lazily on the first What does happen per Documented in |
I see. What I actually want to discuss is the per-domain window, but I think it has been described in #824. I don't have more questions here. |
| ## 1. API | ||
|
|
||
| ```python | ||
| with orch.allocate_domain( | ||
| name="default", # local label (peers need not agree) | ||
| workers=[0, 1], # subset of the Worker's device_ids indices | ||
| window_size=4096, # per-rank symmetric window, bytes | ||
| buffers=[ # named slices carved from the window | ||
| CommBufferSpec(name="scratch", dtype="float32", count=1024, nbytes=4096), | ||
| ], | ||
| ) as handle: | ||
| for chip_idx in handle.workers: | ||
| domain = handle[chip_idx] # -> ChipDomainContext | ||
| ... | ||
| orch.submit_next_level(cid, args, cfg, worker=chip_idx) | ||
| ``` | ||
|
|
||
| `window_size` is validated on the orch thread **before** any chip-side | ||
| allocation: if `sum(b.nbytes) > window_size`, `allocate_domain` raises | ||
| `ValueError` immediately and no backend allocation is registered. |
There was a problem hiding this comment.
Should we make it clear that this code must exist in orch_fn?
Summary
Adds a runtime path for allocating CommDomain windows from inside the orch function, alongside the existing static
Worker(comm_plan=...)path. Implements the design discussed in the RFC sketch — every "Option A" decision is reflected in code.Each
orch.allocate_domain(...)pays a collective cost (~hundreds of ms on HCCL: aclrtMalloc + IPC import; sim: shm_open + ftruncate + ready barrier), sized to actual need rather than worst-case-pre-declared. Static topology (Worker(comm_plan=...)) is unchanged and remains the right choice when sizes are known at init.Public surface
Implementation layers
comm_alloc_domain_windows/comm_release_domain_windowsin sim, HCCL, a5 stub. HCCL mirrorsalloc_windows_via_ipcbut scopes every barrier / IPC announce byallocation_id. P2P routes inherited from base alloc.(device_ctx, local_window_base).CTRL_ALLOC_DOMAIN=7/CTRL_RELEASE_DOMAIN=8. Variable payload via per-call POSIX shms (request + reply); two shm names staged back-to-back atMAILBOX_OFF_ARGS. WorkerThread holdsmailbox_mu_._Workercontrol_alloc_domain(worker_id, …)/control_release_domain(…)expose the per-chip dispatch to Python.Orchestratorallocate_domain(...)/release_domain(...). Fans out per-chip control calls on threads (one per participating chip) and joins._live_domainstracks active handles. Auto-released inrun'sfinally(orch-fn exception) and inclose(LIFO).CommDomainHandleis a context manager.Backwards compatibility
Worker(comm_plan=...)works unchanged. It now also serves a second purpose: declare a minimal "membership" domain sobootstrap_contextestablishes the HCCL/sim base communicator that dynamic allocations need as their root.Sim test coverage (new file)
tests/ut/py/test_worker/test_dynamic_alloc_sim.py(5 tests):allocation_idmonotonically incrementsValueErrorWorker.runreleases all live handles before propagating; subsequentlive_domainsis emptyPlus all existing tests still pass.
Test plan
pip install --no-build-isolation -e .rebuilds cleanpytest tests/ut/py/test_worker/→ 102 passed, 1 hw skiporch.allocate_domainand submits PTO-ISA allreduce against the allocated domainNon-goals
Orchestratorexposes it for v1)