Skip to content

Commit 8ac3dfe

Browse files
committed
Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in 0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
1 parent c20b05e commit 8ac3dfe

1 file changed

Lines changed: 27 additions & 1 deletion

File tree

tractor/runtime/_runtime.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1216,6 +1216,23 @@ async def cancel(
12161216
ipc_server.cancel()
12171217
await ipc_server.wait_for_shutdown()
12181218

1219+
# Break the shield on the parent-channel
1220+
# `process_messages` loop (started with `shield=True`
1221+
# in `async_main` above). Required to avoid a
1222+
# deadlock during teardown of fork-spawned subactors:
1223+
# without this cancel, the loop parks waiting for
1224+
# EOF on the parent channel, but the parent is
1225+
# blocked on `os.waitpid()` for THIS actor's exit
1226+
# — mutual wait. For exec-spawn backends the EOF
1227+
# arrives naturally when the parent closes its
1228+
# handler-task socket during its own teardown, but
1229+
# in fork backends the shared-process-image makes
1230+
# that delivery racy / not guaranteed. Explicit
1231+
# cancel here gives us deterministic unwinding
1232+
# regardless of backend.
1233+
if self._parent_chan_cs is not None:
1234+
self._parent_chan_cs.cancel()
1235+
12191236
# cancel all rpc tasks permanently
12201237
if self._service_tn:
12211238
self._service_tn.cancel_scope.cancel()
@@ -1736,7 +1753,16 @@ async def async_main(
17361753
# start processing parent requests until our channel
17371754
# server is 100% up and running.
17381755
if actor._parent_chan:
1739-
await root_tn.start(
1756+
# Capture the shielded `loop_cs` for the
1757+
# parent-channel `process_messages` task so
1758+
# `Actor.cancel()` has a handle to break the
1759+
# shield during teardown — without this, the
1760+
# shielded loop would park on the parent chan
1761+
# indefinitely waiting for EOF that only arrives
1762+
# after the PARENT tears down, which under
1763+
# fork-based backends (e.g. `subint_forkserver`)
1764+
# it waits on THIS actor's exit — deadlock.
1765+
actor._parent_chan_cs = await root_tn.start(
17401766
partial(
17411767
_rpc.process_messages,
17421768
chan=actor._parent_chan,

0 commit comments

Comments
 (0)