Skip to content

Commit ab86f76

Browse files
committed
Refine subint_forkserver cancel-cascade diag
Third diagnostic pass on `test_nested_multierrors[subint_forkserver]` hang. Two prior hypotheses ruled out + a new, more specific deadlock shape identified. Ruled out, - **capture-pipe fill** (`-s` flag changes test): retested explicitly — `test_nested_multierrors` hangs identically with and without `-s`. The earlier observation was likely a competing pytest process I had running in another session holding registry state - **stuck peer-chan recv that cancel can't break**: pivot from the prior pass. With `handle_stream_from_peer` instrumented at ENTER / `except trio.Cancelled:` / finally: 40 ENTERs, ZERO `trio.Cancelled` hits. Cancel never reaches those tasks at all — the recvs are fine, nothing is telling them to stop Actual deadlock shape: multi-level mutual wait. root blocks on spawner.wait() spawner blocks on grandchild.wait() grandchild blocks on errorer.wait() errorer Actor.cancel() ran, but proc never exits `Actor.cancel()` fired in 12 PIDs — but NOT in root + 2 direct spawners. Those 3 have peer handlers stuck because their own `Actor.cancel()` never runs, which only runs when the enclosing `tractor.open_nursery()` exits, which waits on `_ForkedProc.wait()` for the child pidfd to signal, which only signals when the child process fully exits. Refined question: **why does an errorer process not exit after its `Actor.cancel()` completes?** Three hypotheses (unverified): 1. `_parent_chan_cs.cancel()` fires but the shielded loop's recv is stuck in a way cancel still can't break 2. `async_main`'s post-cancel unwind has other tasks in `root_tn` awaiting something that never arrives (e.g. outbound IPC reply) 3. `os._exit(rc)` in `_worker` never runs because `_child_target` never returns Next-session probes (priority order): 1. instrument `_worker`'s fork-child branch — confirm whether `child_target()` returns / `os._exit(rc)` is reached for errorer PIDs 2. instrument `async_main`'s final unwind — see which await in teardown doesn't complete 3. compare under `trio_proc` backend at the equivalent level to spot divergence No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
1 parent 458a35c commit ab86f76

1 file changed

Lines changed: 108 additions & 19 deletions

File tree

ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md

Lines changed: 108 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -431,25 +431,114 @@ Either way, the sync-close hypothesis is **ruled
431431
out**. Reverted the experiment, restored the skip-
432432
mark on the test.
433433

434-
### Aside: `-s` flag changes behavior for peer-intensive tests
435-
436-
While exploring, noticed
437-
`tests/test_context_stream_semantics.py` under
438-
`--spawn-backend=subint_forkserver` hangs with
439-
pytest's default `--capture=fd` but passes with
440-
`-s` (`--capture=no`). Hypothesis (unverified): fork
441-
children inherit pytest's capture pipe for stdout/
442-
stderr (fds 1,2 — we preserve these in
443-
`_close_inherited_fds`). When subactor logging is
444-
verbose, the capture pipe buffer fills, writes block,
445-
child can't progress, deadlock.
446-
447-
If confirmed, fix direction: redirect subactor
448-
stdout/stderr to `/dev/null` (or a file) in
449-
`_actor_child_main` so subactors don't hold pytest's
450-
capture pipe open. Not a blocker on the main
451-
peer-chan-loop investigation; deserves its own mini-
452-
tracker.
434+
### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
435+
436+
Tested explicitly: both with and without `-s`, the
437+
test hangs identically. So the capture-pipe-fill
438+
hypothesis is **ruled out** for this test.
439+
440+
The earlier `test_context_stream_semantics.py` `-s`
441+
observation was most likely caused by a competing
442+
pytest run in my session (confirmed via process list
443+
— my leftover pytest was alive at that time and
444+
could have been holding state on the default
445+
registry port).
446+
447+
## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
448+
449+
**New diagnostic run** instrumented
450+
`handle_stream_from_peer` at ENTER / `except
451+
trio.Cancelled:` / finally, plus `Actor.cancel()`
452+
just before `self._parent_chan_cs.cancel()`. Result:
453+
454+
- **40 `handle_stream_from_peer` ENTERs**.
455+
- **0 `except trio.Cancelled:` hits** — cancel
456+
never fires on any peer-handler.
457+
- **35 finally hits** — those handlers exit via
458+
peer-initiated EOF (normal return), NOT cancel.
459+
- **5 handlers never reach finally** — stuck forever.
460+
- **`Actor.cancel()` fired in 12 PIDs** — but the
461+
PIDs with peer handlers that DIDN'T fire
462+
Actor.cancel are exactly **root + 2 direct
463+
spawners**. These 3 actors have peer handlers
464+
(for their own subactors) that stay stuck because
465+
**`Actor.cancel()` at these levels never runs**.
466+
467+
### The actual deadlock shape
468+
469+
`Actor.cancel()` lives in
470+
`open_root_actor.__aexit__` / `async_main` teardown.
471+
That only runs when the enclosing `async with
472+
tractor.open_nursery()` exits. The nursery's
473+
`__aexit__` calls the backend `*_proc` spawn target's
474+
teardown, which does `soft_kill() →
475+
_ForkedProc.wait()` on its child PID. That wait is
476+
trio-cancellable via pidfd now (good) — but nothing
477+
CANCELS it because the outer scope only cancels when
478+
`Actor.cancel()` runs, which only runs when the
479+
nursery completes, which waits on the child.
480+
481+
It's a **multi-level mutual wait**:
482+
483+
```
484+
root blocks on spawner.wait()
485+
spawner blocks on grandchild.wait()
486+
grandchild blocks on errorer.wait()
487+
errorer Actor.cancel() ran, but process
488+
may not have fully exited yet
489+
(something in root_tn holding on?)
490+
```
491+
492+
Each level waits for the level below. The bottom
493+
level (errorer) reaches Actor.cancel(), but its
494+
process may not fully exit — meaning its pidfd
495+
doesn't go readable, meaning the grandchild's
496+
waitpid doesn't return, meaning the grandchild's
497+
nursery doesn't unwind, etc. all the way up.
498+
499+
### Refined question
500+
501+
**Why does an errorer process not exit after its
502+
`Actor.cancel()` completes?**
503+
504+
Possibilities:
505+
1. `_parent_chan_cs.cancel()` fires (shielded
506+
parent-chan loop unshielded), but the task is
507+
stuck INSIDE the shielded loop's recv in a way
508+
that cancel still can't break.
509+
2. After `Actor.cancel()` returns, `async_main`
510+
still has other tasks in `root_tn` waiting for
511+
something that never arrives (e.g. outbound
512+
IPC reply delivery).
513+
3. The `os._exit(rc)` in `_worker` (at
514+
`_subint_forkserver.py`) doesn't run because
515+
`_child_target` never returns.
516+
517+
Next-session candidate probes (in priority order):
518+
519+
1. **Instrument `_worker`'s fork-child branch** to
520+
confirm whether `child_target()` returns (and
521+
thus `os._exit(rc)` is reached) for errorer
522+
PIDs. If yes → process should die; if no →
523+
trace back into `_actor_child_main` /
524+
`_trio_main` / `async_main` to find the stuck
525+
spot.
526+
2. **Instrument `async_main`'s final unwind** to
527+
see which await in the teardown doesn't
528+
complete.
529+
3. **Compare under `trio_proc` backend** at the
530+
same `_worker`-equivalent level to see where
531+
the flows diverge.
532+
533+
### Rule-out: NOT a stuck peer-chan recv
534+
535+
Earlier hypothesis was that the 5 stuck peer-chan
536+
loops were blocked on a socket recv that cancel
537+
couldn't interrupt. This pass revealed the real
538+
cause: cancel **never reaches those tasks** because
539+
their owning actor's `Actor.cancel()` never runs.
540+
The recvs are fine — they're just parked because
541+
nothing is telling them to stop.
453542

454543
## Stopgap (landed)
455544

0 commit comments

Comments
 (0)