`main_thread_forkserver`: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention

**Symptom:** `tests/test_advanced_streaming.py::test_dynamic_pub_sub`
flakes ~17% of runs when exercised under 3 parallel pytest streams ×
`cpu_count - 2` (~22) forked subactors per test under
`--spawn-backend=main_thread_forkserver`. A single-stream run on an
otherwise-idle system reproduces ~0% — the bug is contention-driven.

### Failure signature when it fires

- Test runtime stretches to ~12.13s (= the `trio.fail_after(12)` cap
  landed in PR #447).
- `trio.TooSlowError` raised cleanly, **no** `Trio guest run got
  abandoned` warning (clean trio state, no `GLOBAL_RUN_CONTEXT`
  poison).
- Parent-side `faulthandler` snapshot at t=9 shows trio's main thread
  parked in `trio._core._io_epoll.get_events()` line 245 (=
  `self._epoll.poll(timeout, max_events)`). The cancel cascade has
  reached the trio I/O wait but never returns from `epoll.poll`.

### Diagnostic gap

Parent-side `faulthandler` dumps only show pytest's own threads (2 —
main + idle trio cache worker). The forked subactors are separate
processes; their stacks aren't in the dump. To pin the root cause we
need a `stackscope` task-tree dump from BOTH the parent (to see WHICH
trio task is parked: subactor exit wait? IPC cancel ack?
`_join_procs.wait()`?) and each subactor (to see if the IPC cancel
arrived and where their teardown is parked).

Wiring is in place — `enable_stack_on_sig=True` in the test's
`open_nursery()` + a side-monitor harness that broadcasts SIGUSR1 to
pytest + subactors when wall-clock crosses 8s — but the hang doesn't
reliably reproduce in light-CPU conditions. Repro currently requires
3+ parallel pytest streams to manifest, which is impractically heavy
for ad-hoc debugging on a dev machine.

### Workaround in place (PR #447)

- Per-test `trio.fail_after(12)` cap inside the test's `main()` —
  when the cascade hangs, trio raises `TooSlowError` cleanly, no
  `GLOBAL_RUN_CONTEXT` poison.
- The `reap_subactors_per_test` opt-in pytest fixture (also in PR
  #447) catches any leftover subactor zombies between tests so the
  failure stays attributed to the single test.
- Net: Mode-A flakes fail-attributed to `test_dynamic_pub_sub`
  itself, no cascade contamination of sibling tests in the module.

### Likely investigation next steps

- Build a contention-amplifier reproducer that doesn't require
  multiple pytest sessions (e.g. sleep-injecting inside
  `_ForkedProc.wait()` or in the parent's IPC cancel-broadcast path;
  or a synthetic fork-spawn-burst workload).
- With a reliable repro, capture `stackscope` task-trees from parent
  + each subactor at the t=9 mid-cascade mark.
- Confirm hypothesis: parent is parked on a `pidfd_open` via
  `_ForkedProc.wait` because some subactor's trio task didn't observe
  the parent's `Portal.cancel_actor()` IPC cancel.

### See also

- PR [#447](https://github.com/goodboy/tractor/pull/447) — the
  `subint_forkserver_backend` → `main_thread_forkserver` work where
  this Mode-A flake was first characterised.
- Test workaround code:
  `tests/test_advanced_streaming.py::test_dynamic_pub_sub`
  module-level NOTE.
- Sibling pre-existing hang issues:
  [#449](https://github.com/goodboy/tractor/issues/449)
  (`test_nested_multierrors` cancel-cascade),
  [#450](https://github.com/goodboy/tractor/issues/450) (msgspec PEP
  684 follow-up).

---

Tracked from [#379](https://github.com/goodboy/tractor/issues/379)
(subint umbrella).



### Update — additional evidence captured 2026-04-28

A new repro of this hang surfaced in
`tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext]`
during a TCP run (alongside an unrelated parallel UDS run for extra
system contention). New diagnostic data:

#### Native stack (via `py-spy dump --pid <pytest> --native`)

```
__internal_syscall_cancel  (libc)
__syscall_cancel
epoll_wait                 ← syscall confirmed
select_epoll_poll          (cpython)
get_events                 (trio/_core/_io_epoll.py:245)
run                        (trio/_core/_run.py:2415)
test_ext_types_over_ipc    (tests/msg/test_ext_types_msgspec.py:733)
```

Confirms the hang is in `epoll_wait` syscall — same trio I/O loop
park as the original `test_dynamic_pub_sub` capture. Now reproduced
across **two distinct test files** under the same backend,
strengthening the "Mode-A is real" signal.

#### `ss -tnp` shows 6 `CLOSE_WAIT` sockets to the registrar

```
CLOSE-WAIT 1 0 127.0.0.1:33750 127.0.0.1:8650 fd=14
CLOSE-WAIT 1 0 127.0.0.1:50614 127.0.0.1:8650 fd=26
CLOSE-WAIT 1 0 127.0.0.1:50698 127.0.0.1:8650 fd=29
CLOSE-WAIT 1 0 127.0.0.1:54884 127.0.0.1:8650 fd=17
CLOSE-WAIT 1 0 127.0.0.1:55004 127.0.0.1:8650 fd=20
CLOSE-WAIT 1 0 127.0.0.1:56252 127.0.0.1:8650 fd=23
```

`:8650` is the test session's random `reg_addr`. All 6 are
client-side connections from pytest → registrar (server is also in
pytest), all in `CLOSE_WAIT` (peer sent FIN, local fd never closed).
Filed separately as
[#452](https://github.com/goodboy/tractor/issues/452). Likely
contributes to but doesn't directly cause this hang — the trio loop
is parked on a *different* fd (probably awaiting a Started or Return
msg from the now-exited subactor).

#### Subactor is gone

`pgrep -P <pytest-pid> -f tractor._child` returned **empty**. The
subactor 'sub' that was spawned by `an.start_actor('sub', ...)` at
line 657 of the test has already exited — but the parent's trio task
is still parked awaiting an IPC message that will never come. Either:

- subactor crashed mid-test before sending an awaited reply
- subactor exited cleanly but `_ForkedProc.wait` (pidfd) didn't wake
  the parent's await
- some other lifecycle race

Confirms the hypothesis from the original issue body: parent's trio
task is parked on something the subactor would have sent.

#### Stackscope gap remains

To pin down WHICH trio task is parked we still need a
`stackscope`-flavored task-tree dump. `enable_stack_on_sig` was not
set in this test (it's gated on `debug_mode=True` or explicit
`open_root_actor(enable_stack_on_sig=True)`), so SIGUSR1 to the hung
pytest exits with rc=138 instead of dumping. Workarounds being
considered:

- Run with `--tpdb` (debug_mode=True) — gives stackscope but also
  activates pdb machinery, may alter cancel timing.
- Add a focused `--enable-stackscope` pytest CLI flag that installs
  the SIGUSR1 handler without full debug_mode side effects.

### See also

- [#452](https://github.com/goodboy/tractor/issues/452) — the
  CLOSE_WAIT fd leak observed alongside this hang.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`main_thread_forkserver`: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451

Failure signature when it fires

Diagnostic gap

Workaround in place (PR #447)

Likely investigation next steps

See also

Update — additional evidence captured 2026-04-28

Native stack (via `py-spy dump --pid <pytest> --native`)

`ss -tnp` shows 6 `CLOSE_WAIT` sockets to the registrar

Subactor is gone

Stackscope gap remains

See also

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

main_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451

Description

Failure signature when it fires

Diagnostic gap

Workaround in place (PR #447)

Likely investigation next steps

See also

Update — additional evidence captured 2026-04-28

Native stack (via py-spy dump --pid <pytest> --native)

ss -tnp shows 6 CLOSE_WAIT sockets to the registrar

Subactor is gone

Stackscope gap remains

See also

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`main_thread_forkserver`: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451

Native stack (via `py-spy dump --pid <pytest> --native`)

`ss -tnp` shows 6 `CLOSE_WAIT` sockets to the registrar