Skip to content

main_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451

@goodboy

Description

@goodboy

Symptom: tests/test_advanced_streaming.py::test_dynamic_pub_sub
flakes ~17% of runs when exercised under 3 parallel pytest streams ×
cpu_count - 2 (~22) forked subactors per test under
--spawn-backend=main_thread_forkserver. A single-stream run on an
otherwise-idle system reproduces ~0% — the bug is contention-driven.

Failure signature when it fires

  • Test runtime stretches to ~12.13s (= the trio.fail_after(12) cap
    landed in PR Working toward a "subinterpreter-forkserver" spawning backend #447).
  • trio.TooSlowError raised cleanly, no Trio guest run got abandoned warning (clean trio state, no GLOBAL_RUN_CONTEXT
    poison).
  • Parent-side faulthandler snapshot at t=9 shows trio's main thread
    parked in trio._core._io_epoll.get_events() line 245 (=
    self._epoll.poll(timeout, max_events)). The cancel cascade has
    reached the trio I/O wait but never returns from epoll.poll.

Diagnostic gap

Parent-side faulthandler dumps only show pytest's own threads (2 —
main + idle trio cache worker). The forked subactors are separate
processes; their stacks aren't in the dump. To pin the root cause we
need a stackscope task-tree dump from BOTH the parent (to see WHICH
trio task is parked: subactor exit wait? IPC cancel ack?
_join_procs.wait()?) and each subactor (to see if the IPC cancel
arrived and where their teardown is parked).

Wiring is in place — enable_stack_on_sig=True in the test's
open_nursery() + a side-monitor harness that broadcasts SIGUSR1 to
pytest + subactors when wall-clock crosses 8s — but the hang doesn't
reliably reproduce in light-CPU conditions. Repro currently requires
3+ parallel pytest streams to manifest, which is impractically heavy
for ad-hoc debugging on a dev machine.

Workaround in place (PR #447)

  • Per-test trio.fail_after(12) cap inside the test's main()
    when the cascade hangs, trio raises TooSlowError cleanly, no
    GLOBAL_RUN_CONTEXT poison.
  • The reap_subactors_per_test opt-in pytest fixture (also in PR
    Working toward a "subinterpreter-forkserver" spawning backend #447) catches any leftover subactor zombies between tests so the
    failure stays attributed to the single test.
  • Net: Mode-A flakes fail-attributed to test_dynamic_pub_sub
    itself, no cascade contamination of sibling tests in the module.

Likely investigation next steps

  • Build a contention-amplifier reproducer that doesn't require
    multiple pytest sessions (e.g. sleep-injecting inside
    _ForkedProc.wait() or in the parent's IPC cancel-broadcast path;
    or a synthetic fork-spawn-burst workload).
  • With a reliable repro, capture stackscope task-trees from parent
    • each subactor at the t=9 mid-cascade mark.
  • Confirm hypothesis: parent is parked on a pidfd_open via
    _ForkedProc.wait because some subactor's trio task didn't observe
    the parent's Portal.cancel_actor() IPC cancel.

See also

  • PR #447 — the
    subint_forkserver_backendmain_thread_forkserver work where
    this Mode-A flake was first characterised.
  • Test workaround code:
    tests/test_advanced_streaming.py::test_dynamic_pub_sub
    module-level NOTE.
  • Sibling pre-existing hang issues:
    #449
    (test_nested_multierrors cancel-cascade),
    #450 (msgspec PEP
    684 follow-up).

Tracked from #379
(subint umbrella).

Update — additional evidence captured 2026-04-28

A new repro of this hang surfaced in
tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext]
during a TCP run (alongside an unrelated parallel UDS run for extra
system contention). New diagnostic data:

Native stack (via py-spy dump --pid <pytest> --native)

__internal_syscall_cancel  (libc)
__syscall_cancel
epoll_wait                 ← syscall confirmed
select_epoll_poll          (cpython)
get_events                 (trio/_core/_io_epoll.py:245)
run                        (trio/_core/_run.py:2415)
test_ext_types_over_ipc    (tests/msg/test_ext_types_msgspec.py:733)

Confirms the hang is in epoll_wait syscall — same trio I/O loop
park as the original test_dynamic_pub_sub capture. Now reproduced
across two distinct test files under the same backend,
strengthening the "Mode-A is real" signal.

ss -tnp shows 6 CLOSE_WAIT sockets to the registrar

CLOSE-WAIT 1 0 127.0.0.1:33750 127.0.0.1:8650 fd=14
CLOSE-WAIT 1 0 127.0.0.1:50614 127.0.0.1:8650 fd=26
CLOSE-WAIT 1 0 127.0.0.1:50698 127.0.0.1:8650 fd=29
CLOSE-WAIT 1 0 127.0.0.1:54884 127.0.0.1:8650 fd=17
CLOSE-WAIT 1 0 127.0.0.1:55004 127.0.0.1:8650 fd=20
CLOSE-WAIT 1 0 127.0.0.1:56252 127.0.0.1:8650 fd=23

:8650 is the test session's random reg_addr. All 6 are
client-side connections from pytest → registrar (server is also in
pytest), all in CLOSE_WAIT (peer sent FIN, local fd never closed).
Filed separately as
#452. Likely
contributes to but doesn't directly cause this hang — the trio loop
is parked on a different fd (probably awaiting a Started or Return
msg from the now-exited subactor).

Subactor is gone

pgrep -P <pytest-pid> -f tractor._child returned empty. The
subactor 'sub' that was spawned by an.start_actor('sub', ...) at
line 657 of the test has already exited — but the parent's trio task
is still parked awaiting an IPC message that will never come. Either:

  • subactor crashed mid-test before sending an awaited reply
  • subactor exited cleanly but _ForkedProc.wait (pidfd) didn't wake
    the parent's await
  • some other lifecycle race

Confirms the hypothesis from the original issue body: parent's trio
task is parked on something the subactor would have sent.

Stackscope gap remains

To pin down WHICH trio task is parked we still need a
stackscope-flavored task-tree dump. enable_stack_on_sig was not
set in this test (it's gated on debug_mode=True or explicit
open_root_actor(enable_stack_on_sig=True)), so SIGUSR1 to the hung
pytest exits with rc=138 instead of dumping. Workarounds being
considered:

  • Run with --tpdb (debug_mode=True) — gives stackscope but also
    activates pdb machinery, may alter cancel timing.
  • Add a focused --enable-stackscope pytest CLI flag that installs
    the SIGUSR1 handler without full debug_mode side effects.

See also

  • #452 — the
    CLOSE_WAIT fd leak observed alongside this hang.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions