You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Symptom:tests/test_advanced_streaming.py::test_dynamic_pub_sub
flakes ~17% of runs when exercised under 3 parallel pytest streams × cpu_count - 2 (~22) forked subactors per test under --spawn-backend=main_thread_forkserver. A single-stream run on an
otherwise-idle system reproduces ~0% — the bug is contention-driven.
trio.TooSlowError raised cleanly, noTrio guest run got abandoned warning (clean trio state, no GLOBAL_RUN_CONTEXT
poison).
Parent-side faulthandler snapshot at t=9 shows trio's main thread
parked in trio._core._io_epoll.get_events() line 245 (= self._epoll.poll(timeout, max_events)). The cancel cascade has
reached the trio I/O wait but never returns from epoll.poll.
Diagnostic gap
Parent-side faulthandler dumps only show pytest's own threads (2 —
main + idle trio cache worker). The forked subactors are separate
processes; their stacks aren't in the dump. To pin the root cause we
need a stackscope task-tree dump from BOTH the parent (to see WHICH
trio task is parked: subactor exit wait? IPC cancel ack? _join_procs.wait()?) and each subactor (to see if the IPC cancel
arrived and where their teardown is parked).
Wiring is in place — enable_stack_on_sig=True in the test's open_nursery() + a side-monitor harness that broadcasts SIGUSR1 to
pytest + subactors when wall-clock crosses 8s — but the hang doesn't
reliably reproduce in light-CPU conditions. Repro currently requires
3+ parallel pytest streams to manifest, which is impractically heavy
for ad-hoc debugging on a dev machine.
Net: Mode-A flakes fail-attributed to test_dynamic_pub_sub
itself, no cascade contamination of sibling tests in the module.
Likely investigation next steps
Build a contention-amplifier reproducer that doesn't require
multiple pytest sessions (e.g. sleep-injecting inside _ForkedProc.wait() or in the parent's IPC cancel-broadcast path;
or a synthetic fork-spawn-burst workload).
With a reliable repro, capture stackscope task-trees from parent
each subactor at the t=9 mid-cascade mark.
Confirm hypothesis: parent is parked on a pidfd_open via _ForkedProc.wait because some subactor's trio task didn't observe
the parent's Portal.cancel_actor() IPC cancel.
See also
PR #447 — the subint_forkserver_backend → main_thread_forkserver work where
this Mode-A flake was first characterised.
Test workaround code: tests/test_advanced_streaming.py::test_dynamic_pub_sub
module-level NOTE.
Sibling pre-existing hang issues: #449
(test_nested_multierrors cancel-cascade), #450 (msgspec PEP
684 follow-up).
A new repro of this hang surfaced in tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext]
during a TCP run (alongside an unrelated parallel UDS run for extra
system contention). New diagnostic data:
Confirms the hang is in epoll_wait syscall — same trio I/O loop
park as the original test_dynamic_pub_sub capture. Now reproduced
across two distinct test files under the same backend,
strengthening the "Mode-A is real" signal.
ss -tnp shows 6 CLOSE_WAIT sockets to the registrar
:8650 is the test session's random reg_addr. All 6 are
client-side connections from pytest → registrar (server is also in
pytest), all in CLOSE_WAIT (peer sent FIN, local fd never closed).
Filed separately as #452. Likely
contributes to but doesn't directly cause this hang — the trio loop
is parked on a different fd (probably awaiting a Started or Return
msg from the now-exited subactor).
Subactor is gone
pgrep -P <pytest-pid> -f tractor._child returned empty. The
subactor 'sub' that was spawned by an.start_actor('sub', ...) at
line 657 of the test has already exited — but the parent's trio task
is still parked awaiting an IPC message that will never come. Either:
subactor crashed mid-test before sending an awaited reply
subactor exited cleanly but _ForkedProc.wait (pidfd) didn't wake
the parent's await
some other lifecycle race
Confirms the hypothesis from the original issue body: parent's trio
task is parked on something the subactor would have sent.
Stackscope gap remains
To pin down WHICH trio task is parked we still need a stackscope-flavored task-tree dump. enable_stack_on_sig was not
set in this test (it's gated on debug_mode=True or explicit open_root_actor(enable_stack_on_sig=True)), so SIGUSR1 to the hung
pytest exits with rc=138 instead of dumping. Workarounds being
considered:
Run with --tpdb (debug_mode=True) — gives stackscope but also
activates pdb machinery, may alter cancel timing.
Add a focused --enable-stackscope pytest CLI flag that installs
the SIGUSR1 handler without full debug_mode side effects.
See also
#452 — the
CLOSE_WAIT fd leak observed alongside this hang.
Symptom:
tests/test_advanced_streaming.py::test_dynamic_pub_subflakes ~17% of runs when exercised under 3 parallel pytest streams ×
cpu_count - 2(~22) forked subactors per test under--spawn-backend=main_thread_forkserver. A single-stream run on anotherwise-idle system reproduces ~0% — the bug is contention-driven.
Failure signature when it fires
trio.fail_after(12)caplanded in PR Working toward a "subinterpreter-forkserver" spawning backend #447).
trio.TooSlowErrorraised cleanly, noTrio guest run got abandonedwarning (clean trio state, noGLOBAL_RUN_CONTEXTpoison).
faulthandlersnapshot at t=9 shows trio's main threadparked in
trio._core._io_epoll.get_events()line 245 (=self._epoll.poll(timeout, max_events)). The cancel cascade hasreached the trio I/O wait but never returns from
epoll.poll.Diagnostic gap
Parent-side
faulthandlerdumps only show pytest's own threads (2 —main + idle trio cache worker). The forked subactors are separate
processes; their stacks aren't in the dump. To pin the root cause we
need a
stackscopetask-tree dump from BOTH the parent (to see WHICHtrio task is parked: subactor exit wait? IPC cancel ack?
_join_procs.wait()?) and each subactor (to see if the IPC cancelarrived and where their teardown is parked).
Wiring is in place —
enable_stack_on_sig=Truein the test'sopen_nursery()+ a side-monitor harness that broadcasts SIGUSR1 topytest + subactors when wall-clock crosses 8s — but the hang doesn't
reliably reproduce in light-CPU conditions. Repro currently requires
3+ parallel pytest streams to manifest, which is impractically heavy
for ad-hoc debugging on a dev machine.
Workaround in place (PR #447)
trio.fail_after(12)cap inside the test'smain()—when the cascade hangs, trio raises
TooSlowErrorcleanly, noGLOBAL_RUN_CONTEXTpoison.reap_subactors_per_testopt-in pytest fixture (also in PRWorking toward a "subinterpreter-forkserver" spawning backend #447) catches any leftover subactor zombies between tests so the
failure stays attributed to the single test.
test_dynamic_pub_subitself, no cascade contamination of sibling tests in the module.
Likely investigation next steps
multiple pytest sessions (e.g. sleep-injecting inside
_ForkedProc.wait()or in the parent's IPC cancel-broadcast path;or a synthetic fork-spawn-burst workload).
stackscopetask-trees from parentpidfd_openvia_ForkedProc.waitbecause some subactor's trio task didn't observethe parent's
Portal.cancel_actor()IPC cancel.See also
subint_forkserver_backend→main_thread_forkserverwork wherethis Mode-A flake was first characterised.
tests/test_advanced_streaming.py::test_dynamic_pub_submodule-level NOTE.
#449
(
test_nested_multierrorscancel-cascade),#450 (msgspec PEP
684 follow-up).
Tracked from #379
(subint umbrella).
Update — additional evidence captured 2026-04-28
A new repro of this hang surfaced in
tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext]during a TCP run (alongside an unrelated parallel UDS run for extra
system contention). New diagnostic data:
Native stack (via
py-spy dump --pid <pytest> --native)Confirms the hang is in
epoll_waitsyscall — same trio I/O looppark as the original
test_dynamic_pub_subcapture. Now reproducedacross two distinct test files under the same backend,
strengthening the "Mode-A is real" signal.
ss -tnpshows 6CLOSE_WAITsockets to the registrar:8650is the test session's randomreg_addr. All 6 areclient-side connections from pytest → registrar (server is also in
pytest), all in
CLOSE_WAIT(peer sent FIN, local fd never closed).Filed separately as
#452. Likely
contributes to but doesn't directly cause this hang — the trio loop
is parked on a different fd (probably awaiting a Started or Return
msg from the now-exited subactor).
Subactor is gone
pgrep -P <pytest-pid> -f tractor._childreturned empty. Thesubactor 'sub' that was spawned by
an.start_actor('sub', ...)atline 657 of the test has already exited — but the parent's trio task
is still parked awaiting an IPC message that will never come. Either:
_ForkedProc.wait(pidfd) didn't wakethe parent's await
Confirms the hypothesis from the original issue body: parent's trio
task is parked on something the subactor would have sent.
Stackscope gap remains
To pin down WHICH trio task is parked we still need a
stackscope-flavored task-tree dump.enable_stack_on_sigwas notset in this test (it's gated on
debug_mode=Trueor explicitopen_root_actor(enable_stack_on_sig=True)), so SIGUSR1 to the hungpytest exits with rc=138 instead of dumping. Workarounds being
considered:
--tpdb(debug_mode=True) — gives stackscope but alsoactivates pdb machinery, may alter cancel timing.
--enable-stackscopepytest CLI flag that installsthe SIGUSR1 handler without full debug_mode side effects.
See also
CLOSE_WAIT fd leak observed alongside this hang.