Skip to content

Commit a72deef

Browse files
committed
Refine subint_forkserver orphan-SIGINT diagnosis
Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is **not** "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
1 parent dcd5c1f commit a72deef

3 files changed

Lines changed: 422 additions & 22 deletions

File tree

Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# `subint_forkserver` backend: orphaned-subactor SIGINT wedged in `epoll_wait`
2+
3+
Follow-up to the Phase C `subint_forkserver` spawn-backend
4+
PR (see `tractor.spawn._subint_forkserver`, issue #379).
5+
Surfaced by the xfail'd
6+
`tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT`.
7+
8+
Related-but-distinct from
9+
`subint_cancel_delivery_hang_issue.md` (orphaned-channel
10+
park AFTER subint teardown) and
11+
`subint_sigint_starvation_issue.md` (GIL-starvation,
12+
SIGINT never delivered): here the SIGINT IS delivered,
13+
trio's handler IS installed, but trio's event loop never
14+
wakes — so the KBI-at-checkpoint → `_trio_main` catch path
15+
(which is the runtime's *intentional* OS-cancel design)
16+
never fires.
17+
18+
## TL;DR
19+
20+
When a `subint_forkserver`-spawned subactor is orphaned
21+
(parent `SIGKILL`'d, no IPC cancel path available) and then
22+
externally `SIGINT`'d, the subactor hangs in
23+
`trio/_core/_io_epoll.py::get_events` (epoll_wait)
24+
indefinitely — even though:
25+
26+
1. `threading.current_thread() is threading.main_thread()`
27+
post-fork (CPython 3.14 re-designates correctly).
28+
2. Trio's SIGINT handler IS installed in the subactor
29+
(`signal.getsignal(SIGINT)` returns
30+
`<function KIManager.install.<locals>.handler at 0x...>`).
31+
3. The kernel does deliver SIGINT — the signal arrives at
32+
the only thread in the process (the fork-inherited
33+
worker which IS now "main" per Python).
34+
35+
Yet `epoll_wait` does not return. Trio's wakeup-fd mechanism
36+
— the machinery that turns SIGINT into an epoll-wake — is
37+
somehow not firing the wakeup. Until that's fixed, the
38+
intentional "KBI-as-OS-cancel" path in
39+
`tractor/spawn/_entry.py::_trio_main:164` is unreachable
40+
for forkserver-spawned subactors whose parent dies.
41+
42+
## Symptom
43+
44+
Test: `tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT`
45+
(currently marked `@pytest.mark.xfail(strict=True)`).
46+
47+
1. Harness subprocess brings up a tractor root actor +
48+
one `run_in_actor(_sleep_forever)` subactor via
49+
`try_set_start_method('subint_forkserver')`.
50+
2. Harness prints `CHILD_PID` (subactor) and
51+
`PARENT_READY` (root actor) markers to stdout.
52+
3. Test `os.kill(parent_pid, SIGKILL)` + `proc.wait()`
53+
to fully reap the root-actor harness.
54+
4. Child (now reparented to pid 1) is still alive.
55+
5. Test `os.kill(child_pid, SIGINT)` and polls
56+
`os.kill(child_pid, 0)` for up to 10s.
57+
6. **Observed**: the child is still alive at deadline —
58+
SIGINT did not unwedge the trio loop.
59+
60+
## What the "intentional" cancel path IS
61+
62+
`tractor/spawn/_entry.py::_trio_main:157-186`
63+
64+
```python
65+
try:
66+
if infect_asyncio:
67+
actor._infected_aio = True
68+
run_as_asyncio_guest(trio_main)
69+
else:
70+
trio.run(trio_main)
71+
72+
except KeyboardInterrupt:
73+
logmeth = log.cancel
74+
exit_status: str = (
75+
'Actor received KBI (aka an OS-cancel)\n'
76+
...
77+
)
78+
```
79+
80+
The "KBI == OS-cancel" mapping IS the runtime's
81+
deliberate, documented design. An OS-level SIGINT should
82+
flow as: kernel → trio handler → KBI at trio checkpoint
83+
→ unwinds `async_main` → surfaces at `_trio_main`'s
84+
`except KeyboardInterrupt:``log.cancel` + clean `rc=0`.
85+
86+
**So fixing this hang is not "add a new SIGINT behavior" —
87+
it's "make the existing designed behavior actually fire in
88+
this backend config".** That's why option (B) ("fix root
89+
cause") is aligned with existing design intent, not a
90+
scope expansion.
91+
92+
## Evidence
93+
94+
### Positive control: standalone fork-from-worker + `trio.run(sleep_forever)` + SIGINT WORKS
95+
96+
```python
97+
import os, signal, time, trio
98+
from tractor.spawn._subint_forkserver import (
99+
fork_from_worker_thread, wait_child,
100+
)
101+
102+
def child_target() -> int:
103+
async def _main():
104+
try:
105+
await trio.sleep_forever()
106+
except KeyboardInterrupt:
107+
print('CHILD: caught KBI — trio SIGINT works!')
108+
return
109+
trio.run(_main)
110+
return 0
111+
112+
pid = fork_from_worker_thread(child_target, thread_name='trio-sigint-test')
113+
time.sleep(1.0)
114+
os.kill(pid, signal.SIGINT)
115+
wait_child(pid)
116+
```
117+
118+
Result: `CHILD: caught KBI — trio SIGINT works!` + clean
119+
exit. So the fork-child + trio signal plumbing IS healthy
120+
in isolation. The hang appears only with the full tractor
121+
subactor runtime on top.
122+
123+
### Negative test: full tractor subactor + orphan-SIGINT
124+
125+
Equivalent to the xfail test. Traceback dump via
126+
`faulthandler.register(SIGUSR1, all_threads=True)` at the
127+
stuck moment:
128+
129+
```
130+
Current thread 0x00007... [subint-forkserv] (most recent call first):
131+
File ".../trio/_core/_io_epoll.py", line 245 in get_events
132+
File ".../trio/_core/_run.py", line 2415 in run
133+
File "tractor/spawn/_entry.py", line 162 in _trio_main
134+
File "tractor/_child.py", line 72 in _actor_child_main
135+
File "tractor/spawn/_subint_forkserver.py", line 650 in _child_target
136+
File "tractor/spawn/_subint_forkserver.py", line 308 in _worker
137+
File ".../threading.py", line 1024 in run
138+
```
139+
140+
### Thread + signal-mask inventory of the stuck subactor
141+
142+
Single thread (`tid == pid`, comm `'subint-forkserv'`,
143+
which IS `threading.main_thread()` post-fork):
144+
145+
```
146+
SigBlk: 0000000000000000 # nothing blocked
147+
SigIgn: 0000000001001000 # SIGPIPE etc (Python defaults)
148+
SigCgt: 0000000108000202 # bit 1 = SIGINT caught
149+
```
150+
151+
Bit 1 set in `SigCgt` → SIGINT handler IS installed. So
152+
trio's handler IS in place at the kernel level — not a
153+
"handler missing" situation.
154+
155+
### Handler identity
156+
157+
Inside the subactor's RPC body, `signal.getsignal(SIGINT)`
158+
returns `<function KIManager.install.<locals>.handler at
159+
0x...>` — trio's own `KIManager` handler. tractor's only
160+
SIGINT touches are `signal.getsignal()` *reads* (to stash
161+
into `debug.DebugStatus._trio_handler`); nothing writes
162+
over trio's handler outside the debug-REPL shielding path
163+
(`devx/debug/_tty_lock.py::shield_sigint`) which isn't
164+
engaged here (no debug_mode).
165+
166+
## Ruled out
167+
168+
- **GIL starvation / signal-pipe-full** (class A,
169+
`subint_sigint_starvation_issue.md`): subactor runs on
170+
its own GIL (separate OS process), not sharing with the
171+
parent → no cross-process GIL contention. And `strace`-
172+
equivalent in the signal mask shows SIGINT IS caught,
173+
not queued.
174+
- **Orphaned channel park** (`subint_cancel_delivery_hang_issue.md`):
175+
different failure mode — that one has trio iterating
176+
normally and getting wedged on an orphaned
177+
`chan.recv()` AFTER teardown. Here trio's event loop
178+
itself never wakes.
179+
- **Tractor explicitly catching + swallowing KBI**:
180+
greppable — the one `except KeyboardInterrupt:` in the
181+
runtime is the INTENTIONAL cancel-path catch at
182+
`_trio_main:164`. `async_main` uses `except Exception`
183+
(not BaseException), so KBI should propagate through
184+
cleanly if it ever fires.
185+
- **Missing `signal.set_wakeup_fd` (main-thread
186+
restriction)**: post-fork, the fork-worker thread IS
187+
`threading.main_thread()`, so trio's main-thread check
188+
passes and its wakeup-fd install should succeed.
189+
190+
## Root cause hypothesis (unverified)
191+
192+
The SIGINT handler fires but trio's wakeup-fd write does
193+
not wake `epoll_wait`. Candidate causes, ranked by
194+
plausibility:
195+
196+
1. **Wakeup-fd lifecycle race around tractor IPC setup.**
197+
`async_main` spins up an IPC server + `process_messages`
198+
loops early. Somewhere in that path the wakeup-fd that
199+
trio registered with its epoll instance may be
200+
closed/replaced/clobbered, so subsequent SIGINT writes
201+
land on an fd that's no longer in the epoll set.
202+
Evidence needed: compare
203+
`signal.set_wakeup_fd(-1)` return value inside a
204+
post-tractor-bringup RPC body vs. a pre-bringup
205+
equivalent. If they differ, that's it.
206+
2. **Shielded cancel scope around `process_messages`.**
207+
The RPC message loop is likely wrapped in a trio cancel
208+
scope; if that scope is `shield=True` at any outer
209+
layer, KBI scheduled at a checkpoint could be absorbed
210+
by the shield and never bubble out to `_trio_main`.
211+
3. **Pre-fork wakeup-fd inheritance.** trio in the PARENT
212+
process registered a wakeup-fd with its own epoll. The
213+
child inherits the fd number but not the parent's
214+
epoll instance — if tractor/trio re-uses the parent's
215+
stale fd number anywhere, writes would go to a no-op
216+
fd. (This is the least likely — `trio.run()` on the
217+
child calls `KIManager.install` which should install a
218+
fresh wakeup-fd from scratch.)
219+
220+
## Cross-backend scope question
221+
222+
**Untested**: does the same orphan-SIGINT hang reproduce
223+
against the `trio_proc` backend (stock subprocess + exec)?
224+
If yes → pre-existing tractor bug, independent of
225+
`subint_forkserver`. If no → something specific to the
226+
fork-from-worker path (e.g. inherited fds, mid-epoll-setup
227+
interference).
228+
229+
**Quick repro for trio_proc**:
230+
231+
```python
232+
# save as /tmp/trio_proc_orphan_sigint_repro.py
233+
import os, sys, signal, time, glob
234+
import subprocess as sp
235+
236+
SCRIPT = '''
237+
import os, sys, trio, tractor
238+
async def _sleep_forever():
239+
print(f"CHILD_PID={os.getpid()}", flush=True)
240+
await trio.sleep_forever()
241+
242+
async def _main():
243+
async with (
244+
tractor.open_root_actor(registry_addrs=[("127.0.0.1", 12350)]),
245+
tractor.open_nursery() as an,
246+
):
247+
await an.run_in_actor(_sleep_forever, name="sf-child")
248+
print(f"PARENT_READY={os.getpid()}", flush=True)
249+
await trio.sleep_forever()
250+
251+
trio.run(_main)
252+
'''
253+
254+
proc = sp.Popen(
255+
[sys.executable, '-c', SCRIPT],
256+
stdout=sp.PIPE, stderr=sp.STDOUT,
257+
)
258+
# parse CHILD_PID + PARENT_READY off proc.stdout ...
259+
# SIGKILL parent, SIGINT child, poll.
260+
```
261+
262+
If that hangs too, open a broader issue; if not, this is
263+
`subint_forkserver`-specific (likely fd-inheritance-related).
264+
265+
## Why this is ours to fix (not CPython's)
266+
267+
- Signal IS delivered (`SigCgt` bitmask confirms).
268+
- Handler IS installed (trio's `KIManager`).
269+
- Thread identity is correct post-fork.
270+
- `_trio_main` already has the intentional KBI→clean-exit
271+
path waiting to fire.
272+
273+
Every CPython-level precondition is met. Something in
274+
tractor's runtime or trio's integration with it is
275+
breaking the SIGINT→wakeup→event-loop-wake pipeline.
276+
277+
## Possible fix directions
278+
279+
1. **Audit the wakeup-fd across tractor's IPC bringup.**
280+
Add a trio startup hook that captures
281+
`signal.set_wakeup_fd(-1)` at `_trio_main` entry,
282+
after `async_main` enters, and periodically — assert
283+
it's unchanged. If it moves, track down the writer.
284+
2. **Explicit `signal.set_wakeup_fd` reset after IPC
285+
setup.** Brute force: re-install a fresh wakeup-fd
286+
mid-bringup. Band-aid, but fast to try.
287+
3. **Ensure no `shield=True` cancel scope envelopes the
288+
RPC-message-loop / IPC-server task.** If one does,
289+
KBI-at-checkpoint never escapes.
290+
4. **Once fixed, the `child_sigint='trio'` mode on
291+
`subint_forkserver_proc`** becomes effectively a no-op
292+
or a doc-only mode — trio's natural handler already
293+
does the right thing. Might end up removing the flag
294+
entirely if there's no behavioral difference between
295+
modes.
296+
297+
## Current workaround
298+
299+
None; `child_sigint` defaults to `'ipc'` (IPC cancel is
300+
the only reliable cancel path today), and the xfail test
301+
documents the gap. Operators hitting orphan-SIGINT get a
302+
hung process that needs `SIGKILL`.
303+
304+
## Reproducer
305+
306+
Inline, standalone (no pytest):
307+
308+
```python
309+
# save as /tmp/orphan_sigint_repro.py (py3.14+)
310+
import os, sys, signal, time, glob, trio
311+
import tractor
312+
from tractor.spawn._subint_forkserver import (
313+
fork_from_worker_thread,
314+
)
315+
316+
async def _sleep_forever():
317+
print(f'SUBACTOR[{os.getpid()}]', flush=True)
318+
await trio.sleep_forever()
319+
320+
async def _main():
321+
async with (
322+
tractor.open_root_actor(
323+
registry_addrs=[('127.0.0.1', 12349)],
324+
),
325+
tractor.open_nursery() as an,
326+
):
327+
await an.run_in_actor(_sleep_forever, name='sf-child')
328+
await trio.sleep_forever()
329+
330+
def child_target() -> int:
331+
from tractor.spawn._spawn import try_set_start_method
332+
try_set_start_method('subint_forkserver')
333+
trio.run(_main)
334+
return 0
335+
336+
pid = fork_from_worker_thread(child_target, thread_name='repro')
337+
time.sleep(3.0)
338+
339+
# find the subactor pid via /proc
340+
children = []
341+
for path in glob.glob(f'/proc/{pid}/task/*/children'):
342+
with open(path) as f:
343+
children.extend(int(x) for x in f.read().split() if x)
344+
subactor_pid = children[0]
345+
346+
# SIGKILL root → orphan the subactor
347+
os.kill(pid, signal.SIGKILL)
348+
os.waitpid(pid, 0)
349+
time.sleep(0.3)
350+
351+
# SIGINT the orphan — should cause clean trio exit
352+
os.kill(subactor_pid, signal.SIGINT)
353+
354+
# poll for exit
355+
for _ in range(100):
356+
try:
357+
os.kill(subactor_pid, 0)
358+
time.sleep(0.1)
359+
except ProcessLookupError:
360+
print('HARNESS: subactor exited cleanly ✔')
361+
sys.exit(0)
362+
os.kill(subactor_pid, signal.SIGKILL)
363+
print('HARNESS: subactor hung — reproduced')
364+
sys.exit(1)
365+
```
366+
367+
Expected (current): `HARNESS: subactor hung — reproduced`.
368+
369+
After fix: `HARNESS: subactor exited cleanly ✔`.
370+
371+
## References
372+
373+
- `tractor/spawn/_entry.py::_trio_main:157-186` — the
374+
intentional KBI→clean-exit path this bug makes
375+
unreachable.
376+
- `tractor/spawn/_subint_forkserver` — the backend whose
377+
orphan cancel-robustness this blocks.
378+
- `tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT`
379+
— the xfail'd reproducer in the test suite.
380+
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
381+
sibling "orphaned channel park" hang (different class).
382+
- `ai/conc-anal/subint_sigint_starvation_issue.md`
383+
sibling "GIL starvation SIGINT drop" hang (different
384+
class).
385+
- tractor issue #379 — subint backend tracking.

0 commit comments

Comments
 (0)