|
| 1 | +# `subint_forkserver` backend: orphaned-subactor SIGINT wedged in `epoll_wait` |
| 2 | + |
| 3 | +Follow-up to the Phase C `subint_forkserver` spawn-backend |
| 4 | +PR (see `tractor.spawn._subint_forkserver`, issue #379). |
| 5 | +Surfaced by the xfail'd |
| 6 | +`tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT`. |
| 7 | + |
| 8 | +Related-but-distinct from |
| 9 | +`subint_cancel_delivery_hang_issue.md` (orphaned-channel |
| 10 | +park AFTER subint teardown) and |
| 11 | +`subint_sigint_starvation_issue.md` (GIL-starvation, |
| 12 | +SIGINT never delivered): here the SIGINT IS delivered, |
| 13 | +trio's handler IS installed, but trio's event loop never |
| 14 | +wakes — so the KBI-at-checkpoint → `_trio_main` catch path |
| 15 | +(which is the runtime's *intentional* OS-cancel design) |
| 16 | +never fires. |
| 17 | + |
| 18 | +## TL;DR |
| 19 | + |
| 20 | +When a `subint_forkserver`-spawned subactor is orphaned |
| 21 | +(parent `SIGKILL`'d, no IPC cancel path available) and then |
| 22 | +externally `SIGINT`'d, the subactor hangs in |
| 23 | +`trio/_core/_io_epoll.py::get_events` (epoll_wait) |
| 24 | +indefinitely — even though: |
| 25 | + |
| 26 | +1. `threading.current_thread() is threading.main_thread()` |
| 27 | + post-fork (CPython 3.14 re-designates correctly). |
| 28 | +2. Trio's SIGINT handler IS installed in the subactor |
| 29 | + (`signal.getsignal(SIGINT)` returns |
| 30 | + `<function KIManager.install.<locals>.handler at 0x...>`). |
| 31 | +3. The kernel does deliver SIGINT — the signal arrives at |
| 32 | + the only thread in the process (the fork-inherited |
| 33 | + worker which IS now "main" per Python). |
| 34 | + |
| 35 | +Yet `epoll_wait` does not return. Trio's wakeup-fd mechanism |
| 36 | +— the machinery that turns SIGINT into an epoll-wake — is |
| 37 | +somehow not firing the wakeup. Until that's fixed, the |
| 38 | +intentional "KBI-as-OS-cancel" path in |
| 39 | +`tractor/spawn/_entry.py::_trio_main:164` is unreachable |
| 40 | +for forkserver-spawned subactors whose parent dies. |
| 41 | + |
| 42 | +## Symptom |
| 43 | + |
| 44 | +Test: `tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT` |
| 45 | +(currently marked `@pytest.mark.xfail(strict=True)`). |
| 46 | + |
| 47 | +1. Harness subprocess brings up a tractor root actor + |
| 48 | + one `run_in_actor(_sleep_forever)` subactor via |
| 49 | + `try_set_start_method('subint_forkserver')`. |
| 50 | +2. Harness prints `CHILD_PID` (subactor) and |
| 51 | + `PARENT_READY` (root actor) markers to stdout. |
| 52 | +3. Test `os.kill(parent_pid, SIGKILL)` + `proc.wait()` |
| 53 | + to fully reap the root-actor harness. |
| 54 | +4. Child (now reparented to pid 1) is still alive. |
| 55 | +5. Test `os.kill(child_pid, SIGINT)` and polls |
| 56 | + `os.kill(child_pid, 0)` for up to 10s. |
| 57 | +6. **Observed**: the child is still alive at deadline — |
| 58 | + SIGINT did not unwedge the trio loop. |
| 59 | + |
| 60 | +## What the "intentional" cancel path IS |
| 61 | + |
| 62 | +`tractor/spawn/_entry.py::_trio_main:157-186` — |
| 63 | + |
| 64 | +```python |
| 65 | +try: |
| 66 | + if infect_asyncio: |
| 67 | + actor._infected_aio = True |
| 68 | + run_as_asyncio_guest(trio_main) |
| 69 | + else: |
| 70 | + trio.run(trio_main) |
| 71 | + |
| 72 | +except KeyboardInterrupt: |
| 73 | + logmeth = log.cancel |
| 74 | + exit_status: str = ( |
| 75 | + 'Actor received KBI (aka an OS-cancel)\n' |
| 76 | + ... |
| 77 | + ) |
| 78 | +``` |
| 79 | + |
| 80 | +The "KBI == OS-cancel" mapping IS the runtime's |
| 81 | +deliberate, documented design. An OS-level SIGINT should |
| 82 | +flow as: kernel → trio handler → KBI at trio checkpoint |
| 83 | +→ unwinds `async_main` → surfaces at `_trio_main`'s |
| 84 | +`except KeyboardInterrupt:` → `log.cancel` + clean `rc=0`. |
| 85 | + |
| 86 | +**So fixing this hang is not "add a new SIGINT behavior" — |
| 87 | +it's "make the existing designed behavior actually fire in |
| 88 | +this backend config".** That's why option (B) ("fix root |
| 89 | +cause") is aligned with existing design intent, not a |
| 90 | +scope expansion. |
| 91 | + |
| 92 | +## Evidence |
| 93 | + |
| 94 | +### Positive control: standalone fork-from-worker + `trio.run(sleep_forever)` + SIGINT WORKS |
| 95 | + |
| 96 | +```python |
| 97 | +import os, signal, time, trio |
| 98 | +from tractor.spawn._subint_forkserver import ( |
| 99 | + fork_from_worker_thread, wait_child, |
| 100 | +) |
| 101 | + |
| 102 | +def child_target() -> int: |
| 103 | + async def _main(): |
| 104 | + try: |
| 105 | + await trio.sleep_forever() |
| 106 | + except KeyboardInterrupt: |
| 107 | + print('CHILD: caught KBI — trio SIGINT works!') |
| 108 | + return |
| 109 | + trio.run(_main) |
| 110 | + return 0 |
| 111 | + |
| 112 | +pid = fork_from_worker_thread(child_target, thread_name='trio-sigint-test') |
| 113 | +time.sleep(1.0) |
| 114 | +os.kill(pid, signal.SIGINT) |
| 115 | +wait_child(pid) |
| 116 | +``` |
| 117 | + |
| 118 | +Result: `CHILD: caught KBI — trio SIGINT works!` + clean |
| 119 | +exit. So the fork-child + trio signal plumbing IS healthy |
| 120 | +in isolation. The hang appears only with the full tractor |
| 121 | +subactor runtime on top. |
| 122 | + |
| 123 | +### Negative test: full tractor subactor + orphan-SIGINT |
| 124 | + |
| 125 | +Equivalent to the xfail test. Traceback dump via |
| 126 | +`faulthandler.register(SIGUSR1, all_threads=True)` at the |
| 127 | +stuck moment: |
| 128 | + |
| 129 | +``` |
| 130 | +Current thread 0x00007... [subint-forkserv] (most recent call first): |
| 131 | + File ".../trio/_core/_io_epoll.py", line 245 in get_events |
| 132 | + File ".../trio/_core/_run.py", line 2415 in run |
| 133 | + File "tractor/spawn/_entry.py", line 162 in _trio_main |
| 134 | + File "tractor/_child.py", line 72 in _actor_child_main |
| 135 | + File "tractor/spawn/_subint_forkserver.py", line 650 in _child_target |
| 136 | + File "tractor/spawn/_subint_forkserver.py", line 308 in _worker |
| 137 | + File ".../threading.py", line 1024 in run |
| 138 | +``` |
| 139 | + |
| 140 | +### Thread + signal-mask inventory of the stuck subactor |
| 141 | + |
| 142 | +Single thread (`tid == pid`, comm `'subint-forkserv'`, |
| 143 | +which IS `threading.main_thread()` post-fork): |
| 144 | + |
| 145 | +``` |
| 146 | +SigBlk: 0000000000000000 # nothing blocked |
| 147 | +SigIgn: 0000000001001000 # SIGPIPE etc (Python defaults) |
| 148 | +SigCgt: 0000000108000202 # bit 1 = SIGINT caught |
| 149 | +``` |
| 150 | + |
| 151 | +Bit 1 set in `SigCgt` → SIGINT handler IS installed. So |
| 152 | +trio's handler IS in place at the kernel level — not a |
| 153 | +"handler missing" situation. |
| 154 | + |
| 155 | +### Handler identity |
| 156 | + |
| 157 | +Inside the subactor's RPC body, `signal.getsignal(SIGINT)` |
| 158 | +returns `<function KIManager.install.<locals>.handler at |
| 159 | +0x...>` — trio's own `KIManager` handler. tractor's only |
| 160 | +SIGINT touches are `signal.getsignal()` *reads* (to stash |
| 161 | +into `debug.DebugStatus._trio_handler`); nothing writes |
| 162 | +over trio's handler outside the debug-REPL shielding path |
| 163 | +(`devx/debug/_tty_lock.py::shield_sigint`) which isn't |
| 164 | +engaged here (no debug_mode). |
| 165 | + |
| 166 | +## Ruled out |
| 167 | + |
| 168 | +- **GIL starvation / signal-pipe-full** (class A, |
| 169 | + `subint_sigint_starvation_issue.md`): subactor runs on |
| 170 | + its own GIL (separate OS process), not sharing with the |
| 171 | + parent → no cross-process GIL contention. And `strace`- |
| 172 | + equivalent in the signal mask shows SIGINT IS caught, |
| 173 | + not queued. |
| 174 | +- **Orphaned channel park** (`subint_cancel_delivery_hang_issue.md`): |
| 175 | + different failure mode — that one has trio iterating |
| 176 | + normally and getting wedged on an orphaned |
| 177 | + `chan.recv()` AFTER teardown. Here trio's event loop |
| 178 | + itself never wakes. |
| 179 | +- **Tractor explicitly catching + swallowing KBI**: |
| 180 | + greppable — the one `except KeyboardInterrupt:` in the |
| 181 | + runtime is the INTENTIONAL cancel-path catch at |
| 182 | + `_trio_main:164`. `async_main` uses `except Exception` |
| 183 | + (not BaseException), so KBI should propagate through |
| 184 | + cleanly if it ever fires. |
| 185 | +- **Missing `signal.set_wakeup_fd` (main-thread |
| 186 | + restriction)**: post-fork, the fork-worker thread IS |
| 187 | + `threading.main_thread()`, so trio's main-thread check |
| 188 | + passes and its wakeup-fd install should succeed. |
| 189 | + |
| 190 | +## Root cause hypothesis (unverified) |
| 191 | + |
| 192 | +The SIGINT handler fires but trio's wakeup-fd write does |
| 193 | +not wake `epoll_wait`. Candidate causes, ranked by |
| 194 | +plausibility: |
| 195 | + |
| 196 | +1. **Wakeup-fd lifecycle race around tractor IPC setup.** |
| 197 | + `async_main` spins up an IPC server + `process_messages` |
| 198 | + loops early. Somewhere in that path the wakeup-fd that |
| 199 | + trio registered with its epoll instance may be |
| 200 | + closed/replaced/clobbered, so subsequent SIGINT writes |
| 201 | + land on an fd that's no longer in the epoll set. |
| 202 | + Evidence needed: compare |
| 203 | + `signal.set_wakeup_fd(-1)` return value inside a |
| 204 | + post-tractor-bringup RPC body vs. a pre-bringup |
| 205 | + equivalent. If they differ, that's it. |
| 206 | +2. **Shielded cancel scope around `process_messages`.** |
| 207 | + The RPC message loop is likely wrapped in a trio cancel |
| 208 | + scope; if that scope is `shield=True` at any outer |
| 209 | + layer, KBI scheduled at a checkpoint could be absorbed |
| 210 | + by the shield and never bubble out to `_trio_main`. |
| 211 | +3. **Pre-fork wakeup-fd inheritance.** trio in the PARENT |
| 212 | + process registered a wakeup-fd with its own epoll. The |
| 213 | + child inherits the fd number but not the parent's |
| 214 | + epoll instance — if tractor/trio re-uses the parent's |
| 215 | + stale fd number anywhere, writes would go to a no-op |
| 216 | + fd. (This is the least likely — `trio.run()` on the |
| 217 | + child calls `KIManager.install` which should install a |
| 218 | + fresh wakeup-fd from scratch.) |
| 219 | + |
| 220 | +## Cross-backend scope question |
| 221 | + |
| 222 | +**Untested**: does the same orphan-SIGINT hang reproduce |
| 223 | +against the `trio_proc` backend (stock subprocess + exec)? |
| 224 | +If yes → pre-existing tractor bug, independent of |
| 225 | +`subint_forkserver`. If no → something specific to the |
| 226 | +fork-from-worker path (e.g. inherited fds, mid-epoll-setup |
| 227 | +interference). |
| 228 | + |
| 229 | +**Quick repro for trio_proc**: |
| 230 | + |
| 231 | +```python |
| 232 | +# save as /tmp/trio_proc_orphan_sigint_repro.py |
| 233 | +import os, sys, signal, time, glob |
| 234 | +import subprocess as sp |
| 235 | + |
| 236 | +SCRIPT = ''' |
| 237 | +import os, sys, trio, tractor |
| 238 | +async def _sleep_forever(): |
| 239 | + print(f"CHILD_PID={os.getpid()}", flush=True) |
| 240 | + await trio.sleep_forever() |
| 241 | +
|
| 242 | +async def _main(): |
| 243 | + async with ( |
| 244 | + tractor.open_root_actor(registry_addrs=[("127.0.0.1", 12350)]), |
| 245 | + tractor.open_nursery() as an, |
| 246 | + ): |
| 247 | + await an.run_in_actor(_sleep_forever, name="sf-child") |
| 248 | + print(f"PARENT_READY={os.getpid()}", flush=True) |
| 249 | + await trio.sleep_forever() |
| 250 | +
|
| 251 | +trio.run(_main) |
| 252 | +''' |
| 253 | + |
| 254 | +proc = sp.Popen( |
| 255 | + [sys.executable, '-c', SCRIPT], |
| 256 | + stdout=sp.PIPE, stderr=sp.STDOUT, |
| 257 | +) |
| 258 | +# parse CHILD_PID + PARENT_READY off proc.stdout ... |
| 259 | +# SIGKILL parent, SIGINT child, poll. |
| 260 | +``` |
| 261 | + |
| 262 | +If that hangs too, open a broader issue; if not, this is |
| 263 | +`subint_forkserver`-specific (likely fd-inheritance-related). |
| 264 | + |
| 265 | +## Why this is ours to fix (not CPython's) |
| 266 | + |
| 267 | +- Signal IS delivered (`SigCgt` bitmask confirms). |
| 268 | +- Handler IS installed (trio's `KIManager`). |
| 269 | +- Thread identity is correct post-fork. |
| 270 | +- `_trio_main` already has the intentional KBI→clean-exit |
| 271 | + path waiting to fire. |
| 272 | + |
| 273 | +Every CPython-level precondition is met. Something in |
| 274 | +tractor's runtime or trio's integration with it is |
| 275 | +breaking the SIGINT→wakeup→event-loop-wake pipeline. |
| 276 | + |
| 277 | +## Possible fix directions |
| 278 | + |
| 279 | +1. **Audit the wakeup-fd across tractor's IPC bringup.** |
| 280 | + Add a trio startup hook that captures |
| 281 | + `signal.set_wakeup_fd(-1)` at `_trio_main` entry, |
| 282 | + after `async_main` enters, and periodically — assert |
| 283 | + it's unchanged. If it moves, track down the writer. |
| 284 | +2. **Explicit `signal.set_wakeup_fd` reset after IPC |
| 285 | + setup.** Brute force: re-install a fresh wakeup-fd |
| 286 | + mid-bringup. Band-aid, but fast to try. |
| 287 | +3. **Ensure no `shield=True` cancel scope envelopes the |
| 288 | + RPC-message-loop / IPC-server task.** If one does, |
| 289 | + KBI-at-checkpoint never escapes. |
| 290 | +4. **Once fixed, the `child_sigint='trio'` mode on |
| 291 | + `subint_forkserver_proc`** becomes effectively a no-op |
| 292 | + or a doc-only mode — trio's natural handler already |
| 293 | + does the right thing. Might end up removing the flag |
| 294 | + entirely if there's no behavioral difference between |
| 295 | + modes. |
| 296 | + |
| 297 | +## Current workaround |
| 298 | + |
| 299 | +None; `child_sigint` defaults to `'ipc'` (IPC cancel is |
| 300 | +the only reliable cancel path today), and the xfail test |
| 301 | +documents the gap. Operators hitting orphan-SIGINT get a |
| 302 | +hung process that needs `SIGKILL`. |
| 303 | + |
| 304 | +## Reproducer |
| 305 | + |
| 306 | +Inline, standalone (no pytest): |
| 307 | + |
| 308 | +```python |
| 309 | +# save as /tmp/orphan_sigint_repro.py (py3.14+) |
| 310 | +import os, sys, signal, time, glob, trio |
| 311 | +import tractor |
| 312 | +from tractor.spawn._subint_forkserver import ( |
| 313 | + fork_from_worker_thread, |
| 314 | +) |
| 315 | + |
| 316 | +async def _sleep_forever(): |
| 317 | + print(f'SUBACTOR[{os.getpid()}]', flush=True) |
| 318 | + await trio.sleep_forever() |
| 319 | + |
| 320 | +async def _main(): |
| 321 | + async with ( |
| 322 | + tractor.open_root_actor( |
| 323 | + registry_addrs=[('127.0.0.1', 12349)], |
| 324 | + ), |
| 325 | + tractor.open_nursery() as an, |
| 326 | + ): |
| 327 | + await an.run_in_actor(_sleep_forever, name='sf-child') |
| 328 | + await trio.sleep_forever() |
| 329 | + |
| 330 | +def child_target() -> int: |
| 331 | + from tractor.spawn._spawn import try_set_start_method |
| 332 | + try_set_start_method('subint_forkserver') |
| 333 | + trio.run(_main) |
| 334 | + return 0 |
| 335 | + |
| 336 | +pid = fork_from_worker_thread(child_target, thread_name='repro') |
| 337 | +time.sleep(3.0) |
| 338 | + |
| 339 | +# find the subactor pid via /proc |
| 340 | +children = [] |
| 341 | +for path in glob.glob(f'/proc/{pid}/task/*/children'): |
| 342 | + with open(path) as f: |
| 343 | + children.extend(int(x) for x in f.read().split() if x) |
| 344 | +subactor_pid = children[0] |
| 345 | + |
| 346 | +# SIGKILL root → orphan the subactor |
| 347 | +os.kill(pid, signal.SIGKILL) |
| 348 | +os.waitpid(pid, 0) |
| 349 | +time.sleep(0.3) |
| 350 | + |
| 351 | +# SIGINT the orphan — should cause clean trio exit |
| 352 | +os.kill(subactor_pid, signal.SIGINT) |
| 353 | + |
| 354 | +# poll for exit |
| 355 | +for _ in range(100): |
| 356 | + try: |
| 357 | + os.kill(subactor_pid, 0) |
| 358 | + time.sleep(0.1) |
| 359 | + except ProcessLookupError: |
| 360 | + print('HARNESS: subactor exited cleanly ✔') |
| 361 | + sys.exit(0) |
| 362 | +os.kill(subactor_pid, signal.SIGKILL) |
| 363 | +print('HARNESS: subactor hung — reproduced') |
| 364 | +sys.exit(1) |
| 365 | +``` |
| 366 | + |
| 367 | +Expected (current): `HARNESS: subactor hung — reproduced`. |
| 368 | + |
| 369 | +After fix: `HARNESS: subactor exited cleanly ✔`. |
| 370 | + |
| 371 | +## References |
| 372 | + |
| 373 | +- `tractor/spawn/_entry.py::_trio_main:157-186` — the |
| 374 | + intentional KBI→clean-exit path this bug makes |
| 375 | + unreachable. |
| 376 | +- `tractor/spawn/_subint_forkserver` — the backend whose |
| 377 | + orphan cancel-robustness this blocks. |
| 378 | +- `tests/spawn/test_subint_forkserver.py::test_orphaned_subactor_sigint_cleanup_DRAFT` |
| 379 | + — the xfail'd reproducer in the test suite. |
| 380 | +- `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — |
| 381 | + sibling "orphaned channel park" hang (different class). |
| 382 | +- `ai/conc-anal/subint_sigint_starvation_issue.md` — |
| 383 | + sibling "GIL starvation SIGINT drop" hang (different |
| 384 | + class). |
| 385 | +- tractor issue #379 — subint backend tracking. |
0 commit comments