Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
de4f470
Add CPython-level `subint_fork` workaround smoketest
goodboy Apr 22, 2026
82332fb
Lift fork prims into `_subint_forkserver` mod
goodboy Apr 22, 2026
25e400d
Add trio-parent tests for `_subint_forkserver`
goodboy Apr 22, 2026
cf2e71d
Add `subint_forkserver` PEP 684 audit-plan doc
goodboy Apr 22, 2026
26914fd
Wire `subint_forkserver` as first-class backend
goodboy Apr 22, 2026
63ab7c9
Reset post-fork `_state` in forkserver child
goodboy Apr 22, 2026
7804a9f
Refactor `_runtime_vars` into pure get/set API
goodboy Apr 22, 2026
76605d5
Add DRAFT `subint_forkserver` orphan-SIGINT test
goodboy Apr 22, 2026
dcd5c1f
Scaffold `child_sigint` modes for forkserver
goodboy Apr 23, 2026
a72deef
Refine `subint_forkserver` orphan-SIGINT diagnosis
goodboy Apr 23, 2026
f5f37b6
Shorten some timeouts in `subint_forkserver` suites
goodboy Apr 23, 2026
5e85f18
Drop unneeded f-str prefixes
goodboy Apr 23, 2026
8bcbe73
Enable `debug_mode` for `subint_forkserver`
goodboy Apr 23, 2026
e31eb8d
Label forkserver child as `subint_forkserver`
goodboy Apr 23, 2026
1e357dc
Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg
goodboy Apr 23, 2026
d093c31
Add zombie-actor check to `run-tests` skill
goodboy Apr 23, 2026
e3f4f5a
Add `subint_forkserver` test-cancellation leak doc
goodboy Apr 23, 2026
1af2121
Wire `reg_addr` through leaky cancel tests
goodboy Apr 23, 2026
70d58c4
Use SIGINT-first ladder in `run-tests` cleanup
goodboy Apr 23, 2026
35da808
Refine `subint_forkserver` nested-cancel hang diagnosis
goodboy Apr 23, 2026
9993db0
Scrub inherited FDs in fork-child prelude
goodboy Apr 23, 2026
c20b05e
Use `pidfd` for cancellable `_ForkedProc.wait`
goodboy Apr 23, 2026
8ac3dfe
Break parent-chan shield during teardown
goodboy Apr 23, 2026
506617c
Skip-mark + narrow `subint_forkserver` cancel hang
goodboy Apr 23, 2026
76d1206
Claude-perms: ensure /commit-msg files can be written!
goodboy Apr 23, 2026
7cd47ef
Doc ruled-out fix + capture-pipe aside
goodboy Apr 23, 2026
458a35c
Surface silent failures in `_subint_forkserver`
goodboy Apr 23, 2026
ab86f76
Refine `subint_forkserver` cancel-cascade diag
goodboy Apr 24, 2026
4d05554
Narrow forkserver hang to `async_main` outer tn
goodboy Apr 24, 2026
e312a68
Bound peer-clear wait in `async_main` finally
goodboy Apr 24, 2026
eceed29
Pin forkserver hang to pytest `--capture=fd`
goodboy Apr 24, 2026
4106ba7
Codify capture-pipe hang lesson in skills
goodboy Apr 24, 2026
4c133ab
Default `pytest` to use `--capture=sys`
goodboy Apr 24, 2026
d6e70e9
Import-or-skip `.devx.` tests requiring `greenback`
goodboy Apr 24, 2026
b350aa0
Wire `reg_addr` through infected-asyncio tests
goodboy Apr 25, 2026
2ca0f41
Skip `test_loglevel_propagated_to_subactor` on subint forkserver too
goodboy Apr 25, 2026
44bdb16
Tighten orphan-SIGINT xfail to `strict=True`
goodboy Apr 25, 2026
eae478f
Add `_testing._reap` + auto-reap fixture
goodboy Apr 25, 2026
6d76b60
Add `tractor-reap` CLI + document auto-reap
goodboy Apr 26, 2026
c99d475
Document `SharedMemory` Γ— `subint_forkserver` incompat
goodboy Apr 27, 2026
aa3e230
Fix `SharedMemory` under `subint_forkserver`
goodboy Apr 27, 2026
4f12d69
Add `--shm` orphan sweep to `tractor-reap`
goodboy Apr 27, 2026
65fcfbf
Bump `test_stale_entry_is_deleted`'s timeout to 30
goodboy Apr 27, 2026
9b05f65
Wire `test_dynamic_pub_sub` to standard fixtures
goodboy Apr 27, 2026
66f1941
Wire `reg_addr` into `test_context_stream_semantics`
goodboy Apr 27, 2026
5456195
Log subint bootstrap excs + cancel-leak state
goodboy Apr 27, 2026
3ab99d5
Doc `_subint_forkserver` design + fork semantics
goodboy Apr 27, 2026
4b5176e
Doc future-subint payoffs for `_subint_forkserver`
goodboy Apr 27, 2026
99dade0
Extract fork primitives into `_main_thread_forkserver`
goodboy Apr 27, 2026
57dae0e
Split forkserver backend into variant 1/2 mods
goodboy Apr 27, 2026
5e83881
Add `subint_forkserver_proc` stub, flip dispatch, prune
goodboy Apr 27, 2026
9f0709e
Migrate test/smoketest imports + rename test file
goodboy Apr 27, 2026
205382a
Sweep `subint_forkserver` β†’ `main_thread_forkserver` in code
goodboy Apr 27, 2026
cbdf1eb
Guard `subint_forkserver` stub against re-alias
goodboy Apr 28, 2026
7c5dd4d
Fix `_testing.addr.get_rando_addr` cross-process collisions
goodboy Apr 28, 2026
b376eb0
Add opt-in `reap_subactors_per_test` fixture
goodboy Apr 28, 2026
530160f
Use `trio.fail_after` cap in `test_dynamic_pub_sub`
goodboy Apr 28, 2026
f8178df
Return parent `pid: int` from new `reap_subactors_per_test` fixture
goodboy Apr 28, 2026
3c366ca
Drop global `pytest-timeout` cap from `pyproject.toml`
goodboy Apr 28, 2026
060f7d2
Backend-aware timeout in `maybe_expect_raises`
goodboy Apr 29, 2026
383b0fd
Backend-aware `fail_after` in pub/sub test
goodboy Apr 29, 2026
5418f2d
Add `--enable-stackscope` pytest plugin flag
goodboy Apr 29, 2026
8c73019
Refine fork-survival docs + `EBADF` handling
goodboy Apr 29, 2026
2d4995e
Route `stackscope` SIGUSR1 onto trio loop
goodboy Apr 29, 2026
2917b74
Add todo for running `test_debugger` suite on forkserver spawner
goodboy Apr 29, 2026
532a983
Add posix-multithreaded-`fork()` explainer doc
goodboy Apr 29, 2026
22cdf15
Flip back to default `pytest` capture for CI
goodboy Apr 29, 2026
208e7c0
Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars
goodboy Apr 29, 2026
b7115fc
Drop test-local timeouts, +`sync_pause` to dev
goodboy Apr 29, 2026
fc5e80f
Drop subint-family gate from `main_thread_forkserver`
goodboy Apr 29, 2026
8bc304f
TOSQUASH 2d4995e0, fix _pformat -> devx.pformat..
goodboy Apr 29, 2026
486249d
Allow per-call `start_method`/`loglevel` overrides
goodboy Apr 30, 2026
1cdc7fb
Add UDS orphan-sweep helpers + reap fixtures to `_reap`
goodboy Apr 30, 2026
0996a83
Add `--uds`/`--uds-only` flags to `tractor-reap`
goodboy Apr 30, 2026
61d4525
Add `pytest_load_initial_conftests()` for `--capture=`
goodboy Apr 30, 2026
e2b790a
Fix `SIGUSR1` tree-dump ordering in `_stackscope`
goodboy Apr 30, 2026
4852335
Add `use_stackscope` runtime var for subactor init
goodboy May 1, 2026
fc2e298
Update `sync_bp` + tighten `test_pause_from_sync`
goodboy May 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
{
"permissions": {
"allow": [
"Bash(date *)",
"Bash(cp .claude/*)",
"Read(.claude/**)",
"Read(.claude/skills/run-tests/**)",
"Write(.claude/**/*commit_msg*)",
"Write(.claude/git_commit_msg_LATEST.md)",
"Skill(run-tests)",
"Skill(close-wkt)",
"Skill(open-wkt)",
"Skill(prompt-io)",
"Bash(date *)",
"Bash(git diff *)",
"Bash(git log *)",
"Bash(git status)",
Expand All @@ -23,14 +31,12 @@
"Bash(UV_PROJECT_ENVIRONMENT=py* uv sync:*)",
"Bash(UV_PROJECT_ENVIRONMENT=py* uv run:*)",
"Bash(echo EXIT:$?:*)",
"Write(.claude/*commit_msg*)",
"Write(.claude/git_commit_msg_LATEST.md)",
"Skill(run-tests)",
"Skill(close-wkt)",
"Skill(open-wkt)",
"Skill(prompt-io)"
"Bash(echo \"EXIT=$?\")",
"Read(//tmp/**)"
],
"deny": [],
"ask": []
}
},
"prefersReducedMotion": false,
"outputStyle": "default"
}
66 changes: 66 additions & 0 deletions .claude/skills/conc-anal/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in
that does `await` can itself be cancelled (e.g.
by nursery shutdown). Watch for cleanup code that
assumes it will run to completion.

### Unbounded waits in cleanup paths

Any `await <event>.wait()` in a teardown path is
a latent deadlock unless the event's setter is
GUARANTEED to fire. If the setter depends on
external state (peer disconnects, child process
exit, subsequent task completion) that itself
depends on the current task's progress, you have
a mutual wait.

Rule: **bound every `await X.wait()` in cleanup
paths with `trio.move_on_after()`** unless you
can prove the setter is unconditionally reachable
from the state at the await site. Concrete recent
example: `ipc_server.wait_for_no_more_peers()` in
`async_main`'s finally (see
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
"probe iteration 3") β€” it was unbounded, and when
one peer-handler was stuck the wait-for-no-more-
peers event never fired, deadlocking the whole
actor-tree teardown cascade.

### The capture-pipe-fill hang pattern (grep this first)

When investigating any hang in the test suite
**especially under fork-based backends**, first
check whether the hang reproduces under `pytest
-s` (`--capture=no`). If `-s` makes it go away
you're not looking at a trio concurrency bug β€”
you're looking at a Linux pipe-buffer fill.

Mechanism: pytest replaces fds 1,2 with pipe
write-ends. Fork-child subactors inherit those
fds. High-volume error-log tracebacks (cancel
cascade spew) fill the 64KB pipe buffer. Child
`write()` blocks. Child can't exit. Parent's
`waitpid`/pidfd wait blocks. Deadlock cascades up
the tree.

Pre-existing guards in `tests/conftest.py` encode
this knowledge β€” grep these BEFORE blaming
concurrency:

```python
# tests/conftest.py:258
if loglevel in ('trace', 'debug'):
# XXX: too much logging will lock up the subproc (smh)
loglevel: str = 'info'

# tests/conftest.py:316
# can lock up on the `_io.BufferedReader` and hang..
stderr: str = proc.stderr.read().decode()
```

Full post-mortem +
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
for the canonical reproduction. Cost several
investigation sessions before catching it β€”
because the capture-pipe symptom was masked by
deeper cascade-deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume β†’ capture-pipe finally
surfaced. Grep-note for future-self: **if a
multi-subproc tractor test hangs, `pytest -s`
first, conc-anal second.**
267 changes: 267 additions & 0 deletions .claude/skills/run-tests/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,101 @@ python -m pytest tests/ -x -q --co 2>&1 | tail -5
If either fails, fix the import error before running
any actual tests.

### Step 4: zombie-actor / stale-registry check (MANDATORY)

The tractor runtime's default registry address is
**`127.0.0.1:1616`** (TCP) / `/tmp/registry@1616.sock`
(UDS). Whenever any prior test run β€” especially one
using a fork-based backend like `subint_forkserver` β€”
leaks a child actor process, that zombie keeps the
registry port bound and **every subsequent test
session fails to bind**, often presenting as 50+
unrelated failures ("all tests broken"!) across
backends.

**This has to be checked before the first run AND
after any cancelled/SIGINT'd run** β€” signal failures
in the middle of a test can leave orphan children.

```sh
# 1. TCP registry β€” any listener on :1616? (primary signal)
ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 free'

# 2. leftover actor/forkserver procs β€” scoped to THIS
# repo's python path, so we don't false-flag legit
# long-running tractor-using apps (e.g. `piker`,
# downstream projects that embed tractor).
pgrep -af "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" \
| grep -v 'grep\|pgrep' \
|| echo 'no leaked actor procs from this repo'

# 3. stale UDS registry sockets
ls -la /tmp/registry@*.sock 2>/dev/null \
|| echo 'no leaked UDS registry sockets'
```

**Interpretation:**

- **TCP :1616 free AND no stale sockets** β†’ clean,
proceed. The actor-procs probe is secondary β€” false
positives are common (piker, any other tractor-
embedding app); only cleanup if `:1616` is bound or
sockets linger.
- **TCP :1616 bound OR stale sockets present** β†’
surface PIDs + cmdlines to the user, offer cleanup:

```sh
# 1. GRACEFUL FIRST (tractor is structured concurrent β€” it
# catches SIGINT as an OS-cancel in `_trio_main` and
# cascades Portal.cancel_actor via IPC to every descendant.
# So always try SIGINT first with a bounded timeout; only
# escalate to SIGKILL if graceful cleanup doesn't complete).
pkill -INT -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"

# 2. bounded wait for graceful teardown (usually sub-second).
# Loop until the processes exit, or timeout. Keep the
# bound tight β€” hung/abrupt-killed descendants usually
# hang forever, so don't wait more than a few seconds.
for i in $(seq 1 10); do
pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null || break
sleep 0.3
done

# 3. ESCALATE TO SIGKILL only if graceful didn't finish.
if pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null; then
echo 'graceful teardown timed out β€” escalating to SIGKILL'
pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
fi

# 4. if a test zombie holds :1616 specifically and doesn't
# match the above pattern, find its PID the hard way:
ss -tlnp 2>/dev/null | grep ':1616' # prints `users:(("<name>",pid=NNNN,...))`
# then (same SIGINT-first ladder):
# kill -INT <NNNN>; sleep 1; kill -9 <NNNN> 2>/dev/null

# 5. remove stale UDS sockets
rm -f /tmp/registry@*.sock

# 6. re-verify
ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 now free'
```

**Never ignore stale registry state.** If you see the
"all tests failing" pattern β€” especially
`trio.TooSlowError` / connection refused / address in
use on many unrelated tests β€” check registry **before**
spelunking into test code. The failure signature will
be identical across backends because they're all
fighting for the same port.

**False-positive warning for step 2:** a plain
`pgrep -af '_actor_child_main'` will also match
legit long-running tractor-embedding apps (e.g.
`piker` at `~/repos/piker/py*/bin/python3 -m
tractor._child ...`). Always scope to the current
repo's python path, or only use step 1 (`:1616`) as
the authoritative signal.

## 4. Run and report

- Run the constructed command.
Expand Down Expand Up @@ -356,3 +451,175 @@ by your changes β€” note them and move on.
**Rule of thumb**: if a test fails with `TooSlowError`,
`trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
touch the relevant code path, it's flaky β€” skip it.

## 9. The pytest-capture hang pattern (CHECK THIS FIRST)

**Symptom:** a tractor test hangs indefinitely under
default `pytest` but passes instantly when you add
`-s` (`--capture=no`).

**Cause:** tractor subactors (especially under fork-
based backends) inherit pytest's stdout/stderr
capture pipes via fds 1,2. Under high-volume error
logging (e.g. multi-level cancel cascade, nested
`run_in_actor` failures, anything triggering
`RemoteActorError` + `ExceptionGroup` traceback
spew), the **64KB Linux pipe buffer fills** faster
than pytest drains it. Subactor writes block β†’ can't
finish exit β†’ parent's `waitpid`/pidfd wait blocks β†’
deadlock cascades up the tree.

**Pre-existing guards in the tractor harness** that
encode this same knowledge β€” grep these FIRST
before spelunking:

- `tests/conftest.py:258-260` (in the `daemon`
fixture): `# XXX: too much logging will lock up
the subproc (smh)` β€” downgrades `trace`/`debug`
loglevel to `info` to prevent the hang.
- `tests/conftest.py:316`: `# can lock up on the
_io.BufferedReader and hang..` β€” noted on the
`proc.stderr.read()` post-SIGINT.

**Debug recipe (in priority order):**

1. **Try `-s` first.** If the hang disappears with
`pytest -s`, you've confirmed it's capture-pipe
fill. Skip spelunking.
2. **Lower the loglevel.** Default `--ll=error` on
this project; if you've bumped it to `debug` /
`info`, try dropping back. Each log level
multiplies pipe-pressure under fault cascades.
3. **If you MUST use default capture + high log
volume**, redirect subactor stdout/stderr in the
child prelude (e.g.
`tractor.spawn._subint_forkserver._child_target`
post-`_close_inherited_fds`) to `/dev/null` or a
file.

**Signature tells you it's THIS bug (vs. a real
code hang):**

- Multi-actor test under fork-based backend
(`subint_forkserver`, eventually `trio_proc` too
under enough log volume).
- Multiple `RemoteActorError` / `ExceptionGroup`
tracebacks in the error path.
- Test passes with `-s` in the 5-10s range, hangs
past pytest-timeout (usually 30+ s) without `-s`.
- Subactor processes visible via `pgrep -af
subint-forkserv` or similar after the hang β€”
they're alive but blocked on `write()` to an
inherited stdout fd.

**Historical reference:** this deadlock cost a
multi-session investigation (4 genuine cascade
fixes landed along the way) that only surfaced the
capture-pipe issue AFTER the deeper fixes let the
tree actually tear down enough to produce pipe-
filling log volume. Full post-mortem in
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
Lesson codified here so future-me grep-finds the
workaround before digging.

## 10. Reaping zombie subactors (`tractor-reap`)

**Symptom:** after a `pytest` run crashes, times out,
or is `Ctrl+C`'d, subactor forks (esp. under
`subint_forkserver`) can be reparented to `init`
(PPid==1) and linger. They hold onto ports, inherit
pytest's capture-pipe fds, and flakify later
sessions.

**Two layers of defense:**

### a) Session-scoped auto-fixture (always on)

`tractor/_testing/pytest.py::_reap_orphaned_subactors`
runs at pytest session teardown. It walks `/proc` for
direct descendants of the pytest pid, SIGINTs them,
waits up to 3s, then SIGKILLs survivors. SC-polite:
gives the subactor runtime a chance to run its trio
cancel shield + IPC teardown before escalation.

This is *autouse* and session-scoped β€” you don't need
to do anything. It just runs.

### b) `scripts/tractor-reap` CLI (manual reap)

For the **pytest-died-mid-session** case (Ctrl+C, OOM
kill, hung process you had to `kill -9`), the fixture
never ran. Reach for the CLI:

```sh
# default: orphans (PPid==1, cwd==repo, cmd contains python)
scripts/tractor-reap

# descendant-mode: from a still-live supervisor
scripts/tractor-reap --parent <pytest-pid>

# see what would be reaped, don't signal
scripts/tractor-reap -n

# tune the SIGINT β†’ SIGKILL grace window
scripts/tractor-reap --grace 5
```

Exit code: `0` if everyone exited on SIGINT, `1` if
SIGKILL had to escalate β€” so you can chain it in CI
health-checks (`scripts/tractor-reap || <alert>`).

**What it matches** (orphan-mode):
- `PPid == 1` (reparented to init β†’ definitely
orphaned, not just a currently-running child)
- `cwd == <repo-root>` (keeps the sweep scoped; won't
touch unrelated init-children elsewhere)
- `python` in cmdline

**What it does not do:** kill anything whose PPid is
still a live tractor parent. If the parent is alive
it's not an orphan; use `--parent <pid>` if you need
to force-reap under a still-live supervisor.

**When NOT to run it:** while a pytest session is
active in another terminal. It's safe (won't touch
that session's live children in orphan-mode) but can
race if the target session is mid-teardown.

### c) `--shm` / `--shm-only`: orphan-segment sweep

Because `tractor.ipc._mp_bs.disable_mantracker()`
turns off `mp.resource_tracker` (see
`ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md`),
a hard-crashing actor can leave `/dev/shm/<key>`
segments behind that nothing else GCs.

```sh
# process reap THEN shm sweep
scripts/tractor-reap --shm

# shm sweep only (skip process phase)
scripts/tractor-reap --shm-only

# dry-run: list candidates, don't unlink
scripts/tractor-reap --shm -n
```

**Match criteria** (very conservative β€” this is a
shared-system path, can't be wrong):
- segment is a regular file under `/dev/shm`,
- owned by the **current uid** (`stat.st_uid`),
- AND **no live process holds it open** β€”
enumerated by walking every readable
`/proc/<pid>/maps` (post-mmap mappings) AND
`/proc/<pid>/fd/*` (pre-mmap shm-opened fds).

The "nobody has it open" check is the
kernel-canonical "is this leaked?" test β€” same
answer `lsof /dev/shm/<key>` would give. No
reliance on tractor-specific naming, so it works
for any tractor app. Critically, it WILL NOT touch
segments held by other apps you have running
(e.g. `piker`, `lttng-ust-*`, `aja-shm-*` β€”
verified locally with 81 in-use segments correctly
preserved).
Loading
Loading