Discovery-client leaks CLOSE_WAIT fds when registrar server-side closes
Surfaced while investigating a Mode-A hang
(#451) in
tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext].
Distinct concern from the hang itself; filing separately so it can be
investigated independently.
Symptom
In a hung pytest process running the test session under
--spawn-backend=main_thread_forkserver --tpt-proto=tcp, ss -tnp | grep <pytest-pid> reports 6 sockets in CLOSE_WAIT state, all
client-side connections from random ephemeral ports →
127.0.0.1:8650 (the test session's reg_addr). Each has Recv-Q=1
(one byte of buffered FIN/EOF that nothing is reading).
6:CLOSE-WAIT 1 0 127.0.0.1:33750 127.0.0.1:8650 fd=14
7:CLOSE-WAIT 1 0 127.0.0.1:50614 127.0.0.1:8650 fd=26
8:CLOSE-WAIT 1 0 127.0.0.1:50698 127.0.0.1:8650 fd=29
9:CLOSE-WAIT 1 0 127.0.0.1:54884 127.0.0.1:8650 fd=17
10:CLOSE-WAIT 1 0 127.0.0.1:55004 127.0.0.1:8650 fd=20
11:CLOSE-WAIT 1 0 127.0.0.1:56252 127.0.0.1:8650 fd=23
Interpretation
CLOSE_WAIT means the remote peer sent FIN but the local
process hasn't called close() on its end. Since both endpoints
(registrar server + discovery client) live in the same pytest process
here, the registrar's server-side fd was successfully closed (it's no
longer in the fd table), but the matching client-side fd was
abandoned.
The pattern is 6 leaked fds across a test session that spawned a
single subactor and exercised it through one IPC Context +
MsgStream — so each find_actor / wait_for_actor / similar
discovery roundtrip is leaking a fd. Worth confirming with a focused
reproducer (1 spawn → expected 1 leak; 100 spawns → expected 100
leaks).
Likely culprit
The discovery client path in tractor.discovery._registry (or
wherever find_actor / wait_for_actor opens a transient registrar
channel) doesn't appear to close() its end of the connection after
the registrar server-side response completes. Under TCP this
manifests as CLOSE_WAIT accumulation; under UDS the symptom would
be similar but with file-descriptor (vs. socket) names.
Side effects
- fd table pressure across long pytest sessions — eventually
EMFILE.
- Test-suite flakiness amplifier — contributes to the conditions
under which #451
(Mode-A cancel-cascade hang) reproduces, because the parent's trio
loop is registered with epoll on a growing set of zombie fds.
Whether trio is itself watching the leaked fds is the next
question to chase.
- May also explain transient
:8650 port-reuse failures across
parallel pytest sessions when the kernel TIME_WAIT-counterparts
pile up.
Investigation next steps
- Add a focused reproducer: open + close N actor nurseries in a loop
in a single trio.run, snapshot ss -tnp before/after, assert no
CLOSE_WAIT growth.
- Trace which call-site in
tractor.discovery opens the client end
of these connections (likely find_actor / wait_for_actor).
- Verify whether
tractor.ipc._chan.Channel (or the
discovery-specific equivalent) has an aclose path that's bypassed
in some discovery flows.
See also
- #451 — Mode-A
cancel-cascade hang where this leak was first observed
- #379 — subint
umbrella
Tracked from #379
(subint umbrella).
Discovery-client leaks
CLOSE_WAITfds when registrar server-side closesSurfaced while investigating a Mode-A hang
(#451) in
tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext].Distinct concern from the hang itself; filing separately so it can be
investigated independently.
Symptom
In a hung pytest process running the test session under
--spawn-backend=main_thread_forkserver --tpt-proto=tcp,ss -tnp | grep <pytest-pid>reports 6 sockets inCLOSE_WAITstate, allclient-side connections from random ephemeral ports →
127.0.0.1:8650(the test session'sreg_addr). Each hasRecv-Q=1(one byte of buffered FIN/EOF that nothing is reading).
Interpretation
CLOSE_WAITmeans the remote peer sent FIN but the localprocess hasn't called
close()on its end. Since both endpoints(registrar server + discovery client) live in the same pytest process
here, the registrar's server-side fd was successfully closed (it's no
longer in the fd table), but the matching client-side fd was
abandoned.
The pattern is 6 leaked fds across a test session that spawned a
single subactor and exercised it through one IPC
Context+MsgStream— so eachfind_actor/wait_for_actor/ similardiscovery roundtrip is leaking a fd. Worth confirming with a focused
reproducer (1 spawn → expected 1 leak; 100 spawns → expected 100
leaks).
Likely culprit
The discovery client path in
tractor.discovery._registry(orwherever
find_actor/wait_for_actoropens a transient registrarchannel) doesn't appear to
close()its end of the connection afterthe registrar server-side response completes. Under TCP this
manifests as
CLOSE_WAITaccumulation; under UDS the symptom wouldbe similar but with file-descriptor (vs. socket) names.
Side effects
EMFILE.under which #451
(Mode-A cancel-cascade hang) reproduces, because the parent's trio
loop is registered with epoll on a growing set of zombie fds.
Whether trio is itself watching the leaked fds is the next
question to chase.
:8650port-reuse failures acrossparallel pytest sessions when the kernel TIME_WAIT-counterparts
pile up.
Investigation next steps
in a single trio.run, snapshot
ss -tnpbefore/after, assert noCLOSE_WAIT growth.
tractor.discoveryopens the client endof these connections (likely
find_actor/wait_for_actor).tractor.ipc._chan.Channel(or thediscovery-specific equivalent) has an
aclosepath that's bypassedin some discovery flows.
See also
cancel-cascade hang where this leak was first observed
umbrella
Tracked from #379
(subint umbrella).