Skip to content

Discovery-client leaks CLOSE_WAIT fds when registrar server-side closes #452

@goodboy

Description

@goodboy

Discovery-client leaks CLOSE_WAIT fds when registrar server-side closes

Surfaced while investigating a Mode-A hang
(#451) in
tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext].
Distinct concern from the hang itself; filing separately so it can be
investigated independently.

Symptom

In a hung pytest process running the test session under
--spawn-backend=main_thread_forkserver --tpt-proto=tcp, ss -tnp | grep <pytest-pid> reports 6 sockets in CLOSE_WAIT state, all
client-side connections from random ephemeral ports →
127.0.0.1:8650 (the test session's reg_addr). Each has Recv-Q=1
(one byte of buffered FIN/EOF that nothing is reading).

6:CLOSE-WAIT 1 0 127.0.0.1:33750 127.0.0.1:8650 fd=14
7:CLOSE-WAIT 1 0 127.0.0.1:50614 127.0.0.1:8650 fd=26
8:CLOSE-WAIT 1 0 127.0.0.1:50698 127.0.0.1:8650 fd=29
9:CLOSE-WAIT 1 0 127.0.0.1:54884 127.0.0.1:8650 fd=17
10:CLOSE-WAIT 1 0 127.0.0.1:55004 127.0.0.1:8650 fd=20
11:CLOSE-WAIT 1 0 127.0.0.1:56252 127.0.0.1:8650 fd=23

Interpretation

CLOSE_WAIT means the remote peer sent FIN but the local
process hasn't called close()
on its end. Since both endpoints
(registrar server + discovery client) live in the same pytest process
here, the registrar's server-side fd was successfully closed (it's no
longer in the fd table), but the matching client-side fd was
abandoned.

The pattern is 6 leaked fds across a test session that spawned a
single subactor and exercised it through one IPC Context +
MsgStream
— so each find_actor / wait_for_actor / similar
discovery roundtrip is leaking a fd. Worth confirming with a focused
reproducer (1 spawn → expected 1 leak; 100 spawns → expected 100
leaks).

Likely culprit

The discovery client path in tractor.discovery._registry (or
wherever find_actor / wait_for_actor opens a transient registrar
channel) doesn't appear to close() its end of the connection after
the registrar server-side response completes. Under TCP this
manifests as CLOSE_WAIT accumulation; under UDS the symptom would
be similar but with file-descriptor (vs. socket) names.

Side effects

  • fd table pressure across long pytest sessions — eventually
    EMFILE.
  • Test-suite flakiness amplifier — contributes to the conditions
    under which #451
    (Mode-A cancel-cascade hang) reproduces, because the parent's trio
    loop is registered with epoll on a growing set of zombie fds.
    Whether trio is itself watching the leaked fds is the next
    question to chase.
  • May also explain transient :8650 port-reuse failures across
    parallel pytest sessions when the kernel TIME_WAIT-counterparts
    pile up.

Investigation next steps

  • Add a focused reproducer: open + close N actor nurseries in a loop
    in a single trio.run, snapshot ss -tnp before/after, assert no
    CLOSE_WAIT growth.
  • Trace which call-site in tractor.discovery opens the client end
    of these connections (likely find_actor / wait_for_actor).
  • Verify whether tractor.ipc._chan.Channel (or the
    discovery-specific equivalent) has an aclose path that's bypassed
    in some discovery flows.

See also

  • #451 — Mode-A
    cancel-cascade hang where this leak was first observed
  • #379 — subint
    umbrella

Tracked from #379
(subint umbrella).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions