Discovery-client leaks `CLOSE_WAIT` fds when registrar server-side closes

## Discovery-client leaks `CLOSE_WAIT` fds when registrar server-side closes

Surfaced while investigating a Mode-A hang
([#451](https://github.com/goodboy/tractor/issues/451)) in
`tests/msg/test_ext_types_msgspec.py::test_ext_types_over_ipc[use_codec_hooks-only_nsp_ext]`.
Distinct concern from the hang itself; filing separately so it can be
investigated independently.

### Symptom

In a hung pytest process running the test session under
`--spawn-backend=main_thread_forkserver --tpt-proto=tcp`, `ss -tnp |
grep <pytest-pid>` reports **6 sockets in `CLOSE_WAIT` state**, all
client-side connections from random ephemeral ports →
`127.0.0.1:8650` (the test session's `reg_addr`). Each has `Recv-Q=1`
(one byte of buffered FIN/EOF that nothing is reading).

```
6:CLOSE-WAIT 1 0 127.0.0.1:33750 127.0.0.1:8650 fd=14
7:CLOSE-WAIT 1 0 127.0.0.1:50614 127.0.0.1:8650 fd=26
8:CLOSE-WAIT 1 0 127.0.0.1:50698 127.0.0.1:8650 fd=29
9:CLOSE-WAIT 1 0 127.0.0.1:54884 127.0.0.1:8650 fd=17
10:CLOSE-WAIT 1 0 127.0.0.1:55004 127.0.0.1:8650 fd=20
11:CLOSE-WAIT 1 0 127.0.0.1:56252 127.0.0.1:8650 fd=23
```

### Interpretation

`CLOSE_WAIT` means the **remote peer sent FIN** but the **local
process hasn't called `close()`** on its end. Since both endpoints
(registrar server + discovery client) live in the same pytest process
here, the registrar's server-side fd was successfully closed (it's no
longer in the fd table), but the matching client-side fd was
abandoned.

The pattern is **6 leaked fds across a test session that spawned a
single subactor and exercised it through one IPC `Context` +
`MsgStream`** — so each `find_actor` / `wait_for_actor` / similar
discovery roundtrip is leaking a fd. Worth confirming with a focused
reproducer (1 spawn → expected 1 leak; 100 spawns → expected 100
leaks).

### Likely culprit

The discovery client path in `tractor.discovery._registry` (or
wherever `find_actor` / `wait_for_actor` opens a transient registrar
channel) doesn't appear to `close()` its end of the connection after
the registrar server-side response completes. Under TCP this
manifests as `CLOSE_WAIT` accumulation; under UDS the symptom would
be similar but with file-descriptor (vs. socket) names.

### Side effects

- **fd table pressure** across long pytest sessions — eventually
  `EMFILE`.
- **Test-suite flakiness amplifier** — contributes to the conditions
  under which [#451](https://github.com/goodboy/tractor/issues/451)
  (Mode-A cancel-cascade hang) reproduces, because the parent's trio
  loop is registered with epoll on a growing set of zombie fds.
  Whether trio is *itself* watching the leaked fds is the next
  question to chase.
- May also explain transient `:8650` port-reuse failures across
  parallel pytest sessions when the kernel TIME_WAIT-counterparts
  pile up.

### Investigation next steps

- Add a focused reproducer: open + close N actor nurseries in a loop
  in a single trio.run, snapshot `ss -tnp` before/after, assert no
  CLOSE_WAIT growth.
- Trace which call-site in `tractor.discovery` opens the client end
  of these connections (likely `find_actor` / `wait_for_actor`).
- Verify whether `tractor.ipc._chan.Channel` (or the
  discovery-specific equivalent) has an `aclose` path that's bypassed
  in some discovery flows.

### See also

- [#451](https://github.com/goodboy/tractor/issues/451) — Mode-A
  cancel-cascade hang where this leak was first observed
- [#379](https://github.com/goodboy/tractor/issues/379) — subint
  umbrella

---

Tracked from [#379](https://github.com/goodboy/tractor/issues/379)
(subint umbrella).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery-client leaks `CLOSE_WAIT` fds when registrar server-side closes #452