Skip to content

fix: reap orphaned fsmonitor daemons in wt remove internal sweep#2814

Merged
max-sixty merged 3 commits into
mainfrom
reap-orphan-fsmonitor
May 19, 2026
Merged

fix: reap orphaned fsmonitor daemons in wt remove internal sweep#2814
max-sixty merged 3 commits into
mainfrom
reap-orphan-fsmonitor

Conversation

@max-sixty
Copy link
Copy Markdown
Owner

Why

Companion to #2813. That PR stops the fsmonitor-daemon leak at its source (wt remove force-terminating a wedged daemon). This PR is defense-in-depth for daemons orphaned by paths that bypass wt remove entirely — plain git worktree remove, a manual rm -rf, or a crashed wt — which #2813 cannot cover. Background: a wedged git fsmonitor--daemon makes git status (and wt list) hang; orphans had accumulated ~80-deep on one machine.

What

Adds an orphan sweep (src/git/fsmonitor.rs: classify_orphans, reap_orphan_fsmonitor_daemons) hooked into the existing wt remove internal sweep op — no new always-on/timer mechanism, and deliberately not wired to wt list or any read-only command (killing processes as a side effect of a read-only command is out of bounds). It reaps only daemons whose IPC socket resolves to a worktree that no longer exists, SIGTERM → bounded wait → SIGKILL, socket-scoped via lsof -a -p <pid> (never broad process-name matching). A daemon serving a live worktree is never reaped — proven by unit tests, and the invariant holds even when the live-worktree set is unknowable: live_git_dirs returns Option, and an unknowable set disables resolved-socket reaping entirely rather than falling back to "reap everything".

Shares low-level fsmonitor mechanics with #2813 and both touch faq.md / troubleshooting.md / CHANGELOG.md; whichever of the two lands second needs a small dedup and doc reconciliation.

Testing

wt hook pre-merge --yes green (3757/3757). 11 deterministic unit tests cover the pure logic, including live_worktree_daemon_is_never_reaped and unknowable_live_set_spares_resolved_socket_daemons.

src/git/fsmonitor.rs is a new file, so it is entirely in patch scope: 94.5% line / 95.4% region. The uncovered lines are real-OS-signal and live daemon-enumeration paths only reachable by spawning real git fsmonitor--daemon processes; their pure logic is exhaustively unit-tested. A real-daemon end-to-end test was prototyped and dropped as flaky and net-negative (daemon-spawn races; risk of SIGKILLing unrelated dev-machine daemons). These uncovered OS-only lines are an accepted trade-off — if codecov/patch flags them it is a known false positive, not a gap to fill with a fragile test.

Defense-in-depth for git fsmonitor--daemon processes orphaned by paths
that bypass wt remove's synchronous daemon stop (plain git worktree
remove, manual rm -rf, crashed wt). New src/git/fsmonitor.rs identifies
provably-orphaned daemons via the IPC socket + live-worktree set and
escalates SIGTERM->bounded wait->SIGKILL. Hooked into the existing
repo-wide internal sweep op (run_internal_sweep) alongside the stale
trash sweep, after primary output. Unix only; no-op on Windows.

Finishing pass:
- Fix rustdoc -Dwarnings failures that broke the pre-merge doc step
  (public docs linked private REAP_KILL_DEADLINE / IPC_SOCKET_NAME;
  Repository::git_dir was actually WorkingTree::git_dir).
- Gate the escalate_terminate test scaffolding under #[cfg(unix)] so the
  Windows test build compiles under -D warnings: FakeSignaller, its two
  impls, its three tests, and the RefCell/HashMap imports were ungated
  while escalate_terminate itself is #[cfg(unix)].
- Strengthen the 'a live worktree's daemon is never reaped' invariant:
  live_git_dirs now returns Option; an unknowable live set (git worktree
  list failed) disables class-2 reaping entirely rather than being
  treated as an empty set that would reap every resolved-socket daemon
  under the repo. New unit test unknowable_live_set_spares_resolved_socket_daemons.
- Drop the structurally-dead catch-all arm in canonicalize_socket; both
  remaining arms are exercised by the existing symlink/gone test.

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@worktrunk-bot worktrunk-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on the pgrep→lsof handoff in enumerate_daemons — see inline. Otherwise the data-safety contract reads carefully and the unit tests cover the classification well.

Comment thread src/git/fsmonitor.rs Outdated
`pgrep -f "git fsmonitor--daemon"` is a substring match on the full
command line, so a non-daemon process whose argv merely contains that
string — a debugger `gdb …/git fsmonitor--daemon`, a wrapper — is
enumerated as a candidate. Such a process holds no IPC socket, so
`parse_lsof_socket_path` returned None and `classify_orphans` treated
`socket == None` as class 1 (reap) and SIGTERM/SIGKILL'd the unrelated
process during the sweep.

Factor the per-PID lsof-stdout → `Option<DaemonProcess>` mapping into
`daemon_from_lsof_stdout` and guard it: if the lsof output does not
contain `IPC_SOCKET_NAME` at all, return None rather than misclassify.
A real orphaned daemon's lsof output always contains the socket name
(resolved path or bare), so class-1 behavior is unchanged for real
orphans. Three deterministic unit tests cover the pgrep-false-positive
skip, the resolved-socket pass-through, and the bare-name class-1
preservation.

Co-Authored-By: Claude <noreply@anthropic.com>
max-sixty added a commit that referenced this pull request May 19, 2026
…#2813)

## Why

`wt remove` already calls `git fsmonitor--daemon stop` before staging a
worktree to trash, but that stop is an IPC request to the daemon itself.
When the daemon is *wedged* — the common failure that makes `git status`
(and therefore `wt list`) hang — it can't answer the IPC, so the stop is
a silent no-op and the daemon leaks. In practice this accumulated ~80
orphaned `git fsmonitor--daemon` processes on one machine, one of which
wedged and hung `wt list` outright.

## What

Adds a canonical `stop_fsmonitor_daemon` helper: after the IPC stop, it
resolves the daemon's PID from its IPC socket and escalates SIGTERM →
bounded wait → SIGKILL, scoped strictly to the removed worktree's own
socket (never matched by process name, never another worktree's daemon).
All three removal paths (library, foreground, detached-background) route
through the one helper; the redundant `git ... fsmonitor--daemon stop`
shell fragment in the detached path is deleted so there is a single
canonical stop. Force-kill is Unix-only with a clean `#[cfg(not(unix))]`
no-op (Windows keeps the IPC stop, where the daemon uses a named pipe).

This is the source-level fix for the leak on the removal path. Companion
PR #2814 adds a defense-in-depth sweep for daemons orphaned by paths
that bypass `wt remove` (plain `git worktree remove`, manual `rm`, a
crashed `wt`). The two branches share low-level fsmonitor mechanics and
both touch `faq.md` / `troubleshooting.md` / `CHANGELOG.md`, so
whichever lands second will need a small dedup and doc reconciliation.

## Testing

Unit tests cover the deterministic logic: git-dir/socket resolution for
a linked worktree, the SIGTERM-honored fast path, and SIGTERM-ignored →
SIGKILL escalation (with a readiness handshake to avoid a
signal-vs-trap-install race). `wt hook pre-merge --yes` is green. The
`lsof`-failure and git-dir-error arms are single best-effort
`log::debug!`+return lines that fall through to the existing fail-open
behavior.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Resolves doc + CHANGELOG conflicts after #2813 landed:
- CHANGELOG.md: moves the orphan-sweep entry from `## 0.51.0 / Fixed`
  (where it was incorrectly placed when this branch first diffed against
  6fd2b70) into `## Unreleased / Fixed`, alongside #2813's
  wedged-daemon-on-removal entry.
- docs/content/faq.md: keeps both "Other cleanup" bullets — #2813's
  synchronous-on-removal force-kill and this PR's background
  orphan-sweep (distinct mechanisms, both worth documenting).
- skills/worktrunk/reference/troubleshooting.md: collapses two
  near-duplicate `pkill` snippets into one and combines #2813's
  wt-remove-reaps-even-wedged paragraph with this PR's
  residual-case-on-a-live-worktree paragraph.
- skills/worktrunk/reference/faq.md: regenerated by the doc-sync test
  from docs/content/faq.md.

src/main.rs / src/git/mod.rs / src/commands/process.rs auto-merged
cleanly; 18 fsmonitor unit tests pass on the merged tree, including
this PR's `daemon_from_lsof_*` tests and #2813's `terminate_pid` tests.

Co-Authored-By: Claude <noreply@anthropic.com>
@max-sixty max-sixty enabled auto-merge (squash) May 19, 2026 23:37
@max-sixty max-sixty merged commit ace32e0 into main May 19, 2026
32 of 34 checks passed
@max-sixty max-sixty deleted the reap-orphan-fsmonitor branch May 19, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants