fix: harden watcher singleton locking#71
Merged
Conversation
Concurrent fm_lock_try_acquire could produce two lock holders: the stale-steal path destroyed and recreated the lock dir without serialization, so two racers could both reclaim it. That let two fm-watch.sh watchers run in one home, doubling every wake. - fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling ".steal" mutex (its abandonment floor decoupled from FM_LOCK_STALE_AFTER, never below 2s) and the holder pid is re-verified dead immediately before the rmdir. Every claim writes its pid and reads it back; a stomped pid means we lost. mkdir is the atomic arbiter, so at most one acquirer ever returns 0. - fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its pid; if another took over, it exits cleanly, so any transient duplicate self-resolves within one poll. - fm-watch-arm.sh: safe re-arm. Default no-ops via the singleton; --restart signals only this home's recorded watcher pid, never a broad pkill that would kill sibling homes' watchers. - AGENTS.md section 8: route re-arm through fm-watch-arm.sh and warn that pkill -f bin/fm-watch.sh kills other homes' watchers. - Tests: concurrency single-winner, dead-pid steal, live-lock no-op, watcher self-eviction.
This was referenced Jun 25, 2026
leo1oel
added a commit
to leo1oel/nemo
that referenced
this pull request
Jun 25, 2026
…hardening (kunchenguid#71, kunchenguid#75) (#4) Port the two upstream watcher-reliability commits onto the herdr backend. The watcher supervises the whole fleet, so these are reliability-critical. kunchenguid#71 - race-proof singleton + home-scoped re-arm: - fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling ".steal" mutex (floor >= 2s, decoupled from FM_LOCK_STALE_AFTER) and the holder pid is re-verified dead immediately before the rmdir; every claim writes its pid and reads it back, so concurrent acquirers yield exactly one winner. - fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its pid; if another took over it exits cleanly, so any transient duplicate self-resolves within one poll. Records fm-home/watcher-path/pid-identity. - fm-watch-arm.sh (new): safe re-arm; default no-ops via the singleton, --restart signals only this home's recorded watcher pid (never a broad pkill that would kill sibling secondmate homes' watchers). kunchenguid#75 - honest self-verifying arm + prominent no-watcher guard banner: - fm-watch-arm.sh forks the watcher as a tracked child, verifies a genuinely live watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest line - started/healthy/FAILED - exiting non-zero when none can be confirmed; never reports healthy off a stale beacon or dead/reused pid. - fm-guard.sh leads the no-watcher case with a bordered banner so it cannot be skimmed past. herdr adaptations: fm-wake-lib.sh/fm-guard.sh/fm-watch-arm.sh taken from upstream (near-identical to the pre-kunchenguid#71 base, keeping FM_HOME so the new arm script's home-scoping is intact and consistent); the kunchenguid#71 hunks applied to the herdr-rewritten fm-watch.sh by hand; the daemon's kunchenguid#71 changes were comment-only and left as the herdr daemon's existing (still-accurate) wording. AGENTS.md §8 and README updated to route re-arm through fm-watch-arm.sh and warn against fire-and-forget `&` / broad pkill. The 19 new upstream tests (lock concurrency, self-eviction, arm started/healthy/FAILED, guard banner) ported and all pass; tests sandbox nothing new.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Make the firstmate watcher singleton race-proof so two fm-watch.sh watchers can never run in one home, and make re-arm safe across homes. Root cause: fm_lock_try_acquire's stale-steal path in bin/fm-wake-lib.sh destroyed and recreated the lock dir without serialization, so two racers seeing the same dead pid could both reclaim it (observed: two watchers in one home doubling every wake); and the supervision pattern of re-arming via 'pkill -f bin/fm-watch.sh' matched every home's watcher, killing secondmate homes' watchers.
Changes and the deliberate decisions behind them:
Constraints honored: portable bash/POSIX, macOS (BSD) + Linux (GNU) compatible, no flock; all existing env knobs preserved (FM_LOCK_STALE_AFTER, FM_WATCHER_STALE_GRACE, FM_GUARD_GRACE) and the watcher's exit/reason contract (signal/stale/check/heartbeat), durable wake queue, and heartbeat backoff unchanged. Verified locally: shellcheck clean (CI-style across bin/.sh tests/.sh), all 165 tests across 8 files pass, and a stress harness yields exactly one winner across 1000+ concurrent steal attempts even at FM_LOCK_STALE_AFTER=0.
What Changed
fm-watch-arm.shfor home-scoped watcher arming and restart behavior, with watcher self-eviction when another process takes over its lock.Risk Assessment
Testing
Captain, I exercised the focused watcher/wake-queue suite, the full CI behavior-test loop, and a reviewer-visible CLI transcript for the intended race and re-arm behavior; all checks passed and the working tree stayed clean.
Evidence: manual watcher singleton and re-arm transcript
Key lines: winner_count=1; second_arm_output=watcher: already running pid 33072; home_b_still_alive=yes; original_watcher_alive_after_poll=no.Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
🔧 **Review** - 2 issues found → auto-fixed (4) ✅
bin/fm-wake-lib.sh:92- When FM_LOCK_STALE_AFTER=0, a just-created lock dir with no pid is immediately treated as stale. The mkdir winner can be preempted before fm_lock_claim, another process can remove and recreate the lock dir and return success, then the original winner can resume, write its pid into the recreated dir, and also return success. Give the mid-acquire empty/non-numeric pid case a nonzero minimum grace or another unstealable claim token.bin/fm-watch-arm.sh:36- --restart sends TERM to any live pid found in state/.watch.lock/pid. If the lock is stale and that pid has been reused, this can kill an unrelated process while trying to restart the watcher. Verify that the pid still belongs to this home's fm-watch.sh instance before signalling it, or store enough owner identity in the lock to reject reused pids.🔧 Fix: Harden watcher lock ownership
1 error still open:
bin/fm-wake-lib.sh:73- A process that winsmkdir "$lockdir"can still be paused longer thanmid_acquire_stalebefore this write, letting another process reclaim the empty lock, write its pid, and return success. When the original process resumes, it writes into the newly-created lockdir, reads back its own pid, and also returns success, so the new single-winner guarantee is still breakable even with the 2s floor. Bind the claim to an unstealable owner token, or otherwise make a stale original winner unable to claim a recreated directory.🔧 Fix: Captain, harden watcher lock claims
1 error still open:
bin/fm-wake-lib.sh:189- The stale-steal mutex can be force-cleared after 2s without proving the stealer is dead. If stealer A creates$lockdir.steal, reads a deadcur, then is paused paststeal_stale, stealer B can remove the mutex, remove/recreate the lock, and return success; when A resumes it can remove B's live lock atfm_lock_remove_path, recreate it, and also return success. Make the steal mutex itself owned/unstealable, or revalidate the exact lock owner immediately before removal so a live slow stealer cannot be displaced.🔧 Fix: Captain, harden lock steal mutex ownership
1 error still open:
bin/fm-wake-lib.sh:216- A paused mid-acquire owner can still win after the stale stealer has checked the empty pid but before this removal. In that interleaving, the original claimant writes its pid and returns success, then the stealer removes that now-live lock, creates a new one, and also returns success, so two callers can enter the protected section. Revalidate the exact primary owner/pid immediately at removal time, or makefm_lock_claimfail while the steal lock is held.🔧 Fix: Captain, harden steal-aware lock claims
✅ Re-checked - no issues remain.
✅ **Test** - passed
✅ No issues found.
FM_FLEET_SYNC_BOOTSTRAP_TIMEOUT=1 bin/fm-bootstrap.shbash tests/fm-wake-queue.test.shset -eu; for test_script in tests/*.test.sh; do "$test_script"; doneManual shell harness wrote/var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KVXEQH3C9PT9ARAKSMFE59J4/watcher-singleton-rearm-transcript.txtexercising stale-lock contention, live-holder refusal, duplicate arm no-op, home-scoped restart, and lock-takeover self-eviction.✅ **Document** - passed
✅ No issues found.
✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.