Skip to content

fix: harden watcher singleton locking#71

Merged
kunchenguid merged 6 commits into
mainfrom
fm/fm-watch-singleton-h4
Jun 24, 2026
Merged

fix: harden watcher singleton locking#71
kunchenguid merged 6 commits into
mainfrom
fm/fm-watch-singleton-h4

Conversation

@kunchenguid

Copy link
Copy Markdown
Owner

Intent

Make the firstmate watcher singleton race-proof so two fm-watch.sh watchers can never run in one home, and make re-arm safe across homes. Root cause: fm_lock_try_acquire's stale-steal path in bin/fm-wake-lib.sh destroyed and recreated the lock dir without serialization, so two racers seeing the same dead pid could both reclaim it (observed: two watchers in one home doubling every wake); and the supervision pattern of re-arming via 'pkill -f bin/fm-watch.sh' matched every home's watcher, killing secondmate homes' watchers.

Changes and the deliberate decisions behind them:

  • bin/fm-wake-lib.sh: rewrote fm_lock_try_acquire to be single-winner. Reclaim of a stale lock is now serialized through a sibling '.steal' mkdir mutex so two stealers can never destroy-and-recreate concurrently; the holder pid is re-verified dead immediately before the rmdir; and a new fm_lock_claim helper writes the pid then reads it back so a stomped pid means we lost. mkdir is the atomic winner-selector, so at most one acquirer returns 0. Deliberate: the steal-mutex abandonment threshold (steal_stale) is DECOUPLED from FM_LOCK_STALE_AFTER and floored at 2s on purpose - tying it to that knob (which can be tuned to 0) would force-clear a live stealer's mutex and re-admit the double-winner race. Deliberate: fm_lock_claim reads its pid as ${BASHPID:-$$} directly rather than via $(fm_current_pid), because a command substitution forks a subshell whose pid would be recorded dead; this also keeps it consistent with fm_lock_release's own ${BASHPID:-$$} comparison. Removed the now-unused fm_lock_remove_stale (only ever called internally).
  • bin/fm-watch.sh: added per-poll self-eviction - each cycle the watcher verifies the lock pid still names it (compared against WATCHER_PID captured once at startup) and exits cleanly if another pid took over, so any transient duplicate self-resolves within one poll. The EXIT-trap release intentionally no-ops for a foreign pid so the survivor's lock is untouched.
  • bin/fm-watch-arm.sh (new): safe home-scoped re-arm. Default execs the watcher and relies on the singleton no-op; --restart signals ONLY this home's recorded watcher pid (from this home's state/.watch.lock) and waits for it to exit before relaunching - never a broad pkill.
  • AGENTS.md section 8: routes re-arm through bin/fm-watch-arm.sh, documents --restart as the home-scoped forced restart, and explicitly warns that 'pkill -f bin/fm-watch.sh' kills sibling homes' watchers since secondmate homes run the same script.
  • tests/fm-wake-queue.test.sh: added four hermetic tests - concurrency single-winner (40 contenders, exactly one wins), dead-pid stale lock stolen by a single acquirer, live-holder lock not stolen (no-op reporting FM_LOCK_HELD_PID), and watcher self-eviction on lock takeover.

Constraints honored: portable bash/POSIX, macOS (BSD) + Linux (GNU) compatible, no flock; all existing env knobs preserved (FM_LOCK_STALE_AFTER, FM_WATCHER_STALE_GRACE, FM_GUARD_GRACE) and the watcher's exit/reason contract (signal/stale/check/heartbeat), durable wake queue, and heartbeat backoff unchanged. Verified locally: shellcheck clean (CI-style across bin/.sh tests/.sh), all 165 tests across 8 files pass, and a stress harness yields exactly one winner across 1000+ concurrent steal attempts even at FM_LOCK_STALE_AFTER=0.

What Changed

  • Hardened wake lock acquisition so stale lock stealing is serialized, ownership is revalidated before takeover, and only one claimant can enter a protected section.
  • Added fm-watch-arm.sh for home-scoped watcher arming and restart behavior, with watcher self-eviction when another process takes over its lock.
  • Updated watcher documentation and added shell tests for lock contention, stale lock recovery, live-holder refusal, and watcher takeover handling.

Risk Assessment

⚠️ Medium: Captain, the final patch has no substantiated merge-blocking findings, but it rewrites a central shared lock primitive with subtle scheduler and stale-state behavior, so residual concurrency risk remains higher than a routine change.

Testing

Captain, I exercised the focused watcher/wake-queue suite, the full CI behavior-test loop, and a reviewer-visible CLI transcript for the intended race and re-arm behavior; all checks passed and the working tree stayed clean.

Evidence: manual watcher singleton and re-arm transcript

Key lines: winner_count=1; second_arm_output=watcher: already running pid 33072; home_b_still_alive=yes; original_watcher_alive_after_poll=no.

Manual evidence for watcher singleton and home-scoped re-arm
repo=/Users/kunchen/.no-mistakes/worktrees/016d88035d58/01KVXEQH3C9PT9ARAKSMFE59J4

1. Stale lock steal race with 40 concurrent contenders
dead_pid_before=999999
winner_count=1
winner_record=winner pid=31788
final_lock_pid=31788
PASS: exactly one contender reclaimed the stale lock.

2. Live holder lock is not stolen
rc=1 held=33046 pid_after=33046
PASS: live holder was reported and its pid file was preserved.

3. Duplicate watcher arm in one home no-ops
first_watcher_pid=33072
lock_pid=33072
first_watcher_alive_after_second_arm=yes
second_arm_output=watcher: already running pid 33072
PASS: second arm exited and reported the existing singleton.

4. --restart is scoped to the selected FM_HOME
home_a_old_pid=33172
home_a_new_pid=33287
home_a_lock_after_restart=33287
home_a_old_exited=yes
home_b_pid_before_and_after=33229/33229
home_a_new_alive=yes
home_b_still_alive=yes
PASS: restarting home A replaced only home A watcher; home B stayed alive.

5. Watcher self-evicts when its lock pid is taken over
original_watcher_pid=33403
takeover_pid=33495
original_watcher_alive_after_poll=no
lock_pid_after_self_eviction=33495
takeover_pid_alive=yes
PASS: displaced watcher exited and did not remove the survivor lock.

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

🔧 **Review** - 2 issues found → auto-fixed (4) ✅
  • 🚨 bin/fm-wake-lib.sh:92 - When FM_LOCK_STALE_AFTER=0, a just-created lock dir with no pid is immediately treated as stale. The mkdir winner can be preempted before fm_lock_claim, another process can remove and recreate the lock dir and return success, then the original winner can resume, write its pid into the recreated dir, and also return success. Give the mid-acquire empty/non-numeric pid case a nonzero minimum grace or another unstealable claim token.
  • ⚠️ bin/fm-watch-arm.sh:36 - --restart sends TERM to any live pid found in state/.watch.lock/pid. If the lock is stale and that pid has been reused, this can kill an unrelated process while trying to restart the watcher. Verify that the pid still belongs to this home's fm-watch.sh instance before signalling it, or store enough owner identity in the lock to reject reused pids.

🔧 Fix: Harden watcher lock ownership
1 error still open:

  • 🚨 bin/fm-wake-lib.sh:73 - A process that wins mkdir "$lockdir" can still be paused longer than mid_acquire_stale before this write, letting another process reclaim the empty lock, write its pid, and return success. When the original process resumes, it writes into the newly-created lockdir, reads back its own pid, and also returns success, so the new single-winner guarantee is still breakable even with the 2s floor. Bind the claim to an unstealable owner token, or otherwise make a stale original winner unable to claim a recreated directory.

🔧 Fix: Captain, harden watcher lock claims
1 error still open:

  • 🚨 bin/fm-wake-lib.sh:189 - The stale-steal mutex can be force-cleared after 2s without proving the stealer is dead. If stealer A creates $lockdir.steal, reads a dead cur, then is paused past steal_stale, stealer B can remove the mutex, remove/recreate the lock, and return success; when A resumes it can remove B's live lock at fm_lock_remove_path, recreate it, and also return success. Make the steal mutex itself owned/unstealable, or revalidate the exact lock owner immediately before removal so a live slow stealer cannot be displaced.

🔧 Fix: Captain, harden lock steal mutex ownership
1 error still open:

  • 🚨 bin/fm-wake-lib.sh:216 - A paused mid-acquire owner can still win after the stale stealer has checked the empty pid but before this removal. In that interleaving, the original claimant writes its pid and returns success, then the stealer removes that now-live lock, creates a new one, and also returns success, so two callers can enter the protected section. Revalidate the exact primary owner/pid immediately at removal time, or make fm_lock_claim fail while the steal lock is held.

🔧 Fix: Captain, harden steal-aware lock claims
✅ Re-checked - no issues remain.

✅ **Test** - passed

✅ No issues found.

  • FM_FLEET_SYNC_BOOTSTRAP_TIMEOUT=1 bin/fm-bootstrap.sh
  • bash tests/fm-wake-queue.test.sh
  • set -eu; for test_script in tests/*.test.sh; do "$test_script"; done
  • Manual shell harness wrote /var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KVXEQH3C9PT9ARAKSMFE59J4/watcher-singleton-rearm-transcript.txt exercising stale-lock contention, live-holder refusal, duplicate arm no-op, home-scoped restart, and lock-takeover self-eviction.
✅ **Document** - passed

✅ No issues found.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

Concurrent fm_lock_try_acquire could produce two lock holders: the
stale-steal path destroyed and recreated the lock dir without
serialization, so two racers could both reclaim it. That let two
fm-watch.sh watchers run in one home, doubling every wake.

- fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a
  sibling ".steal" mutex (its abandonment floor decoupled from
  FM_LOCK_STALE_AFTER, never below 2s) and the holder pid is re-verified
  dead immediately before the rmdir. Every claim writes its pid and reads
  it back; a stomped pid means we lost. mkdir is the atomic arbiter, so at
  most one acquirer ever returns 0.
- fm-watch.sh: self-eviction. Each poll a watcher checks the lock still
  names its pid; if another took over, it exits cleanly, so any transient
  duplicate self-resolves within one poll.
- fm-watch-arm.sh: safe re-arm. Default no-ops via the singleton;
  --restart signals only this home's recorded watcher pid, never a broad
  pkill that would kill sibling homes' watchers.
- AGENTS.md section 8: route re-arm through fm-watch-arm.sh and warn that
  pkill -f bin/fm-watch.sh kills other homes' watchers.
- Tests: concurrency single-winner, dead-pid steal, live-lock no-op,
  watcher self-eviction.
@kunchenguid kunchenguid merged commit 0b11b10 into main Jun 24, 2026
4 checks passed
@kunchenguid kunchenguid deleted the fm/fm-watch-singleton-h4 branch June 24, 2026 21:54
leo1oel added a commit to leo1oel/nemo that referenced this pull request Jun 25, 2026
…hardening (kunchenguid#71, kunchenguid#75) (#4)

Port the two upstream watcher-reliability commits onto the herdr backend. The
watcher supervises the whole fleet, so these are reliability-critical.

kunchenguid#71 - race-proof singleton + home-scoped re-arm:
- fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling
  ".steal" mutex (floor >= 2s, decoupled from FM_LOCK_STALE_AFTER) and the holder
  pid is re-verified dead immediately before the rmdir; every claim writes its pid
  and reads it back, so concurrent acquirers yield exactly one winner.
- fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its
  pid; if another took over it exits cleanly, so any transient duplicate
  self-resolves within one poll. Records fm-home/watcher-path/pid-identity.
- fm-watch-arm.sh (new): safe re-arm; default no-ops via the singleton, --restart
  signals only this home's recorded watcher pid (never a broad pkill that would
  kill sibling secondmate homes' watchers).

kunchenguid#75 - honest self-verifying arm + prominent no-watcher guard banner:
- fm-watch-arm.sh forks the watcher as a tracked child, verifies a genuinely live
  watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest line
  - started/healthy/FAILED - exiting non-zero when none can be confirmed; never
  reports healthy off a stale beacon or dead/reused pid.
- fm-guard.sh leads the no-watcher case with a bordered banner so it cannot be
  skimmed past.

herdr adaptations: fm-wake-lib.sh/fm-guard.sh/fm-watch-arm.sh taken from upstream
(near-identical to the pre-kunchenguid#71 base, keeping FM_HOME so the new arm script's
home-scoping is intact and consistent); the kunchenguid#71 hunks applied to the
herdr-rewritten fm-watch.sh by hand; the daemon's kunchenguid#71 changes were comment-only
and left as the herdr daemon's existing (still-accurate) wording. AGENTS.md §8 and
README updated to route re-arm through fm-watch-arm.sh and warn against
fire-and-forget `&` / broad pkill. The 19 new upstream tests (lock concurrency,
self-eviction, arm started/healthy/FAILED, guard banner) ported and all pass;
tests sandbox nothing new.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant