fix: verify watcher arming before reporting healthy#75
Merged
Conversation
…ard banner fm-watch-arm.sh now forks the watcher as a tracked child, verifies a live watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest line - started/healthy/FAILED - exiting non-zero when none can be confirmed. Never reports healthy off a stale beacon or dead/reused pid. fm-guard.sh leads the no-watcher case with a bordered ●-marked banner (in-flight count, beacon age, exact re-arm command) so it cannot be skimmed past; queued- wakes warning and grace-window quiet preserved. AGENTS.md section 8 codifies arming only via the harness's tracked background mechanism, never a fire-and-forget shell &. Tests: arm started/healthy/FAILED + never-healthy-off-dead-pid; guard banner- first and fresh-watcher-quiet. Restart test updated for the child topology.
This was referenced Jun 25, 2026
leo1oel
added a commit
to leo1oel/nemo
that referenced
this pull request
Jun 25, 2026
…hardening (kunchenguid#71, kunchenguid#75) (#4) Port the two upstream watcher-reliability commits onto the herdr backend. The watcher supervises the whole fleet, so these are reliability-critical. kunchenguid#71 - race-proof singleton + home-scoped re-arm: - fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling ".steal" mutex (floor >= 2s, decoupled from FM_LOCK_STALE_AFTER) and the holder pid is re-verified dead immediately before the rmdir; every claim writes its pid and reads it back, so concurrent acquirers yield exactly one winner. - fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its pid; if another took over it exits cleanly, so any transient duplicate self-resolves within one poll. Records fm-home/watcher-path/pid-identity. - fm-watch-arm.sh (new): safe re-arm; default no-ops via the singleton, --restart signals only this home's recorded watcher pid (never a broad pkill that would kill sibling secondmate homes' watchers). kunchenguid#75 - honest self-verifying arm + prominent no-watcher guard banner: - fm-watch-arm.sh forks the watcher as a tracked child, verifies a genuinely live watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest line - started/healthy/FAILED - exiting non-zero when none can be confirmed; never reports healthy off a stale beacon or dead/reused pid. - fm-guard.sh leads the no-watcher case with a bordered banner so it cannot be skimmed past. herdr adaptations: fm-wake-lib.sh/fm-guard.sh/fm-watch-arm.sh taken from upstream (near-identical to the pre-kunchenguid#71 base, keeping FM_HOME so the new arm script's home-scoping is intact and consistent); the kunchenguid#71 hunks applied to the herdr-rewritten fm-watch.sh by hand; the daemon's kunchenguid#71 changes were comment-only and left as the herdr daemon's existing (still-accurate) wording. AGENTS.md §8 and README updated to route re-arm through fm-watch-arm.sh and warn against fire-and-forget `&` / broad pkill. The 19 new upstream tests (lock concurrency, self-eviction, arm started/healthy/FAILED, guard banner) ported and all pass; tests sandbox nothing new.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Make firstmate's watcher arming honest and its no-watcher state impossible to miss, fixing an incident where a fire-and-forget 're-arm' was reaped on call return yet fm-watch-arm.sh reported 'already running pid X' off the dying process, so supervision was silently down ~30 min.
Three fixes: (1) bin/fm-watch-arm.sh is now self-verifying. Instead of exec-ing the watcher, it forks the one-shot watcher as a tracked child and waits on it (so killing the arm tears down the watcher and the watcher's wake exit propagates so the harness re-notifies). Before settling it VERIFIES the outcome: a live pid whose recorded identity matches THIS home's watcher AND a fresh liveness beacon (state/.last-watcher-beat within FM_GUARD_GRACE, reusing that env var as the single source of truth, not a hardcoded copy). It prints exactly one honest line - 'watcher: started pid=N (beacon fresh)', 'watcher: healthy pid=N (beacon s)', or 'watcher: FAILED - no live watcher with a fresh beacon' - and exits non-zero only on FAILED. It never reports healthy off a stale beacon or a dead/reused pid: a dead-pid lock self-heals via the existing steal path into a confirmed 'started', while a live-but-stale holder yields FAILED. Singleton/lock semantics (race-proof acquire, self-eviction, home-scoped --restart) are intentionally preserved.
(2) bin/fm-guard.sh leads the no-watcher case with a prominent bordered, dot-marked banner (in-flight task count, beacon age or 'never', exact one-line re-arm command) emitted FIRST so it cannot be skimmed past. The existing queued-wakes-pending warning and the grace-window-quiet behavior are preserved; banner text deliberately keeps the 'After draining queued wakes, re-arm the watcher' phrasing and drops the old 'Restart it NOW' phrasing to satisfy existing tests.
(3) AGENTS.md section 8 codifies the rule: arm/re-arm only via the harness's tracked background mechanism that notifies on exit, never a fire-and-forget shell '&', and treat the fm-watch-arm.sh status line as the source of truth for watcher state.
Deliberate decision: the process topology changed from exec to fork+wait, so the existing test_watch_restart_rejects_reused_pid was updated to assert the lock now names the watcher child (its real guarded properties - stale reused-pid lock replaced by a live watcher, unrelated pid not killed - are preserved) rather than the arm's own pid. Added six hermetic tests (no real tmux): arm started/healthy/FAILED and never-healthy-off-dead-pid, plus guard banner-first and fresh-watcher-quiet. Also corrected a now-stale README line that described the old exec model. Scope limited to bin/fm-watch-arm.sh, bin/fm-guard.sh, AGENTS.md, README.md, and tests; macOS(BSD)+Linux(GNU) compatible; shellcheck clean.
What Changed
fm-watch-arm.shself-verifying: it tracks the watcher child, confirms a live home-scoped watcher with a fresh beacon, propagates immediate wakes, and reports failed startup honestly.fm-guard.shto put the no-watcher warning first with a prominent re-arm banner while preserving queued-wake and grace-window behavior.Risk Assessment
Testing
No prior baseline command output was provided; I inspected the changed paths, ran the focused watcher/wake/guard shell suite, manually exercised the end-user CLI surfaces with isolated state, saved reviewer-visible CLI evidence, and confirmed the worktree stayed clean with no transient test artifacts left behind.
Evidence: Watcher arm and guard CLI transcript
Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
🔧 **Review** - 1 issue found → auto-fixed (3) ✅
bin/fm-watch-arm.sh:159- If the watcher exits before the confirmation loop observes it as healthy, for example because a due*.check.shemits a wake on the first cycle, this branch falls through toFAILEDeven though the child successfully queued and printed a wake. The old exec path propagated that wake with status 0; the wrapper should wait for the child and propagate a valid wake instead of returning 1.🔧 Fix: Propagate immediate watcher wakes
1 warning still open:
bin/fm-watch-arm.sh:150- The cleanup trap only handles TERM and INT. If the harness or shell tears down the tracked arm with SIGHUP, the wrapper exits without running cleanup_child, so the watcher child and temp output file can survive untracked, undermining the new guarantee that killing the arm tears down the watcher. Trap HUP as well.🔧 Fix: Trap HUP during watcher arm cleanup
1 warning still open:
bin/fm-watch-arm.sh:180- A duplicate arm can falsely report FAILED during a startup race: if its child loses the singleton and exits withalready runningbefore the winning watcher has written its first fresh beacon, this path does one immediatehealthy_watchercheck and breaks instead of polling until the confirmation deadline. Keep waiting for a peer to become healthy unless the child emitted a real wake.🔧 Fix: Wait for peer watcher beacon
✅ Re-checked - no issues remain.
✅ **Test** - passed
✅ No issues found.
git diff --name-status 98b046162d3e64fdf8ce8dc4bed7bf85fdea08be..f33e142bf9729b7351f63389e021dfa954235cabtests/fm-wake-queue.test.shManual evidence check using isolatedFM_HOMEstates forbin/fm-watch-arm.shstarted, healthy, stale-lock failure, andbin/fm-guard.shbanner ordering; transcript saved at/var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KVY3BGBWX43CJ3TBN49SQBRE/watcher-arm-guard-transcript.txtgit status --shortfind . -maxdepth 3 -name '.watch-arm-output.*' -o -name 'fm-wake-tests.*'✅ **Document** - passed
✅ No issues found.
✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.