Skip to content

fix: verify watcher arming before reporting healthy#75

Merged
kunchenguid merged 6 commits into
mainfrom
fm/watch-arm-honest-j6
Jun 25, 2026
Merged

fix: verify watcher arming before reporting healthy#75
kunchenguid merged 6 commits into
mainfrom
fm/watch-arm-honest-j6

Conversation

@kunchenguid

Copy link
Copy Markdown
Owner

Intent

Make firstmate's watcher arming honest and its no-watcher state impossible to miss, fixing an incident where a fire-and-forget 're-arm' was reaped on call return yet fm-watch-arm.sh reported 'already running pid X' off the dying process, so supervision was silently down ~30 min.

Three fixes: (1) bin/fm-watch-arm.sh is now self-verifying. Instead of exec-ing the watcher, it forks the one-shot watcher as a tracked child and waits on it (so killing the arm tears down the watcher and the watcher's wake exit propagates so the harness re-notifies). Before settling it VERIFIES the outcome: a live pid whose recorded identity matches THIS home's watcher AND a fresh liveness beacon (state/.last-watcher-beat within FM_GUARD_GRACE, reusing that env var as the single source of truth, not a hardcoded copy). It prints exactly one honest line - 'watcher: started pid=N (beacon fresh)', 'watcher: healthy pid=N (beacon s)', or 'watcher: FAILED - no live watcher with a fresh beacon' - and exits non-zero only on FAILED. It never reports healthy off a stale beacon or a dead/reused pid: a dead-pid lock self-heals via the existing steal path into a confirmed 'started', while a live-but-stale holder yields FAILED. Singleton/lock semantics (race-proof acquire, self-eviction, home-scoped --restart) are intentionally preserved.

(2) bin/fm-guard.sh leads the no-watcher case with a prominent bordered, dot-marked banner (in-flight task count, beacon age or 'never', exact one-line re-arm command) emitted FIRST so it cannot be skimmed past. The existing queued-wakes-pending warning and the grace-window-quiet behavior are preserved; banner text deliberately keeps the 'After draining queued wakes, re-arm the watcher' phrasing and drops the old 'Restart it NOW' phrasing to satisfy existing tests.

(3) AGENTS.md section 8 codifies the rule: arm/re-arm only via the harness's tracked background mechanism that notifies on exit, never a fire-and-forget shell '&', and treat the fm-watch-arm.sh status line as the source of truth for watcher state.

Deliberate decision: the process topology changed from exec to fork+wait, so the existing test_watch_restart_rejects_reused_pid was updated to assert the lock now names the watcher child (its real guarded properties - stale reused-pid lock replaced by a live watcher, unrelated pid not killed - are preserved) rather than the arm's own pid. Added six hermetic tests (no real tmux): arm started/healthy/FAILED and never-healthy-off-dead-pid, plus guard banner-first and fresh-watcher-quiet. Also corrected a now-stale README line that described the old exec model. Scope limited to bin/fm-watch-arm.sh, bin/fm-guard.sh, AGENTS.md, README.md, and tests; macOS(BSD)+Linux(GNU) compatible; shellcheck clean.

What Changed

  • Made fm-watch-arm.sh self-verifying: it tracks the watcher child, confirms a live home-scoped watcher with a fresh beacon, propagates immediate wakes, and reports failed startup honestly.
  • Updated fm-guard.sh to put the no-watcher warning first with a prominent re-arm banner while preserving queued-wake and grace-window behavior.
  • Documented the tracked watcher-arm requirement and expanded watcher/guard tests for started, healthy, failed, dead-pid, banner-first, and quiet-fresh-watcher cases.

Risk Assessment

⚠️ Medium: Captain, risk is medium because this changes the critical watcher process topology and shell lifecycle, but the review found no substantiated merge-blocking issues in the changed code.

Testing

No prior baseline command output was provided; I inspected the changed paths, ran the focused watcher/wake/guard shell suite, manually exercised the end-user CLI surfaces with isolated state, saved reviewer-visible CLI evidence, and confirmed the worktree stayed clean with no transient test artifacts left behind.

Evidence: Watcher arm and guard CLI transcript
# Watcher arm and guard manual verification transcript
# repo: /Users/kunchen/.no-mistakes/worktrees/016d88035d58/01KVY3BGBWX43CJ3TBN49SQBRE

## Clean arm starts a confirmed watcher
$ FM_HOME=<isolated> bin/fm-watch-arm.sh
watcher: started pid=46871 (beacon fresh)
confirmed-lock-pid=46871 alive=yes beacon=present

## Second arm reports the live fresh watcher as healthy
$ FM_HOME=<same-isolated-home> bin/fm-watch-arm.sh
watcher: healthy pid=46871 (beacon 1s)

## Unconfirmable live lock fails loudly
$ FM_HOME=<isolated-stale-lock> FM_ARM_CONFIRM_TIMEOUT=1 bin/fm-watch-arm.sh
watcher: lock held by live pid 47040 but heartbeat is stale for 835635798s (>300s); inspect or stop that watcher before re-arming.
watcher: FAILED - no live watcher with a fresh beacon
exit-status=1 unrelated-holder-alive=yes

## Guard banner appears before queued-wake warning
$ FM_HOME=<isolated-with-task-and-queued-wake> bin/fm-guard.sh
●━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
●  WATCHER DOWN - SUPERVISION IS OFF
●  1 task(s) in flight, but no watcher has a fresh beacon (last beat: never, grace 300s).
●  Trust bin/fm-watch-arm.sh for the true state: it confirms a live watcher and a fresh beacon, or fails loudly.
●  After draining queued wakes, re-arm the watcher: run bin/fm-watch-arm.sh as the harness-tracked background task (never a shell & that gets reaped).
●━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WARNING: queued wakes pending - drain them with bin/fm-wake-drain.sh before anything else.

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

🔧 **Review** - 1 issue found → auto-fixed (3) ✅
  • 🚨 bin/fm-watch-arm.sh:159 - If the watcher exits before the confirmation loop observes it as healthy, for example because a due *.check.sh emits a wake on the first cycle, this branch falls through to FAILED even though the child successfully queued and printed a wake. The old exec path propagated that wake with status 0; the wrapper should wait for the child and propagate a valid wake instead of returning 1.

🔧 Fix: Propagate immediate watcher wakes
1 warning still open:

  • ⚠️ bin/fm-watch-arm.sh:150 - The cleanup trap only handles TERM and INT. If the harness or shell tears down the tracked arm with SIGHUP, the wrapper exits without running cleanup_child, so the watcher child and temp output file can survive untracked, undermining the new guarantee that killing the arm tears down the watcher. Trap HUP as well.

🔧 Fix: Trap HUP during watcher arm cleanup
1 warning still open:

  • ⚠️ bin/fm-watch-arm.sh:180 - A duplicate arm can falsely report FAILED during a startup race: if its child loses the singleton and exits with already running before the winning watcher has written its first fresh beacon, this path does one immediate healthy_watcher check and breaks instead of polling until the confirmation deadline. Keep waiting for a peer to become healthy unless the child emitted a real wake.

🔧 Fix: Wait for peer watcher beacon
✅ Re-checked - no issues remain.

✅ **Test** - passed

✅ No issues found.

  • git diff --name-status 98b046162d3e64fdf8ce8dc4bed7bf85fdea08be..f33e142bf9729b7351f63389e021dfa954235cab
  • tests/fm-wake-queue.test.sh
  • Manual evidence check using isolated FM_HOME states for bin/fm-watch-arm.sh started, healthy, stale-lock failure, and bin/fm-guard.sh banner ordering; transcript saved at /var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KVY3BGBWX43CJ3TBN49SQBRE/watcher-arm-guard-transcript.txt
  • git status --short
  • find . -maxdepth 3 -name '.watch-arm-output.*' -o -name 'fm-wake-tests.*'
✅ **Document** - passed

✅ No issues found.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

…ard banner

fm-watch-arm.sh now forks the watcher as a tracked child, verifies a live
watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest
line - started/healthy/FAILED - exiting non-zero when none can be confirmed.
Never reports healthy off a stale beacon or dead/reused pid.

fm-guard.sh leads the no-watcher case with a bordered ●-marked banner (in-flight
count, beacon age, exact re-arm command) so it cannot be skimmed past; queued-
wakes warning and grace-window quiet preserved.

AGENTS.md section 8 codifies arming only via the harness's tracked background
mechanism, never a fire-and-forget shell &.

Tests: arm started/healthy/FAILED + never-healthy-off-dead-pid; guard banner-
first and fresh-watcher-quiet. Restart test updated for the child topology.
@kunchenguid kunchenguid merged commit e4d236b into main Jun 25, 2026
4 checks passed
@kunchenguid kunchenguid deleted the fm/watch-arm-honest-j6 branch June 25, 2026 01:25
leo1oel added a commit to leo1oel/nemo that referenced this pull request Jun 25, 2026
…hardening (kunchenguid#71, kunchenguid#75) (#4)

Port the two upstream watcher-reliability commits onto the herdr backend. The
watcher supervises the whole fleet, so these are reliability-critical.

kunchenguid#71 - race-proof singleton + home-scoped re-arm:
- fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling
  ".steal" mutex (floor >= 2s, decoupled from FM_LOCK_STALE_AFTER) and the holder
  pid is re-verified dead immediately before the rmdir; every claim writes its pid
  and reads it back, so concurrent acquirers yield exactly one winner.
- fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its
  pid; if another took over it exits cleanly, so any transient duplicate
  self-resolves within one poll. Records fm-home/watcher-path/pid-identity.
- fm-watch-arm.sh (new): safe re-arm; default no-ops via the singleton, --restart
  signals only this home's recorded watcher pid (never a broad pkill that would
  kill sibling secondmate homes' watchers).

kunchenguid#75 - honest self-verifying arm + prominent no-watcher guard banner:
- fm-watch-arm.sh forks the watcher as a tracked child, verifies a genuinely live
  watcher with a fresh beacon (reusing FM_GUARD_GRACE), and prints one honest line
  - started/healthy/FAILED - exiting non-zero when none can be confirmed; never
  reports healthy off a stale beacon or dead/reused pid.
- fm-guard.sh leads the no-watcher case with a bordered banner so it cannot be
  skimmed past.

herdr adaptations: fm-wake-lib.sh/fm-guard.sh/fm-watch-arm.sh taken from upstream
(near-identical to the pre-kunchenguid#71 base, keeping FM_HOME so the new arm script's
home-scoping is intact and consistent); the kunchenguid#71 hunks applied to the
herdr-rewritten fm-watch.sh by hand; the daemon's kunchenguid#71 changes were comment-only
and left as the herdr daemon's existing (still-accurate) wording. AGENTS.md §8 and
README updated to route re-arm through fm-watch-arm.sh and warn against
fire-and-forget `&` / broad pkill. The 19 new upstream tests (lock concurrency,
self-eviction, arm started/healthy/FAILED, guard banner) ported and all pass;
tests sandbox nothing new.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant