Skip to content

v0.19.0 — gateway keep-awake hardening (throttle-proof + self-heal watchdog)

Latest

Choose a tag to compare

@hanfour hanfour released this 03 Jun 14:15
· 61 commits to main since this release

Why

The v0.18.0 demo smoke uncovered a silent multi-day gateway outage: a backgrounded daemon was App-Nap/sleep-throttled on macOS, starving Slack Socket-Mode's ping/pong (5 s window). The socket reconnected endlessly but never stayed healthy — while the process stayed alive and the heartbeat kept ticking, so nothing looked wrong and the bot answered no one. This release makes that failure mode impossible-by-default, self-healing, and — for any residual unrecoverable state — a loud, alerting exit instead of a silent zombie.

What shipped

  • Throttle-proofing (keep-awake.ts): on macOS, pmk gateway start holds a caffeinate power assertion bound to its own pid (-w <pid>) for its whole lifetime. Default flags -is (idle+system sleep) — not the heavier -dimsu (display) which stays available via PMK_GATEWAY_CAFFEINATE_FLAGS. No-op off macOS; spawn failure never blocks startup.
  • Self-heal watchdog: a pure SocketHealth tracker (fed by a pong-timeout tap logger + the five real SDK conn-state events) drives a SocketWatchdog on 30 s ticks. Wedged socket (≥3 pong-timeouts/60 s, or not connected past 60 s) → forced in-process reconnect (time-boxed 45 s). A reconnect is "failed" only if it goes unhealthy again before 3 min of continuous health; after 3 confirmed failures the next unhealthy tick does a loud exit.
  • Loud exit: gateway.offline (reason: watchdog-unhealthy, broadcast:false — operator alert, not stakeholder fan-out) + admin DMs (conversations.open, works over HTTP even with a dead socket) under a hard 15 s cap, then process.exit(1) — guaranteed even if the alert hangs/rejects. No admins → offline event + terminal log.

Fixed

  • Removed a dead socket.on("reconnect", …) listener (the SDK emits reconnecting, never reconnect).

How it was built

brainstorming → 4 rounds of spec review → writing-plans → subagent-driven development (7 tasks, each implemented + spec-reviewed + quality-reviewed by fresh subagents) → a final holistic review. Bugs caught and fixed in review include: a log-line drop when the pong tap throws, a presence-alert timer leak, and — most importantly — the loud exit being silently skipped if terminate() rejected (now guaranteed via try/finally + a bounded alert). The highest-risk seam (do socket listeners survive a watchdog disconnect()+start()?) was verified SAFE against the installed SDK source.

Test plan

  • @pmk/cli: 506 tests, 100% pass (npm --workspace packages/cli test). New suites: socket-health, keep-awake, socket-logger, socket-watchdog, socket-watchdog-alert.
  • Operator live-sanity (not unit-tested): pmk gateway start → confirm a caffeinate … -w <pid> child exists and Ctrl+C shuts down cleanly with no spurious watchdog reconnects.

Deferred

launchd/systemd service + boot auto-start; a watchdogAlertChannelId for channel (vs admin-DM) alerts; adaptive thresholds; active health probes.