Why
The v0.18.0 demo smoke uncovered a silent multi-day gateway outage: a backgrounded daemon was App-Nap/sleep-throttled on macOS, starving Slack Socket-Mode's ping/pong (5 s window). The socket reconnected endlessly but never stayed healthy — while the process stayed alive and the heartbeat kept ticking, so nothing looked wrong and the bot answered no one. This release makes that failure mode impossible-by-default, self-healing, and — for any residual unrecoverable state — a loud, alerting exit instead of a silent zombie.
What shipped
- Throttle-proofing (
keep-awake.ts): on macOS,pmk gateway startholds acaffeinatepower assertion bound to its own pid (-w <pid>) for its whole lifetime. Default flags-is(idle+system sleep) — not the heavier-dimsu(display) which stays available viaPMK_GATEWAY_CAFFEINATE_FLAGS. No-op off macOS; spawn failure never blocks startup. - Self-heal watchdog: a pure
SocketHealthtracker (fed by a pong-timeout tap logger + the five real SDK conn-state events) drives aSocketWatchdogon 30 s ticks. Wedged socket (≥3 pong-timeouts/60 s, or notconnectedpast 60 s) → forced in-process reconnect (time-boxed 45 s). A reconnect is "failed" only if it goes unhealthy again before 3 min of continuous health; after 3 confirmed failures the next unhealthy tick does a loud exit. - Loud exit:
gateway.offline(reason: watchdog-unhealthy,broadcast:false— operator alert, not stakeholder fan-out) + admin DMs (conversations.open, works over HTTP even with a dead socket) under a hard 15 s cap, thenprocess.exit(1)— guaranteed even if the alert hangs/rejects. No admins → offline event + terminal log.
Fixed
- Removed a dead
socket.on("reconnect", …)listener (the SDK emitsreconnecting, neverreconnect).
How it was built
brainstorming → 4 rounds of spec review → writing-plans → subagent-driven development (7 tasks, each implemented + spec-reviewed + quality-reviewed by fresh subagents) → a final holistic review. Bugs caught and fixed in review include: a log-line drop when the pong tap throws, a presence-alert timer leak, and — most importantly — the loud exit being silently skipped if terminate() rejected (now guaranteed via try/finally + a bounded alert). The highest-risk seam (do socket listeners survive a watchdog disconnect()+start()?) was verified SAFE against the installed SDK source.
Test plan
@pmk/cli: 506 tests, 100% pass (npm --workspace packages/cli test). New suites:socket-health,keep-awake,socket-logger,socket-watchdog,socket-watchdog-alert.- Operator live-sanity (not unit-tested):
pmk gateway start→ confirm acaffeinate … -w <pid>child exists and Ctrl+C shuts down cleanly with no spurious watchdog reconnects.
Deferred
launchd/systemd service + boot auto-start; a watchdogAlertChannelId for channel (vs admin-DM) alerts; adaptive thresholds; active health probes.