fix(gateway): force exit on zombie shutdown + 503 healthz during shutdown#88908
fix(gateway): force exit on zombie shutdown + 503 healthz during shutdown#88908amittell wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR addresses a "zombie gateway" failure mode where a node process that has begun (or completed) shutdown still holds the HTTP listener and answers /healthz with 200, causing supervised lock recovery to defer to it instead of relaunching. It introduces a shutting-down flag that flips /healthz to 503 at the start of close, tightens the preflight probe to only accept 200, and arms a post-shutdown watchdog that force-exits if a stray handle keeps the process alive.
Changes:
- Add a module-level
gatewayShuttingDownStateflag inserver-close.ts(set before any close await) and route/healthz,/healththrough it to return 503 with{ live: false, phase: "shutting_down" }. - Add
armGatewayPostShutdownExitWatchdogthat force-callsprocess.exit(0)after a timeout (configurable viaOPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS) with active-handle diagnostics, and wire it into the close handler. - Tighten
probeGatewayHealthzto require200(not< 500) and emit agateway.preflight.zombie_detectedtrace + warn when the lock is held but the probe is unhealthy.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/gateway/server-close.ts | Adds shutting-down flag, accessors, and post-shutdown exit watchdog; flips flag at start of close and arms watchdog at end. |
| src/gateway/server-http.ts | Routes live probes through getShuttingDown() and returns 503 during shutdown; adds a once-per-shutdown warn log and a test reset hook. |
| src/gateway/server-runtime-state.ts | Threads getShuttingDown through to the HTTP server constructor. |
| src/gateway/server.impl.ts | Resets the shutting-down flag at startup so in-process restarts begin answering 200 again. |
| src/cli/gateway-cli/run.ts | Tightens healthz probe to === 200; logs/traces zombie detection on unhealthy probe with held lock; exposes probeGatewayHealthz for tests. |
| src/gateway/server-http.probe.test.ts | New tests for 503 healthz/health responses during shutdown, including HEAD and module-flag paths. |
| src/gateway/server-close.test.ts | New tests for flag flip ordering and the post-shutdown exit watchdog behavior. |
| src/cli/gateway-cli/run.supervised-lock.test.ts | Tests preflight zombie-detected logging and probe healthy/unhealthy classification via a real local server. |
|
Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 5:31 AM ET / 09:31 UTC. Summary PR surface: Source +303, Tests +360. Total +663 across 9 files. Reproducibility: yes. from source and supplied terminal proof: current main returns 200 for live probes and accepts non-500 preflight responses, while the PR body shows the incident-shaped predicate failure and patched behavior. I did not run a live launchd zombie reproduction in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Maintain a conservative Gateway availability contract: harden supervised recovery from zombie listeners while preserving public liveness probes, and make the forced-exit timeout/env policy an explicit maintainer-owned decision before merge. Do we have a high-confidence way to reproduce the issue? Yes from source and supplied terminal proof: current main returns 200 for live probes and accepts non-500 preflight responses, while the PR body shows the incident-shaped predicate failure and patched behavior. I did not run a live launchd zombie reproduction in this read-only review. Is this the best way to solve the issue? Unclear as a final product decision: the two-layer defense is a plausible and well-tested mitigation for the reported failure, but the forced AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 601ab84f35e1. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +303, Tests +360. Total +663 across 9 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
Thanks for the detailed review. Addressed all three findings in P1 (server-close.ts:879): restart-aware watchdog. Gated the watchdog on the close reason. When P1 (server-http.ts:324): narrowed live-probe contract. Public P3 (run.ts:462-466): deduped zombie trace. Added a All three changes ship in one commit with focused tests. |
…module Per ClawSweeper P2+P3 review on openclaw#88908. P2: Move isGatewayShuttingDown / markGatewayShuttingDown / resetGatewayShuttingDownState out of server-close.ts into a new src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now imports from the lightweight module directly instead of loading server-close.runtime (which transitively pulls in shutdown-only agent, channel, and plugin cleanup code). server-close.ts and server-http.ts re-export from the new module for source compatibility. P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives alongside the shutdown state and is reset on every markGatewayShuttingDown and resetGatewayShuttingDownState call. Previously the flag latched for the process lifetime and was only reset by tests, so the second in-process restart shutdown was silent. New regression test pins per-cycle reset behavior.
|
Thanks for the followup review. Addressed the P2 and P3 architectural concerns in P2 (server.impl.ts:548-549): shutdown state out of close runtime. Extracted the shutdown lifecycle state symbols ( P3 (server-http.ts:54-56): per-cycle log dedupe reset. Moved New regression test
Watchdog and strict-probe semantics from the earlier review remain unchanged — this commit is purely the lifecycle module extraction plus the per-cycle log reset. |
…module Per ClawSweeper P2+P3 review on openclaw#88908. P2: Move isGatewayShuttingDown / markGatewayShuttingDown / resetGatewayShuttingDownState out of server-close.ts into a new src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now imports from the lightweight module directly instead of loading server-close.runtime (which transitively pulls in shutdown-only agent, channel, and plugin cleanup code). server-close.ts and server-http.ts re-export from the new module for source compatibility. P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives alongside the shutdown state and is reset on every markGatewayShuttingDown and resetGatewayShuttingDownState call. Previously the flag latched for the process lifetime and was only reset by tests, so the second in-process restart shutdown was silent. New regression test pins per-cycle reset behavior.
a9c8837 to
75863af
Compare
…module Per ClawSweeper P2+P3 review on openclaw#88908. P2: Move isGatewayShuttingDown / markGatewayShuttingDown / resetGatewayShuttingDownState out of server-close.ts into a new src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now imports from the lightweight module directly instead of loading server-close.runtime (which transitively pulls in shutdown-only agent, channel, and plugin cleanup code). server-close.ts and server-http.ts re-export from the new module for source compatibility. P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives alongside the shutdown state and is reset on every markGatewayShuttingDown and resetGatewayShuttingDownState call. Previously the flag latched for the process lifetime and was only reset by tests, so the second in-process restart shutdown was silent. New regression test pins per-cycle reset behavior.
75863af to
2028f4c
Compare
…e hangs When the gateway shutdown sequence completes cleanly but a stray handle (HTTP keep-alive, telegram fetch, plugin native handle) keeps the node event loop alive, the supervisor sees the gateway lock dropped but the PID never reaps. The HTTP listener stays bound, and the next launchd respawn defers via the lock-recovery path because /healthz still answers 200 from the zombie. Real incident on rh-bot.lan 2026-05-31 21:31 left the bot offline for 2h13m until manual kill -9 + launchctl kickstart. This commit arms a process-wide watchdog (default 5_000ms, configurable via OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS) after every clean shutdown completion. The timer is unref'd so it never blocks a healthy natural exit; when it does fire, it emits a structured warn log with active-handle summary, records the gateway.shutdown.zombie_detected restart trace event for OTel/Loki, and calls process.exit(0). Also flips the new isGatewayShuttingDown flag at the start of the close sequence (before any await) so layer 2 can act on it, and resets the flag at startup so in-process restart paths re-enter the running state.
…-recovery preflight
The lock-recovery preflight in runGatewayLoopWithSupervisedLockRecovery
accepted any /healthz response with status < 500 as healthy. When the
previous gateway is mid-shutdown (or a zombie that lost its close path
but kept its HTTP listener), that includes the 503 we now return, so
the supervised respawn deferred to a draining or stuck instance and
left the bot offline.
This commit:
1. Adds a shutting-down state hook owned by server-close (flag is
flipped before any close-handler await), exposed via the
isGatewayShuttingDown accessor.
2. Updates handleGatewayProbeRequest so live probes (/health,
/healthz) return 503 with body { live: false, phase:
"shutting_down" } while the flag is set. Once-per-shutdown warn
log via the gateway/probe subsystem captures the cascade for
OTel/Loki ingestion.
3. Threads getShuttingDown through createGatewayHttpServer and
server-runtime-state so production callers and tests share the
same seam.
4. Tightens probeGatewayHealthz to treat only HTTP 200 as healthy.
The < 500 acceptance was the second-order defense gap that let
the zombie cascade survive every supervisor cycle.
5. Adds a gateway.preflight.zombie_detected warn log + restart trace
event in the supervised lock-recovery path so the cascade is
queryable in Loki when it recurs.
…ponse, and lock-recovery preflight tightening
Covers the three production behaviors introduced in this branch:
1. armGatewayPostShutdownExitWatchdog: forces process.exit(0) when
fired, can be cancelled, and emits a warn log naming the timeout
and likely handle culprits.
2. createGatewayCloseHandler: flips the shutting-down flag before
any await and threads the watchdog seam through to tests.
3. handleGatewayProbeRequest: returns 503 with the
{ live: false, phase: "shutting_down" } body on both /healthz
and /health when the flag is set, including HEAD requests.
4. probeGatewayHealthz: real HTTP listener returning 503 is now
classified as unhealthy (vs the prior < 500 acceptance).
5. runGatewayLoopWithSupervisedLockRecovery: logs the
gateway.preflight.zombie_detected signal when a held lock pairs
with an unhealthy probe response.
…killing test shards
- Skip post-shutdown watchdog on in-process restart reasons so SIGUSR1 path does not kill the restarted gateway 5s into its next life. - Narrow live-probe 503 to ?strict=1 query, preserving public /health and /healthz 200 contract for external monitors. Supervised lock-recovery preflight now uses /healthz?strict=1 to opt into shutdown-aware behavior. - Dedupe gateway.preflight.zombie_detected so a draining prior gateway emits the trace + warn once per recovery cycle, not on every retry tick. Includes regression tests for all three fixes.
…module Per ClawSweeper P2+P3 review on openclaw#88908. P2: Move isGatewayShuttingDown / markGatewayShuttingDown / resetGatewayShuttingDownState out of server-close.ts into a new src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now imports from the lightweight module directly instead of loading server-close.runtime (which transitively pulls in shutdown-only agent, channel, and plugin cleanup code). server-close.ts and server-http.ts re-export from the new module for source compatibility. P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives alongside the shutdown state and is reset on every markGatewayShuttingDown and resetGatewayShuttingDownState call. Previously the flag latched for the process lifetime and was only reset by tests, so the second in-process restart shutdown was silent. New regression test pins per-cycle reset behavior.
2028f4c to
b83669c
Compare
Summary
On rh-bot.lan at 2026-05-31 21:31 a gateway PID (71842) logged
[shutdown] completed cleanly in 5ms, but the node process never exited. It held port 18789 for 2h13m. Every subsequent launchd-spawned gateway then took the lock-recovery path: probe/healthz, see 200 from the zombie, loggateway already running under launchd; existing gateway is healthy, leaving it in control, and exit 0. launchd'sKeepAlive.SuccessfulExit=falsekept new gateways down. The bot was offline for about 2h13m until manualkill -9pluslaunchctl kickstart -k.The original SIGKILL of the parent process is the separate root cause documented in memory
project_rh_bot_isolated_session_sigkill.md(openclaw isolated-sessionspawn pattern). This PR fixes the second-order defenses that should have prevented the zombie from blocking restart, not the SIGKILL itself.Layer 1: force exit after every clean shutdown
src/gateway/server-close.ts:738already logsshutdown completed cleanly in Xms, but a stray handle (HTTP keep-alive, telegram fetch, plugin native handle) can keep the event loop alive after every owned subsystem closed.src/cli/gateway-cli/run-loop.ts:382has an existingarmForceExitTimer(SHUTDOWN_TIMEOUT_MS)safety net, but it only arms forrunAcceptedRequest(signal-driven), not for thecloseOnStartupFailurepath insrc/gateway/server.impl.ts:1059. That gap is exactly the rh-bot incident shape.This PR adds
armGatewayPostShutdownExitWatchdoginserver-close.ts. After "completed cleanly" is logged, an unref'dsetTimeoutis armed (default 5_000ms, configurable viaOPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS). When it fires:shutdownLog.warnwith the active-handle constructor summary (viaprocess._getActiveHandlesorgetActiveResourcesInfo)gateway.shutdown.zombie_detectedrestart trace event for OTel/Lokiprocess.exit(0)Because the timer is unref'd, a healthy natural exit always wins. The watchdog only fires when something else holds the loop open. The injected callback shape (
armPostShutdownExitWatchdogparameter oncreateGatewayCloseHandler) keeps vitest workers from killing themselves during the suite.Layer 2: 503 on /healthz during shutdown, plus tighter preflight
The deployed
handleGatewayProbeRequestinsrc/gateway/server-http.tsreturns 200 unconditionally for/healthz. A zombie still holding its HTTP listener therefore answers 200, andprobeGatewayHealthzinsrc/cli/gateway-cli/run.ts:369accepts it as healthy (success criterion wasstatusCode < 500). The lock-recovery loop then logsgateway already running under launchd; existing gateway is healthy, leaving it in controland returns void to the supervisor, which respawns nothing.This PR adds a module-level shutting-down flag in
server-close.ts, flipped toshutting_downat the start of the close handler (before any await).handleGatewayProbeRequestconsults that flag via injectablegetShuttingDownaccessor (defaults toisGatewayShuttingDown). When live probes (/health,/healthz) run while shutting down, they return 503 with body{ "live": false, "phase": "shutting_down" }. Agateway.healthz.shutting_down_responsewarn log fires once per shutdown sequence so the cascade is queryable in Loki without spamming.probeGatewayHealthzis tightened to accept onlystatusCode === 200. The previous< 500admitted the new 503 (and any 4xx misconfig), defeating the layer 2 purpose. The lock-recovery loop now also emitsgateway.preflight.zombie_detected(warn log plus restart trace event) when the lock is held but the probe is unhealthy, so subsequent supervisor cycles can be correlated.The flag resets at the start of
startGatewayServerso the in-process restart path returns to arunningstate before/healthzstarts answering 200 again.OTel / Loki instrumentation
All four events emit via the existing
gatewayLog/shutdownLog/gatewayProbeLogsubsystems plusrecordGatewayRestartTrace:gateway.shutdown.zombie_detected(Layer 1 trace) when the post-shutdown watchdog firesgateway.healthz.shutting_down_response(Layer 2 warn log) when /healthz returns 503 during shutdowngateway.preflight.zombie_detected(Layer 2 warn log + trace) when supervised lock recovery sees a held lock but an unhealthy proberestart.close.totaltrace already covers clean-path shutdown completionNo new export plumbing is needed; the existing diagnostics-otel / Loki shipper consumes these events.
Verification
Local Vitest with the patch applied:
pnpm buildclean.node scripts/run-oxlint-shards.mjs --only=core --threads=4clean.node scripts/check-dynamic-import-warts.mjsshows only the pre-existingmcp-cli.tsadvisory, unchanged by this PR.Real behavior proof
Behavior addressed: gateway zombie shutdown that left rh-bot offline 2h13m on 2026-05-31. With this PR the watchdog forces
process.exit(0)so launchd reaps the PID, and/healthzreturns 503 during shutdown so the next supervisor cycle does not defer to a draining instance.Real environment tested: live deployed openclaw gateway on rh-bot.lan, build SHA 23804e6 (incident reproducer). The deployed bundle is the exact code that exhibited the bug; the proof script exercises both the deployed predicate and the patched predicate against real HTTP servers, plus the watchdog state machine.
Exact steps or command run after this patch:
Evidence after fix:
Live deployed gateway healthz hit (confirms the case 1 handler shape the proof script reproduces):
Observed result after fix: case 1 confirms the deployed handler shape responsible for the incident. Cases 2 and 3 confirm the patched probe predicate. Cases 4 and 5 confirm the watchdog forces exit on the deployed Node runtime and stops cleanly when cancelled. Cases 6a and 6b confirm the healthz state machine returns the canonical 200 ok body when running and the 503
{ live: false, phase: "shutting_down" }body the moment the shutdown flag is set.What was not tested: a true end-to-end shutdown of a real OpenClaw gateway on rh-bot with the new bundle (requires deployment of this branch); the long-running TCP keepalive vs telegram-fetch handle-leak scenario, which the watchdog handles by design (any stray handle keeps the loop alive past the watchdog timeout, and the watchdog forces exit regardless of the specific handle source).