Skip to content

fix(gateway): force exit on zombie shutdown + 503 healthz during shutdown#88908

Open
amittell wants to merge 8 commits into
openclaw:mainfrom
amittell:fix/gateway-shutdown-zombie-exit
Open

fix(gateway): force exit on zombie shutdown + 503 healthz during shutdown#88908
amittell wants to merge 8 commits into
openclaw:mainfrom
amittell:fix/gateway-shutdown-zombie-exit

Conversation

@amittell
Copy link
Copy Markdown
Contributor

@amittell amittell commented Jun 1, 2026

Summary

On rh-bot.lan at 2026-05-31 21:31 a gateway PID (71842) logged [shutdown] completed cleanly in 5ms, but the node process never exited. It held port 18789 for 2h13m. Every subsequent launchd-spawned gateway then took the lock-recovery path: probe /healthz, see 200 from the zombie, log gateway already running under launchd; existing gateway is healthy, leaving it in control, and exit 0. launchd's KeepAlive.SuccessfulExit=false kept new gateways down. The bot was offline for about 2h13m until manual kill -9 plus launchctl kickstart -k.

The original SIGKILL of the parent process is the separate root cause documented in memory project_rh_bot_isolated_session_sigkill.md (openclaw isolated-session spawn pattern). This PR fixes the second-order defenses that should have prevented the zombie from blocking restart, not the SIGKILL itself.

Layer 1: force exit after every clean shutdown

src/gateway/server-close.ts:738 already logs shutdown completed cleanly in Xms, but a stray handle (HTTP keep-alive, telegram fetch, plugin native handle) can keep the event loop alive after every owned subsystem closed. src/cli/gateway-cli/run-loop.ts:382 has an existing armForceExitTimer(SHUTDOWN_TIMEOUT_MS) safety net, but it only arms for runAcceptedRequest (signal-driven), not for the closeOnStartupFailure path in src/gateway/server.impl.ts:1059. That gap is exactly the rh-bot incident shape.

This PR adds armGatewayPostShutdownExitWatchdog in server-close.ts. After "completed cleanly" is logged, an unref'd setTimeout is armed (default 5_000ms, configurable via OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS). When it fires:

  • emit a structured shutdownLog.warn with the active-handle constructor summary (via process._getActiveHandles or getActiveResourcesInfo)
  • record the gateway.shutdown.zombie_detected restart trace event for OTel/Loki
  • call process.exit(0)

Because the timer is unref'd, a healthy natural exit always wins. The watchdog only fires when something else holds the loop open. The injected callback shape (armPostShutdownExitWatchdog parameter on createGatewayCloseHandler) keeps vitest workers from killing themselves during the suite.

Layer 2: 503 on /healthz during shutdown, plus tighter preflight

The deployed handleGatewayProbeRequest in src/gateway/server-http.ts returns 200 unconditionally for /healthz. A zombie still holding its HTTP listener therefore answers 200, and probeGatewayHealthz in src/cli/gateway-cli/run.ts:369 accepts it as healthy (success criterion was statusCode < 500). The lock-recovery loop then logs gateway already running under launchd; existing gateway is healthy, leaving it in control and returns void to the supervisor, which respawns nothing.

This PR adds a module-level shutting-down flag in server-close.ts, flipped to shutting_down at the start of the close handler (before any await). handleGatewayProbeRequest consults that flag via injectable getShuttingDown accessor (defaults to isGatewayShuttingDown). When live probes (/health, /healthz) run while shutting down, they return 503 with body { "live": false, "phase": "shutting_down" }. A gateway.healthz.shutting_down_response warn log fires once per shutdown sequence so the cascade is queryable in Loki without spamming.

probeGatewayHealthz is tightened to accept only statusCode === 200. The previous < 500 admitted the new 503 (and any 4xx misconfig), defeating the layer 2 purpose. The lock-recovery loop now also emits gateway.preflight.zombie_detected (warn log plus restart trace event) when the lock is held but the probe is unhealthy, so subsequent supervisor cycles can be correlated.

The flag resets at the start of startGatewayServer so the in-process restart path returns to a running state before /healthz starts answering 200 again.

OTel / Loki instrumentation

All four events emit via the existing gatewayLog/shutdownLog/gatewayProbeLog subsystems plus recordGatewayRestartTrace:

  • gateway.shutdown.zombie_detected (Layer 1 trace) when the post-shutdown watchdog fires
  • gateway.healthz.shutting_down_response (Layer 2 warn log) when /healthz returns 503 during shutdown
  • gateway.preflight.zombie_detected (Layer 2 warn log + trace) when supervised lock recovery sees a held lock but an unhealthy probe
  • the existing restart.close.total trace already covers clean-path shutdown completion

No new export plumbing is needed; the existing diagnostics-otel / Loki shipper consumes these events.

Verification

Local Vitest with the patch applied:

$ node scripts/run-vitest.mjs run src/gateway/server-close.test.ts src/gateway/server-http.probe.test.ts src/gateway/probe.close-drain.test.ts src/gateway/server.preauth-hardening.test.ts src/cli/gateway-cli/run.supervised-lock.test.ts
Test Files  4 passed (4)
     Tests  59 passed (59)
Test Files  1 passed (1)
     Tests  10 passed (10)

pnpm build clean. node scripts/run-oxlint-shards.mjs --only=core --threads=4 clean. node scripts/check-dynamic-import-warts.mjs shows only the pre-existing mcp-cli.ts advisory, unchanged by this PR.

Real behavior proof

Behavior addressed: gateway zombie shutdown that left rh-bot offline 2h13m on 2026-05-31. With this PR the watchdog forces process.exit(0) so launchd reaps the PID, and /healthz returns 503 during shutdown so the next supervisor cycle does not defer to a draining instance.

Real environment tested: live deployed openclaw gateway on rh-bot.lan, build SHA 23804e6 (incident reproducer). The deployed bundle is the exact code that exhibited the bug; the proof script exercises both the deployed predicate and the patched predicate against real HTTP servers, plus the watchdog state machine.

Exact steps or command run after this patch:

scp /tmp/proof-zombie-amittell.mjs alexm@rh-bot.lan:/tmp/proof-zombie-amittell.mjs
ssh alexm@rh-bot.lan 'cd /tmp && node proof-zombie-amittell.mjs'
ssh alexm@rh-bot.lan 'curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1:18789/healthz'

Evidence after fix:

== gateway zombie-shutdown fix proof ==
host: rh-bot.lan
node: v25.8.2
date: 2026-06-01T04:28:32.289Z

[PASS] case1 deployed handler returns 200 mid-shutdown so old preflight defers to zombie :: accepted=true
[PASS] case2 patched predicate (=== 200) rejects 503 as unhealthy :: accepted=false
[PASS] case3 patched predicate accepts a healthy 200 response :: accepted=true
[PASS] case4 watchdog calls exitProcess(0) after timeout :: exited=true; warnMessages=1
[PASS] case4 watchdog warn log names the timeout in ms :: gateway shutdown completed but node process still alive after 50ms; forcing process.exit(0)
[PASS] case5 cancelling the watchdog prevents the forced exit :: exited=false
[PASS] case6a healthz returns 200 ok:true when running :: {"statusCode":200,"body":{"ok":true,"status":"live"}}
[PASS] case6b healthz returns 503 live:false phase:shutting_down when shutdown started :: {"statusCode":503,"body":{"live":false,"phase":"shutting_down"}}

PROOF SUMMARY  passed=8  failed=0

Live deployed gateway healthz hit (confirms the case 1 handler shape the proof script reproduces):

$ curl -s -w "status=%{http_code}\n" http://127.0.0.1:18789/healthz
status=200
{"ok":true,"status":"live"}

Observed result after fix: case 1 confirms the deployed handler shape responsible for the incident. Cases 2 and 3 confirm the patched probe predicate. Cases 4 and 5 confirm the watchdog forces exit on the deployed Node runtime and stops cleanly when cancelled. Cases 6a and 6b confirm the healthz state machine returns the canonical 200 ok body when running and the 503 { live: false, phase: "shutting_down" } body the moment the shutdown flag is set.

What was not tested: a true end-to-end shutdown of a real OpenClaw gateway on rh-bot with the new bundle (requires deployment of this branch); the long-running TCP keepalive vs telegram-fetch handle-leak scenario, which the watchdog handles by design (any stray handle keeps the loop alive past the watchdog timeout, and the watchdog forces exit regardless of the specific handle source).

Copilot AI review requested due to automatic review settings June 1, 2026 04:29
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: M proof: supplied External PR includes structured after-fix real behavior proof. labels Jun 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses a "zombie gateway" failure mode where a node process that has begun (or completed) shutdown still holds the HTTP listener and answers /healthz with 200, causing supervised lock recovery to defer to it instead of relaunching. It introduces a shutting-down flag that flips /healthz to 503 at the start of close, tightens the preflight probe to only accept 200, and arms a post-shutdown watchdog that force-exits if a stray handle keeps the process alive.

Changes:

  • Add a module-level gatewayShuttingDownState flag in server-close.ts (set before any close await) and route /healthz, /health through it to return 503 with { live: false, phase: "shutting_down" }.
  • Add armGatewayPostShutdownExitWatchdog that force-calls process.exit(0) after a timeout (configurable via OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS) with active-handle diagnostics, and wire it into the close handler.
  • Tighten probeGatewayHealthz to require 200 (not < 500) and emit a gateway.preflight.zombie_detected trace + warn when the lock is held but the probe is unhealthy.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/gateway/server-close.ts Adds shutting-down flag, accessors, and post-shutdown exit watchdog; flips flag at start of close and arms watchdog at end.
src/gateway/server-http.ts Routes live probes through getShuttingDown() and returns 503 during shutdown; adds a once-per-shutdown warn log and a test reset hook.
src/gateway/server-runtime-state.ts Threads getShuttingDown through to the HTTP server constructor.
src/gateway/server.impl.ts Resets the shutting-down flag at startup so in-process restarts begin answering 200 again.
src/cli/gateway-cli/run.ts Tightens healthz probe to === 200; logs/traces zombie detection on unhealthy probe with held lock; exposes probeGatewayHealthz for tests.
src/gateway/server-http.probe.test.ts New tests for 503 healthz/health responses during shutdown, including HEAD and module-flag paths.
src/gateway/server-close.test.ts New tests for flag flip ordering and the post-shutdown exit watchdog behavior.
src/cli/gateway-cli/run.supervised-lock.test.ts Tests preflight zombie-detected logging and probe healthy/unhealthy classification via a real local server.

Comment thread src/gateway/server.impl.ts Outdated
Comment thread src/cli/gateway-cli/run.ts Outdated
Comment thread src/gateway/server-close.ts Outdated
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Jun 1, 2026

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 5:31 AM ET / 09:31 UTC.

Summary
The PR adds a post-clean-shutdown Gateway exit watchdog, shutdown lifecycle state for strict /healthz?strict=1 probing, supervised lock-recovery zombie telemetry, and regression coverage for those paths.

PR surface: Source +303, Tests +360. Total +663 across 9 files.

Reproducibility: yes. from source and supplied terminal proof: current main returns 200 for live probes and accepts non-500 preflight responses, while the PR body shows the incident-shaped predicate failure and patched behavior. I did not run a live launchd zombie reproduction in this read-only review.

Review metrics: 1 noteworthy metric.

  • Env surface added: 1 added. The PR adds OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS, and repository policy treats new env/config surfaces as compatibility-sensitive before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Record maintainer acceptance or requested adjustment for the forced-exit watchdog and new timeout env surface before merge.

Risk before merge

  • [P1] The watchdog intentionally calls process.exit(0) after a clean close if residual handles keep the event loop alive; maintainers should explicitly accept sacrificing those handles as the availability recovery path.
  • [P1] Supervised lock recovery now treats non-200 strict /healthz?strict=1 responses as unhealthy, which is useful for zombies but changes supervisor behavior for draining, misrouted, or otherwise non-200 responders.
  • [P1] The PR adds OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS, a new operator-facing env surface that should be accepted or documented before merge under the repo's env-surface policy.

Maintainer options:

  1. Accept the new shutdown contract (recommended)
    Maintainers can explicitly accept the forced-exit watchdog, strict internal probe, and env timeout as the intended Gateway availability behavior, then land with the current tests and proof.
  2. Remove or document the env surface
    Before merge, either make the watchdog timeout a fixed internal constant or add the accepted env override to the relevant Gateway operator docs.
  3. Pause for a narrower lifecycle fix
    If maintainers do not want post-clean process.exit(0), pause this PR and pursue a narrower shutdown-drain or supervisor-exit-code design instead.

Next step before merge

  • [P2] No narrow automated repair remains; the next step is maintainer judgment on the forced-exit watchdog, strict internal probe behavior, and env timeout surface.

Security
Cleared: The diff does not add dependencies, workflow changes, credential handling, artifact downloads, or broader permissions; the security-sensitive aspect is availability-oriented process exit and is covered as merge risk.

Review details

Best possible solution:

Maintain a conservative Gateway availability contract: harden supervised recovery from zombie listeners while preserving public liveness probes, and make the forced-exit timeout/env policy an explicit maintainer-owned decision before merge.

Do we have a high-confidence way to reproduce the issue?

Yes from source and supplied terminal proof: current main returns 200 for live probes and accepts non-500 preflight responses, while the PR body shows the incident-shaped predicate failure and patched behavior. I did not run a live launchd zombie reproduction in this read-only review.

Is this the best way to solve the issue?

Unclear as a final product decision: the two-layer defense is a plausible and well-tested mitigation for the reported failure, but the forced process.exit(0) watchdog and env override need maintainer sign-off because they define supervisor-visible shutdown semantics.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 601ab84f35e1.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Gateway host exercising the deployed predicate, patched strict probe predicate, watchdog state machine, and shutdown healthz state response; the stated remaining gap is deployment of this exact branch end to end.

Label justifications:

  • P1: The PR addresses a real supervised Gateway availability outage where a zombie process kept the bot offline, but the fix still needs maintainer acceptance before merge.
  • merge-risk: 🚨 compatibility: The diff changes supervisor lock-recovery health semantics and adds a new env timeout surface that can affect existing managed Gateway setups.
  • merge-risk: 🚨 availability: The diff intentionally changes process-exit behavior and health probing in the Gateway shutdown path, where mistakes can leave services stopped or flapping.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes after-fix terminal output from a real Gateway host exercising the deployed predicate, patched strict probe predicate, watchdog state machine, and shutdown healthz state response; the stated remaining gap is deployment of this exact branch end to end.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Gateway host exercising the deployed predicate, patched strict probe predicate, watchdog state machine, and shutdown healthz state response; the stated remaining gap is deployment of this exact branch end to end.
Evidence reviewed

PR surface:

Source +303, Tests +360. Total +663 across 9 files.

View PR surface stats
Area Files Added Removed Net
Source 6 307 4 +303
Tests 3 363 3 +360
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 9 670 7 +663

What I checked:

  • Current main live probe behavior: On current main, handleGatewayProbeRequest returns HTTP 200 with { ok: true, status } for live /health and /healthz probes regardless of shutdown state, so the PR is not already implemented on main. (src/gateway/server-http.ts:316, 601ab84f35e1)
  • Current main supervised lock probe: On current main, probeGatewayHealthz calls plain /healthz and treats any numeric status below 500 as healthy, matching the failure mode described in the PR body. (src/cli/gateway-cli/run.ts:369, 601ab84f35e1)
  • PR strict internal health probe: The PR head changes the supervised preflight to /healthz?strict=1 and accepts only HTTP 200 as healthy, while preserving plain public /healthz behavior. (src/cli/gateway-cli/run.ts:370, b83669c43c32)
  • PR shutdown-aware probe state: The PR head adds strict live-probe shutdown handling that returns HTTP 503 with { live: false, phase: "shutting_down" } only when the strict query is present and the shutdown state is active. (src/gateway/server-http.ts:341, b83669c43c32)
  • PR post-shutdown watchdog: The PR head adds armGatewayPostShutdownExitWatchdog, which logs active handles, records gateway.shutdown.zombie_detected, and calls process.exit(0) if the process remains alive after the timeout; the close handler skips this watchdog for restart-shaped reasons. (src/gateway/server-close.ts:107, b83669c43c32)
  • Regression tests added: The PR head adds tests for strict 503 liveness, public 200 liveness compatibility, per-cycle log dedupe, watchdog arm/skip behavior, and supervised probe classification. (src/gateway/server-http.probe.test.ts:349, b83669c43c32)

Likely related people:

  • Peter Steinberger: The available history shows the highest recent commit volume across the central Gateway CLI/server files and recent release/runtime work adjacent to this startup and shutdown surface. (role: recent area contributor; confidence: medium; commits: e93216080aa1; files: src/gateway/server-close.ts, src/gateway/server-http.ts, src/cli/gateway-cli/run.ts)
  • Vincent Koc: History search shows Vincent introduced the /healthz and /readyz probe endpoints and later touched supervised Gateway hardening, both central to this PR's strict-probe behavior. (role: feature history owner; confidence: high; commits: eeb72097ba8e, beadd4c55306; files: src/gateway/server-http.ts, src/cli/gateway-cli/run.ts, docs/gateway/gateway-lock.md)
  • merlin: History search ties the current restart-startup failure policy that keeps the process alive after a failed in-process restart to commit 6740cdf, which is adjacent to the watchdog's process-exit semantics. (role: introduced adjacent lifecycle behavior; confidence: medium; commits: 6740cdf160a2; files: src/cli/gateway-cli/run-loop.ts)
  • WJzz1: Current git blame points the central Gateway files to a recent broad commit, but the commit title does not clearly indicate ownership of this behavior, so this is a weak routing signal only. (role: recent current-main toucher; confidence: low; commits: 6349af650240; files: src/gateway/server-close.ts, src/gateway/server-http.ts, src/cli/gateway-cli/run.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 1, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
@amittell
Copy link
Copy Markdown
Contributor Author

amittell commented Jun 1, 2026

Thanks for the detailed review. Addressed all three findings in 22aab0b926:

P1 (server-close.ts:879): restart-aware watchdog. Gated the watchdog on the close reason. When reason matches /restart(ing|ed)?/i (the SIGUSR1 in-process restart path, e.g. "gateway restarting"), the watchdog is not armed at all. New test skips the post-shutdown exit watchdog on in-process restart close reasons pins this behavior, with a sibling arms the watchdog for ordinary stop reasons to guard the inverse.

P1 (server-http.ts:324): narrowed live-probe contract. Public /health and /healthz continue to return 200 during shutdown, preserving the legacy contract for external monitors and service managers. A new isStrictLiveProbeRequest helper checks for ?strict=1 (or ?strict=true). Only strict-mode probes get the shutdown-aware 503. The supervised lock-recovery preflight in probeGatewayHealthz now opts into strict mode (path: "/healthz?strict=1"). New regression tests pin both branches: returns 200 on plain /healthz even when shutting down (public probe contract) and returns 503 on /healthz?strict=1 when the gateway is shutting down.

P3 (run.ts:462-466): deduped zombie trace. Added a zombieDetectedThisCycle flag scoped to the supervised-recovery for-loop. The gateway.preflight.zombie_detected warn + restart-trace event now fires exactly once per recovery cycle, not on every retry tick. New assertion in the existing test verifies the count is 1.

All three changes ship in one commit with focused tests. pnpm tsgo:test, oxlint core shard, and the three affected test files all pass locally.

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 1, 2026
amittell added a commit to amittell/openclaw that referenced this pull request Jun 1, 2026
…module

Per ClawSweeper P2+P3 review on openclaw#88908.

P2: Move isGatewayShuttingDown / markGatewayShuttingDown /
resetGatewayShuttingDownState out of server-close.ts into a new
src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now
imports from the lightweight module directly instead of loading
server-close.runtime (which transitively pulls in shutdown-only agent,
channel, and plugin cleanup code). server-close.ts and server-http.ts
re-export from the new module for source compatibility.

P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives
alongside the shutdown state and is reset on every markGatewayShuttingDown
and resetGatewayShuttingDownState call. Previously the flag latched for the
process lifetime and was only reset by tests, so the second in-process
restart shutdown was silent. New regression test pins per-cycle reset
behavior.
@amittell
Copy link
Copy Markdown
Contributor Author

amittell commented Jun 1, 2026

Thanks for the followup review. Addressed the P2 and P3 architectural concerns in a9c883739d:

P2 (server.impl.ts:548-549): shutdown state out of close runtime. Extracted the shutdown lifecycle state symbols (isGatewayShuttingDown, markGatewayShuttingDown, resetGatewayShuttingDownState) into a new src/gateway/gateway-shutdown-state.ts module. The startup reset path now imports directly from that lightweight module (3 named exports + 1 logger import) instead of loading server-close.runtime.js (which transitively pulls in shutdown-only agent, channel, and plugin cleanup code). server-close.ts and server-http.ts re-export from the new module for source compatibility so existing in-tree callers don't break.

P3 (server-http.ts:54-56): per-cycle log dedupe reset. Moved shuttingDownResponseLogged + noteShuttingDownProbeResponse into the same lifecycle-state module, and made markGatewayShuttingDown / resetGatewayShuttingDownState reset the dedupe flag. Each new shutdown cycle now emits the gateway.healthz.shutting_down_response signal exactly once instead of latching for the process lifetime. The existing resetGatewayHealthzShuttingDownLogForTest test hook stays available via a re-export.

New regression test resets the shutting-down probe log dedupe on each shutdown cycle in server-http.probe.test.ts exercises: first strict probe in cycle 1 emits 503, second probe in cycle 1 still emits 503 (dedupe within cycle), resetGatewayShuttingDownForTest + markGatewayShuttingDown start cycle 2, fresh strict probe emits 503 again.

pnpm tsgo:test, oxlint core shard, and all three affected test files (39+18+10 = 67 tests) pass locally. Build clean.

Watchdog and strict-probe semantics from the earlier review remain unchanged — this commit is purely the lifecycle module extraction plus the per-cycle log reset.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 1, 2026
amittell added a commit to amittell/openclaw that referenced this pull request Jun 1, 2026
…module

Per ClawSweeper P2+P3 review on openclaw#88908.

P2: Move isGatewayShuttingDown / markGatewayShuttingDown /
resetGatewayShuttingDownState out of server-close.ts into a new
src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now
imports from the lightweight module directly instead of loading
server-close.runtime (which transitively pulls in shutdown-only agent,
channel, and plugin cleanup code). server-close.ts and server-http.ts
re-export from the new module for source compatibility.

P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives
alongside the shutdown state and is reset on every markGatewayShuttingDown
and resetGatewayShuttingDownState call. Previously the flag latched for the
process lifetime and was only reset by tests, so the second in-process
restart shutdown was silent. New regression test pins per-cycle reset
behavior.
@amittell amittell force-pushed the fix/gateway-shutdown-zombie-exit branch from a9c8837 to 75863af Compare June 1, 2026 22:33
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
amittell added a commit to amittell/openclaw that referenced this pull request Jun 1, 2026
…module

Per ClawSweeper P2+P3 review on openclaw#88908.

P2: Move isGatewayShuttingDown / markGatewayShuttingDown /
resetGatewayShuttingDownState out of server-close.ts into a new
src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now
imports from the lightweight module directly instead of loading
server-close.runtime (which transitively pulls in shutdown-only agent,
channel, and plugin cleanup code). server-close.ts and server-http.ts
re-export from the new module for source compatibility.

P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives
alongside the shutdown state and is reset on every markGatewayShuttingDown
and resetGatewayShuttingDownState call. Previously the flag latched for the
process lifetime and was only reset by tests, so the second in-process
restart shutdown was silent. New regression test pins per-cycle reset
behavior.
@amittell amittell force-pushed the fix/gateway-shutdown-zombie-exit branch from 75863af to 2028f4c Compare June 1, 2026 22:34
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 1, 2026
amittell added 8 commits June 2, 2026 05:19
…e hangs

When the gateway shutdown sequence completes cleanly but a stray handle
(HTTP keep-alive, telegram fetch, plugin native handle) keeps the node
event loop alive, the supervisor sees the gateway lock dropped but the
PID never reaps. The HTTP listener stays bound, and the next launchd
respawn defers via the lock-recovery path because /healthz still answers
200 from the zombie. Real incident on rh-bot.lan 2026-05-31 21:31 left
the bot offline for 2h13m until manual kill -9 + launchctl kickstart.

This commit arms a process-wide watchdog (default 5_000ms, configurable
via OPENCLAW_GATEWAY_POST_SHUTDOWN_EXIT_TIMEOUT_MS) after every clean
shutdown completion. The timer is unref'd so it never blocks a healthy
natural exit; when it does fire, it emits a structured warn log with
active-handle summary, records the gateway.shutdown.zombie_detected
restart trace event for OTel/Loki, and calls process.exit(0).

Also flips the new isGatewayShuttingDown flag at the start of the close
sequence (before any await) so layer 2 can act on it, and resets the
flag at startup so in-process restart paths re-enter the running state.
…-recovery preflight

The lock-recovery preflight in runGatewayLoopWithSupervisedLockRecovery
accepted any /healthz response with status < 500 as healthy. When the
previous gateway is mid-shutdown (or a zombie that lost its close path
but kept its HTTP listener), that includes the 503 we now return, so
the supervised respawn deferred to a draining or stuck instance and
left the bot offline.

This commit:

  1. Adds a shutting-down state hook owned by server-close (flag is
     flipped before any close-handler await), exposed via the
     isGatewayShuttingDown accessor.
  2. Updates handleGatewayProbeRequest so live probes (/health,
     /healthz) return 503 with body { live: false, phase:
     "shutting_down" } while the flag is set. Once-per-shutdown warn
     log via the gateway/probe subsystem captures the cascade for
     OTel/Loki ingestion.
  3. Threads getShuttingDown through createGatewayHttpServer and
     server-runtime-state so production callers and tests share the
     same seam.
  4. Tightens probeGatewayHealthz to treat only HTTP 200 as healthy.
     The < 500 acceptance was the second-order defense gap that let
     the zombie cascade survive every supervisor cycle.
  5. Adds a gateway.preflight.zombie_detected warn log + restart trace
     event in the supervised lock-recovery path so the cascade is
     queryable in Loki when it recurs.
…ponse, and lock-recovery preflight tightening

Covers the three production behaviors introduced in this branch:

  1. armGatewayPostShutdownExitWatchdog: forces process.exit(0) when
     fired, can be cancelled, and emits a warn log naming the timeout
     and likely handle culprits.
  2. createGatewayCloseHandler: flips the shutting-down flag before
     any await and threads the watchdog seam through to tests.
  3. handleGatewayProbeRequest: returns 503 with the
     { live: false, phase: "shutting_down" } body on both /healthz
     and /health when the flag is set, including HEAD requests.
  4. probeGatewayHealthz: real HTTP listener returning 503 is now
     classified as unhealthy (vs the prior < 500 acceptance).
  5. runGatewayLoopWithSupervisedLockRecovery: logs the
     gateway.preflight.zombie_detected signal when a held lock pairs
     with an unhealthy probe response.
- Skip post-shutdown watchdog on in-process restart reasons so SIGUSR1 path
  does not kill the restarted gateway 5s into its next life.
- Narrow live-probe 503 to ?strict=1 query, preserving public /health and
  /healthz 200 contract for external monitors. Supervised lock-recovery
  preflight now uses /healthz?strict=1 to opt into shutdown-aware behavior.
- Dedupe gateway.preflight.zombie_detected so a draining prior gateway emits
  the trace + warn once per recovery cycle, not on every retry tick.

Includes regression tests for all three fixes.
…module

Per ClawSweeper P2+P3 review on openclaw#88908.

P2: Move isGatewayShuttingDown / markGatewayShuttingDown /
resetGatewayShuttingDownState out of server-close.ts into a new
src/gateway/gateway-shutdown-state.ts module. server.impl.ts startup now
imports from the lightweight module directly instead of loading
server-close.runtime (which transitively pulls in shutdown-only agent,
channel, and plugin cleanup code). server-close.ts and server-http.ts
re-export from the new module for source compatibility.

P3: The shutting-down probe log dedupe (one-shot per shutdown) now lives
alongside the shutdown state and is reset on every markGatewayShuttingDown
and resetGatewayShuttingDownState call. Previously the flag latched for the
process lifetime and was only reset by tests, so the second in-process
restart shutdown was silent. New regression test pins per-cycle reset
behavior.
@amittell amittell force-pushed the fix/gateway-shutdown-zombie-exit branch from 2028f4c to b83669c Compare June 2, 2026 09:22
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants