Skip to content

fix(gateway): hold startup-gated requests at server until post-attach (closes #67160)#68146

Open
sparkeros wants to merge 1 commit into
openclaw:mainfrom
sparkeros:fix/gateway-startup-gate-server-side-queue
Open

fix(gateway): hold startup-gated requests at server until post-attach (closes #67160)#68146
sparkeros wants to merge 1 commit into
openclaw:mainfrom
sparkeros:fix/gateway-startup-gate-server-side-queue

Conversation

@sparkeros
Copy link
Copy Markdown

Summary

  • Problem: The openclaw-gateway startup gate from fix(gateway): defer cron/heartbeat activation until sidecars ready (#65322) #65365 responds UNAVAILABLE to chat.history and models.list for the 8–15 s window between [gateway] ready and post-attach sidecar registration. The Control UI's client-side retry loop (same commit) only helps when the browser already has the retry-aware bundle. Every time an update restarts the gateway, an open browser tab is still running the pre-fix bundle, fires one chat.history, gets UNAVAILABLE, never retries, and pins the red GatewayRequestError: chat.history unavailable during gateway startup banner.
  • Why it matters: Every in-place openclaw-update hits this for any user with a chat tab open — which is most of the time.
  • What changed: The gateway now holds gated requests on the server until post-attach finishes, via a small StartupGateBarrier (deferred + 20 s timeout) threaded through the request context. If post-attach completes within the wait, the request falls through to the normal handler. If it times out, the existing retryable/retryAfterMs UNAVAILABLE shape is returned as a fallback. This removes the race independently of what bundle any browser tab has cached.
  • What did NOT change (scope boundary): UI is untouched — ui/src/ui/controllers/chat.ts retry and ui/src/ui/controllers/models.ts silent-swallow both stay as belt-and-braces fallbacks. PR fix(ui): retry chat.history during gateway startup without retryable #67951's UI-side retry is complementary if it later merges.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
  • Target test or file: new src/gateway/server-methods.startup-gate.test.ts (unit) + extended assertions in src/gateway/server-startup-post-attach.test.ts (seam).
  • Scenario the test should lock in:
    1. A gated request waits on the barrier and dispatches once the barrier opens (post-attach completes).
    2. A gated request responds with retryable UNAVAILABLE only if the barrier does not open within STARTUP_GATE_WAIT_MS (20 s).
    3. Non-gated methods bypass the wait entirely.
    4. Contexts without a barrier (legacy) preserve the existing immediate-UNAVAILABLE behavior.
  • Why this is the smallest reliable guardrail: the dispatch behavior is exercised directly via handleGatewayRequest; the sidecar-boot seam is exercised via the existing startGatewayPostAttachRuntime harness.
  • Existing test that already covers this (if any): server-startup-post-attach.test.ts already covered clearing of unavailableGatewayMethods; this PR extends it to also assert the barrier opens at the same point.

Before / after evidence

Before (pre-fix journal from reporter, journalctl --user -u openclaw-gateway):

05:22:00        SIGTERM (openclaw-update triggered)
05:22:20.138    [gateway] ready (6 plugins; 15.3s)
05:22:20.184    [gateway] starting channels and sidecars...
05:22:24.767    [ws] webchat connected … client=openclaw-control-ui webchat v2026.4.11
05:22:24.783    [ws] ⇄ res ✗ chat.history  0ms errorCode=UNAVAILABLE
05:22:24.796    [ws] ⇄ res ✗ models.list   0ms errorCode=UNAVAILABLE
05:22:32.738    [plugins] embedded acpx runtime backend registered

Note: only one request per method, no retry attempts — the cached UI bundle lacked the retry code. Red banner stays until the user reloads.

After (this PR, same box, identical restart-while-tab-open flow):

07:19:26.520    [gateway] ready (10.0s)
07:19:26.580    [gateway] starting channels and sidecars...
07:19:30.844    [ws] webchat connected … client=openclaw-control-ui webchat v2026.4.16
07:19:42.024    [ws] ⇄ res ✓ chat.history 11172ms
07:19:42.027    [ws] ⇄ res ✓ models.list  11162ms
07:19:42.111    [plugins] embedded acpx runtime backend ready

The gated requests are held on the server for ~11.2 s (exactly the post-attach window) and then dispatched to the real handler. Zero ✗ UNAVAILABLE rows. No banner.

AI disclosure

  • AI-assisted (designed and implemented in Claude Code; code review and end-to-end verification performed by the human author on a running gateway).
  • Degree of testing: fully tested. Unit tests pass, pnpm build and pnpm check clean, pnpm test:changed against upstream/main is green apart from one pre-existing flake in server.chat.gateway-server-chat.test.ts → "agent.wait keeps lifecycle wait active while same-runId chat.send is active" that reproduces on pristine upstream HEAD and is left untouched per CONTRIBUTING.md L119.
  • Degree of human review: I confirmed the author understands the code and exercised the fix end-to-end against the running openclaw-gateway on the reporter's box before opening this PR.

The startup gate from openclaw#65365 responds UNAVAILABLE to chat.history and
models.list during the 8-15s window between [gateway] ready and sidecar
registration. The UI retry loop landed in the same commit is ineffective
when the browser tab already has a cached pre-fix bundle - which is
every time openclaw-update restarts the gateway.

Add a StartupGateBarrier (deferred + timeout) threaded through the
gateway request context. Dispatch now awaits the barrier with a 20s
timeout instead of failing fast, removing the race independent of the
UI bundle. The existing retryable/retryAfterMs UNAVAILABLE response is
preserved as a fallback for timeout and for legacy contexts without a
barrier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Apr 17, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR fixes a race where the gateway startup gate immediately returned UNAVAILABLE to chat.history/models.list during the 8–15 s post-attach window, permanently pinning the red banner in browser tabs running a pre-fix UI bundle. The fix introduces a StartupGateBarrier (deferred promise + 20 s timeout) that holds gated requests server-side until post-attach completes, eliminating the race independently of the cached client bundle.

Confidence Score: 5/5

Safe to merge — fix is well-scoped, ordering is correct, and all four test scenarios pass.

No P0/P1 issues. The barrier's open-after-delete ordering guarantees waiting requests re-check a clean set. The minimalTestGateway pre-open, the finally clearTimeout, and the fallback UNAVAILABLE on timeout are all handled correctly. Tests cover the full matrix (wait-and-dispatch, timeout, non-gated bypass, no-barrier legacy).

No files require special attention.

Reviews (1): Last reviewed commit: "fix(gateway): hold startup-gated request..." | Re-trigger Greptile

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 27, 2026

Codex review: found issues before merge.

Summary
The PR adds a Gateway startup barrier that holds startup-gated requests until post-attach sidecars finish, wires it through request context/startup, and adds focused Gateway tests.

Reproducibility: no. high-confidence current-main end-to-end reproduction was run. Source inspection shows current main still fail-fasts startup-gated RPCs while sidecars are pending, but newer client and websocket startup retry work changes which clients can still surface the original banner.

Real behavior proof
Sufficient (logs): The PR body includes after-fix Gateway journal logs from a real restart-while-tab-open flow showing held requests completing after post-attach readiness.

Next step before merge
This needs maintainer review of the bounded server-side request-holding semantics plus a rebase over current Gateway startup contracts, not a standalone cleanup close or automatic replacement PR.

Security
Cleared: The diff is limited to Gateway startup request gating, request-context wiring, and tests; it does not change secrets, dependencies, CI, install scripts, or supply-chain sensitive paths.

Review findings

  • [P2] Preserve startup retry details on timeout — src/gateway/server-methods.ts:124
  • [P2] Carry forward the descriptor-derived startup method set — src/gateway/server-startup-unavailable-methods.ts:3
Review details

Best possible solution:

Finish a Gateway-owned bounded server-side barrier or equivalent queue on top of the descriptor-derived startup method set, preserving the canonical startup-sidecars retry shape on timeout and validating restart behavior with real Gateway logs.

Do we have a high-confidence way to reproduce the issue?

No high-confidence current-main end-to-end reproduction was run. Source inspection shows current main still fail-fasts startup-gated RPCs while sidecars are pending, but newer client and websocket startup retry work changes which clients can still surface the original banner.

Is this the best way to solve the issue?

No, not as currently written. A server-side bounded barrier may be the durable fix for cached or older clients, but this branch must rebase over the descriptor-derived method set and preserve canonical startup retry details before it is mergeable.

Full review comments:

  • [P2] Preserve startup retry details on timeout — src/gateway/server-methods.ts:124
    Current retry-aware clients classify startup failures by details.reason === "startup-sidecars". This timeout fallback returns only { method }, so after rebase a barrier timeout would become a terminal UNAVAILABLE for clients that currently retry startup-sidecar races.
    Confidence: 0.88
  • [P2] Carry forward the descriptor-derived startup method set — src/gateway/server-startup-unavailable-methods.ts:3
    Current main derives startup-gated methods from core descriptors, but this branch keeps the old chat.history/models.list literal list. Rebasing this as-is would leave newer startup-protected RPCs outside the barrier and timeout fallback behavior.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.87

Acceptance criteria:

  • node scripts/run-vitest.mjs src/gateway/server-methods.startup-gate.test.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-methods.control-plane-rate-limit.test.ts
  • node scripts/run-vitest.mjs src/gateway/client.test.ts src/tui/gateway-chat.test.ts ui/src/ui/gateway.node.test.ts ui/src/ui/controllers/chat.test.ts
  • node scripts/crabbox-wrapper.mjs run ... --shell -- "pnpm check:changed"

What I checked:

Likely related people:

  • lml2468: Merged PR fix(gateway): defer cron/heartbeat activation until sidecars ready (#65322) #65365 introduced the startup-gated chat.history/models.list behavior and Control UI retry context this PR builds on. (role: introduced related behavior; confidence: high; commits: c7f6a670c36e, fb64c53bcf7f, 4fec8073b12b; files: src/gateway/server-methods.ts, src/gateway/server.impl.ts, src/gateway/server-startup-post-attach.ts)
  • scoootscooob: Authored the merged startup control-plane retry PR that current main now uses for descriptor-derived gated methods and canonical retry details. (role: recent adjacent contributor; confidence: high; commits: 17ae7de7deb3, 54cfea679b89, ccb847e46f1b; files: src/gateway/server-methods.ts, src/gateway/methods/core-descriptors.ts, src/gateway/server-methods.control-plane-rate-limit.test.ts)
  • shakkernerd: Authored the shipped Node Gateway client and TUI chat.history startup retry work that overlaps the user-visible startup failure mode but does not add a server-side hold. (role: recent adjacent contributor; confidence: high; commits: fd81edf805ab, d1fd6b8cb366, a164aacb8a3f; files: src/gateway/client.ts, src/gateway/client.test.ts, src/tui/gateway-chat.ts)
  • steipete: Current-main blame for the central startup gate and descriptor lines points at a recent broad Gateway refactor, and PR history also shows commits in the startup-gate merge path. (role: recent area contributor; confidence: medium; commits: 5aefc9dda47b, b9562f84f740, fb64c53bcf7f; files: src/gateway/server-methods.ts, src/gateway/methods/core-descriptors.ts, src/gateway/server-startup-post-attach.ts)

Remaining risk / open question:

  • The branch is stale against current main's descriptor-derived startup method list and canonical startup-sidecars retry details.
  • Holding startup-gated websocket requests for up to 20 seconds changes request lifetime during restart bursts, so Gateway owners should explicitly accept the bounded queue semantics before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 532759e1ab2e.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs maintainer review before merge.

What this changes:

This PR adds a Gateway startup barrier so chat.history and models.list wait for post-attach sidecars before dispatching, falling back to retryable UNAVAILABLE on timeout, with request-context wiring and focused gateway tests.

Maintainer follow-up before merge:

This is an open implementation PR with a plausible targeted fix; the next action is maintainer review and likely rebase/behavior validation, not an automated replacement PR from ClawSweeper.

Review details

Best possible solution:

Keep a gateway-owned server-side barrier or equivalent queue for startup-gated methods if maintainers want cached or older clients to recover without relying on client retries. The finished version should preserve the existing retryable timeout fallback and include coverage for dispatch-after-sidecars, timeout fallback, non-gated bypass, and legacy no-barrier contexts.

Acceptance criteria:

  • pnpm test src/gateway/server-methods.startup-gate.test.ts src/gateway/server-startup-post-attach.test.ts src/gateway/server-methods.control-plane-rate-limit.test.ts
  • pnpm test src/gateway/client.test.ts src/tui/gateway-chat.test.ts
  • pnpm check:changed in Testbox before merge

What I checked:

  • Current main fail-fasts gated methods: handleGatewayRequest checks context.unavailableGatewayMethods and immediately responds with UNAVAILABLE, retryable: true, and retryAfterMs: 500; there is no wait, queue, or barrier path before returning. (src/gateway/server-methods.ts:121, acae48b790fa)
  • Startup window still exists: The Gateway creates unavailableGatewayMethods before attaching websocket handlers and listening, then starts post-attach runtime afterward, leaving connected requests able to hit the startup-gated set. (src/gateway/server.impl.ts:836, acae48b790fa)
  • Gate clears only after sidecars: startGatewayPostAttachRuntime starts sidecars and only then deletes STARTUP_UNAVAILABLE_GATEWAY_METHODS from the set, matching the PR's described post-attach window. (src/gateway/server-startup-post-attach.ts:642, acae48b790fa)
  • Existing test locks immediate block behavior: The current test named blocks startup-gated methods before dispatch asserts models.list does not call its handler and returns retryable UNAVAILABLE while gated, which is the opposite of the proposed server-held dispatch behavior. (src/gateway/server-methods.control-plane-rate-limit.test.ts:134, acae48b790fa)
  • No server barrier symbols on main: Search found no StartupGateBarrier, startupGateBarrier, STARTUP_GATE_WAIT_MS, or waitWithTimeout; server-startup-unavailable-methods.ts still only exports the gated method list. (src/gateway/server-startup-unavailable-methods.ts:1, acae48b790fa)
  • Related shipped work is narrower: Current TUI code retries chat.history on retryable startup UNAVAILABLE, and changelog entries mention TUI/status startup retries, but this does not provide a server-side hold for models.list or older cached Control UI bundles. (src/tui/gateway-chat.ts:201, acae48b790fa)

Likely related people:

  • lml2468: Merged fix(gateway): defer cron/heartbeat activation until sidecars ready (#65322) #65365 introduced the startup-gated chat.history/models.list behavior and Control UI retry context that this PR is trying to harden server-side. (role: introduced behavior; confidence: high; commits: fb64c53bcf7f; files: src/gateway/server-methods.ts, src/gateway/server.impl.ts, src/gateway/server-startup-post-attach.ts)
  • shakkernerd: Merged fix: retry TUI chat history during startup #69164 recently preserved Gateway retry metadata and added TUI chat.history startup retry coverage, which is the closest shipped follow-up in this area. (role: recent adjacent maintainer; confidence: high; commits: 626ea3770248; files: src/gateway/client.ts, src/tui/gateway-chat.ts, src/gateway/client.test.ts)

Remaining risk / open question:

  • The branch should be rebased and re-reviewed against newer Gateway startup retry work, especially the later TUI/status startup retry changes referenced in the changelog.
  • Holding startup-gated websocket requests open for up to the timeout changes request lifetime during restart bursts, so maintainers should validate bounded concurrency and reconnect behavior before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against acae48b790fa.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TUI: chat.history unavailable during gateway startup — race condition on reconnect

1 participant