fix(gateway): hold startup-gated requests at server until post-attach (closes #67160)#68146
fix(gateway): hold startup-gated requests at server until post-attach (closes #67160)#68146sparkeros wants to merge 1 commit into
Conversation
The startup gate from openclaw#65365 responds UNAVAILABLE to chat.history and models.list during the 8-15s window between [gateway] ready and sidecar registration. The UI retry loop landed in the same commit is ineffective when the browser tab already has a cached pre-fix bundle - which is every time openclaw-update restarts the gateway. Add a StartupGateBarrier (deferred + timeout) threaded through the gateway request context. Dispatch now awaits the barrier with a 20s timeout instead of failing fast, removing the race independent of the UI bundle. The existing retryable/retryAfterMs UNAVAILABLE response is preserved as a fallback for timeout and for legacy contexts without a barrier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR fixes a race where the gateway startup gate immediately returned Confidence Score: 5/5Safe to merge — fix is well-scoped, ordering is correct, and all four test scenarios pass. No P0/P1 issues. The barrier's open-after-delete ordering guarantees waiting requests re-check a clean set. The minimalTestGateway pre-open, the finally clearTimeout, and the fallback UNAVAILABLE on timeout are all handled correctly. Tests cover the full matrix (wait-and-dispatch, timeout, non-gated bypass, no-barrier legacy). No files require special attention. Reviews (1): Last reviewed commit: "fix(gateway): hold startup-gated request..." | Re-trigger Greptile |
|
Codex review: found issues before merge. Summary Reproducibility: no. high-confidence current-main end-to-end reproduction was run. Source inspection shows current main still fail-fasts startup-gated RPCs while sidecars are pending, but newer client and websocket startup retry work changes which clients can still surface the original banner. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Finish a Gateway-owned bounded server-side barrier or equivalent queue on top of the descriptor-derived startup method set, preserving the canonical startup-sidecars retry shape on timeout and validating restart behavior with real Gateway logs. Do we have a high-confidence way to reproduce the issue? No high-confidence current-main end-to-end reproduction was run. Source inspection shows current main still fail-fasts startup-gated RPCs while sidecars are pending, but newer client and websocket startup retry work changes which clients can still surface the original banner. Is this the best way to solve the issue? No, not as currently written. A server-side bounded barrier may be the durable fix for cached or older clients, but this branch must rebase over the descriptor-derived method set and preserve canonical startup retry details before it is mergeable. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 532759e1ab2e. |
|
Codex review: needs maintainer review before merge. What this changes: This PR adds a Gateway startup barrier so Maintainer follow-up before merge: This is an open implementation PR with a plausible targeted fix; the next action is maintainer review and likely rebase/behavior validation, not an automated replacement PR from ClawSweeper. Review detailsBest possible solution: Keep a gateway-owned server-side barrier or equivalent queue for startup-gated methods if maintainers want cached or older clients to recover without relying on client retries. The finished version should preserve the existing retryable timeout fallback and include coverage for dispatch-after-sidecars, timeout fallback, non-gated bypass, and legacy no-barrier contexts. Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against acae48b790fa. |
Summary
UNAVAILABLEtochat.historyandmodels.listfor the 8–15 s window between[gateway] readyand post-attach sidecar registration. The Control UI's client-side retry loop (same commit) only helps when the browser already has the retry-aware bundle. Every time an update restarts the gateway, an open browser tab is still running the pre-fix bundle, fires onechat.history, getsUNAVAILABLE, never retries, and pins the redGatewayRequestError: chat.history unavailable during gateway startupbanner.openclaw-updatehits this for any user with a chat tab open — which is most of the time.StartupGateBarrier(deferred + 20 s timeout) threaded through the request context. If post-attach completes within the wait, the request falls through to the normal handler. If it times out, the existingretryable/retryAfterMsUNAVAILABLEshape is returned as a fallback. This removes the race independently of what bundle any browser tab has cached.ui/src/ui/controllers/chat.tsretry andui/src/ui/controllers/models.tssilent-swallow both stay as belt-and-braces fallbacks. PR fix(ui): retry chat.history during gateway startup without retryable #67951's UI-side retry is complementary if it later merges.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause
UNAVAILABLEat 0 ms) makes the race unrecoverable from the client side when the browser-cached UI bundle predates the retry fix in fix(gateway): defer cron/heartbeat activation until sidecars ready (#65322) #65365. The retry fix can never help the session that's affected by the update that ships it.Regression Test Plan
src/gateway/server-methods.startup-gate.test.ts(unit) + extended assertions insrc/gateway/server-startup-post-attach.test.ts(seam).UNAVAILABLEonly if the barrier does not open withinSTARTUP_GATE_WAIT_MS(20 s).handleGatewayRequest; the sidecar-boot seam is exercised via the existingstartGatewayPostAttachRuntimeharness.server-startup-post-attach.test.tsalready covered clearing ofunavailableGatewayMethods; this PR extends it to also assert the barrier opens at the same point.Before / after evidence
Before (pre-fix journal from reporter,
journalctl --user -u openclaw-gateway):Note: only one request per method, no retry attempts — the cached UI bundle lacked the retry code. Red banner stays until the user reloads.
After (this PR, same box, identical restart-while-tab-open flow):
The gated requests are held on the server for ~11.2 s (exactly the post-attach window) and then dispatched to the real handler. Zero
✗ UNAVAILABLErows. No banner.AI disclosure
pnpm buildandpnpm checkclean,pnpm test:changedagainst upstream/main is green apart from one pre-existing flake inserver.chat.gateway-server-chat.test.ts→ "agent.wait keeps lifecycle wait active while same-runId chat.send is active" that reproduces on pristine upstream HEAD and is left untouched per CONTRIBUTING.md L119.