Skip to content

fix: gateway stability — OOM, failover 404, hook crash, handshake timeout#51362

Open
adrianwedd wants to merge 5 commits intoopenclaw:mainfrom
adrianwedd:fix/gateway-stability
Open

fix: gateway stability — OOM, failover 404, hook crash, handshake timeout#51362
adrianwedd wants to merge 5 commits intoopenclaw:mainfrom
adrianwedd:fix/gateway-stability

Conversation

@adrianwedd
Copy link

Summary

  • Problem: Four high-impact gateway stability issues causing OOM crashes, broken failover chains, startup crashes on bad hooks, and CLI connection failures on loaded gateways.
  • Why it matters: These affect daily operator experience — the gateway crash-loops, CLI commands fail silently, and model failover hangs for 10 minutes instead of cascading.
  • What changed:
    • loadSessionStore gains a readOnly option to skip unnecessary structuredClone calls (3→1 clones per load), used by loadCombinedSessionStoreForGateway
    • resolveFailoverReasonFromError now maps HTTP 404 → model_not_found, enabling proper failover cascade
    • resolveHooksConfig() wrapped in try/catch at startup — bad hook config disables hooks instead of crashing the gateway
    • Handshake timeout bumped 10s→15s, configurable via OPENCLAW_HANDSHAKE_TIMEOUT_MS env var, clamped at 120s max
  • What did NOT change (scope boundary): No changes to session store write paths, no changes to hook execution logic, no changes to WS connection protocol. Voice-call/ElevenLabs STT changes are on a separate branch.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Gateway no longer OOMs when loading session stores for 10+ agents
  • Model failover now cascades on HTTP 404 (model not found) instead of retrying the same broken provider
  • Invalid hook config no longer crashes the gateway — logs a warning and disables hooks
  • CLI connections to loaded gateways succeed more reliably (15s default timeout, configurable via OPENCLAW_HANDSHAKE_TIMEOUT_MS)

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Darwin 25.3.0)
  • Runtime: Node 22.22.1
  • Model/provider: N/A (infrastructure fixes)

Steps

  1. Configure 10+ agents with accumulated session stores → gateway OOMs on startup
  2. Set primary model to a non-existent model ID → failover hangs instead of cascading
  3. Add hooks.enabled: true without hooks.token → gateway crashes on startup
  4. Run CLI commands against a loaded gateway (600MB+ state) → handshake timeout

Expected

  • Gateway starts without OOM
  • Failover cascades to next provider within seconds
  • Gateway starts with hooks disabled and a warning log
  • CLI connects successfully with the extended timeout

Actual (before fix)

  • OOM crash
  • 10-minute hang retrying same provider
  • Unhandled MODULE_NOT_FOUND crash
  • Silent 1000 close on CLI WebSocket

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

71 tests pass across 5 test files covering all 4 fixes. QA review by 3 specialized agents (code-reviewer, silent-failure-hunter, pr-test-analyzer) plus Codex gpt-5.4 review — all findings addressed.

Human Verification (required)

  • Verified scenarios: All 4 fixes tested via Vitest with targeted unit tests
  • Edge cases checked: readOnly cache mutation safety, 404 coercion through full pipeline, env var priority/clamping/invalid values, hook error fallback to null
  • What you did not verify: Full gateway restart under production load (unit tests only)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? Yes — new optional env var OPENCLAW_HANDSHAKE_TIMEOUT_MS
  • Migration needed? No
  • Default handshake timeout changed from 10s to 15s (strictly more permissive)

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert individual commits; each fix is independent
  • Files/config to restore: None — no config migration
  • Known bad symptoms reviewers should watch for: If readOnly cache sharing causes stale data, sessions would show incorrect metadata; if 404 failover is too aggressive, operators may not notice misconfigured primary models

Risks and Mitigations

  • Risk: readOnly returns mutable reference to cache — future callers could corrupt cache
    • Mitigation: Only used in loadCombinedSessionStoreForGateway which creates new objects via spread; TypeScript types guide correct usage
  • Risk: HTTP 404 mapped to failover could mask endpoint misconfiguration
    • Mitigation: Failover attempts are logged; if all candidates fail, aggregated error surfaces to operator

🤖 Generated with Claude Code — AI-assisted, fully tested, QA reviewed by Codex gpt-5.4

…event OOM

Add readOnly option to loadSessionStore that skips the return-path clone.
loadCombinedSessionStoreForGateway now uses readOnly since it builds
new objects via spread and never mutates source entries.

Reduces peak memory from ~4x to ~2x per agent store load.

Closes openclaw#51264
resolveFailoverReasonFromError now returns 'model_not_found' for 404
status codes, enabling the fallback chain to cascade to the next
provider instead of retrying or throwing.

Closes openclaw#51209
…on startup

Wrap resolveHooksConfig() in try/catch so invalid hook transform paths
log an error and disable hooks instead of crashing the gateway process.

Closes openclaw#51266
…AKE_TIMEOUT_MS

Bump default from 10s to 15s and allow override via env var for loaded
gateways where the CLI handshake exceeds the default budget.

Closes openclaw#51274
… timeout upper bound

- server-runtime-config: switch console.error to console.warn to match
  the structured logging pattern in server-reload-handlers.ts
- server-constants: add MAX_HANDSHAKE_TIMEOUT_MS (120s) upper bound to
  prevent accidental no-op timeouts from absurd env var values
- Add tests for timeout clamping and env var priority
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime agents Agent runtime and tooling size: M labels Mar 21, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR applies four targeted gateway stability fixes — OOM reduction via readOnly session store reads, HTTP 404 → model_not_found failover mapping, graceful hook-config error recovery, and a configurable handshake timeout increase. Each change is independent, well-scoped, and backed by new unit tests.

Key changes:

  • loadSessionStore gains a readOnly option that skips the return-path structuredClone, reducing allocations from 3 to 1 clone per loadCombinedSessionStoreForGateway call. The optimization is safe for the current caller (entries are spread, never mutated) but the return type remains Record<string, SessionEntry>, providing no compiler-level guard against accidental mutations by future callers.
  • resolveFailoverReasonFromError now maps HTTP 404 to model_not_found, enabling proper failover cascade instead of a 10-minute retry hang.
  • resolveHooksConfig is wrapped in try/catch at startup; bad hook config logs a warning and disables hooks rather than crashing the gateway.
  • Default handshake timeout is raised from 10 s to 15 s, configurable via OPENCLAW_HANDSHAKE_TIMEOUT_MS (clamped at 120 s). The getHandshakeTimeoutMs function correctly handles invalid strings, zero/negative values, and the Vitest test-override env var.

Confidence Score: 4/5

  • Safe to merge; all four fixes are well-contained with unit-test coverage, and the only open gap is a non-critical type-safety improvement for the readOnly option.
  • All four bugs are correctly addressed with minimal blast radius — no write paths, no protocol changes, no new network calls. The readOnly cache optimization is sound for the current usage but exposes a footgun for future callers because the return type doesn't reflect immutability. No critical logic or security issues found.
  • src/config/sessions/store.ts — the readOnly return type should be strengthened to Readonly<Record<string, Readonly<SessionEntry>>> when readOnly: true to prevent accidental cache corruption by future callers.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 190-207

Comment:
**`readOnly` not reflected in return type**

The `readOnly` flag signals that the caller promises not to mutate the returned store, but the return type is still `Record<string, SessionEntry>` — TypeScript won't catch a future caller who accidentally writes to the result. When `readOnly: true` and the cache is hot, `cached.store` is returned directly, so any mutation would silently corrupt the cache.

Consider using overloads so the compiler enforces the contract:

```ts
export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
  storePath: string,
  opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
  // ...existing implementation unchanged...
}
```

This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix(gateway): addres..."

Comment on lines 190 to +207
type LoadSessionStoreOptions = {
skipCache?: boolean;
/** Skip the return-path structuredClone — safe when the caller never mutates the result. */
readOnly?: boolean;
};

export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> {
// Check cache first if enabled
if (!opts.skipCache && isSessionStoreCacheEnabled()) {
const cached = SESSION_STORE_CACHE.get(storePath);
if (cached && isSessionStoreCacheValid(cached)) {
const currentMtimeMs = getFileMtimeMs(storePath);
if (currentMtimeMs === cached.mtimeMs) {
// Return a deep copy to prevent external mutations affecting cache
return structuredClone(cached.store);
return opts.readOnly ? cached.store : structuredClone(cached.store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 readOnly not reflected in return type

The readOnly flag signals that the caller promises not to mutate the returned store, but the return type is still Record<string, SessionEntry> — TypeScript won't catch a future caller who accidentally writes to the result. When readOnly: true and the cache is hot, cached.store is returned directly, so any mutation would silently corrupt the cache.

Consider using overloads so the compiler enforces the contract:

export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
  storePath: string,
  opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
  // ...existing implementation unchanged...
}

This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 190-207

Comment:
**`readOnly` not reflected in return type**

The `readOnly` flag signals that the caller promises not to mutate the returned store, but the return type is still `Record<string, SessionEntry>` — TypeScript won't catch a future caller who accidentally writes to the result. When `readOnly: true` and the cache is hot, `cached.store` is returned directly, so any mutation would silently corrupt the cache.

Consider using overloads so the compiler enforces the contract:

```ts
export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
  storePath: string,
  opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
  storePath: string,
  opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
  // ...existing implementation unchanged...
}
```

This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d016f13e7a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +114 to +118
try {
hooksConfig = resolveHooksConfig(params.cfg);
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
// Match the structured logging pattern used by the hot-reload path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-throw startup hook config errors after logging

Swallowing every resolveHooksConfig() exception here leaves startup in a half-enabled state instead of failing fast. If hooks are enabled but invalid for reasons other than a missing token—e.g. hooks.path: "/" or bad session-prefix validation—runtimeConfig.hooksConfig becomes null, so server-http.ts rejects every hook request, but server-startup.ts still starts hook-adjacent services like startGmailWatcherWithLogs() from the raw config. In that scenario the gateway boots, Gmail watch registration can succeed, and inbound hook deliveries are silently dropped because no hook handler is mounted.

Useful? React with 👍 / 👎.

Comment on lines +163 to +164
if (status === 404) {
return "model_not_found";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid mapping every 404 response to model_not_found

A bare HTTP 404 is not specific enough to mean “missing model” in this codebase. src/agents/model-compat.ts already documents endpoint/config mistakes that return 404s (for example Anthropic .../v1/v1/messages), but this branch now coerces any 404 into model_not_found. runWithModelFallback() will then treat the error as retryable and advance to other candidates, and the embedded runner records it as a non-timeout auth-profile failure (src/agents/pi-embedded-runner/run.ts / src/agents/auth-profiles/usage.ts), so a bad base URL or route typo gets masked as a missing model and can incorrectly cool down healthy credentials.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: M

Projects

None yet

1 participant