fix: gateway stability — OOM, failover 404, hook crash, handshake timeout#51362
fix: gateway stability — OOM, failover 404, hook crash, handshake timeout#51362adrianwedd wants to merge 5 commits intoopenclaw:mainfrom
Conversation
…event OOM Add readOnly option to loadSessionStore that skips the return-path clone. loadCombinedSessionStoreForGateway now uses readOnly since it builds new objects via spread and never mutates source entries. Reduces peak memory from ~4x to ~2x per agent store load. Closes openclaw#51264
resolveFailoverReasonFromError now returns 'model_not_found' for 404 status codes, enabling the fallback chain to cascade to the next provider instead of retrying or throwing. Closes openclaw#51209
…on startup Wrap resolveHooksConfig() in try/catch so invalid hook transform paths log an error and disable hooks instead of crashing the gateway process. Closes openclaw#51266
…AKE_TIMEOUT_MS Bump default from 10s to 15s and allow override via env var for loaded gateways where the CLI handshake exceeds the default budget. Closes openclaw#51274
… timeout upper bound - server-runtime-config: switch console.error to console.warn to match the structured logging pattern in server-reload-handlers.ts - server-constants: add MAX_HANDSHAKE_TIMEOUT_MS (120s) upper bound to prevent accidental no-op timeouts from absurd env var values - Add tests for timeout clamping and env var priority
Greptile SummaryThis PR applies four targeted gateway stability fixes — OOM reduction via Key changes:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 190-207
Comment:
**`readOnly` not reflected in return type**
The `readOnly` flag signals that the caller promises not to mutate the returned store, but the return type is still `Record<string, SessionEntry>` — TypeScript won't catch a future caller who accidentally writes to the result. When `readOnly: true` and the cache is hot, `cached.store` is returned directly, so any mutation would silently corrupt the cache.
Consider using overloads so the compiler enforces the contract:
```ts
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
storePath: string,
opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
// ...existing implementation unchanged...
}
```
This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "fix(gateway): addres..." |
| type LoadSessionStoreOptions = { | ||
| skipCache?: boolean; | ||
| /** Skip the return-path structuredClone — safe when the caller never mutates the result. */ | ||
| readOnly?: boolean; | ||
| }; | ||
|
|
||
| export function loadSessionStore( | ||
| storePath: string, | ||
| opts: LoadSessionStoreOptions = {}, | ||
| ): Record<string, SessionEntry> { | ||
| // Check cache first if enabled | ||
| if (!opts.skipCache && isSessionStoreCacheEnabled()) { | ||
| const cached = SESSION_STORE_CACHE.get(storePath); | ||
| if (cached && isSessionStoreCacheValid(cached)) { | ||
| const currentMtimeMs = getFileMtimeMs(storePath); | ||
| if (currentMtimeMs === cached.mtimeMs) { | ||
| // Return a deep copy to prevent external mutations affecting cache | ||
| return structuredClone(cached.store); | ||
| return opts.readOnly ? cached.store : structuredClone(cached.store); |
There was a problem hiding this comment.
readOnly not reflected in return type
The readOnly flag signals that the caller promises not to mutate the returned store, but the return type is still Record<string, SessionEntry> — TypeScript won't catch a future caller who accidentally writes to the result. When readOnly: true and the cache is hot, cached.store is returned directly, so any mutation would silently corrupt the cache.
Consider using overloads so the compiler enforces the contract:
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
storePath: string,
opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
// ...existing implementation unchanged...
}This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 190-207
Comment:
**`readOnly` not reflected in return type**
The `readOnly` flag signals that the caller promises not to mutate the returned store, but the return type is still `Record<string, SessionEntry>` — TypeScript won't catch a future caller who accidentally writes to the result. When `readOnly: true` and the cache is hot, `cached.store` is returned directly, so any mutation would silently corrupt the cache.
Consider using overloads so the compiler enforces the contract:
```ts
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions & { readOnly: true },
): Readonly<Record<string, Readonly<SessionEntry>>>;
export function loadSessionStore(
storePath: string,
opts?: LoadSessionStoreOptions,
): Record<string, SessionEntry>;
export function loadSessionStore(
storePath: string,
opts: LoadSessionStoreOptions = {},
): Record<string, SessionEntry> | Readonly<Record<string, Readonly<SessionEntry>>> {
// ...existing implementation unchanged...
}
```
This keeps the optimization intact while making the immutability contract visible to every future caller — and would have caught any accidental write at compile time rather than as a runtime cache-corruption bug.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d016f13e7a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| try { | ||
| hooksConfig = resolveHooksConfig(params.cfg); | ||
| } catch (err) { | ||
| const message = err instanceof Error ? err.message : String(err); | ||
| // Match the structured logging pattern used by the hot-reload path |
There was a problem hiding this comment.
Re-throw startup hook config errors after logging
Swallowing every resolveHooksConfig() exception here leaves startup in a half-enabled state instead of failing fast. If hooks are enabled but invalid for reasons other than a missing token—e.g. hooks.path: "/" or bad session-prefix validation—runtimeConfig.hooksConfig becomes null, so server-http.ts rejects every hook request, but server-startup.ts still starts hook-adjacent services like startGmailWatcherWithLogs() from the raw config. In that scenario the gateway boots, Gmail watch registration can succeed, and inbound hook deliveries are silently dropped because no hook handler is mounted.
Useful? React with 👍 / 👎.
| if (status === 404) { | ||
| return "model_not_found"; |
There was a problem hiding this comment.
Avoid mapping every 404 response to model_not_found
A bare HTTP 404 is not specific enough to mean “missing model” in this codebase. src/agents/model-compat.ts already documents endpoint/config mistakes that return 404s (for example Anthropic .../v1/v1/messages), but this branch now coerces any 404 into model_not_found. runWithModelFallback() will then treat the error as retryable and advance to other candidates, and the embedded runner records it as a non-timeout auth-profile failure (src/agents/pi-embedded-runner/run.ts / src/agents/auth-profiles/usage.ts), so a bad base URL or route typo gets masked as a missing model and can incorrectly cool down healthy credentials.
Useful? React with 👍 / 👎.
Summary
loadSessionStoregains areadOnlyoption to skip unnecessarystructuredClonecalls (3→1 clones per load), used byloadCombinedSessionStoreForGatewayresolveFailoverReasonFromErrornow maps HTTP 404 →model_not_found, enabling proper failover cascaderesolveHooksConfig()wrapped in try/catch at startup — bad hook config disables hooks instead of crashing the gatewayOPENCLAW_HANDSHAKE_TIMEOUT_MSenv var, clamped at 120s maxChange Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
OPENCLAW_HANDSHAKE_TIMEOUT_MS)Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
Steps
hooks.enabled: truewithouthooks.token→ gateway crashes on startupExpected
Actual (before fix)
Evidence
71 tests pass across 5 test files covering all 4 fixes. QA review by 3 specialized agents (code-reviewer, silent-failure-hunter, pr-test-analyzer) plus Codex gpt-5.4 review — all findings addressed.
Human Verification (required)
Compatibility / Migration
YesYes— new optional env varOPENCLAW_HANDSHAKE_TIMEOUT_MSNoFailure Recovery (if this breaks)
readOnlycache sharing causes stale data, sessions would show incorrect metadata; if 404 failover is too aggressive, operators may not notice misconfigured primary modelsRisks and Mitigations
readOnlyreturns mutable reference to cache — future callers could corrupt cacheloadCombinedSessionStoreForGatewaywhich creates new objects via spread; TypeScript types guide correct usage🤖 Generated with Claude Code — AI-assisted, fully tested, QA reviewed by Codex gpt-5.4