Skip to content

fix(gateway): allow cooldown probe for timeout failover reason#64130

Closed
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/timeout-cooldown-probe
Closed

fix(gateway): allow cooldown probe for timeout failover reason#64130
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/timeout-cooldown-probe

Conversation

@openperf
Copy link
Copy Markdown
Member

Summary

  • Problem: When the primary model fails due to a network timeout (e.g., ETIMEDOUT, ECONNRESET, or transient 5xx errors classified as timeout), the circuit breaker enters a cooldown state but never triggers a cooldown probe. This causes the gateway to permanently stick to the fallback model even after the network recovers, as reported in Circuit Breaker cooldown probe does not fire for 'timeout' reason — no auto-recovery after network outage #63996.
  • Root Cause: In src/agents/failover-policy.ts, the functions shouldAllowCooldownProbeForReason and shouldUseTransientCooldownProbeSlot explicitly whitelist reasons like rate_limit, overloaded, and unknown, but omit timeout. As a result, when resolveCooldownDecision schedules a probe attempt for a timeout, the fallback runner silently drops the allowTransientCooldownProbe flag, resulting in a "ghost probe" that never actually fires.
  • Fix: Added "timeout" to the allowed reasons in both shouldAllowCooldownProbeForReason and shouldUseTransientCooldownProbeSlot. This ensures that timeout errors are treated as transient network issues (similar to rate_limit and overloaded), allowing the circuit breaker to periodically probe the primary model and recover automatically.
  • What changed:
    • src/agents/failover-policy.ts: Added "timeout" to allowed probe reasons.
    • src/agents/failover-policy.test.ts: Updated expected test assertions for "timeout".
    • src/agents/model-fallback.probe.test.ts: Added an integration test attempts non-primary fallbacks during timeout cooldown after primary probe failure to prevent future regressions.
  • What did NOT change (scope boundary):
    • The core circuit breaker logic (resolveCooldownDecision) and the timeout classification logic (classifyFailoverReasonFromCode) remain untouched.
    • shouldAttemptDespiteCooldown in model-fallback.ts (line 611–616) does not include "timeout" for non-primary candidates. This means same-provider sibling models still skip timeout-cooldowned profiles during fallback. This is an independent, lower-priority issue — the current fix targets primary model probe recovery only, which is the scenario reported in Circuit Breaker cooldown probe does not fire for 'timeout' reason — no auto-recovery after network outage #63996.
    • shouldPreserveTransientCooldownProbeSlot intentionally excludes "timeout" — a failed timeout probe should consume its slot to prevent repeated probing against an unreachable network. This is correct behavior and not a bug.

Reproduction

  1. Configure a primary model (e.g., OpenAI) and a fallback model (e.g., Anthropic).
  2. Disconnect the network or simulate a timeout (ETIMEDOUT) for the primary model.
  3. Observe the gateway failing over to the fallback model.
  4. Reconnect the network.
  5. Before: The gateway never probes the primary model and stays on the fallback indefinitely.
  6. After: The gateway periodically probes the primary model and successfully switches back once the network is restored.

Risk / Mitigation

  • Risk: Probing too aggressively during a persistent network outage could waste resources or cause slight delays in fallback routing.
  • Mitigation: The existing probe throttling mechanism (e.g., 30s interval) in model-fallback.ts remains fully active. The new integration test in model-fallback.probe.test.ts verifies that timeout probes correctly respect the transient cooldown slot and fallback chain without introducing infinite loops.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway

Linked Issue/PR

Fixes #63996

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS maintainer Maintainer-authored PR labels Apr 10, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR fixes a circuit-breaker stall where a timeout failover reason caused the gateway to stay permanently on the fallback model. The fix adds "timeout" to shouldAllowCooldownProbeForReason and shouldUseTransientCooldownProbeSlot in failover-policy.ts, mirroring the existing treatment of rate_limit, overloaded, and unknown. The PR intentionally leaves shouldPreserveTransientCooldownProbeSlot unchanged (so a failed timeout probe consumes its slot) and leaves shouldAttemptDespiteCooldown for non-primary candidates untouched, both of which are correct per the PR's stated scope. Test coverage is well-aligned: existing unit-test assertions are updated and a new expectProbeFailureFallsBack integration case covers the timeout path end-to-end.

Confidence Score: 5/5

Safe to merge — minimal, well-scoped fix with aligned unit and integration test coverage and no unintended behavioral changes.

All changes are additive and consistent with the existing policy pattern. The asymmetry between shouldAllowCooldownProbeForReason (includes billing) and shouldUseTransientCooldownProbeSlot (excludes billing) is pre-existing and intentional, not introduced here. The deliberate exclusion of timeout from shouldPreserveTransientCooldownProbeSlot is correctly justified. No P0 or P1 findings.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(gateway): allow cooldown probe for t..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01fc1fec5d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/model-fallback.probe.test.ts Outdated
@openperf openperf force-pushed the fix/timeout-cooldown-probe branch from 01fc1fe to 1258d97 Compare April 10, 2026 06:30
@openperf openperf closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Circuit Breaker cooldown probe does not fire for 'timeout' reason — no auto-recovery after network outage

1 participant