fix(gateway): allow cooldown probe for timeout failover reason#64130
fix(gateway): allow cooldown probe for timeout failover reason#64130openperf wants to merge 1 commit into
Conversation
Greptile SummaryThis PR fixes a circuit-breaker stall where a Confidence Score: 5/5Safe to merge — minimal, well-scoped fix with aligned unit and integration test coverage and no unintended behavioral changes. All changes are additive and consistent with the existing policy pattern. The asymmetry between shouldAllowCooldownProbeForReason (includes billing) and shouldUseTransientCooldownProbeSlot (excludes billing) is pre-existing and intentional, not introduced here. The deliberate exclusion of timeout from shouldPreserveTransientCooldownProbeSlot is correctly justified. No P0 or P1 findings. No files require special attention. Reviews (1): Last reviewed commit: "fix(gateway): allow cooldown probe for t..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01fc1fec5d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
01fc1fe to
1258d97
Compare
Summary
ETIMEDOUT,ECONNRESET, or transient 5xx errors classified astimeout), the circuit breaker enters a cooldown state but never triggers a cooldown probe. This causes the gateway to permanently stick to the fallback model even after the network recovers, as reported in Circuit Breaker cooldown probe does not fire for 'timeout' reason — no auto-recovery after network outage #63996.src/agents/failover-policy.ts, the functionsshouldAllowCooldownProbeForReasonandshouldUseTransientCooldownProbeSlotexplicitly whitelist reasons likerate_limit,overloaded, andunknown, but omittimeout. As a result, whenresolveCooldownDecisionschedules a probe attempt for a timeout, the fallback runner silently drops theallowTransientCooldownProbeflag, resulting in a "ghost probe" that never actually fires."timeout"to the allowed reasons in bothshouldAllowCooldownProbeForReasonandshouldUseTransientCooldownProbeSlot. This ensures that timeout errors are treated as transient network issues (similar torate_limitandoverloaded), allowing the circuit breaker to periodically probe the primary model and recover automatically.src/agents/failover-policy.ts: Added"timeout"to allowed probe reasons.src/agents/failover-policy.test.ts: Updated expected test assertions for"timeout".src/agents/model-fallback.probe.test.ts: Added an integration testattempts non-primary fallbacks during timeout cooldown after primary probe failureto prevent future regressions.resolveCooldownDecision) and the timeout classification logic (classifyFailoverReasonFromCode) remain untouched.shouldAttemptDespiteCooldowninmodel-fallback.ts(line 611–616) does not include"timeout"for non-primary candidates. This means same-provider sibling models still skip timeout-cooldowned profiles during fallback. This is an independent, lower-priority issue — the current fix targets primary model probe recovery only, which is the scenario reported in Circuit Breaker cooldown probe does not fire for 'timeout' reason — no auto-recovery after network outage #63996.shouldPreserveTransientCooldownProbeSlotintentionally excludes"timeout"— a failed timeout probe should consume its slot to prevent repeated probing against an unreachable network. This is correct behavior and not a bug.Reproduction
ETIMEDOUT) for the primary model.Risk / Mitigation
model-fallback.tsremains fully active. The new integration test inmodel-fallback.probe.test.tsverifies that timeout probes correctly respect the transient cooldown slot and fallback chain without introducing infinite loops.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #63996