fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message#49834
Conversation
Greptile SummaryThis PR addresses three related cooldown pain points in the auth-profile system: a stepped backoff formula (30 s → 1 min → 5 min), per-model cooldown scoping so a single 429 no longer blocks every model on a shared profile, and a structured Key changes reviewed:
Confidence Score: 4/5
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b5c264330
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2b5c264 to
106d513
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 106d513919
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9451a27678
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36ac8951f0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/agents/model-fallback.ts
Outdated
| allProfileIds.add(id); | ||
| } | ||
| } | ||
| return getSoonestCooldownExpiry(authStore, [...allProfileIds]); |
There was a problem hiding this comment.
Recompute cooldown expiry from fresh store before summary
soonestCooldownExpiry is computed from the authStore snapshot captured before the fallback attempts run. In the embedded runner path, attempts can persist new cooldowns via markAuthProfileFailure, but this snapshot is never refreshed, so the summary countdown can be null/stale (for example after the first in-run 429), producing incorrect retry timing in the user-facing rate-limit message.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Acknowledged — this is a valid observation about the snapshot staleness.
However, this is existing architecture, not introduced by this PR. The authStore in model-fallback.ts is a clone created via ensureAuthProfileStore() → structuredClone() at the start of runWithModelFallback (line 555). The embedded runner in run.ts creates its own separate clone (line 423). Mutations from markAuthProfileFailure during attempts update the runner's clone and persist to disk via updateAuthProfileStoreWithLock, but the model-fallback.ts clone is never refreshed.
Impact is cosmetic only: The soonestCooldownExpiry computed at line 818 feeds into the user-facing error message countdown text (e.g. "retry in ~4m 30s"). Cooldown enforcement works correctly because each embedded runner call reads fresh state from disk via the lock-guarded store update. A stale countdown in the error message is a minor UX imperfection, not a correctness issue.
Fixing this properly would require refactoring the structuredClone snapshot pattern to either re-read the store after the attempt loop or use a shared mutable reference — both are broader architectural changes beyond this PR's scope. Happy to tackle that in a follow-up if maintainers prefer.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 971693935a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I found one concrete merge blocker and one follow-up nit. The blocker is that this branch changes pnpm test -- src/agents/auth-profiles.markauthprofilefailure.test.tsThat fails at |
|
Thanks for catching this, @altaywtf! You're right — the Fixed in b2ccb3d:
Both test suites now pass: You also mentioned a follow-up nit — happy to address that if you could share the details! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 487deab3a5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
487deab to
0016af2
Compare
CI Failure Analysis — All Failures Pre-existing / UnrelatedI've analyzed all 4 failing CI jobs on the latest run (
For reference, the latest upstream All jobs that exercise code paths touched by this PR are passing: |
0016af2 to
bd401f7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd401f7876
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
CI Failure Analysis — All Pre-Existing on
|
| Job | Root Cause | Our PR touches this? |
|---|---|---|
check |
TS errors in extensions/matrix/src/onboarding.ts, acp-spawn.test.ts, commands/channels/remove.ts, matrix-plugin-helper.test.ts |
No — all in Matrix extension / channel commands |
contracts |
registry.contract.test.ts expects ['discord','feishu','matrix','telegram'] but gets ['discord','feishu','telegram'] (matrix missing) |
No — channel contract registry |
channels |
Discord: createAccountListHelpers is not a function, resolveThreadBindingIdleTimeoutMs is not a function; WhatsApp: updateLastRoute setter mock failure |
No — Discord/WhatsApp extension tests |
install-smoke |
Cannot find module '@vector-im/matrix-bot-sdk/package.json' |
No — Matrix extension install (also fails on main c4a4050ce4) |
windows shards |
Same contracts + channel test failures as above | No |
Our PR's failures are a strict subset of main's
- Main (
009a10bce2): 11 failing jobs (check, bun-test, channels, contracts, compat-node22, release-check, windows shards 1/3/4/5/6) - Our PR (
6d58d2f381): 7 failing jobs (check, channels, contracts, windows shards 1/4/5/6) — all present in main's failure list - Our PR passes jobs that main fails:
bun test,compat-node22,release-check
Timeline
mainwas green at0443ee82be(run #23284150176)f3097b4c09(refactor: install optional channels for remove) introduced these test failuresc4a4050ce4(fix(macos): align exec command parity) fixed CI infrastructure but did not touch the failing test files — it passed CI only becausechanged-scopeskipped all test shards (nosrc/files changed)- The
Install Smokeworkflow fails on all recentmaincommits includingc4a4050ce4
This PR's scope
Our changes are limited to:
src/agents/auth-profiles/(types, usage logic, tests)src/agents/model-fallback.tssrc/agents/pi-embedded-runner/run.tssrc/auto-reply/reply/agent-runner-execution.tschangelog/fragments/cooldown-per-model-stepped-backoff.md
None of the failing tests are in or related to these files. The test shard 2/2, extensions, protocol, build-smoke, bun test, and all boundary checks pass.
Production Validation — 4 Days of A/B ComparisonI've been running this PR's code in a Docker deployment alongside production (stock Copilot Failure Rates
¹ Mar 16 was the PR instance's first day — high failure rate due to initial cooldown cascade before the stepped backoff had a chance to stabilize. What the Numbers ShowProduction ( PR instance (this code) uses per-model cooldown scoping + stepped backoff (30s → 1m → 5m cap). Gateway logs show the error counter consistently resetting between rate-limit windows: Every rate-limit event starts fresh at Model Isolation WorkingA 429 on External Watchdog Interventions
Production's "0 cascade alerts" is misleading — the watchdog doesn't trigger because fallback providers absorb the failures. But Copilot is locked out 100% of the time, routing all traffic to fallback providers unnecessarily. The PR instance's 2 early cascade alerts (Mar 17-18) were during initial tuning; by Mar 19 the system is self-correcting with zero external intervention needed. CI NoteAll 6 failing CI jobs on this PR are pre-existing failures on |
16c4fa3 to
6d58d2f
Compare
d1764f1 to
fd0e16c
Compare
|
@greptileai review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd0e16c08a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
fd0e16c to
27b7573
Compare
|
@greptileai review |
| if (typeof expiry === "number" && expiry > Date.now()) { | ||
| const secsLeft = Math.ceil((expiry - Date.now()) / 1000); | ||
| if (secsLeft <= 60) { | ||
| return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`; | ||
| } | ||
| const minsLeft = Math.ceil(secsLeft / 60); | ||
| return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`; |
There was a problem hiding this comment.
Two
Date.now() calls leave a window for secsLeft to be zero or negative
expiry > Date.now() and expiry - Date.now() are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past expiry. Math.ceil rounds small negative fractions to 0, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".
Capture a single timestamp at the top of the function and reuse it:
| if (typeof expiry === "number" && expiry > Date.now()) { | |
| const secsLeft = Math.ceil((expiry - Date.now()) / 1000); | |
| if (secsLeft <= 60) { | |
| return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`; | |
| } | |
| const minsLeft = Math.ceil(secsLeft / 60); | |
| return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`; | |
| const expiry = err.soonestCooldownExpiry; | |
| const now = Date.now(); | |
| if (typeof expiry === "number" && expiry > now) { | |
| const secsLeft = Math.ceil((expiry - now) / 1000); | |
| if (secsLeft <= 60) { | |
| return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`; | |
| } | |
| const minsLeft = Math.ceil(secsLeft / 60); | |
| return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/auto-reply/reply/agent-runner-execution.ts
Line: 87-93
Comment:
**Two `Date.now()` calls leave a window for `secsLeft` to be zero or negative**
`expiry > Date.now()` and `expiry - Date.now()` are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past `expiry`. `Math.ceil` rounds small negative fractions to `0`, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".
Capture a single timestamp at the top of the function and reuse it:
```suggestion
const expiry = err.soonestCooldownExpiry;
const now = Date.now();
if (typeof expiry === "number" && expiry > now) {
const secsLeft = Math.ceil((expiry - now) / 1000);
if (secsLeft <= 60) {
return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
}
const minsLeft = Math.ceil(secsLeft / 60);
return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
}
```
How can I resolve this? If you propose a fix, please make it concise.| if ( | ||
| params.existing.cooldownModel && | ||
| params.modelId && | ||
| params.existing.cooldownModel !== params.modelId | ||
| ) { | ||
| updatedStats.cooldownModel = undefined; | ||
| } else if (params.reason !== "rate_limit") { | ||
| // Non-rate-limit failures are profile-wide — clear model scope even | ||
| // when the same model fails, so that no model can bypass. | ||
| updatedStats.cooldownModel = undefined; | ||
| } else { | ||
| updatedStats.cooldownModel = params.existing.cooldownModel; |
There was a problem hiding this comment.
cooldownModel preserved when modelId is absent on a rate_limit during an active window
In the existingCooldownActive branch the three conditions are:
- Different model → widen to
undefined - Non-rate-limit reason → widen to
undefined - Else → keep
params.existing.cooldownModel
Condition 3 is reached when params.reason === "rate_limit" and either params.modelId or params.existing.cooldownModel is falsy. When params.modelId is undefined (i.e. the caller didn't pass a model) and params.existing.cooldownModel is already set to "model-A", the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.
All current call sites in this PR correctly thread modelId, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits modelId:
} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
// Unknown originating model — conservatively widen scope so no model bypasses.
updatedStats.cooldownModel = undefined;
} else {
updatedStats.cooldownModel = params.existing.cooldownModel;
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/usage.ts
Line: 508-519
Comment:
**`cooldownModel` preserved when `modelId` is absent on a `rate_limit` during an active window**
In the `existingCooldownActive` branch the three conditions are:
1. Different model → widen to `undefined`
2. Non-rate-limit reason → widen to `undefined`
3. Else → keep `params.existing.cooldownModel`
Condition 3 is reached when `params.reason === "rate_limit"` **and** either `params.modelId` or `params.existing.cooldownModel` is falsy. When `params.modelId` is `undefined` (i.e. the caller didn't pass a model) and `params.existing.cooldownModel` is already set to `"model-A"`, the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.
All current call sites in this PR correctly thread `modelId`, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits `modelId`:
```ts
} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
// Unknown originating model — conservatively widen scope so no model bypasses.
updatedStats.cooldownModel = undefined;
} else {
updatedStats.cooldownModel = params.existing.cooldownModel;
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bece623e3d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
hello @kiranvk-2011 pushed some commits, would you mind having a look and re-testing in your setup? I can proceed with the merge once you give the green-light. |
Green Light — Production Validation Complete@altaywtf I've rebuilt the test instance from source using your latest commits (up to Stepped Backoff — Confirmed Working (Today's Live Data)6
The 10-Day A/B Comparison (Mar 16–25)Both instances share the same Copilot Production (stock
PR #49834 (stepped 30s/1m/5m + per-model scope):
The key metric: production's exponential formula locked Copilot out permanently after the first cascade (Mar 17), while the stepped formula recovered and continued using Copilot whenever the rate limit cleared. Fallback Chain — Working Correctly5 successful fallback events today: BuildSource-built using the upstream monorepo Dockerfile (not sed patches). Container created today at 15:44 UTC with all your commits through Merge ConflictThe PR currently shows Minor Notes (Non-Blocking)Per Greptile's review (all cosmetic):
TL;DR: 10 days of production A/B data confirm the stepped backoff eliminates the self-reinforcing cascade loop. Copilot utilization went from 4% (stock) to 35% (PR) on the same rate-limited token. Zero user-facing failures on both. Green light from me — just needs the rebase. |
aa1111b to
1f84da1
Compare
…mit message Combines ideas from PRs openclaw#45113, openclaw#31962, and openclaw#45763 to address three cooldown-related issues: 1. Stepped cooldown (30s → 1m → 5m cap) replaces the aggressive exponential formula (1m → 5m → 25m → 1h) that locked out providers for far longer than the actual API rate-limit window. 2. Per-model cooldown scoping: rate_limit cooldowns now record which model triggered them. When a different model on the same auth profile is requested, the cooldown is bypassed — so one model hitting a 429 no longer blocks all other models on the same provider. 3. FallbackSummaryError with soonest-expiry countdown: when all candidates are exhausted, the user sees a clear message like '⚠️ Rate-limited — ready in ~28s' instead of a generic failure. Files changed: - types.ts: add cooldownReason/cooldownModel to ProfileUsageStats - usage.ts: stepped formula, model-aware isProfileInCooldown, modelId threading through computeNextProfileUsageStats/markAuthProfileFailure - model-fallback.ts: FallbackSummaryError class, model-aware availability check, soonestCooldownExpiry computation - pi-embedded-runner/run.ts: thread modelId into failure recording - agent-runner-execution.ts: buildCopilotCooldownMessage helper, rate-limit detection branch in error handler - usage.test.ts: update expected cooldown value (60s → 30s)
…ormatting - Update markAuthProfileCooldown JSDoc to reflect new stepped backoff (30s/1m/5m) - Merge duplicate isFallbackSummaryError import into single import statement - Run oxfmt on all changed files to fix formatting CI failure
…ndow - When model A is cooling down and model B also fails, set cooldownModel to undefined so neither model bypasses via per-model scope - Same-model retries preserve the original cooldownModel - Add 8 new tests for per-model cooldown behavior: model-scoped bypass, profile-wide cooldown, billing-disable guard, scope-widening, same-model retry preservation - Update .some() comment to document intentional design choice for mixed fallback failure reasons
… checks - Add curly braces to single-line if/for bodies in usage.ts and model-fallback.ts to satisfy oxlint eslint(curly) rule - Thread modelId into all 3 isProfileInCooldown calls in pi-embedded-runner/run.ts (lines 719, 746, 767) so the inner profile loop respects per-model cooldown scope — fixes Codex P1 review comment about outer gate passing model-B while inner loop rejects it without model context
…ope for non-rate-limit
…own ladder Update the auth-profiles.markauthprofilefailure test suite to match the new stepped cooldown formula (30s → 1m → 5m cap) introduced in the first commit. The test was still asserting the old exponential backoff values (1m → 5m → 25m → 1h cap). Changes: - calculateAuthProfileCooldownMs assertions: 60s→30s, 5m→1m, 25m→5m, 1h→5m cap - 'resets error count when previous cooldown has expired' test: upper bound adjusted from 120s to 60s to match 30s base cooldown - Comments updated to reflect the stepped ladder Resolves merge-blocker review from @altaywtf.
1f84da1 to
7c488c0
Compare
|
Merged via squash.
Thanks @kiranvk-2011! |
Summary
This PR addresses three related cooldown issues that cause disproportionate service disruption when a single model on a shared auth profile (e.g. GitHub Copilot) hits a rate limit:
1m -> 5m -> 25m -> 1hescalation with a capped30s -> 1m -> 5mladder that better matches actual API rate-limit windowsFallbackSummaryErrorwith countdown replaces the generic "Agent failed" textCombines ideas from #45113 (per-model cooldown metadata), #31962 (flat/stepped backoff), and #45763 (structured fallback error with UX).
Problem
When a single GitHub Copilot model (e.g.
gpt-4.1) returns HTTP 429, the current code:claude-sonnet-4.6,gpt-4.1-mini, etc.) on the same profileIn multi-model deployments with fallback chains, this creates cascading failures where one model's rate limit locks out the entire provider for up to an hour.
Changes
src/agents/auth-profiles/types.tscooldownReasonandcooldownModelfields toProfileUsageStatssrc/agents/auth-profiles/usage.tscalculateAuthProfileCooldownMs()— stepped formula:30s -> 1m -> 5m(cap)isProfileInCooldown()— newforModelparameter; bypasses cooldown when the recordedcooldownModeldiffers from the requested model (rate_limit only — billing/auth failures remain profile-wide)computeNextProfileUsageStats()— recordscooldownReasonandcooldownModelmetadata; preserves existing metadata during active cooldown windowsmarkAuthProfileFailure()— accepts and threadsmodelIdparameterclearExpiredCooldowns()/resetUsageStats()— clear the new metadata fieldssrc/agents/model-fallback.tsFallbackSummaryError— structured error class withattempts[]andsoonestCooldownExpirytimestampthrowFallbackFailureSummary()— computes soonest expiry across all candidate profiles and throwsFallbackSummaryErrorrunWithModelFallback()— passescandidate.modelintoisProfileInCooldown()for model-aware availability checksrc/agents/pi-embedded-runner/run.tsmodelIdthroughmaybeMarkAuthProfileFailureand its two call sitesmodelIdto allisProfileInCooldown()calls in the inner profile loop so model-scoped cooldowns are respected consistentlysrc/auto-reply/reply/agent-runner-execution.tsbuildCopilotCooldownMessage()— produces messages like "Rate-limited — ready in ~28s"src/agents/auth-profiles/usage.test.ts60_000to30_000to match the new stepped formulaCooldown Behavior After This PR
Other models on the same auth profile remain fully available during any model-scoped rate-limit cooldown.
Testing
usage.test.tstests pass (1 expectation updated for new formula, 8 new per-model tests)model-fallback.test.tstests pass (1 pre-existing ANSI escape failure, unrelated)pi-embedded-runner/runtests passagent-runner-executiontests passoxlint --type-awarepasses with 0 errorsoxfmtformatting passesCI Note
Several CI jobs (
contracts,channels,extensions,extension-fast,install-smoke) fail on this PR — these are pre-existing failures that also occur onmain. See #49848 for full analysis. All CI jobs that cover our changed files (format:check,build-smoke,check,node test 1/2,node test 2/2,changed-scope,protocol:check,secrets) pass.Related Issues / PRs