fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message by kiranvk-2011 · Pull Request #49834 · openclaw/openclaw

kiranvk-2011 · 2026-03-18T13:51:54Z

Summary

This PR addresses three related cooldown issues that cause disproportionate service disruption when a single model on a shared auth profile (e.g. GitHub Copilot) hits a rate limit:

Stepped cooldown formula — replaces the exponential 1m -> 5m -> 25m -> 1h escalation with a capped 30s -> 1m -> 5m ladder that better matches actual API rate-limit windows
Per-model cooldown scoping — rate-limit cooldowns now record which model triggered them; other models on the same auth profile are allowed through
User-facing rate-limit message — a structured FallbackSummaryError with countdown replaces the generic "Agent failed" text

Combines ideas from #45113 (per-model cooldown metadata), #31962 (flat/stepped backoff), and #45763 (structured fallback error with UX).

Problem

When a single GitHub Copilot model (e.g. gpt-4.1) returns HTTP 429, the current code:

Puts the entire auth profile into cooldown — blocking all other models (claude-sonnet-4.6, gpt-4.1-mini, etc.) on the same profile
Escalates aggressively: 3 consecutive errors -> 25-minute cooldown, 4+ -> 1 hour
Shows a generic "Agent failed before reply" message with no actionable information

In multi-model deployments with fallback chains, this creates cascading failures where one model's rate limit locks out the entire provider for up to an hour.

Changes

`src/agents/auth-profiles/types.ts`

Add cooldownReason and cooldownModel fields to ProfileUsageStats

`src/agents/auth-profiles/usage.ts`

calculateAuthProfileCooldownMs() — stepped formula: 30s -> 1m -> 5m (cap)
isProfileInCooldown() — new forModel parameter; bypasses cooldown when the recorded cooldownModel differs from the requested model (rate_limit only — billing/auth failures remain profile-wide)
computeNextProfileUsageStats() — records cooldownReason and cooldownModel metadata; preserves existing metadata during active cooldown windows
markAuthProfileFailure() — accepts and threads modelId parameter
clearExpiredCooldowns() / resetUsageStats() — clear the new metadata fields

`src/agents/model-fallback.ts`

FallbackSummaryError — structured error class with attempts[] and soonestCooldownExpiry timestamp
throwFallbackFailureSummary() — computes soonest expiry across all candidate profiles and throws FallbackSummaryError
runWithModelFallback() — passes candidate.model into isProfileInCooldown() for model-aware availability check

`src/agents/pi-embedded-runner/run.ts`

Thread modelId through maybeMarkAuthProfileFailure and its two call sites
Pass modelId to all isProfileInCooldown() calls in the inner profile loop so model-scoped cooldowns are respected consistently

`src/auto-reply/reply/agent-runner-execution.ts`

buildCopilotCooldownMessage() — produces messages like "Rate-limited — ready in ~28s"
Rate-limit detection branch added to the error handler (before the generic fallback)

`src/agents/auth-profiles/usage.test.ts`

Updated expected cooldown value from 60_000 to 30_000 to match the new stepped formula
8 new per-model cooldown tests covering model-scoped bypass, profile-wide billing fallback, scope widening, and expiry calculation

Cooldown Behavior After This PR

Error count	Cooldown	Scope
1st rate_limit	30 seconds	Model-scoped
2nd rate_limit	1 minute	Model-scoped
3rd+ rate_limit	5 minutes	Model-scoped
billing / auth_permanent	Existing behavior	Profile-wide

Other models on the same auth profile remain fully available during any model-scoped rate-limit cooldown.

Testing

All 45 usage.test.ts tests pass (1 expectation updated for new formula, 8 new per-model tests)
All 59 passing model-fallback.test.ts tests pass (1 pre-existing ANSI escape failure, unrelated)
All 149 pi-embedded-runner/run tests pass
All 26 agent-runner-execution tests pass
oxlint --type-aware passes with 0 errors
oxfmt formatting passes
TypeScript compilation clean (no errors in modified files)

CI Note

Several CI jobs (contracts, channels, extensions, extension-fast, install-smoke) fail on this PR — these are pre-existing failures that also occur on main. See #49848 for full analysis. All CI jobs that cover our changed files (format:check, build-smoke, check, node test 1/2, node test 2/2, changed-scope, protocol:check, secrets) pass.

Related Issues / PRs

fix(auth): preserve per-model cooldown windows #45113 — per-model cooldown metadata
fix(auth): flat 30s cooldown replaces exponential backoff #31962 — replace exponential backoff
feat(agents): structured FallbackSummaryError with human-friendly rate limit messages #45763 — structured FallbackSummaryError
Rate-limited profiles cause immediate failure instead of waiting for cooldown expiry #24158 — aggressive cooldown escalation
[Bug]: after updating to 2026.2.22-2 all models break #24839 — providers enter cooldown without real rate limits
bug: Multiple CI jobs fail on main branch — contracts, channels, extensions, extension-fast, install-smoke, Windows shards #49848 — CI failures affecting all PRs (pre-existing, unrelated to this change)

greptile-apps · 2026-03-18T13:55:49Z

Greptile Summary

This PR addresses three related cooldown pain points in the auth-profile system: a stepped backoff formula (30 s → 1 min → 5 min), per-model cooldown scoping so a single 429 no longer blocks every model on a shared profile, and a structured FallbackSummaryError with a user-facing countdown. The approach is well-motivated and the implementation integrates cleanly with the existing auth-profile machinery.

Key changes reviewed:

calculateAuthProfileCooldownMs now uses a simple three-tier ladder instead of the previous 5^n exponential — tests updated accordingly.
isProfileInCooldown gains a forModel parameter with a guard that prevents the model-bypass from short-circuiting an active disabledUntil (billing/auth) — this correctly addresses the previous reviewer concern.
computeNextProfileUsageStats records cooldownReason/cooldownModel and widens the scope to profile-wide when a second model fails or when the reason is not rate_limit.
FallbackSummaryError / isPureTransientRateLimitSummary ensure the countdown UX fires only when all attempts are concretely rate_limit or overloaded — mixed-cause exhaustion correctly falls through to the generic error message.
Minor items flagged: buildCopilotCooldownMessage calls Date.now() twice (could produce a "~0s" countdown at the boundary); resolveFallbackSoonestCooldownExpiry lacks a try/catch around its synchronous file I/O; and an edge case in computeNextProfileUsageStats where an absent modelId during an active rate-limit window would preserve the existing model scope rather than widening it.

Confidence Score: 4/5

Safe to merge — all three stated goals are correctly implemented and well-tested; the remaining comments are minor style/robustness observations that do not affect correctness under normal operating conditions.
Core logic is sound: the model-scoped bypass, disabledUntil guard, and mixed-cause FallbackSummaryError filtering all work correctly. Previous reviewer issues have been addressed. The three flagged items are cosmetic (double Date.now()) and low-probability reliability concerns (loadAuthProfileStoreForRuntime not guarded, implicit cooldownModel preservation when modelId is absent), none of which affect the primary use-case. Test coverage is comprehensive (45 + 8 new tests).
src/agents/model-fallback.ts (the resolveFallbackSoonestCooldownExpiry I/O path) and src/auto-reply/reply/agent-runner-execution.ts (double Date.now() in buildCopilotCooldownMessage) are worth a second look before merge.

Comments Outside Diff (1)

src/agents/model-fallback.ts, line 546-575 (link)

resolveFallbackSoonestCooldownExpiry can throw, silently bypassing FallbackSummaryError

loadAuthProfileStoreForRuntime does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because resolveFallbackSoonestCooldownExpiry is called inline as a parameter to throwFallbackFailureSummary, any such exception propagates out of runWithModelFallback before FallbackSummaryError is thrown. The downstream handler in agent-runner-execution.ts would then receive an unexpected I/O error: isFallbackSummaryError returns false, isRateLimitErrorMessage also returns false, and the user sees a raw filesystem error message in the "Agent failed before reply" text.

Wrapping the call in a try/catch (returning null on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:

function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
  if (!params.authStore) {
    return null;
  }
  try {
    const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
      readOnly: true,
      allowKeychainPrompt: false,
    });
    // ... existing logic
    return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
  } catch {
    return null;
  }
}

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/agents/model-fallback.ts
Line: 546-575

Comment:
**`resolveFallbackSoonestCooldownExpiry` can throw, silently bypassing `FallbackSummaryError`**

`loadAuthProfileStoreForRuntime` does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because `resolveFallbackSoonestCooldownExpiry` is called inline as a parameter to `throwFallbackFailureSummary`, any such exception propagates out of `runWithModelFallback` *before* `FallbackSummaryError` is thrown. The downstream handler in `agent-runner-execution.ts` would then receive an unexpected I/O error: `isFallbackSummaryError` returns `false`, `isRateLimitErrorMessage` also returns `false`, and the user sees a raw filesystem error message in the "Agent failed before reply" text.

Wrapping the call in a try/catch (returning `null` on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:

```ts
function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
  if (!params.authStore) {
    return null;
  }
  try {
    const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
      readOnly: true,
      allowKeychainPrompt: false,
    });
    // ... existing logic
    return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
  } catch {
    return null;
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/auto-reply/reply/agent-runner-execution.ts
Line: 87-93

Comment:
**Two `Date.now()` calls leave a window for `secsLeft` to be zero or negative**

`expiry > Date.now()` and `expiry - Date.now()` are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past `expiry`. `Math.ceil` rounds small negative fractions to `0`, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".

Capture a single timestamp at the top of the function and reuse it:

```suggestion
  const expiry = err.soonestCooldownExpiry;
  const now = Date.now();
  if (typeof expiry === "number" && expiry > now) {
    const secsLeft = Math.ceil((expiry - now) / 1000);
    if (secsLeft <= 60) {
      return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
    }
    const minsLeft = Math.ceil(secsLeft / 60);
    return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/model-fallback.ts
Line: 546-575

Comment:
**`resolveFallbackSoonestCooldownExpiry` can throw, silently bypassing `FallbackSummaryError`**

`loadAuthProfileStoreForRuntime` does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because `resolveFallbackSoonestCooldownExpiry` is called inline as a parameter to `throwFallbackFailureSummary`, any such exception propagates out of `runWithModelFallback` *before* `FallbackSummaryError` is thrown. The downstream handler in `agent-runner-execution.ts` would then receive an unexpected I/O error: `isFallbackSummaryError` returns `false`, `isRateLimitErrorMessage` also returns `false`, and the user sees a raw filesystem error message in the "Agent failed before reply" text.

Wrapping the call in a try/catch (returning `null` on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:

```ts
function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
  if (!params.authStore) {
    return null;
  }
  try {
    const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
      readOnly: true,
      allowKeychainPrompt: false,
    });
    // ... existing logic
    return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
  } catch {
    return null;
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/auth-profiles/usage.ts
Line: 508-519

Comment:
**`cooldownModel` preserved when `modelId` is absent on a `rate_limit` during an active window**

In the `existingCooldownActive` branch the three conditions are:
1. Different model → widen to `undefined`
2. Non-rate-limit reason → widen to `undefined`
3. Else → keep `params.existing.cooldownModel`

Condition 3 is reached when `params.reason === "rate_limit"` **and** either `params.modelId` or `params.existing.cooldownModel` is falsy. When `params.modelId` is `undefined` (i.e. the caller didn't pass a model) and `params.existing.cooldownModel` is already set to `"model-A"`, the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.

All current call sites in this PR correctly thread `modelId`, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits `modelId`:

```ts
} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
  // Unknown originating model — conservatively widen scope so no model bypasses.
  updatedStats.cooldownModel = undefined;
} else {
  updatedStats.cooldownModel = params.existing.cooldownModel;
}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (3): Last reviewed commit: "fix(agents): scope cooldowns per model" | Re-trigger Greptile}

src/agents/auth-profiles/usage.ts

src/auto-reply/reply/agent-runner-execution.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b5c264330

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/auth-profiles/usage.ts

src/auto-reply/reply/agent-runner-execution.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 106d513919

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/auth-profiles/usage.ts

src/auto-reply/reply/agent-runner-execution.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9451a27678

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/model-fallback.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36ac8951f0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/auth-profiles/usage.ts

chatgpt-codex-connector · 2026-03-18T15:11:35Z

src/agents/model-fallback.ts

+          allProfileIds.add(id);
+        }
+      }
+      return getSoonestCooldownExpiry(authStore, [...allProfileIds]);


Recompute cooldown expiry from fresh store before summary

soonestCooldownExpiry is computed from the authStore snapshot captured before the fallback attempts run. In the embedded runner path, attempts can persist new cooldowns via markAuthProfileFailure, but this snapshot is never refreshed, so the summary countdown can be null/stale (for example after the first in-run 429), producing incorrect retry timing in the user-facing rate-limit message.

Useful? React with 👍 / 👎.

Acknowledged — this is a valid observation about the snapshot staleness.

However, this is existing architecture, not introduced by this PR. The authStore in model-fallback.ts is a clone created via ensureAuthProfileStore() → structuredClone() at the start of runWithModelFallback (line 555). The embedded runner in run.ts creates its own separate clone (line 423). Mutations from markAuthProfileFailure during attempts update the runner's clone and persist to disk via updateAuthProfileStoreWithLock, but the model-fallback.ts clone is never refreshed.

Impact is cosmetic only: The soonestCooldownExpiry computed at line 818 feeds into the user-facing error message countdown text (e.g. "retry in ~4m 30s"). Cooldown enforcement works correctly because each embedded runner call reads fresh state from disk via the lock-guarded store update. A stale countdown in the error message is a minor UX imperfection, not a correctness issue.

Fixing this properly would require refactoring the structuredClone snapshot pattern to either re-read the store after the attempt loop or use a shared mutable reference — both are broader architectural changes beyond this PR's scope. Happy to tackle that in a follow-up if maintainers prefer.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 971693935a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/model-fallback.ts

altaywtf · 2026-03-18T18:18:28Z

I found one concrete merge blocker and one follow-up nit.

The blocker is that this branch changes calculateAuthProfileCooldownMs() to the stepped 30s -> 1m -> 5m ladder, but src/agents/auth-profiles.markauthprofilefailure.test.ts still asserts the old exponential contract. I verified it on the PR head with:

pnpm test -- src/agents/auth-profiles.markauthprofilefailure.test.ts

That fails at src/agents/auth-profiles.markauthprofilefailure.test.ts:271 with expected 30000 to be 60000, so the production change and targeted auth-profile test suite are currently out of sync.

kiranvk-2011 · 2026-03-18T18:45:01Z

Thanks for catching this, @altaywtf! You're right — the markauthprofilefailure test suite was still asserting the old exponential 1m → 5m → 25m → 1h contract.

Fixed in b2ccb3d:

calculateAuthProfileCooldownMs assertions updated to the new stepped ladder: 30s → 1m → 5m (cap)
The "resets error count when previous cooldown has expired" test's upper bound tightened from 120_000 to 60_000 to match calculateAuthProfileCooldownMs(1) = 30_000
Comments updated throughout

Both test suites now pass:

pnpm test -- src/agents/auth-profiles.markauthprofilefailure.test.ts  # 9/9 passed
pnpm test -- src/agents/auth-profiles/usage.test.ts                  # 47/47 passed

You also mentioned a follow-up nit — happy to address that if you could share the details!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 487deab3a5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/model-fallback.ts

kiranvk-2011 · 2026-03-18T20:14:20Z

CI Failure Analysis — All Failures Pre-existing / Unrelated

I've analyzed all 4 failing CI jobs on the latest run (23264393916). None involve files modified by this PR (auth-profiles/usage.ts, auth-profiles/types.ts, model-fallback.ts, pi-embedded-runner/run.ts, agent-runner-execution.ts, or their test files).

Job	Failure	Related to PR?
`checks (node, test, 1, 2)`	Tlon submodule TypeScript errors (TS1360/TS2339/TS2307 in `channelContentConfig.ts`, `postContent.ts`, `debug.ts`, `groupTemplates.ts`)	❌
`checks (node, extensions)`	`llm-task-tool.test.ts` — 2 schema validation test failures	❌
`checks-windows (node, test, 5, 6)`	`windows-acl.test.ts` — 48/48 Windows ACL tests failed	❌
`checks-windows (node, test, 3, 6)`	`bundle-mcp.test.ts` — 2 path shortname mismatches (`RUNNER~1` vs `runneradmin`)	❌

For reference, the latest upstream main CI runs (23264812040, 23264809135) are all concluding success. The failures on this branch may be due to submodule version drift between our rebase point and latest main, or transient CI runner environment issues.

All jobs that exercise code paths touched by this PR are passing: check (lint), checks (node, test, 2, 2), checks (node, contracts), checks (bun, test), build-smoke, install-smoke, and checks-windows (node, test, 4, 6).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd401f7876

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/agents/model-fallback.ts

kiranvk-2011 · 2026-03-19T11:27:20Z

CI Failure Analysis — All Pre-Existing on `main`

Rebased onto latest green main (c4a4050ce4, main CI run #23293514090). All CI failures on this PR are identical to failures on main itself — verified by comparing against main CI run #23287263443 on 009a10bce2 (the most recent main commit that ran the full test suite).

Failure Breakdown

Job	Root Cause	Our PR touches this?
`check`	TS errors in `extensions/matrix/src/onboarding.ts`, `acp-spawn.test.ts`, `commands/channels/remove.ts`, `matrix-plugin-helper.test.ts`	No — all in Matrix extension / channel commands
`contracts`	`registry.contract.test.ts` expects `['discord','feishu','matrix','telegram']` but gets `['discord','feishu','telegram']` (matrix missing)	No — channel contract registry
`channels`	Discord: `createAccountListHelpers is not a function`, `resolveThreadBindingIdleTimeoutMs is not a function`; WhatsApp: `updateLastRoute` setter mock failure	No — Discord/WhatsApp extension tests
`install-smoke`	`Cannot find module '@vector-im/matrix-bot-sdk/package.json'`	No — Matrix extension install (also fails on main `c4a4050ce4`)
`windows shards`	Same contracts + channel test failures as above	No

Our PR's failures are a strict subset of main's

Main (009a10bce2): 11 failing jobs (check, bun-test, channels, contracts, compat-node22, release-check, windows shards 1/3/4/5/6)
Our PR (6d58d2f381): 7 failing jobs (check, channels, contracts, windows shards 1/4/5/6) — all present in main's failure list
Our PR passes jobs that main fails: bun test, compat-node22, release-check

Timeline

main was green at 0443ee82be (run #23284150176)
f3097b4c09 (refactor: install optional channels for remove) introduced these test failures
c4a4050ce4 (fix(macos): align exec command parity) fixed CI infrastructure but did not touch the failing test files — it passed CI only because changed-scope skipped all test shards (no src/ files changed)
The Install Smoke workflow fails on all recent main commits including c4a4050ce4

This PR's scope

Our changes are limited to:

src/agents/auth-profiles/ (types, usage logic, tests)
src/agents/model-fallback.ts
src/agents/pi-embedded-runner/run.ts
src/auto-reply/reply/agent-runner-execution.ts
changelog/fragments/cooldown-per-model-stepped-backoff.md

None of the failing tests are in or related to these files. The test shard 2/2, extensions, protocol, build-smoke, bun test, and all boundary checks pass.

kiranvk-2011 · 2026-03-19T11:46:53Z

Production Validation — 4 Days of A/B Comparison

I've been running this PR's code in a Docker deployment alongside production (stock main) since Mar 16. Both instances share the same GitHub Copilot token (same rate-limit pool) and serve real Telegram users. Here are the results:

Copilot Failure Rates

Date	Production (main)	PR #49834
2026-03-16	0% (6 calls)	92.8% (69 calls)¹
2026-03-17	100% (100 calls)	27.9% (43 calls)
2026-03-18	100% (16 calls)	42.1% (19 calls)
2026-03-19	100% (4 calls)	9.5% (42 calls)

¹ Mar 16 was the PR instance's first day — high failure rate due to initial cooldown cascade before the stepped backoff had a chance to stabilize.

What the Numbers Show

Production (main) enters a provider-wide cooldown cascade on the first 429 and stays locked out for 25-60 minutes (escalation formula: min(1h, 1m × 5^(n-1))). Once in this state, every subsequent request fails, each failure compounds the error counter, and the provider stays in cooldown until external intervention (watchdog restart). Result: 100% Copilot failure rate for 3 consecutive days.

PR instance (this code) uses per-model cooldown scoping + stepped backoff (30s → 1m → 5m cap). Gateway logs show the error counter consistently resetting between rate-limit windows:

2026-03-18T20:44:53 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-18T20:45:41 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-18T21:09:08 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-19T06:52:56 | errors: 0→1 | cooldown: 30s | reason: rate_limit

Every rate-limit event starts fresh at errorCount: 0→1 with a 30-second cooldown — the cascade never compounds. After 30 seconds the model is retried and usually succeeds.

Model Isolation Working

A 429 on claude-haiku-4.5 (cron job) does not block claude-opus-4.6 (user request). The cooldownModel field ensures cooldown is scoped to the specific model that was rate-limited, not the entire provider. This is the core fix for issue #24158.

External Watchdog Interventions

Instance	Cascade alerts (4 days)	Proactive cooldown cleanups
Production	0 full cascades	4 today (stale cooldowns on disk)
PR #49834	2 cascade alerts	Normal — cooldowns created and expired naturally

Production's "0 cascade alerts" is misleading — the watchdog doesn't trigger because fallback providers absorb the failures. But Copilot is locked out 100% of the time, routing all traffic to fallback providers unnecessarily. The PR instance's 2 early cascade alerts (Mar 17-18) were during initial tuning; by Mar 19 the system is self-correcting with zero external intervention needed.

CI Note

All 6 failing CI jobs on this PR are pre-existing failures on main itself — verified by comparing against main CI run #23287263443 at the same commit (009a10bce2). The test shards containing our changes (test 1/2, test 2/2, bun) all pass ✅. See detailed CI analysis comment above.

altaywtf · 2026-03-24T18:11:53Z

@greptileai review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd0e16c08a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/auto-reply/reply/agent-runner-execution.ts

src/agents/model-fallback.ts

altaywtf · 2026-03-25T09:11:06Z

@greptileai review
@codex review

greptile-apps · 2026-03-25T09:15:42Z

src/auto-reply/reply/agent-runner-execution.ts

+  if (typeof expiry === "number" && expiry > Date.now()) {
+    const secsLeft = Math.ceil((expiry - Date.now()) / 1000);
+    if (secsLeft <= 60) {
+      return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
+    }
+    const minsLeft = Math.ceil(secsLeft / 60);
+    return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;


Two Date.now() calls leave a window for secsLeft to be zero or negative

expiry > Date.now() and expiry - Date.now() are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past expiry. Math.ceil rounds small negative fractions to 0, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".

Capture a single timestamp at the top of the function and reuse it:

Suggested change

if (typeof expiry === "number" && expiry > Date.now()) {

const secsLeft = Math.ceil((expiry - Date.now()) / 1000);

if (secsLeft <= 60) {

return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;

}

const minsLeft = Math.ceil(secsLeft / 60);

return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;

const expiry = err.soonestCooldownExpiry;

const now = Date.now();

if (typeof expiry === "number" && expiry > now) {

const secsLeft = Math.ceil((expiry - now) / 1000);

if (secsLeft <= 60) {

return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;

}

const minsLeft = Math.ceil(secsLeft / 60);

return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;

}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/auto-reply/reply/agent-runner-execution.ts Line: 87-93 Comment: **Two `Date.now()` calls leave a window for `secsLeft` to be zero or negative** `expiry > Date.now()` and `expiry - Date.now()` are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past `expiry`. `Math.ceil` rounds small negative fractions to `0`, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s". Capture a single timestamp at the top of the function and reuse it: ```suggestion const expiry = err.soonestCooldownExpiry; const now = Date.now(); if (typeof expiry === "number" && expiry > now) { const secsLeft = Math.ceil((expiry - now) / 1000); if (secsLeft <= 60) { return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`; } const minsLeft = Math.ceil(secsLeft / 60); return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`; } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-25T09:15:47Z

src/agents/auth-profiles/usage.ts

+      if (
+        params.existing.cooldownModel &&
+        params.modelId &&
+        params.existing.cooldownModel !== params.modelId
+      ) {
+        updatedStats.cooldownModel = undefined;
+      } else if (params.reason !== "rate_limit") {
+        // Non-rate-limit failures are profile-wide — clear model scope even
+        // when the same model fails, so that no model can bypass.
+        updatedStats.cooldownModel = undefined;
+      } else {
+        updatedStats.cooldownModel = params.existing.cooldownModel;


cooldownModel preserved when modelId is absent on a rate_limit during an active window

In the existingCooldownActive branch the three conditions are:

Different model → widen to undefined

Non-rate-limit reason → widen to undefined

Else → keep params.existing.cooldownModel

Condition 3 is reached when params.reason === "rate_limit" and either params.modelId or params.existing.cooldownModel is falsy. When params.modelId is undefined (i.e. the caller didn't pass a model) and params.existing.cooldownModel is already set to "model-A", the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.

All current call sites in this PR correctly thread modelId, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits modelId:

} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) { // Unknown originating model — conservatively widen scope so no model bypasses. updatedStats.cooldownModel = undefined; } else { updatedStats.cooldownModel = params.existing.cooldownModel; }

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/auth-profiles/usage.ts Line: 508-519 Comment: **`cooldownModel` preserved when `modelId` is absent on a `rate_limit` during an active window** In the `existingCooldownActive` branch the three conditions are: 1. Different model → widen to `undefined` 2. Non-rate-limit reason → widen to `undefined` 3. Else → keep `params.existing.cooldownModel` Condition 3 is reached when `params.reason === "rate_limit"` **and** either `params.modelId` or `params.existing.cooldownModel` is falsy. When `params.modelId` is `undefined` (i.e. the caller didn't pass a model) and `params.existing.cooldownModel` is already set to `"model-A"`, the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope. All current call sites in this PR correctly thread `modelId`, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits `modelId`: ```ts } else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) { // Unknown originating model — conservatively widen scope so no model bypasses. updatedStats.cooldownModel = undefined; } else { updatedStats.cooldownModel = params.existing.cooldownModel; } ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bece623e3d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/agents/model-fallback.ts

altaywtf · 2026-03-25T09:38:33Z

hello @kiranvk-2011 pushed some commits, would you mind having a look and re-testing in your setup? I can proceed with the merge once you give the green-light.

kiranvk-2011 · 2026-03-25T16:11:00Z

Green Light — Production Validation Complete

@altaywtf I've rebuilt the test instance from source using your latest commits (up to aa1111b — fix(agents): filter cooldown eta by model) and have 10 days of A/B data. Ready for merge from my side.

Stepped Backoff — Confirmed Working (Today's Live Data)

6 auth_profile_failure_state_updated events from today's gateway log on the source-built image:

Time (UTC)	errors	cooldown	expected
15:49:03	0→1	30s	30s ✅
15:51:17	1→2	60s	60s ✅
15:53:27	2→1	30s	30s (reset, restarted) ✅
15:56:35	1→2	60s	60s ✅
16:00:06	2→3	300s	300s (5m cap) ✅
16:04:29	0→1	30s	30s (reset) ✅

The 0→1 resets prove the error counter clears correctly between cooldown windows. No escalation beyond 5 minutes observed.

10-Day A/B Comparison (Mar 16–25)

Both instances share the same Copilot ghu_ token (same rate-limit pool):

Production (stock main — exponential min(1h, 60s × 5^n)):

Copilot failure rate: 100% for 9 of 10 days (locked out by cascade)
Copilot calls: 146 total, 6 succeeded (4.1% success rate)
All traffic routed to bailian/qwen3.5-plus fallback
total_runs_failed = 0 (fallbacks absorbed everything)

PR #49834 (stepped 30s/1m/5m + per-model scope):

Copilot failure rate: variable, recovers between windows
Copilot calls: 316 total, 112 succeeded (35.4% success rate)
Copilot recovers to 0 errors after each 30s window expires
Mixed traffic: Copilot when available, bailian/glm-5 fallback when rate-limited
total_runs_failed = 0 every single day — zero user-facing failures

The key metric: production's exponential formula locked Copilot out permanently after the first cascade (Mar 17), while the stepped formula recovered and continued using Copilot whenever the rate limit cleared.

Fallback Chain — Working Correctly

5 successful fallback events today: github-copilot/claude-opus-4.6 → 429 → bailian/glm-5 → success. Per-model scoping preserved access to non-rate-limited models (e.g., haiku not blocked by opus 429s).

Build

Source-built using the upstream monorepo Dockerfile (not sed patches). Container created today at 15:44 UTC with all your commits through aa1111b. Node 24.14.0, runtime v24.

Merge Conflict

The PR currently shows mergeStateStatus: DIRTY — needs a rebase onto latest main before merging.

Minor Notes (Non-Blocking)

Per Greptile's review (all cosmetic):

buildCopilotCooldownMessage calls Date.now() twice — could show "~0s" at a boundary. Cosmetic.
resolveFallbackSoonestCooldownExpiry — no try/catch around sync file I/O. Low probability.

TL;DR: 10 days of production A/B data confirm the stepped backoff eliminates the self-reinforcing cascade loop. Copilot utilization went from 4% (stock) to 35% (PR) on the same rate-limited token. Zero user-facing failures on both. Green light from me — just needs the rebase.

…mit message Combines ideas from PRs openclaw#45113, openclaw#31962, and openclaw#45763 to address three cooldown-related issues: 1. Stepped cooldown (30s → 1m → 5m cap) replaces the aggressive exponential formula (1m → 5m → 25m → 1h) that locked out providers for far longer than the actual API rate-limit window. 2. Per-model cooldown scoping: rate_limit cooldowns now record which model triggered them. When a different model on the same auth profile is requested, the cooldown is bypassed — so one model hitting a 429 no longer blocks all other models on the same provider. 3. FallbackSummaryError with soonest-expiry countdown: when all candidates are exhausted, the user sees a clear message like '⚠️ Rate-limited — ready in ~28s' instead of a generic failure. Files changed: - types.ts: add cooldownReason/cooldownModel to ProfileUsageStats - usage.ts: stepped formula, model-aware isProfileInCooldown, modelId threading through computeNextProfileUsageStats/markAuthProfileFailure - model-fallback.ts: FallbackSummaryError class, model-aware availability check, soonestCooldownExpiry computation - pi-embedded-runner/run.ts: thread modelId into failure recording - agent-runner-execution.ts: buildCopilotCooldownMessage helper, rate-limit detection branch in error handler - usage.test.ts: update expected cooldown value (60s → 30s)

…ormatting - Update markAuthProfileCooldown JSDoc to reflect new stepped backoff (30s/1m/5m) - Merge duplicate isFallbackSummaryError import into single import statement - Run oxfmt on all changed files to fix formatting CI failure

…ndow - When model A is cooling down and model B also fails, set cooldownModel to undefined so neither model bypasses via per-model scope - Same-model retries preserve the original cooldownModel - Add 8 new tests for per-model cooldown behavior: model-scoped bypass, profile-wide cooldown, billing-disable guard, scope-widening, same-model retry preservation - Update .some() comment to document intentional design choice for mixed fallback failure reasons

… checks - Add curly braces to single-line if/for bodies in usage.ts and model-fallback.ts to satisfy oxlint eslint(curly) rule - Thread modelId into all 3 isProfileInCooldown calls in pi-embedded-runner/run.ts (lines 719, 746, 767) so the inner profile loop respects per-model cooldown scope — fixes Codex P1 review comment about outer gate passing model-B while inner loop rejects it without model context

…ope for non-rate-limit

@altaywtf

…own ladder Update the auth-profiles.markauthprofilefailure test suite to match the new stepped cooldown formula (30s → 1m → 5m cap) introduced in the first commit. The test was still asserting the old exponential backoff values (1m → 5m → 25m → 1h cap). Changes: - calculateAuthProfileCooldownMs assertions: 60s→30s, 5m→1m, 25m→5m, 1h→5m cap - 'resets error count when previous cooldown has expired' test: upper bound adjusted from 120s to 60s to match 30s base cooldown - Comments updated to reflect the stepped ladder Resolves merge-blocker review from @altaywtf.

altaywtf · 2026-03-25T19:03:55Z

Merged via squash.

Prepared head SHA: 7c488c070c0cafb5a4b53c598d8ccd38c418b67c
Merge commit: 84401223c7b8dff8dfeda9e210114b6f6036f68d

Thanks @kiranvk-2011!

kiranvk-2011 requested a review from a team as a code owner March 18, 2026 13:51

openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 18, 2026

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

src/agents/auth-profiles/usage.ts Outdated Show resolved Hide resolved

src/auto-reply/reply/agent-runner-execution.ts Outdated Show resolved Hide resolved

src/auto-reply/reply/agent-runner-execution.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/agents/auth-profiles/usage.ts Outdated Show resolved Hide resolved

src/auto-reply/reply/agent-runner-execution.ts Outdated Show resolved Hide resolved

kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 2b5c264 to 106d513 Compare March 18, 2026 14:08

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/agents/auth-profiles/usage.ts Outdated Show resolved Hide resolved

src/auto-reply/reply/agent-runner-execution.ts Outdated Show resolved Hide resolved

kiranvk-2011 mentioned this pull request Mar 18, 2026

bug: Multiple CI jobs fail on main branch — contracts, channels, extensions, extension-fast, install-smoke, Windows shards #49848

Closed

openclaw-barnacle bot added size: M and removed size: S labels Mar 18, 2026

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/agents/model-fallback.ts Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/agents/model-fallback.ts Outdated Show resolved Hide resolved

altaywtf self-assigned this Mar 18, 2026

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/agents/model-fallback.ts Show resolved Hide resolved

kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 487deab to 0016af2 Compare March 18, 2026 20:00

kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 0016af2 to bd401f7 Compare March 19, 2026 11:15

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

src/agents/model-fallback.ts Outdated Show resolved Hide resolved

kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 16c4fa3 to 6d58d2f Compare March 19, 2026 11:55

altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch 2 times, most recently from d1764f1 to fd0e16c Compare March 24, 2026 18:10

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

src/auto-reply/reply/agent-runner-execution.ts Outdated Show resolved Hide resolved

src/agents/model-fallback.ts Outdated Show resolved Hide resolved

altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from fd0e16c to 27b7573 Compare March 24, 2026 19:11

openclaw-barnacle bot removed the channel: discord Channel integration: discord label Mar 25, 2026

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 25, 2026

View reviewed changes

src/agents/model-fallback.ts Outdated Show resolved Hide resolved

src/agents/model-fallback.ts Show resolved Hide resolved

altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from aa1111b to 1f84da1 Compare March 25, 2026 18:27

altaywtf mentioned this pull request Mar 25, 2026

test: fix unrelated CI failures #54628

Closed

kiranvk-2011 and others added 13 commits March 25, 2026 21:46

fix: update cooldownReason on active-window failures + clear model sc…

dccf5e0

…ope for non-rate-limit

docs: add changelog fragment for per-model cooldown + stepped backoff

6908c51

ci: retrigger CI (all failures are pre-existing on main)

bbf60e1

fix(reply): tighten fallback cooldown messaging

bceba27

fix(fallback): tighten cooldown failure summaries

538bc5a

fix(agents): scope cooldowns per model

3b336c5

fix(agents): harden cooldown rate-limit handling

85644e7

fix(agents): filter cooldown eta by model

7c488c0

altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from 1f84da1 to 7c488c0 Compare March 25, 2026 18:47

altaywtf merged commit 8440122 into openclaw:main Mar 25, 2026
40 checks passed

github-actions bot mentioned this pull request Mar 25, 2026

📡 Upstream Digest — 2026-03-25 20:31 UTC curtismercier/openclaw-mods#363

Open

Uh oh!

Conversation

kiranvk-2011 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

src/agents/auth-profiles/types.ts

src/agents/auth-profiles/usage.ts

src/agents/model-fallback.ts

src/agents/pi-embedded-runner/run.ts

src/auto-reply/reply/agent-runner-execution.ts

src/agents/auth-profiles/usage.test.ts

Cooldown Behavior After This PR

Testing

CI Note

Related Issues / PRs

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kiranvk-2011 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

altaywtf commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiranvk-2011 commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kiranvk-2011 commented Mar 18, 2026

CI Failure Analysis — All Failures Pre-existing / Unrelated

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kiranvk-2011 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kiranvk-2011 commented Mar 18, 2026 •

edited

Loading

`src/agents/auth-profiles/types.ts`

`src/agents/auth-profiles/usage.ts`

`src/agents/model-fallback.ts`

`src/agents/pi-embedded-runner/run.ts`

`src/auto-reply/reply/agent-runner-execution.ts`

`src/agents/auth-profiles/usage.test.ts`

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

altaywtf commented Mar 18, 2026 •

edited

Loading

kiranvk-2011 commented Mar 19, 2026 •

edited

Loading

CI Failure Analysis — All Pre-Existing on `main`