Skip to content

fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message#49834

Merged
altaywtf merged 13 commits intoopenclaw:mainfrom
kiranvk-2011:fix/per-model-cooldown-stepped-backoff
Mar 25, 2026
Merged

fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message#49834
altaywtf merged 13 commits intoopenclaw:mainfrom
kiranvk-2011:fix/per-model-cooldown-stepped-backoff

Conversation

@kiranvk-2011
Copy link
Contributor

@kiranvk-2011 kiranvk-2011 commented Mar 18, 2026

Summary

This PR addresses three related cooldown issues that cause disproportionate service disruption when a single model on a shared auth profile (e.g. GitHub Copilot) hits a rate limit:

  1. Stepped cooldown formula — replaces the exponential 1m -> 5m -> 25m -> 1h escalation with a capped 30s -> 1m -> 5m ladder that better matches actual API rate-limit windows
  2. Per-model cooldown scoping — rate-limit cooldowns now record which model triggered them; other models on the same auth profile are allowed through
  3. User-facing rate-limit message — a structured FallbackSummaryError with countdown replaces the generic "Agent failed" text

Combines ideas from #45113 (per-model cooldown metadata), #31962 (flat/stepped backoff), and #45763 (structured fallback error with UX).

Problem

When a single GitHub Copilot model (e.g. gpt-4.1) returns HTTP 429, the current code:

  • Puts the entire auth profile into cooldown — blocking all other models (claude-sonnet-4.6, gpt-4.1-mini, etc.) on the same profile
  • Escalates aggressively: 3 consecutive errors -> 25-minute cooldown, 4+ -> 1 hour
  • Shows a generic "Agent failed before reply" message with no actionable information

In multi-model deployments with fallback chains, this creates cascading failures where one model's rate limit locks out the entire provider for up to an hour.

Changes

src/agents/auth-profiles/types.ts

  • Add cooldownReason and cooldownModel fields to ProfileUsageStats

src/agents/auth-profiles/usage.ts

  • calculateAuthProfileCooldownMs() — stepped formula: 30s -> 1m -> 5m (cap)
  • isProfileInCooldown() — new forModel parameter; bypasses cooldown when the recorded cooldownModel differs from the requested model (rate_limit only — billing/auth failures remain profile-wide)
  • computeNextProfileUsageStats() — records cooldownReason and cooldownModel metadata; preserves existing metadata during active cooldown windows
  • markAuthProfileFailure() — accepts and threads modelId parameter
  • clearExpiredCooldowns() / resetUsageStats() — clear the new metadata fields

src/agents/model-fallback.ts

  • FallbackSummaryError — structured error class with attempts[] and soonestCooldownExpiry timestamp
  • throwFallbackFailureSummary() — computes soonest expiry across all candidate profiles and throws FallbackSummaryError
  • runWithModelFallback() — passes candidate.model into isProfileInCooldown() for model-aware availability check

src/agents/pi-embedded-runner/run.ts

  • Thread modelId through maybeMarkAuthProfileFailure and its two call sites
  • Pass modelId to all isProfileInCooldown() calls in the inner profile loop so model-scoped cooldowns are respected consistently

src/auto-reply/reply/agent-runner-execution.ts

  • buildCopilotCooldownMessage() — produces messages like "Rate-limited — ready in ~28s"
  • Rate-limit detection branch added to the error handler (before the generic fallback)

src/agents/auth-profiles/usage.test.ts

  • Updated expected cooldown value from 60_000 to 30_000 to match the new stepped formula
  • 8 new per-model cooldown tests covering model-scoped bypass, profile-wide billing fallback, scope widening, and expiry calculation

Cooldown Behavior After This PR

Error count Cooldown Scope
1st rate_limit 30 seconds Model-scoped
2nd rate_limit 1 minute Model-scoped
3rd+ rate_limit 5 minutes Model-scoped
billing / auth_permanent Existing behavior Profile-wide

Other models on the same auth profile remain fully available during any model-scoped rate-limit cooldown.

Testing

  • All 45 usage.test.ts tests pass (1 expectation updated for new formula, 8 new per-model tests)
  • All 59 passing model-fallback.test.ts tests pass (1 pre-existing ANSI escape failure, unrelated)
  • All 149 pi-embedded-runner/run tests pass
  • All 26 agent-runner-execution tests pass
  • oxlint --type-aware passes with 0 errors
  • oxfmt formatting passes
  • TypeScript compilation clean (no errors in modified files)

CI Note

Several CI jobs (contracts, channels, extensions, extension-fast, install-smoke) fail on this PR — these are pre-existing failures that also occur on main. See #49848 for full analysis. All CI jobs that cover our changed files (format:check, build-smoke, check, node test 1/2, node test 2/2, changed-scope, protocol:check, secrets) pass.

Related Issues / PRs

@kiranvk-2011 kiranvk-2011 requested a review from a team as a code owner March 18, 2026 13:51
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 18, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR addresses three related cooldown pain points in the auth-profile system: a stepped backoff formula (30 s → 1 min → 5 min), per-model cooldown scoping so a single 429 no longer blocks every model on a shared profile, and a structured FallbackSummaryError with a user-facing countdown. The approach is well-motivated and the implementation integrates cleanly with the existing auth-profile machinery.

Key changes reviewed:

  • calculateAuthProfileCooldownMs now uses a simple three-tier ladder instead of the previous 5^n exponential — tests updated accordingly.
  • isProfileInCooldown gains a forModel parameter with a guard that prevents the model-bypass from short-circuiting an active disabledUntil (billing/auth) — this correctly addresses the previous reviewer concern.
  • computeNextProfileUsageStats records cooldownReason/cooldownModel and widens the scope to profile-wide when a second model fails or when the reason is not rate_limit.
  • FallbackSummaryError / isPureTransientRateLimitSummary ensure the countdown UX fires only when all attempts are concretely rate_limit or overloaded — mixed-cause exhaustion correctly falls through to the generic error message.
  • Minor items flagged: buildCopilotCooldownMessage calls Date.now() twice (could produce a "~0s" countdown at the boundary); resolveFallbackSoonestCooldownExpiry lacks a try/catch around its synchronous file I/O; and an edge case in computeNextProfileUsageStats where an absent modelId during an active rate-limit window would preserve the existing model scope rather than widening it.

Confidence Score: 4/5

  • Safe to merge — all three stated goals are correctly implemented and well-tested; the remaining comments are minor style/robustness observations that do not affect correctness under normal operating conditions.
  • Core logic is sound: the model-scoped bypass, disabledUntil guard, and mixed-cause FallbackSummaryError filtering all work correctly. Previous reviewer issues have been addressed. The three flagged items are cosmetic (double Date.now()) and low-probability reliability concerns (loadAuthProfileStoreForRuntime not guarded, implicit cooldownModel preservation when modelId is absent), none of which affect the primary use-case. Test coverage is comprehensive (45 + 8 new tests).
  • src/agents/model-fallback.ts (the resolveFallbackSoonestCooldownExpiry I/O path) and src/auto-reply/reply/agent-runner-execution.ts (double Date.now() in buildCopilotCooldownMessage) are worth a second look before merge.

Comments Outside Diff (1)

  1. src/agents/model-fallback.ts, line 546-575 (link)

    P2 resolveFallbackSoonestCooldownExpiry can throw, silently bypassing FallbackSummaryError

    loadAuthProfileStoreForRuntime does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because resolveFallbackSoonestCooldownExpiry is called inline as a parameter to throwFallbackFailureSummary, any such exception propagates out of runWithModelFallback before FallbackSummaryError is thrown. The downstream handler in agent-runner-execution.ts would then receive an unexpected I/O error: isFallbackSummaryError returns false, isRateLimitErrorMessage also returns false, and the user sees a raw filesystem error message in the "Agent failed before reply" text.

    Wrapping the call in a try/catch (returning null on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:

    function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
      if (!params.authStore) {
        return null;
      }
      try {
        const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
          readOnly: true,
          allowKeychainPrompt: false,
        });
        // ... existing logic
        return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
      } catch {
        return null;
      }
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/agents/model-fallback.ts
    Line: 546-575
    
    Comment:
    **`resolveFallbackSoonestCooldownExpiry` can throw, silently bypassing `FallbackSummaryError`**
    
    `loadAuthProfileStoreForRuntime` does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because `resolveFallbackSoonestCooldownExpiry` is called inline as a parameter to `throwFallbackFailureSummary`, any such exception propagates out of `runWithModelFallback` *before* `FallbackSummaryError` is thrown. The downstream handler in `agent-runner-execution.ts` would then receive an unexpected I/O error: `isFallbackSummaryError` returns `false`, `isRateLimitErrorMessage` also returns `false`, and the user sees a raw filesystem error message in the "Agent failed before reply" text.
    
    Wrapping the call in a try/catch (returning `null` on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:
    
    ```ts
    function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
      if (!params.authStore) {
        return null;
      }
      try {
        const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
          readOnly: true,
          allowKeychainPrompt: false,
        });
        // ... existing logic
        return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
      } catch {
        return null;
      }
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/auto-reply/reply/agent-runner-execution.ts
Line: 87-93

Comment:
**Two `Date.now()` calls leave a window for `secsLeft` to be zero or negative**

`expiry > Date.now()` and `expiry - Date.now()` are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past `expiry`. `Math.ceil` rounds small negative fractions to `0`, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".

Capture a single timestamp at the top of the function and reuse it:

```suggestion
  const expiry = err.soonestCooldownExpiry;
  const now = Date.now();
  if (typeof expiry === "number" && expiry > now) {
    const secsLeft = Math.ceil((expiry - now) / 1000);
    if (secsLeft <= 60) {
      return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
    }
    const minsLeft = Math.ceil(secsLeft / 60);
    return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/model-fallback.ts
Line: 546-575

Comment:
**`resolveFallbackSoonestCooldownExpiry` can throw, silently bypassing `FallbackSummaryError`**

`loadAuthProfileStoreForRuntime` does synchronous file I/O. If the store file is absent, malformed, or temporarily locked by another process, it can throw. Because `resolveFallbackSoonestCooldownExpiry` is called inline as a parameter to `throwFallbackFailureSummary`, any such exception propagates out of `runWithModelFallback` *before* `FallbackSummaryError` is thrown. The downstream handler in `agent-runner-execution.ts` would then receive an unexpected I/O error: `isFallbackSummaryError` returns `false`, `isRateLimitErrorMessage` also returns `false`, and the user sees a raw filesystem error message in the "Agent failed before reply" text.

Wrapping the call in a try/catch (returning `null` on failure) makes the fallback behaviour gracefully degrade to the generic rate-limit message without a countdown rather than surfacing an unrelated I/O error:

```ts
function resolveFallbackSoonestCooldownExpiry(params: { ... }): number | null {
  if (!params.authStore) {
    return null;
  }
  try {
    const refreshedStore = loadAuthProfileStoreForRuntime(params.agentDir, {
      readOnly: true,
      allowKeychainPrompt: false,
    });
    // ... existing logic
    return getSoonestCooldownExpiry(refreshedStore, [...allProfileIds]);
  } catch {
    return null;
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/auth-profiles/usage.ts
Line: 508-519

Comment:
**`cooldownModel` preserved when `modelId` is absent on a `rate_limit` during an active window**

In the `existingCooldownActive` branch the three conditions are:
1. Different model → widen to `undefined`
2. Non-rate-limit reason → widen to `undefined`
3. Else → keep `params.existing.cooldownModel`

Condition 3 is reached when `params.reason === "rate_limit"` **and** either `params.modelId` or `params.existing.cooldownModel` is falsy. When `params.modelId` is `undefined` (i.e. the caller didn't pass a model) and `params.existing.cooldownModel` is already set to `"model-A"`, the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.

All current call sites in this PR correctly thread `modelId`, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits `modelId`:

```ts
} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
  // Unknown originating model — conservatively widen scope so no model bypasses.
  updatedStats.cooldownModel = undefined;
} else {
  updatedStats.cooldownModel = params.existing.cooldownModel;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "fix(agents): scope cooldowns per model" | Re-trigger Greptile

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b5c264330

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kiranvk-2011 kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 2b5c264 to 106d513 Compare March 18, 2026 14:08
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 106d513919

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9451a27678

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36ac8951f0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

allProfileIds.add(id);
}
}
return getSoonestCooldownExpiry(authStore, [...allProfileIds]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute cooldown expiry from fresh store before summary

soonestCooldownExpiry is computed from the authStore snapshot captured before the fallback attempts run. In the embedded runner path, attempts can persist new cooldowns via markAuthProfileFailure, but this snapshot is never refreshed, so the summary countdown can be null/stale (for example after the first in-run 429), producing incorrect retry timing in the user-facing rate-limit message.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a valid observation about the snapshot staleness.

However, this is existing architecture, not introduced by this PR. The authStore in model-fallback.ts is a clone created via ensureAuthProfileStore()structuredClone() at the start of runWithModelFallback (line 555). The embedded runner in run.ts creates its own separate clone (line 423). Mutations from markAuthProfileFailure during attempts update the runner's clone and persist to disk via updateAuthProfileStoreWithLock, but the model-fallback.ts clone is never refreshed.

Impact is cosmetic only: The soonestCooldownExpiry computed at line 818 feeds into the user-facing error message countdown text (e.g. "retry in ~4m 30s"). Cooldown enforcement works correctly because each embedded runner call reads fresh state from disk via the lock-guarded store update. A stale countdown in the error message is a minor UX imperfection, not a correctness issue.

Fixing this properly would require refactoring the structuredClone snapshot pattern to either re-read the store after the attempt loop or use a shared mutable reference — both are broader architectural changes beyond this PR's scope. Happy to tackle that in a follow-up if maintainers prefer.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 971693935a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@altaywtf altaywtf self-assigned this Mar 18, 2026
@altaywtf
Copy link
Member

altaywtf commented Mar 18, 2026

I found one concrete merge blocker and one follow-up nit.

The blocker is that this branch changes calculateAuthProfileCooldownMs() to the stepped 30s -> 1m -> 5m ladder, but src/agents/auth-profiles.markauthprofilefailure.test.ts still asserts the old exponential contract. I verified it on the PR head with:

pnpm test -- src/agents/auth-profiles.markauthprofilefailure.test.ts

That fails at src/agents/auth-profiles.markauthprofilefailure.test.ts:271 with expected 30000 to be 60000, so the production change and targeted auth-profile test suite are currently out of sync.

@kiranvk-2011
Copy link
Contributor Author

Thanks for catching this, @altaywtf! You're right — the markauthprofilefailure test suite was still asserting the old exponential 1m → 5m → 25m → 1h contract.

Fixed in b2ccb3d:

  • calculateAuthProfileCooldownMs assertions updated to the new stepped ladder: 30s → 1m → 5m (cap)
  • The "resets error count when previous cooldown has expired" test's upper bound tightened from 120_000 to 60_000 to match calculateAuthProfileCooldownMs(1) = 30_000
  • Comments updated throughout

Both test suites now pass:

pnpm test -- src/agents/auth-profiles.markauthprofilefailure.test.ts  # 9/9 passed
pnpm test -- src/agents/auth-profiles/usage.test.ts                  # 47/47 passed

You also mentioned a follow-up nit — happy to address that if you could share the details!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 487deab3a5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kiranvk-2011 kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 487deab to 0016af2 Compare March 18, 2026 20:00
@kiranvk-2011
Copy link
Contributor Author

CI Failure Analysis — All Failures Pre-existing / Unrelated

I've analyzed all 4 failing CI jobs on the latest run (23264393916). None involve files modified by this PR (auth-profiles/usage.ts, auth-profiles/types.ts, model-fallback.ts, pi-embedded-runner/run.ts, agent-runner-execution.ts, or their test files).

Job Failure Related to PR?
checks (node, test, 1, 2) Tlon submodule TypeScript errors (TS1360/TS2339/TS2307 in channelContentConfig.ts, postContent.ts, debug.ts, groupTemplates.ts)
checks (node, extensions) llm-task-tool.test.ts — 2 schema validation test failures
checks-windows (node, test, 5, 6) windows-acl.test.ts — 48/48 Windows ACL tests failed
checks-windows (node, test, 3, 6) bundle-mcp.test.ts — 2 path shortname mismatches (RUNNER~1 vs runneradmin)

For reference, the latest upstream main CI runs (23264812040, 23264809135) are all concluding success. The failures on this branch may be due to submodule version drift between our rebase point and latest main, or transient CI runner environment issues.

All jobs that exercise code paths touched by this PR are passing: check (lint), checks (node, test, 2, 2), checks (node, contracts), checks (bun, test), build-smoke, install-smoke, and checks-windows (node, test, 4, 6).

@kiranvk-2011 kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 0016af2 to bd401f7 Compare March 19, 2026 11:15
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd401f7876

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kiranvk-2011
Copy link
Contributor Author

kiranvk-2011 commented Mar 19, 2026

CI Failure Analysis — All Pre-Existing on main

Rebased onto latest green main (c4a4050ce4, main CI run #23293514090). All CI failures on this PR are identical to failures on main itself — verified by comparing against main CI run #23287263443 on 009a10bce2 (the most recent main commit that ran the full test suite).

Failure Breakdown

Job Root Cause Our PR touches this?
check TS errors in extensions/matrix/src/onboarding.ts, acp-spawn.test.ts, commands/channels/remove.ts, matrix-plugin-helper.test.ts No — all in Matrix extension / channel commands
contracts registry.contract.test.ts expects ['discord','feishu','matrix','telegram'] but gets ['discord','feishu','telegram'] (matrix missing) No — channel contract registry
channels Discord: createAccountListHelpers is not a function, resolveThreadBindingIdleTimeoutMs is not a function; WhatsApp: updateLastRoute setter mock failure No — Discord/WhatsApp extension tests
install-smoke Cannot find module '@vector-im/matrix-bot-sdk/package.json' No — Matrix extension install (also fails on main c4a4050ce4)
windows shards Same contracts + channel test failures as above No

Our PR's failures are a strict subset of main's

  • Main (009a10bce2): 11 failing jobs (check, bun-test, channels, contracts, compat-node22, release-check, windows shards 1/3/4/5/6)
  • Our PR (6d58d2f381): 7 failing jobs (check, channels, contracts, windows shards 1/4/5/6) — all present in main's failure list
  • Our PR passes jobs that main fails: bun test, compat-node22, release-check

Timeline

  • main was green at 0443ee82be (run #23284150176)
  • f3097b4c09 (refactor: install optional channels for remove) introduced these test failures
  • c4a4050ce4 (fix(macos): align exec command parity) fixed CI infrastructure but did not touch the failing test files — it passed CI only because changed-scope skipped all test shards (no src/ files changed)
  • The Install Smoke workflow fails on all recent main commits including c4a4050ce4

This PR's scope

Our changes are limited to:

  • src/agents/auth-profiles/ (types, usage logic, tests)
  • src/agents/model-fallback.ts
  • src/agents/pi-embedded-runner/run.ts
  • src/auto-reply/reply/agent-runner-execution.ts
  • changelog/fragments/cooldown-per-model-stepped-backoff.md

None of the failing tests are in or related to these files. The test shard 2/2, extensions, protocol, build-smoke, bun test, and all boundary checks pass.

@kiranvk-2011
Copy link
Contributor Author

Production Validation — 4 Days of A/B Comparison

I've been running this PR's code in a Docker deployment alongside production (stock main) since Mar 16. Both instances share the same GitHub Copilot token (same rate-limit pool) and serve real Telegram users. Here are the results:

Copilot Failure Rates

Date Production (main) PR #49834
2026-03-16 0% (6 calls) 92.8% (69 calls)¹
2026-03-17 100% (100 calls) 27.9% (43 calls)
2026-03-18 100% (16 calls) 42.1% (19 calls)
2026-03-19 100% (4 calls) 9.5% (42 calls)

¹ Mar 16 was the PR instance's first day — high failure rate due to initial cooldown cascade before the stepped backoff had a chance to stabilize.

What the Numbers Show

Production (main) enters a provider-wide cooldown cascade on the first 429 and stays locked out for 25-60 minutes (escalation formula: min(1h, 1m × 5^(n-1))). Once in this state, every subsequent request fails, each failure compounds the error counter, and the provider stays in cooldown until external intervention (watchdog restart). Result: 100% Copilot failure rate for 3 consecutive days.

PR instance (this code) uses per-model cooldown scoping + stepped backoff (30s → 1m → 5m cap). Gateway logs show the error counter consistently resetting between rate-limit windows:

2026-03-18T20:44:53 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-18T20:45:41 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-18T21:09:08 | errors: 0→1 | cooldown: 30s | reason: rate_limit
2026-03-19T06:52:56 | errors: 0→1 | cooldown: 30s | reason: rate_limit

Every rate-limit event starts fresh at errorCount: 0→1 with a 30-second cooldown — the cascade never compounds. After 30 seconds the model is retried and usually succeeds.

Model Isolation Working

A 429 on claude-haiku-4.5 (cron job) does not block claude-opus-4.6 (user request). The cooldownModel field ensures cooldown is scoped to the specific model that was rate-limited, not the entire provider. This is the core fix for issue #24158.

External Watchdog Interventions

Instance Cascade alerts (4 days) Proactive cooldown cleanups
Production 0 full cascades 4 today (stale cooldowns on disk)
PR #49834 2 cascade alerts Normal — cooldowns created and expired naturally

Production's "0 cascade alerts" is misleading — the watchdog doesn't trigger because fallback providers absorb the failures. But Copilot is locked out 100% of the time, routing all traffic to fallback providers unnecessarily. The PR instance's 2 early cascade alerts (Mar 17-18) were during initial tuning; by Mar 19 the system is self-correcting with zero external intervention needed.

CI Note

All 6 failing CI jobs on this PR are pre-existing failures on main itself — verified by comparing against main CI run #23287263443 at the same commit (009a10bce2). The test shards containing our changes (test 1/2, test 2/2, bun) all pass ✅. See detailed CI analysis comment above.

@kiranvk-2011 kiranvk-2011 force-pushed the fix/per-model-cooldown-stepped-backoff branch from 16c4fa3 to 6d58d2f Compare March 19, 2026 11:55
@altaywtf altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch 2 times, most recently from d1764f1 to fd0e16c Compare March 24, 2026 18:10
@altaywtf
Copy link
Member

@greptileai review
@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd0e16c08a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@altaywtf altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from fd0e16c to 27b7573 Compare March 24, 2026 19:11
@openclaw-barnacle openclaw-barnacle bot removed the channel: discord Channel integration: discord label Mar 25, 2026
@altaywtf
Copy link
Member

@greptileai review
@codex review

Comment on lines +87 to +93
if (typeof expiry === "number" && expiry > Date.now()) {
const secsLeft = Math.ceil((expiry - Date.now()) / 1000);
if (secsLeft <= 60) {
return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
}
const minsLeft = Math.ceil(secsLeft / 60);
return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Two Date.now() calls leave a window for secsLeft to be zero or negative

expiry > Date.now() and expiry - Date.now() are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past expiry. Math.ceil rounds small negative fractions to 0, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".

Capture a single timestamp at the top of the function and reuse it:

Suggested change
if (typeof expiry === "number" && expiry > Date.now()) {
const secsLeft = Math.ceil((expiry - Date.now()) / 1000);
if (secsLeft <= 60) {
return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
}
const minsLeft = Math.ceil(secsLeft / 60);
return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
const expiry = err.soonestCooldownExpiry;
const now = Date.now();
if (typeof expiry === "number" && expiry > now) {
const secsLeft = Math.ceil((expiry - now) / 1000);
if (secsLeft <= 60) {
return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
}
const minsLeft = Math.ceil(secsLeft / 60);
return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/auto-reply/reply/agent-runner-execution.ts
Line: 87-93

Comment:
**Two `Date.now()` calls leave a window for `secsLeft` to be zero or negative**

`expiry > Date.now()` and `expiry - Date.now()` are evaluated at separate instants. Under normal conditions the gap is negligible, but under a very loaded event loop the second call can return a value past `expiry`. `Math.ceil` rounds small negative fractions to `0`, so the user would see "ready in ~0s" or, in extreme cases, a negative countdown like "ready in ~-1s".

Capture a single timestamp at the top of the function and reuse it:

```suggestion
  const expiry = err.soonestCooldownExpiry;
  const now = Date.now();
  if (typeof expiry === "number" && expiry > now) {
    const secsLeft = Math.ceil((expiry - now) / 1000);
    if (secsLeft <= 60) {
      return `⚠️ Rate-limited — ready in ~${secsLeft}s. Please wait a moment.`;
    }
    const minsLeft = Math.ceil(secsLeft / 60);
    return `⚠️ Rate-limited — ready in ~${minsLeft} min. Please try again shortly.`;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +508 to +519
if (
params.existing.cooldownModel &&
params.modelId &&
params.existing.cooldownModel !== params.modelId
) {
updatedStats.cooldownModel = undefined;
} else if (params.reason !== "rate_limit") {
// Non-rate-limit failures are profile-wide — clear model scope even
// when the same model fails, so that no model can bypass.
updatedStats.cooldownModel = undefined;
} else {
updatedStats.cooldownModel = params.existing.cooldownModel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 cooldownModel preserved when modelId is absent on a rate_limit during an active window

In the existingCooldownActive branch the three conditions are:

  1. Different model → widen to undefined
  2. Non-rate-limit reason → widen to undefined
  3. Else → keep params.existing.cooldownModel

Condition 3 is reached when params.reason === "rate_limit" and either params.modelId or params.existing.cooldownModel is falsy. When params.modelId is undefined (i.e. the caller didn't pass a model) and params.existing.cooldownModel is already set to "model-A", the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.

All current call sites in this PR correctly thread modelId, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits modelId:

} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
  // Unknown originating model — conservatively widen scope so no model bypasses.
  updatedStats.cooldownModel = undefined;
} else {
  updatedStats.cooldownModel = params.existing.cooldownModel;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/usage.ts
Line: 508-519

Comment:
**`cooldownModel` preserved when `modelId` is absent on a `rate_limit` during an active window**

In the `existingCooldownActive` branch the three conditions are:
1. Different model → widen to `undefined`
2. Non-rate-limit reason → widen to `undefined`
3. Else → keep `params.existing.cooldownModel`

Condition 3 is reached when `params.reason === "rate_limit"` **and** either `params.modelId` or `params.existing.cooldownModel` is falsy. When `params.modelId` is `undefined` (i.e. the caller didn't pass a model) and `params.existing.cooldownModel` is already set to `"model-A"`, the existing model-scoped value is silently preserved — even though the new failure came from an unknown model and should arguably widen the scope.

All current call sites in this PR correctly thread `modelId`, so this is not a live bug today. But adding an explicit guard prevents a silent regression if a future call-site omits `modelId`:

```ts
} else if (params.reason === "rate_limit" && !params.modelId && params.existing.cooldownModel) {
  // Unknown originating model — conservatively widen scope so no model bypasses.
  updatedStats.cooldownModel = undefined;
} else {
  updatedStats.cooldownModel = params.existing.cooldownModel;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bece623e3d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@altaywtf
Copy link
Member

hello @kiranvk-2011 pushed some commits, would you mind having a look and re-testing in your setup? I can proceed with the merge once you give the green-light.

@kiranvk-2011
Copy link
Contributor Author

Green Light — Production Validation Complete

@altaywtf I've rebuilt the test instance from source using your latest commits (up to aa1111bfix(agents): filter cooldown eta by model) and have 10 days of A/B data. Ready for merge from my side.

Stepped Backoff — Confirmed Working (Today's Live Data)

6 auth_profile_failure_state_updated events from today's gateway log on the source-built image:

Time (UTC) errors cooldown expected
15:49:03 0→1 30s 30s ✅
15:51:17 1→2 60s 60s ✅
15:53:27 2→1 30s 30s (reset, restarted) ✅
15:56:35 1→2 60s 60s ✅
16:00:06 2→3 300s 300s (5m cap) ✅
16:04:29 0→1 30s 30s (reset) ✅

The 0→1 resets prove the error counter clears correctly between cooldown windows. No escalation beyond 5 minutes observed.

10-Day A/B Comparison (Mar 16–25)

Both instances share the same Copilot ghu_ token (same rate-limit pool):

Production (stock main — exponential min(1h, 60s × 5^n)):

  • Copilot failure rate: 100% for 9 of 10 days (locked out by cascade)
  • Copilot calls: 146 total, 6 succeeded (4.1% success rate)
  • All traffic routed to bailian/qwen3.5-plus fallback
  • total_runs_failed = 0 (fallbacks absorbed everything)

PR #49834 (stepped 30s/1m/5m + per-model scope):

  • Copilot failure rate: variable, recovers between windows
  • Copilot calls: 316 total, 112 succeeded (35.4% success rate)
  • Copilot recovers to 0 errors after each 30s window expires
  • Mixed traffic: Copilot when available, bailian/glm-5 fallback when rate-limited
  • total_runs_failed = 0 every single day — zero user-facing failures

The key metric: production's exponential formula locked Copilot out permanently after the first cascade (Mar 17), while the stepped formula recovered and continued using Copilot whenever the rate limit cleared.

Fallback Chain — Working Correctly

5 successful fallback events today: github-copilot/claude-opus-4.6 → 429 → bailian/glm-5 → success. Per-model scoping preserved access to non-rate-limited models (e.g., haiku not blocked by opus 429s).

Build

Source-built using the upstream monorepo Dockerfile (not sed patches). Container created today at 15:44 UTC with all your commits through aa1111b. Node 24.14.0, runtime v24.

Merge Conflict

The PR currently shows mergeStateStatus: DIRTY — needs a rebase onto latest main before merging.

Minor Notes (Non-Blocking)

Per Greptile's review (all cosmetic):

  • buildCopilotCooldownMessage calls Date.now() twice — could show "~0s" at a boundary. Cosmetic.
  • resolveFallbackSoonestCooldownExpiry — no try/catch around sync file I/O. Low probability.

TL;DR: 10 days of production A/B data confirm the stepped backoff eliminates the self-reinforcing cascade loop. Copilot utilization went from 4% (stock) to 35% (PR) on the same rate-limited token. Zero user-facing failures on both. Green light from me — just needs the rebase.

@altaywtf altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from aa1111b to 1f84da1 Compare March 25, 2026 18:27
kiranvk-2011 and others added 13 commits March 25, 2026 21:46
…mit message

Combines ideas from PRs openclaw#45113, openclaw#31962, and openclaw#45763 to address three
cooldown-related issues:

1. Stepped cooldown (30s → 1m → 5m cap) replaces the aggressive
   exponential formula (1m → 5m → 25m → 1h) that locked out providers
   for far longer than the actual API rate-limit window.

2. Per-model cooldown scoping: rate_limit cooldowns now record which
   model triggered them. When a different model on the same auth profile
   is requested, the cooldown is bypassed — so one model hitting a 429
   no longer blocks all other models on the same provider.

3. FallbackSummaryError with soonest-expiry countdown: when all
   candidates are exhausted, the user sees a clear message like
   '⚠️ Rate-limited — ready in ~28s' instead of a generic failure.

Files changed:
- types.ts: add cooldownReason/cooldownModel to ProfileUsageStats
- usage.ts: stepped formula, model-aware isProfileInCooldown, modelId
  threading through computeNextProfileUsageStats/markAuthProfileFailure
- model-fallback.ts: FallbackSummaryError class, model-aware availability
  check, soonestCooldownExpiry computation
- pi-embedded-runner/run.ts: thread modelId into failure recording
- agent-runner-execution.ts: buildCopilotCooldownMessage helper, rate-limit
  detection branch in error handler
- usage.test.ts: update expected cooldown value (60s → 30s)
…ormatting

- Update markAuthProfileCooldown JSDoc to reflect new stepped backoff (30s/1m/5m)
- Merge duplicate isFallbackSummaryError import into single import statement
- Run oxfmt on all changed files to fix formatting CI failure
…ndow

- When model A is cooling down and model B also fails, set cooldownModel
  to undefined so neither model bypasses via per-model scope
- Same-model retries preserve the original cooldownModel
- Add 8 new tests for per-model cooldown behavior: model-scoped bypass,
  profile-wide cooldown, billing-disable guard, scope-widening, same-model
  retry preservation
- Update .some() comment to document intentional design choice for mixed
  fallback failure reasons
… checks

- Add curly braces to single-line if/for bodies in usage.ts and
  model-fallback.ts to satisfy oxlint eslint(curly) rule
- Thread modelId into all 3 isProfileInCooldown calls in
  pi-embedded-runner/run.ts (lines 719, 746, 767) so the inner
  profile loop respects per-model cooldown scope — fixes Codex P1
  review comment about outer gate passing model-B while inner loop
  rejects it without model context
…own ladder

Update the auth-profiles.markauthprofilefailure test suite to match the
new stepped cooldown formula (30s → 1m → 5m cap) introduced in the
first commit. The test was still asserting the old exponential backoff
values (1m → 5m → 25m → 1h cap).

Changes:
- calculateAuthProfileCooldownMs assertions: 60s→30s, 5m→1m, 25m→5m,
  1h→5m cap
- 'resets error count when previous cooldown has expired' test: upper
  bound adjusted from 120s to 60s to match 30s base cooldown
- Comments updated to reflect the stepped ladder

Resolves merge-blocker review from @altaywtf.
@altaywtf altaywtf force-pushed the fix/per-model-cooldown-stepped-backoff branch from 1f84da1 to 7c488c0 Compare March 25, 2026 18:47
@altaywtf altaywtf merged commit 8440122 into openclaw:main Mar 25, 2026
40 checks passed
@altaywtf
Copy link
Member

Merged via squash.

Thanks @kiranvk-2011!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants