fix(agents): announce auto model fallback transitions by steipete · Pull Request #82676 · openclaw/openclaw

steipete · 2026-05-16T17:21:43Z

Summary

Fixes Session model pinning persists indefinitely after fallback, never reconciles with configured primary — confirmed across two independent occurrences (v2026.4.15 and v2026.5.6) #82544 by making auto model fallback state changes visible in chat: users now see one notice when OpenClaw falls back from the selected model and one notice when it returns to the selected model.
Keeps the current-main primary retry/probe behavior, but changes the cooldown to five minutes so a bad default model does not slow every turn forever while still recovering quickly.
Keeps fallback notices out of ACP transcript/TTS/pending assistant content paths; the notices are delivery/status messages, not model output.
Updates the model failover docs, model selection docs, and changelog. This supersedes fix(agent): do not persist fallback model as session runtime model #73790 and replaces the duplicated idea from fix(agents): retry primary after auto model fallback #82661.

Behavior addressed

Auto fallback was effectively sticky and mostly invisible: after a selected/default model failed, OpenClaw could keep using the fallback model without a clear per-session chat notice. The PR makes fallback/recovery visible and probes the selected model again every five minutes while auto fallback is pinned.

Real environment tested

Local OpenClaw source checkout on macOS, rebased onto current origin/main.

Exact steps or command run after this patch

pnpm docs:list
OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=180000 OPENCLAW_TEST_PROJECTS_SERIAL=1 node scripts/run-vitest.mjs src/agents/agent-scope.test.ts src/auto-reply/reply/reply-utils.test.ts src/auto-reply/reply/dispatch-acp-delivery.test.ts src/auto-reply/reply/dispatch-from-config.test.ts
OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=180000 OPENCLAW_TEST_PROJECTS_SERIAL=1 node scripts/run-vitest.mjs run --config test/vitest/vitest.e2e.config.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts
pnpm tsgo:core
git diff --check origin/main...HEAD
codex-review --mode branch --base origin/main

Evidence after fix

Focused reply/model fallback tests passed: 5 files, 282 tests.
Agent runner fallback E2E passed: 1 file, 49 tests.
ACP delivery regression passed again after the transcript exclusion fix: 1 file, 24 tests.
Core typecheck passed.
Codex review reported no actionable correctness issues after the ACP transcript fix.

Observed result after fix

When auto fallback changes model state, OpenClaw emits visible status notices once per transition:

fallback: ↪️ Model Fallback: <fallback> (selected <primary>; <reason>)
recovery: ↪️ Model Fallback cleared: <primary> (was <fallback>)

The fallback notice bypasses source-reply suppression, does not consume the first assistant reply slot, and is excluded from ACP transcript accumulation and TTS.

What was not tested

No live provider outage was forced against production credentials; the behavior is covered by focused unit/E2E tests with simulated model failure and fallback recovery.

clawsweeper · 2026-05-16T17:24:45Z

Codex review: needs changes before merge.

Summary
The PR makes auto model fallback and recovery transitions visible as status reply payloads, changes the primary probe interval to five minutes, excludes fallback notices from TTS/ACP transcript paths, and updates tests, docs, and changelog.

Reproducibility: yes. from source: the PR head builds fallback notices only in runReplyAgent, while createFollowupRunner still performs model fallback and sends followup payloads without any fallback notice construction. I did not run a failing test in this read-only review.

Real behavior proof
Override: A maintainer applied proof: override for this PR.

Next step before merge
A focused repair can extend fallback/recovery notices to queued followup delivery and add regression coverage; maintainers still decide promotion and merge because the PR carries a protected label.

Security
Cleared: The diff changes TypeScript reply/model-selection behavior, tests, docs, and changelog only; I found no concrete dependency, CI, secret-handling, install, or supply-chain concern.

Review findings

[P2] Cover queued followup fallback notices — src/auto-reply/reply/agent-runner.ts:1659

Review details

Best possible solution:

Keep the merged recovery behavior from #82707 and land fallback/recovery notices after the same notice contract covers normal replies, side-effect-only replies, ACP/TTS exclusions, and queued followup delivery.

Do we have a high-confidence way to reproduce the issue?

Yes, from source: the PR head builds fallback notices only in runReplyAgent, while createFollowupRunner still performs model fallback and sends followup payloads without any fallback notice construction. I did not run a failing test in this read-only review.

Is this the best way to solve the issue?

No, not as patched. The notice approach fits the merged primary-probe recovery baseline, but it should be shared with or mirrored in the queued followup path before merge.

Full review comments:

[P2] Cover queued followup fallback notices — src/auto-reply/reply/agent-runner.ts:1659
The new fallback/recovery notices are only constructed in runReplyAgent. Queued followup turns still call runWithModelFallback in createFollowupRunner and send finalPayloads directly, so a fallback or recovery during a queued followup remains silent even though the docs now promise one notice per state change. Share this notice construction with the followup path or mirror it there before merging.
Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.82

Acceptance criteria:

node scripts/run-vitest.mjs src/auto-reply/reply/followup-runner.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts src/auto-reply/reply/reply-utils.test.ts src/auto-reply/reply/dispatch-acp-delivery.test.ts src/auto-reply/reply/dispatch-from-config.test.ts
pnpm tsgo:core
git diff --check

What I checked:

Protected live PR context: Provided GitHub context shows this PR is open at head 0441b24 with protected maintainer label, proof override, Telegram-visible-proof label, and P1 priority. (0441b24ae9e3)
Current main baseline: Current main already contains the merged primary-probe recovery implementation from fix(agents): probe primary after auto fallback pin #82707. (src/agents/agent-scope.ts:49, c8782d18ebd4)
PR direct reply notice implementation: The PR constructs fallback notice payloads inside runReplyAgent and prepends them to the normal reply payload list before threading/dedupe handling. (src/auto-reply/reply/agent-runner.ts:1659, 0441b24ae9e3)
Queued followup gap: At the PR head, createFollowupRunner still runs model fallback and then sends finalPayloads directly without resolving fallback transitions or prepending isFallbackNotice payloads. (src/auto-reply/reply/followup-runner.ts:356, 0441b24ae9e3)
Telegram review context: Scoped maintainer notes say Telegram-visible behavior changes need real Telegram proof; this PR is marked for Telegram-visible proof and also has proof override. (.agents/maintainer-notes/telegram.md:44, 532e42213d03)

Likely related people:

@steipete: Authored the merged current-main primary-probe recovery work that this PR builds on and is the current PR author with prior merged history in the same files. (role: recent feature contributor; confidence: high; commits: c8782d18ebd4; files: src/agents/agent-scope.ts, src/auto-reply/reply/followup-runner.ts)
@vincentkoc: Blame on the current fallback transition/notice helpers and runReplyAgent fallback notice state points to recent work in this area. (role: recent area contributor; confidence: medium; commits: e06782d5e747; files: src/auto-reply/reply/agent-runner.ts, src/auto-reply/fallback-state.ts)

Remaining risk / open question:

I did not run tests in this read-only review; the blocker is from source inspection of the PR head.
The PR has proof override, so no live provider outage or Telegram recording was required for this review.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 532e42213d03.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0302eba4f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T17:35:23Z

+  const shouldClearDirectAutoFallbackOverride =
+    hasDirectAutoFallbackOverride &&
+    params.clearDirectAutoFallbackOverride !== false &&
+    !hasOneTurnModelOverride;


Preserve read-only status lookups from clearing auto overrides

createModelSelectionState now clears direct auto fallback overrides unless callers explicitly pass clearDirectAutoFallbackOverride: false. That default is safe for real reply runs, but read-only callers like resolveDirectStatusReplyForSession (in src/plugin-sdk/command-status.runtime.ts) call this helper without that flag, so a /status-style inspection can silently persist a session mutation and drop the active fallback model. This changes a read operation into a write and can unexpectedly switch subsequent turns back to primary even when no reply turn occurred.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d75195817f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T18:42:38Z

    modelOverrideFallbackOriginProvider: params.originProvider,
    modelOverrideFallbackOriginModel: params.originModel,
+    modelOverrideFallbackLastProbeAt: params.now,
    authProfileOverride: params.authProfileId,


Keep primary-probe timestamp tied to actual primary attempts

modelOverrideFallbackLastProbeAt is set unconditionally whenever fallback persistence writes a new auto override, even when the run started from an already-fallback model and only moved to a different fallback candidate. In that case resolveFallbackSelectionOrigin keeps the original primary in modelOverrideFallbackOrigin*, but this new timestamp still refreshes the cooldown used by shouldProbeAutoFallbackPrimary, so repeated fallback-to-fallback transitions can keep postponing primary recovery probes without ever retrying the primary. This makes primary recovery slower or effectively starved under unstable fallback chains.

Useful? React with 👍 / 👎.

# Conflicts: # docs/concepts/model-failover.md

steipete · 2026-05-16T20:57:37Z

Landed on main as 4b0f16d.

Gate:

pnpm docs:list
focused fallback/reply Vitest: 5 files, 282 tests
agent-runner fallback E2E: 1 file, 49 tests
ACP delivery regression rerun: 1 file, 24 tests
pnpm tsgo:core
git diff --check origin/main...HEAD
codex-review --mode branch --base origin/main
GitHub PR checks green, including Real behavior proof

openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels May 16, 2026

This was referenced May 16, 2026

fix(agent): do not persist fallback model as session runtime model #73790

Closed

fix(agents): retry primary after auto model fallback #82661

Closed

clawsweeper Bot added the P1 High-priority user-facing bug, regression, or broken workflow. label May 16, 2026

steipete force-pushed the fix/session-auto-fallback-retry-primary branch from 6711684 to 0302eba Compare May 16, 2026 17:31

steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label May 16, 2026

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

openclaw-barnacle Bot added size: L and removed size: M labels May 16, 2026

steipete changed the title ~~fix(agents): retry primary after auto model fallback~~ fix(agents): probe primary after auto model fallback May 16, 2026

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

openclaw-barnacle Bot added size: XL and removed size: L labels May 16, 2026

clawsweeper Bot added the mantis: telegram-visible-proof Mantis should capture Telegram visible proof. label May 16, 2026

steipete added 4 commits May 16, 2026 21:49

fix(agents): announce model fallback transitions

5d17817

docs(agents): explain model fallback notices

447e01e

# Conflicts: # docs/concepts/model-failover.md

fix(agents): use five minute fallback probe cadence

b976856

fix(agents): keep fallback notices out of ACP transcripts

0441b24

steipete force-pushed the fix/session-auto-fallback-retry-primary branch from 1a9b17a to 0441b24 Compare May 16, 2026 20:49

openclaw-barnacle Bot added size: L and removed size: XL labels May 16, 2026

steipete changed the title ~~fix(agents): probe primary after auto model fallback~~ fix(agents): announce auto model fallback transitions May 16, 2026

steipete merged commit 4b0f16d into main May 16, 2026
121 of 125 checks passed

steipete deleted the fix/session-auto-fallback-retry-primary branch May 16, 2026 20:56

steipete mentioned this pull request May 16, 2026

Session model pinning persists indefinitely after fallback, never reconciles with configured primary — confirmed across two independent occurrences (v2026.4.15 and v2026.5.6) #82544

Closed

clawsweeper Bot mentioned this pull request May 16, 2026

[Feature]: Bidirectional fallback notifications — alert on switch AND on recovery, per-session/per-topic #44941

Closed

github-actions Bot mentioned this pull request May 16, 2026

📡 Upstream Digest — 2026-05-16 22:37 UTC curtismercier/openclaw-mods#876

Open

clawsweeper Bot mentioned this pull request May 17, 2026

Feature Request: Proactive user notification on LLM failover/API errors #8130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): announce auto model fallback transitions#82676

fix(agents): announce auto model fallback transitions#82676
steipete merged 4 commits into
mainfrom
fix/session-auto-fallback-retry-primary

steipete commented May 16, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 16, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

Uh oh!

steipete commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

steipete commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior addressed

Real environment tested

Exact steps or command run after this patch

Evidence after fix

Observed result after fix

What was not tested

Uh oh!

clawsweeper Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steipete commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

steipete commented May 16, 2026 •

edited

Loading

clawsweeper Bot commented May 16, 2026 •

edited

Loading