Skip to content

fix(agents): announce auto model fallback transitions#82676

Merged
steipete merged 4 commits into
mainfrom
fix/session-auto-fallback-retry-primary
May 16, 2026
Merged

fix(agents): announce auto model fallback transitions#82676
steipete merged 4 commits into
mainfrom
fix/session-auto-fallback-retry-primary

Conversation

@steipete
Copy link
Copy Markdown
Contributor

@steipete steipete commented May 16, 2026

Summary

Behavior addressed

Auto fallback was effectively sticky and mostly invisible: after a selected/default model failed, OpenClaw could keep using the fallback model without a clear per-session chat notice. The PR makes fallback/recovery visible and probes the selected model again every five minutes while auto fallback is pinned.

Real environment tested

Local OpenClaw source checkout on macOS, rebased onto current origin/main.

Exact steps or command run after this patch

  • pnpm docs:list
  • OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=180000 OPENCLAW_TEST_PROJECTS_SERIAL=1 node scripts/run-vitest.mjs src/agents/agent-scope.test.ts src/auto-reply/reply/reply-utils.test.ts src/auto-reply/reply/dispatch-acp-delivery.test.ts src/auto-reply/reply/dispatch-from-config.test.ts
  • OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=180000 OPENCLAW_TEST_PROJECTS_SERIAL=1 node scripts/run-vitest.mjs run --config test/vitest/vitest.e2e.config.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts
  • pnpm tsgo:core
  • git diff --check origin/main...HEAD
  • codex-review --mode branch --base origin/main

Evidence after fix

  • Focused reply/model fallback tests passed: 5 files, 282 tests.
  • Agent runner fallback E2E passed: 1 file, 49 tests.
  • ACP delivery regression passed again after the transcript exclusion fix: 1 file, 24 tests.
  • Core typecheck passed.
  • Codex review reported no actionable correctness issues after the ACP transcript fix.

Observed result after fix

When auto fallback changes model state, OpenClaw emits visible status notices once per transition:

  • fallback: ↪️ Model Fallback: <fallback> (selected <primary>; <reason>)
  • recovery: ↪️ Model Fallback cleared: <primary> (was <fallback>)

The fallback notice bypasses source-reply suppression, does not consume the first assistant reply slot, and is excluded from ACP transcript accumulation and TTS.

What was not tested

No live provider outage was forced against production credentials; the behavior is covered by focused unit/E2E tests with simulated model failure and fallback recovery.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels May 16, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 16, 2026

Codex review: needs changes before merge.

Summary
The PR makes auto model fallback and recovery transitions visible as status reply payloads, changes the primary probe interval to five minutes, excludes fallback notices from TTS/ACP transcript paths, and updates tests, docs, and changelog.

Reproducibility: yes. from source: the PR head builds fallback notices only in runReplyAgent, while createFollowupRunner still performs model fallback and sends followup payloads without any fallback notice construction. I did not run a failing test in this read-only review.

Real behavior proof
Override: A maintainer applied proof: override for this PR.

Next step before merge
A focused repair can extend fallback/recovery notices to queued followup delivery and add regression coverage; maintainers still decide promotion and merge because the PR carries a protected label.

Security
Cleared: The diff changes TypeScript reply/model-selection behavior, tests, docs, and changelog only; I found no concrete dependency, CI, secret-handling, install, or supply-chain concern.

Review findings

  • [P2] Cover queued followup fallback notices — src/auto-reply/reply/agent-runner.ts:1659
Review details

Best possible solution:

Keep the merged recovery behavior from #82707 and land fallback/recovery notices after the same notice contract covers normal replies, side-effect-only replies, ACP/TTS exclusions, and queued followup delivery.

Do we have a high-confidence way to reproduce the issue?

Yes, from source: the PR head builds fallback notices only in runReplyAgent, while createFollowupRunner still performs model fallback and sends followup payloads without any fallback notice construction. I did not run a failing test in this read-only review.

Is this the best way to solve the issue?

No, not as patched. The notice approach fits the merged primary-probe recovery baseline, but it should be shared with or mirrored in the queued followup path before merge.

Full review comments:

  • [P2] Cover queued followup fallback notices — src/auto-reply/reply/agent-runner.ts:1659
    The new fallback/recovery notices are only constructed in runReplyAgent. Queued followup turns still call runWithModelFallback in createFollowupRunner and send finalPayloads directly, so a fallback or recovery during a queued followup remains silent even though the docs now promise one notice per state change. Share this notice construction with the followup path or mirror it there before merging.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.82

Acceptance criteria:

  • node scripts/run-vitest.mjs src/auto-reply/reply/followup-runner.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts src/auto-reply/reply/reply-utils.test.ts src/auto-reply/reply/dispatch-acp-delivery.test.ts src/auto-reply/reply/dispatch-from-config.test.ts
  • pnpm tsgo:core
  • git diff --check

What I checked:

Likely related people:

  • @steipete: Authored the merged current-main primary-probe recovery work that this PR builds on and is the current PR author with prior merged history in the same files. (role: recent feature contributor; confidence: high; commits: c8782d18ebd4; files: src/agents/agent-scope.ts, src/auto-reply/reply/followup-runner.ts)
  • @vincentkoc: Blame on the current fallback transition/notice helpers and runReplyAgent fallback notice state points to recent work in this area. (role: recent area contributor; confidence: medium; commits: e06782d5e747; files: src/auto-reply/reply/agent-runner.ts, src/auto-reply/fallback-state.ts)

Remaining risk / open question:

  • I did not run tests in this read-only review; the blocker is from source inspection of the PR head.
  • The PR has proof override, so no live provider outage or Telegram recording was required for this review.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 532e42213d03.

@clawsweeper clawsweeper Bot added the P1 High-priority user-facing bug, regression, or broken workflow. label May 16, 2026
@steipete steipete force-pushed the fix/session-auto-fallback-retry-primary branch from 6711684 to 0302eba Compare May 16, 2026 17:31
@steipete steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label May 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0302eba4f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
Comment on lines +185 to +188
const shouldClearDirectAutoFallbackOverride =
hasDirectAutoFallbackOverride &&
params.clearDirectAutoFallbackOverride !== false &&
!hasOneTurnModelOverride;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve read-only status lookups from clearing auto overrides

createModelSelectionState now clears direct auto fallback overrides unless callers explicitly pass clearDirectAutoFallbackOverride: false. That default is safe for real reply runs, but read-only callers like resolveDirectStatusReplyForSession (in src/plugin-sdk/command-status.runtime.ts) call this helper without that flag, so a /status-style inspection can silently persist a session mutation and drop the active fallback model. This changes a read operation into a write and can unexpectedly switch subsequent turns back to primary even when no reply turn occurred.

Useful? React with 👍 / 👎.

@steipete steipete changed the title fix(agents): retry primary after auto model fallback fix(agents): probe primary after auto model fallback May 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d75195817f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 258 to 261
modelOverrideFallbackOriginProvider: params.originProvider,
modelOverrideFallbackOriginModel: params.originModel,
modelOverrideFallbackLastProbeAt: params.now,
authProfileOverride: params.authProfileId,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep primary-probe timestamp tied to actual primary attempts

modelOverrideFallbackLastProbeAt is set unconditionally whenever fallback persistence writes a new auto override, even when the run started from an already-fallback model and only moved to a different fallback candidate. In that case resolveFallbackSelectionOrigin keeps the original primary in modelOverrideFallbackOrigin*, but this new timestamp still refreshes the cooldown used by shouldProbeAutoFallbackPrimary, so repeated fallback-to-fallback transitions can keep postponing primary recovery probes without ever retrying the primary. This makes primary recovery slower or effectively starved under unstable fallback chains.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added the mantis: telegram-visible-proof Mantis should capture Telegram visible proof. label May 16, 2026
@steipete steipete force-pushed the fix/session-auto-fallback-retry-primary branch from 1a9b17a to 0441b24 Compare May 16, 2026 20:49
@steipete steipete changed the title fix(agents): probe primary after auto model fallback fix(agents): announce auto model fallback transitions May 16, 2026
@steipete steipete merged commit 4b0f16d into main May 16, 2026
121 of 125 checks passed
@steipete steipete deleted the fix/session-auto-fallback-retry-primary branch May 16, 2026 20:56
@steipete
Copy link
Copy Markdown
Contributor Author

Landed on main as 4b0f16d.

Gate:

  • pnpm docs:list
  • focused fallback/reply Vitest: 5 files, 282 tests
  • agent-runner fallback E2E: 1 file, 49 tests
  • ACP delivery regression rerun: 1 file, 24 tests
  • pnpm tsgo:core
  • git diff --check origin/main...HEAD
  • codex-review --mode branch --base origin/main
  • GitHub PR checks green, including Real behavior proof

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation maintainer Maintainer-authored PR mantis: telegram-visible-proof Mantis should capture Telegram visible proof. P1 High-priority user-facing bug, regression, or broken workflow. proof: override Maintainer override for the external PR real behavior proof gate. size: L

Projects

None yet

1 participant