Skip to content

fix(model-selection): clear auto-failover overrides so primary is retried on each turn#69365

Merged
steipete merged 3 commits intoopenclaw:mainfrom
Chevron7Locked:fix/auto-failover-override-persists
Apr 21, 2026
Merged

fix(model-selection): clear auto-failover overrides so primary is retried on each turn#69365
steipete merged 3 commits intoopenclaw:mainfrom
Chevron7Locked:fix/auto-failover-override-persists

Conversation

@Chevron7Locked
Copy link
Copy Markdown
Contributor

Summary

  • Problem: When runWithModelFallback falls back to a secondary provider it writes providerOverride/modelOverride/modelOverrideSource:"auto" to the session. On subsequent turns createModelSelectionState re-applies the stored override and passes the fallback provider directly to runWithModelFallback — the configured primary is never tried again. The session is permanently pinned to the fallback even after the primary recovers.
  • Why it matters: Any transient primary failure (network blip, restart, OOM) permanently routes all future turns through a fallback provider until the user manually runs /model default. For local/self-hosted primaries that recover in seconds, this silently burns paid API quota on fallbacks for the rest of the session.
  • What changed: In createModelSelectionState, when the direct session override has modelOverrideSource: "auto" (set by a previous automatic fallback, not a user /model command), the override is cleared and the configured primary is retried. If the primary is still down runWithModelFallback falls back and re-sets the override for that turn. Once the primary recovers the override stays clear.
  • What did NOT change: User-selected overrides (modelOverrideSource: "user") and legacy overrides (no source field, backward-compat treated as user) are preserved unchanged. Parent-session auto-overrides are applied to children as before.

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Root cause trace

Turn N:   model-selection.ts → provider=primary → runWithModelFallback (primary fails) → fallback wins
          agent-runner-execution.ts:applyFallbackCandidateSelectionToEntry → sets modelOverrideSource:"auto"

Turn N+1: model-selection.ts reads stored override → provider=fallback
          runWithModelFallback(provider=fallback) → fallback succeeds first try
          applyFallbackCandidateSelectionToEntry: provider==run.provider → no update → override never cleared

Turn N+2: same as N+1 — primary never retried

Fix

// src/auto-reply/reply/model-selection.ts
const isAutoSessionOverride =
  storedOverride?.source === "session" &&
  sessionEntry?.modelOverrideSource === "auto";
if (isAutoSessionOverride && sessionEntry && sessionStore && sessionKey && !resetModelOverride) {
  const { updated } = applyModelOverrideToSessionEntry({
    entry: sessionEntry,
    selection: { provider: defaultProvider, model: defaultModel, isDefault: true },
  });
  // persist + set resetModelOverride = true
}
if (storedOverride?.model && !skipStoredOverride && !isAutoSessionOverride) {
  // existing apply logic
}

Tests

Four new cases in model-selection.test.ts covering the happy path, user-override preservation, legacy backward-compat, and parent-session auto-override pass-through.

✓ clears auto-failover override and retries the configured primary
✓ preserves a user-selected override across turns
✓ preserves a legacy override with no modelOverrideSource (treated as user)
✓ does not touch an auto-failover override inherited from a parent session

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR fixes a session-pinning bug where a transient primary-model failure would permanently route all subsequent turns through the fallback provider until the user manually ran /model default. The fix adds a check in createModelSelectionState for modelOverrideSource: "auto" on a direct session override; when found, it clears the override so the primary is retried each turn, letting runWithModelFallback re-set it only when the primary is still down. Four targeted tests cover the new behaviour, user-override preservation, legacy backward-compat, and parent-session passthrough.

Confidence Score: 5/5

Safe to merge — the logic is correct, well-guarded, and follows existing patterns in the codebase.

The fix correctly targets only auto-sourced direct session overrides (source === 'session' && modelOverrideSource === 'auto'), preserves user and legacy overrides unchanged, and leaves parent-session overrides untouched. The auth-profile clearing that occurs via applyModelOverrideToSessionEntry with isDefault: true is benign because applyFallbackCandidateSelectionToEntry already replaces or deletes any user auth-profile override when the fallback fires. The persistent-store update follows the identical pattern used in the existing allowedModelKeys reset block. All four new test cases exercise the discriminating conditions.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(model-selection): clear auto-failove..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7006763d11

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
@altaywtf altaywtf self-assigned this Apr 20, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa9ce6a43b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
@altaywtf altaywtf force-pushed the fix/auto-failover-override-persists branch from aa9ce6a to 08ac24f Compare April 20, 2026 16:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08ac24ffd8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
@altaywtf altaywtf force-pushed the fix/auto-failover-override-persists branch from 08ac24f to 71fe22a Compare April 20, 2026 16:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1622c8ab02

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
@altaywtf altaywtf force-pushed the fix/auto-failover-override-persists branch from fe1b1f8 to bbb1c7c Compare April 20, 2026 17:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbb1c7c34a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
Comment on lines +404 to +405
provider = autoHealProvider;
model = autoHealModel;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use self-healed model for status surfaces in same request

This branch now rewrites the in-memory selection to autoHealProvider/autoHealModel after clearing an auto failover override, but the /status path still snapshots selectedProvider/selectedModel before calling createModelSelectionState and renders those stale values (src/plugin-sdk/command-status.runtime.ts:51-57 and :105-114). In sessions with modelOverrideSource: "auto", the first /status after recovery will report the fallback model even though this code has already cleared the override in session storage, so operators get incorrect model state for that turn.

Useful? React with 👍 / 👎.

@Chevron7Locked Chevron7Locked force-pushed the fix/auto-failover-override-persists branch from bbb1c7c to 9605e45 Compare April 20, 2026 17:11
@Chevron7Locked
Copy link
Copy Markdown
Contributor Author

Thanks for the P2 feedback — all valid. Addressed in 9605e45:

  1. resetModelOverride flag — removed from the auto-heal path. That flag is for allowlist violations; auto-heal is not a violation. The wrong system event will no longer fire.

  2. In-memory provider/model not updated — added provider = defaultProvider; model = defaultModel after clearing session state, so the current turn retries the primary rather than still executing on the fallback. Added an explicit test case that simulates the get-reply-directives.ts preload path (caller passes fallback values as params) and verifies the returned state reflects the primary.

  3. Channel model override edge case — acknowledged in a code comment. Channel overrides are skipped on the recovery turn because hasSessionModelOverride was true at preload time; they resume on the following turn once the session state is clear. Fixing this cleanly requires changes to the get-reply-directives preload flow (re-evaluating channel override when the session override was auto and got cleared). Happy to follow up in a separate PR if the maintainers want that addressed here.

@altaywtf
Copy link
Copy Markdown
Member

altaywtf commented Apr 20, 2026

@Chevron7Locked I was preparing the MR for the merge, doing some refactoring.
since you have also pushed, leaving it for now.
please let me know when you're ready for a review. 🙏

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9605e45098

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/auto-reply/reply/model-selection.ts Outdated
Comment on lines +389 to +390
if (isAutoSessionOverride && sessionEntry && sessionStore && sessionKey && !resetModelOverride) {
const { updated } = applyModelOverrideToSessionEntry({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip auto-heal reset when heartbeat model override is active

This branch runs even when hasResolvedHeartbeatModelOverride is true (skipStoredOverride), and it forces provider/model back to defaultProvider/defaultModel. In heartbeat turns with an explicit heartbeat.model, that means the first run after an auto-failover override no longer uses the configured heartbeat model and instead executes on the default model, which contradicts the explicit heartbeat selection path and can change cost/behavior for that run.

Useful? React with 👍 / 👎.

Comment thread src/auto-reply/reply/model-selection.ts Outdated
Comment on lines +390 to +393
const { updated } = applyModelOverrideToSessionEntry({
entry: sessionEntry,
selection: { provider: defaultProvider, model: defaultModel, isDefault: true },
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep user auth profile when clearing auto model override

Calling applyModelOverrideToSessionEntry here without profileOverride clears authProfileOverride fields as part of its default-reset behavior. If failover only switched models within the same provider, fallback persistence intentionally keeps the user's scoped auth profile; this auto-heal path then drops it on the next turn, causing an unintended credentials/profile reset after a transient fallback.

Useful? React with 👍 / 👎.

@steipete steipete force-pushed the fix/auto-failover-override-persists branch from 9605e45 to 9bd50c1 Compare April 21, 2026 01:55
steipete added a commit to Chevron7Locked/openclaw that referenced this pull request Apr 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bd50c1944

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +404 to +406
const storedOverride = hadDirectAutoSessionOverride
? undefined
: resolveStoredModelOverride({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply parent override after clearing direct auto override

When a child session has a direct modelOverrideSource: "auto", this branch forces storedOverride to undefined after clearing the direct override, so the same turn cannot fall back to a parent-session override. In get-reply.ts, child sessions can inherit a parent /model selection, but after transient failover the next turn now runs on defaultProvider/defaultModel instead of the parent-selected model because resolveStoredModelOverride is bypassed here; that causes at least one request on the wrong model and temporarily ignores the user’s parent override.

Useful? React with 👍 / 👎.

steipete added a commit to Chevron7Locked/openclaw that referenced this pull request Apr 21, 2026
@steipete steipete force-pushed the fix/auto-failover-override-persists branch from 9bd50c1 to f7ea39b Compare April 21, 2026 02:14
steipete added a commit to Chevron7Locked/openclaw that referenced this pull request Apr 21, 2026
@steipete steipete force-pushed the fix/auto-failover-override-persists branch from f7ea39b to d70cc44 Compare April 21, 2026 02:21
…ried on each turn

When runWithModelFallback falls back to a secondary provider it writes
providerOverride/modelOverride/modelOverrideSource:"auto" to the session.
On subsequent turns createModelSelectionState read this stored override and
passed the fallback provider directly to runWithModelFallback, so the
configured primary was never retried — the session was permanently pinned to
the fallback even after the primary recovered.

Fix: at model-selection ingress, when the direct session override has
modelOverrideSource "auto" (set by a previous automatic fallback, not a user
/model command), clear the override and retry the configured primary. If the
primary is still down runWithModelFallback will fall back and re-set the auto
override for that turn. Once the primary recovers the override stays clear.

User-selected overrides (modelOverrideSource "user" or legacy undefined+model)
are preserved unchanged.

Covered by four new unit tests in model-selection.test.ts:
- auto-failover override cleared and primary retried
- user-selected override preserved
- legacy override without source field preserved
- parent-session auto-override applied to child (not cleared by child logic)
…verride clearing

Three corrections to the auto-failover self-healing introduced in the prior commit:

1. Reset in-memory provider/model to configured primary after clearing auto override.
   get-reply-directives.ts preloads provider/model from the stored override before
   calling createModelSelectionState, so clearing only session state still ran the
   current turn on the fallback. Now provider/model are reset to defaultProvider/
   defaultModel so this turn retries the primary immediately, not on the next turn.

2. Remove resetModelOverride = true from the auto-heal path. That flag triggers a
   "Model override not allowed for this agent" system event in
   applyInlineDirectiveOverrides, which is incorrect: the override was valid and set
   by the fallback loop — it just expired once the primary recovered. Auto-heal is
   not an allowlist violation.

3. Add a test case that verifies the in-memory reset when the caller pre-loads the
   fallback provider/model (simulating the get-reply-directives.ts preload path).

Known limitation (noted in comment): channel model overrides (channels.modelByChannel)
are skipped on the recovery turn because hasSessionModelOverride was true when they
were evaluated at preload time. They resume on the following turn once session state
is clear. Fixing this cleanly requires changes to the get-reply-directives preload
flow and is out of scope for this PR.
@steipete steipete force-pushed the fix/auto-failover-override-persists branch from d70cc44 to fcbf830 Compare April 21, 2026 02:29
@steipete steipete merged commit 215d5fb into openclaw:main Apr 21, 2026
91 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Landed after maintainer review and focused regression coverage.

  • Gate: pnpm test src/auto-reply/reply/model-selection.test.ts; pnpm check:changed; GitHub CI green
  • Source SHA: fcbf830
  • Merge SHA: 215d5fb

Thanks @Chevron7Locked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants