Skip to content

fix: surface_error failover now throws FailoverError to prevent UI hang#64817

Closed
Ricardo-M-L wants to merge 1 commit into
openclaw:mainfrom
Ricardo-M-L:fix/agent-timeout-ui-error-surfacing
Closed

fix: surface_error failover now throws FailoverError to prevent UI hang#64817
Ricardo-M-L wants to merge 1 commit into
openclaw:mainfrom
Ricardo-M-L:fix/agent-timeout-ui-error-surfacing

Conversation

@Ricardo-M-L
Copy link
Copy Markdown
Contributor

Summary

Fixes #64793

When an LLM request times out and the failover policy decides to surface_error, the handleAssistantFailover function previously fell through to return { action: "continue_normal" }, silently swallowing the error. This caused the WebSocket connection to abort before any final or error event could be broadcast to the UI, leaving the client spinner hanging indefinitely.

  • assistant-failover.ts: surface_error decisions now return { action: "throw" } with a properly constructed FailoverError, ensuring the error propagates through the promise chain to broadcastChatError
  • chat.ts: Added a terminalEventSent safety net in the .finally() handler — if neither .then() nor .catch() managed to broadcast a terminal event (e.g. due to a connection abort race), a fallback error event is emitted
  • Tests: Added 9 regression tests covering timeout, billing, rate-limit, auth, generic failure, idle-timeout retry, and continue_normal preservation scenarios; extended failover-policy.test.ts with 2 timeout surface_error policy assertions

Test plan

  • assistant-failover.surface-error-throws.test.ts — 9 tests pass
  • failover-policy.test.ts — all assertions pass with new timeout cases
  • Manual: deploy with a slow LLM provider (e.g. minimax-m2.5 via NVIDIA API), trigger timeout, verify UI displays error instead of hanging
  • Verify existing retry/fallback behavior is preserved (idle timeout retry, profile rotation, model fallback)

🤖 Generated with Claude Code

…ent UI hang

When an LLM request times out and the failover policy decides to surface
the error to the user, the assistant-failover handler previously fell
through to return "continue_normal". This silently swallowed the error,
causing the WebSocket connection to abort before the final event reached
the UI -- leaving the client spinner hanging indefinitely.

Three fixes:
1. handleAssistantFailover now returns a "throw" action with a proper
   FailoverError when the decision is "surface_error", ensuring the error
   propagates through the promise chain to broadcastChatError.
2. The chat.send handler tracks whether a terminal event (final/error)
   was broadcast and uses a .finally() safety net to emit one if neither
   the .then() nor .catch() branch succeeded.
3. Added regression tests for the surface_error -> throw path covering
   timeout, billing, rate-limit, auth, and generic failure scenarios.

Fixes openclaw#64793

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime agents Agent runtime and tooling size: M labels Apr 11, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 11, 2026

Greptile Summary

This PR fixes a silent error-swallowing bug where surface_error failover decisions in handleAssistantFailover fell through to return { action: "continue_normal" }, preventing any terminal event from reaching the UI and leaving the client spinner hanging. The fix makes surface_error return { action: "throw", error: FailoverError } — which the existing caller in run.ts already handles by re-throwing — and adds a terminalEventSent safety net in chat.ts to emit a fallback error broadcast if a connection-abort race prevented either branch from sending a terminal event.

Confidence Score: 5/5

Safe to merge — the fix is well-scoped, the caller already handled the throw action, and 9 regression tests cover the key paths.

Only finding is a P2 test name mismatch with no impact on correctness or runtime behavior. Core logic, error propagation, and the safety net are all sound.

No files require special attention; the P2 comment on failover-policy.test.ts is cosmetic only.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/failover-policy.test.ts
Line: 133-147

Comment:
**Misleading test name contradicts the fixture**

The test description says "not rotated" but the fixture sets `profileRotated: true`. In `resolveRunFailoverDecision`, `profileRotated: true` means the rotation already happened; passing `false` here would yield `rotate_profile`, not `surface_error`. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.

```suggestion
  it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => {
    // When timedOut=true and not during compaction, shouldRotateAssistant returns true.
    // With profileRotated=true and no fallback, we get surface_error.
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: surface_error failover decision now..." | Re-trigger Greptile

Comment on lines +133 to +147
it("surfaces error for timeout when no fallback is configured and not rotated", () => {
// When timedOut=true and not during compaction, shouldRotateAssistant returns true.
// With profileRotated=true and no fallback, we get surface_error.
const decision = resolveRunFailoverDecision({
stage: "assistant",
aborted: false,
fallbackConfigured: false,
failoverFailure: false,
failoverReason: null,
timedOut: true,
timedOutDuringCompaction: false,
profileRotated: true,
});
expect(decision.action).toBe("surface_error");
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading test name contradicts the fixture

The test description says "not rotated" but the fixture sets profileRotated: true. In resolveRunFailoverDecision, profileRotated: true means the rotation already happened; passing false here would yield rotate_profile, not surface_error. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.

Suggested change
it("surfaces error for timeout when no fallback is configured and not rotated", () => {
// When timedOut=true and not during compaction, shouldRotateAssistant returns true.
// With profileRotated=true and no fallback, we get surface_error.
const decision = resolveRunFailoverDecision({
stage: "assistant",
aborted: false,
fallbackConfigured: false,
failoverFailure: false,
failoverReason: null,
timedOut: true,
timedOutDuringCompaction: false,
profileRotated: true,
});
expect(decision.action).toBe("surface_error");
});
it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => {
// When timedOut=true and not during compaction, shouldRotateAssistant returns true.
// With profileRotated=true and no fallback, we get surface_error.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/failover-policy.test.ts
Line: 133-147

Comment:
**Misleading test name contradicts the fixture**

The test description says "not rotated" but the fixture sets `profileRotated: true`. In `resolveRunFailoverDecision`, `profileRotated: true` means the rotation already happened; passing `false` here would yield `rotate_profile`, not `surface_error`. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.

```suggestion
  it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => {
    // When timedOut=true and not during compaction, shouldRotateAssistant returns true.
    // With profileRotated=true and no fallback, we get surface_error.
```

How can I resolve this? If you propose a fix, please make it concise.

@steipete
Copy link
Copy Markdown
Contributor

Closing this as implemented after Codex review.

Current main already covers the underlying timeout/UI-hang problem and the broader surface_error propagation path, with a more precise implementation than this PR.

What I checked:

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review notes: reviewed against fd65caf4b0ba; fix evidence: commit fd65caf4b0ba.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Agent timeout does not surface error to UI, UI hangs indefinitely

2 participants