fix: surface_error failover now throws FailoverError to prevent UI hang#64817
fix: surface_error failover now throws FailoverError to prevent UI hang#64817Ricardo-M-L wants to merge 1 commit into
Conversation
…ent UI hang When an LLM request times out and the failover policy decides to surface the error to the user, the assistant-failover handler previously fell through to return "continue_normal". This silently swallowed the error, causing the WebSocket connection to abort before the final event reached the UI -- leaving the client spinner hanging indefinitely. Three fixes: 1. handleAssistantFailover now returns a "throw" action with a proper FailoverError when the decision is "surface_error", ensuring the error propagates through the promise chain to broadcastChatError. 2. The chat.send handler tracks whether a terminal event (final/error) was broadcast and uses a .finally() safety net to emit one if neither the .then() nor .catch() branch succeeded. 3. Added regression tests for the surface_error -> throw path covering timeout, billing, rate-limit, auth, and generic failure scenarios. Fixes openclaw#64793 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR fixes a silent error-swallowing bug where Confidence Score: 5/5Safe to merge — the fix is well-scoped, the caller already handled the throw action, and 9 regression tests cover the key paths. Only finding is a P2 test name mismatch with no impact on correctness or runtime behavior. Core logic, error propagation, and the safety net are all sound. No files require special attention; the P2 comment on failover-policy.test.ts is cosmetic only. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/failover-policy.test.ts
Line: 133-147
Comment:
**Misleading test name contradicts the fixture**
The test description says "not rotated" but the fixture sets `profileRotated: true`. In `resolveRunFailoverDecision`, `profileRotated: true` means the rotation already happened; passing `false` here would yield `rotate_profile`, not `surface_error`. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.
```suggestion
it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => {
// When timedOut=true and not during compaction, shouldRotateAssistant returns true.
// With profileRotated=true and no fallback, we get surface_error.
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: surface_error failover decision now..." | Re-trigger Greptile |
| it("surfaces error for timeout when no fallback is configured and not rotated", () => { | ||
| // When timedOut=true and not during compaction, shouldRotateAssistant returns true. | ||
| // With profileRotated=true and no fallback, we get surface_error. | ||
| const decision = resolveRunFailoverDecision({ | ||
| stage: "assistant", | ||
| aborted: false, | ||
| fallbackConfigured: false, | ||
| failoverFailure: false, | ||
| failoverReason: null, | ||
| timedOut: true, | ||
| timedOutDuringCompaction: false, | ||
| profileRotated: true, | ||
| }); | ||
| expect(decision.action).toBe("surface_error"); | ||
| }); |
There was a problem hiding this comment.
Misleading test name contradicts the fixture
The test description says "not rotated" but the fixture sets profileRotated: true. In resolveRunFailoverDecision, profileRotated: true means the rotation already happened; passing false here would yield rotate_profile, not surface_error. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.
| it("surfaces error for timeout when no fallback is configured and not rotated", () => { | |
| // When timedOut=true and not during compaction, shouldRotateAssistant returns true. | |
| // With profileRotated=true and no fallback, we get surface_error. | |
| const decision = resolveRunFailoverDecision({ | |
| stage: "assistant", | |
| aborted: false, | |
| fallbackConfigured: false, | |
| failoverFailure: false, | |
| failoverReason: null, | |
| timedOut: true, | |
| timedOutDuringCompaction: false, | |
| profileRotated: true, | |
| }); | |
| expect(decision.action).toBe("surface_error"); | |
| }); | |
| it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => { | |
| // When timedOut=true and not during compaction, shouldRotateAssistant returns true. | |
| // With profileRotated=true and no fallback, we get surface_error. |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/failover-policy.test.ts
Line: 133-147
Comment:
**Misleading test name contradicts the fixture**
The test description says "not rotated" but the fixture sets `profileRotated: true`. In `resolveRunFailoverDecision`, `profileRotated: true` means the rotation already happened; passing `false` here would yield `rotate_profile`, not `surface_error`. The comment inside the body ("With profileRotated=true") acknowledges this but the name still misleads readers.
```suggestion
it("surfaces error for timeout when no fallback is configured and rotation already exhausted", () => {
// When timedOut=true and not during compaction, shouldRotateAssistant returns true.
// With profileRotated=true and no fallback, we get surface_error.
```
How can I resolve this? If you propose a fix, please make it concise.|
Closing this as implemented after Codex review. Current What I checked:
So I’m closing this as already implemented rather than keeping a duplicate issue open. Review notes: reviewed against fd65caf4b0ba; fix evidence: commit fd65caf4b0ba. |
Summary
Fixes #64793
When an LLM request times out and the failover policy decides to
surface_error, thehandleAssistantFailoverfunction previously fell through to return{ action: "continue_normal" }, silently swallowing the error. This caused the WebSocket connection to abort before anyfinalorerrorevent could be broadcast to the UI, leaving the client spinner hanging indefinitely.assistant-failover.ts:surface_errordecisions now return{ action: "throw" }with a properly constructedFailoverError, ensuring the error propagates through the promise chain tobroadcastChatErrorchat.ts: Added aterminalEventSentsafety net in the.finally()handler — if neither.then()nor.catch()managed to broadcast a terminal event (e.g. due to a connection abort race), a fallback error event is emittedfailover-policy.test.tswith 2 timeout surface_error policy assertionsTest plan
assistant-failover.surface-error-throws.test.ts— 9 tests passfailover-policy.test.ts— all assertions pass with new timeout cases🤖 Generated with Claude Code