Skip to content

Commit 983909f

Browse files
aaron-he-zhuAaron Zhualtaywtf
authored
fix(agents): classify generic provider errors for failover (#59325)
* fix(agents): classify generic provider errors for failover Anthropic returns bare 'An unknown error occurred' during API instability and OpenRouter wraps upstream failures as 'Provider returned error'. Neither message was recognized by the failover classifier, so the error surfaced directly to users instead of triggering the configured fallback chain. Add both patterns to the serverError classifier so they are classified as transient server errors (timeout) and trigger model failover. Closes #49706 Closes #45834 * fix(agents): scope unknown-error failover by provider * docs(changelog): note provider-scoped unknown-error failover --------- Co-authored-by: Aaron Zhu <aaron@Aarons-MacBook-Air.local> Co-authored-by: Altay <altay@uinaf.dev>
1 parent 8a6da9d commit 983909f

File tree

8 files changed

+103
-11
lines changed

8 files changed

+103
-11
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ Docs: https://docs.openclaw.ai
110110
- Gateway/device auth: reuse cached device-token scopes only for cached-token reconnects, while keeping explicit `deviceToken` scope requests and empty-cache fallbacks intact so reconnects preserve `operator.read` without breaking explicit auth flows. (#46032) Thanks @caicongyang.
111111
- Google Gemini CLI auth: improve OAuth credential discovery across Windows nvm and Homebrew libexec installs, and align Code Assist metadata so Gemini login stops failing on packaged CLI layouts. (#40729) Thanks @hughcube.
112112
- Mattermost/config schema: accept `groups.*.requireMention` again so existing Mattermost configs no longer fail strict validation after upgrade. (#58271) Thanks @MoerAI.
113+
- Agents/failover: scope Anthropic `An unknown error occurred` failover matching by provider so generic internal unknown-error text no longer triggers retryable timeout fallback. (#59325) Thanks @aaron-he-zhu.
113114
- Providers/OpenRouter failover: classify `403 "Key limit exceeded"` spending-limit responses as billing so model fallback continues instead of stopping on generic auth. (#59892) Thanks @rockcent.
114115
- Device pairing/security: keep non-operator device scope checks bound to the requested role prefix so bootstrap verification cannot redeem `operator.*` scopes through `node` auth. (#57258) Thanks @jlapenna.
115116
- Gateway/device pairing: require non-admin paired-device sessions to manage only their own device for token rotate/revoke and paired-device removal, blocking cross-device token theft inside pairing-scoped sessions. (#50627) Thanks @coygeek.

src/agents/cli-runner.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ export async function runCliAgent(params: RunCliAgentParams): Promise<EmbeddedPi
7373
throw err;
7474
}
7575
const message = err instanceof Error ? err.message : String(err);
76-
if (isFailoverErrorMessage(message)) {
77-
const reason = classifyFailoverReason(message) ?? "unknown";
76+
if (isFailoverErrorMessage(message, { provider: params.provider })) {
77+
const reason = classifyFailoverReason(message, { provider: params.provider }) ?? "unknown";
7878
const status = resolveFailoverStatus(reason);
7979
throw new FailoverError(message, {
8080
reason,

src/agents/failover-error.test.ts

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,38 @@ describe("failover-error", () => {
196196
).toBe("overloaded");
197197
});
198198

199+
it("classifies Anthropic bare 'unknown error' as timeout for failover (#49706)", () => {
200+
expect(
201+
resolveFailoverReasonFromError({
202+
provider: "anthropic",
203+
message: "An unknown error occurred",
204+
}),
205+
).toBe("timeout");
206+
});
207+
208+
it("does not classify generic internal unknown-error text as failover timeout", () => {
209+
expect(
210+
resolveFailoverReasonFromError({
211+
message: "LLM request failed with an unknown error.",
212+
}),
213+
).toBeNull();
214+
expect(
215+
resolveFailoverReasonFromError({
216+
message: "An unknown error occurred",
217+
}),
218+
).toBeNull();
219+
expect(
220+
resolveFailoverReasonFromError({
221+
provider: "openrouter",
222+
message: "An unknown error occurred",
223+
}),
224+
).toBeNull();
225+
expect(
226+
resolveFailoverReasonFromError({
227+
message: "Provider returned error",
228+
}),
229+
).toBeNull();
230+
});
199231
it("treats 400 insufficient_quota payloads as billing instead of format", () => {
200232
expect(
201233
resolveFailoverReasonFromError({

src/agents/failover-error.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,22 @@ function getErrorCode(err: unknown): string | undefined {
132132
return findErrorProperty(err, readDirectErrorCode);
133133
}
134134

135+
function readDirectProvider(err: unknown): string | undefined {
136+
if (!err || typeof err !== "object") {
137+
return undefined;
138+
}
139+
const provider = (err as { provider?: unknown }).provider;
140+
if (typeof provider !== "string") {
141+
return undefined;
142+
}
143+
const trimmed = provider.trim();
144+
return trimmed || undefined;
145+
}
146+
147+
function getProvider(err: unknown): string | undefined {
148+
return findErrorProperty(err, readDirectProvider);
149+
}
150+
135151
function readDirectErrorMessage(err: unknown): string | undefined {
136152
if (err instanceof Error) {
137153
return err.message || undefined;
@@ -207,6 +223,7 @@ function normalizeErrorSignal(err: unknown): FailoverSignal {
207223
status: getStatusCode(err),
208224
code: getErrorCode(err),
209225
message: message || undefined,
226+
provider: getProvider(err),
210227
};
211228
}
212229

src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -638,6 +638,21 @@ describe("classifyFailoverReason", () => {
638638
),
639639
).toBeNull();
640640
});
641+
it("classifies Anthropic bare 'unknown error' as timeout for failover", () => {
642+
expect(classifyFailoverReason("An unknown error occurred", { provider: "anthropic" })).toBe(
643+
"timeout",
644+
);
645+
});
646+
647+
it("does not classify generic internal unknown-error text as timeout", () => {
648+
expect(classifyFailoverReason("An unknown error occurred")).toBeNull();
649+
expect(
650+
classifyFailoverReason("An unknown error occurred", { provider: "openrouter" }),
651+
).toBeNull();
652+
expect(classifyFailoverReason("Provider returned error")).toBeNull();
653+
expect(classifyFailoverReason("Unknown error")).toBeNull();
654+
expect(classifyFailoverReason("LLM request failed with an unknown error.")).toBeNull();
655+
});
641656
});
642657

643658
describe("isFailoverErrorMessage", () => {

src/agents/pi-embedded-helpers/errors.ts

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -371,6 +371,7 @@ export type FailoverSignal = {
371371
status?: number;
372372
code?: string;
373373
message?: string;
374+
provider?: string;
374375
};
375376

376377
export type FailoverClassification =
@@ -629,7 +630,19 @@ function classifyFailoverReasonFromCode(raw: string | undefined): FailoverReason
629630
}
630631
}
631632

632-
function classifyFailoverClassificationFromMessage(raw: string): FailoverClassification | null {
633+
function isAnthropicProvider(provider?: string): boolean {
634+
const normalized = provider?.trim().toLowerCase();
635+
return Boolean(normalized && normalized.includes("anthropic"));
636+
}
637+
638+
function isAnthropicGenericUnknownError(raw: string, provider?: string): boolean {
639+
return isAnthropicProvider(provider) && raw.toLowerCase().includes("an unknown error occurred");
640+
}
641+
642+
function classifyFailoverClassificationFromMessage(
643+
raw: string,
644+
provider?: string,
645+
): FailoverClassification | null {
633646
if (isImageDimensionErrorMessage(raw)) {
634647
return null;
635648
}
@@ -677,6 +690,9 @@ function classifyFailoverClassificationFromMessage(raw: string): FailoverClassif
677690
if (isAuthErrorMessage(raw)) {
678691
return toReasonClassification("auth");
679692
}
693+
if (isAnthropicGenericUnknownError(raw, provider)) {
694+
return toReasonClassification("timeout");
695+
}
680696
if (isServerErrorMessage(raw)) {
681697
return toReasonClassification("timeout");
682698
}
@@ -703,7 +719,7 @@ export function classifyFailoverSignal(signal: FailoverSignal): FailoverClassifi
703719
? signal.status
704720
: extractLeadingHttpStatus(signal.message?.trim() ?? "")?.code;
705721
const messageClassification = signal.message
706-
? classifyFailoverClassificationFromMessage(signal.message)
722+
? classifyFailoverClassificationFromMessage(signal.message, signal.provider)
707723
: null;
708724
const statusClassification = classifyFailoverClassificationFromHttpStatus(
709725
inferredStatus,
@@ -1207,24 +1223,28 @@ function isCliSessionExpiredErrorMessage(raw: string): boolean {
12071223
);
12081224
}
12091225

1210-
export function classifyFailoverReason(raw: string): FailoverReason | null {
1226+
export function classifyFailoverReason(
1227+
raw: string,
1228+
opts?: { provider?: string },
1229+
): FailoverReason | null {
12111230
const trimmed = raw.trim();
12121231
const leadingStatus = extractLeadingHttpStatus(trimmed);
12131232
return failoverReasonFromClassification(
12141233
classifyFailoverSignal({
12151234
status: leadingStatus?.code,
12161235
message: raw,
1236+
provider: opts?.provider,
12171237
}),
12181238
);
12191239
}
12201240

1221-
export function isFailoverErrorMessage(raw: string): boolean {
1222-
return classifyFailoverReason(raw) !== null;
1241+
export function isFailoverErrorMessage(raw: string, opts?: { provider?: string }): boolean {
1242+
return classifyFailoverReason(raw, opts) !== null;
12231243
}
12241244

12251245
export function isFailoverAssistantError(msg: AssistantMessage | undefined): boolean {
12261246
if (!msg || msg.stopReason !== "error") {
12271247
return false;
12281248
}
1229-
return isFailoverErrorMessage(msg.errorMessage ?? "");
1249+
return isFailoverErrorMessage(msg.errorMessage ?? "", { provider: msg.provider });
12301250
}

src/agents/pi-embedded-runner/run.ts

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1048,7 +1048,7 @@ export async function runEmbeddedPiAgent(
10481048
};
10491049
}
10501050
const promptFailoverReason =
1051-
promptErrorDetails.reason ?? classifyFailoverReason(errorText);
1051+
promptErrorDetails.reason ?? classifyFailoverReason(errorText, { provider });
10521052
const promptProfileFailureReason =
10531053
resolveAuthProfileFailureReason(promptFailoverReason);
10541054
await maybeMarkAuthProfileFailure({
@@ -1161,7 +1161,12 @@ export async function runEmbeddedPiAgent(
11611161
const rateLimitFailure = isRateLimitAssistantError(lastAssistant);
11621162
const billingFailure = isBillingAssistantError(lastAssistant);
11631163
const failoverFailure = isFailoverAssistantError(lastAssistant);
1164-
const assistantFailoverReason = classifyFailoverReason(lastAssistant?.errorMessage ?? "");
1164+
const assistantFailoverReason = classifyFailoverReason(
1165+
lastAssistant?.errorMessage ?? "",
1166+
{
1167+
provider: lastAssistant?.provider,
1168+
},
1169+
);
11651170
const assistantProfileFailureReason =
11661171
resolveAuthProfileFailureReason(assistantFailoverReason);
11671172
const cloudCodeAssistFormatError = attempt.cloudCodeAssistFormatError;

src/agents/pi-embedded-subscribe.handlers.lifecycle.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,9 @@ export function handleAgentEnd(ctx: EmbeddedPiSubscribeContext) {
4747
model: lastAssistant.model,
4848
});
4949
const rawError = lastAssistant.errorMessage?.trim();
50-
const failoverReason = classifyFailoverReason(rawError ?? "");
50+
const failoverReason = classifyFailoverReason(rawError ?? "", {
51+
provider: lastAssistant.provider,
52+
});
5153
const errorText = (friendlyError || lastAssistant.errorMessage || "LLM request failed.").trim();
5254
const observedError = buildApiErrorObservationFields(rawError);
5355
const safeErrorText =

0 commit comments

Comments
 (0)