Skip to content

fix(agents): propagate provider errorMessage in transport stream throws#69091

Open
CoKeFish wants to merge 1 commit intoopenclaw:mainfrom
CoKeFish:fix/propagate-transport-error-message
Open

fix(agents): propagate provider errorMessage in transport stream throws#69091
CoKeFish wants to merge 1 commit intoopenclaw:mainfrom
CoKeFish:fix/propagate-transport-error-message

Conversation

@CoKeFish
Copy link
Copy Markdown

Summary

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: the three throws in transport-stream-shared.ts and openai-transport-stream.ts hard-coded "An unknown error occurred" even when the stream had already populated output.errorMessage with the provider signal. The outer catch then overwrites output.errorMessage with the thrown string, so the real cause is lost end-to-end.
  • Missing detection / guardrail: no unit test asserted that finalizeTransportStream propagates output.errorMessage on the error branch.
  • Contributing context: Gemini finishReason (SAFETY, RECITATION, MALFORMED_FUNCTION_CALL) swallowed as generic error #1914 fixed the analogous case inside @mariozechner/pi-ai, but the openclaw transport layer kept its own generic throws.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/transport-stream-shared.test.ts
  • Scenario the test should lock in: finalizeTransportStream with stopReason: "error" throws the real output.errorMessage; with errorMessage absent, falls back to the generic string.
  • Why this is the smallest reliable guardrail: the OpenAI variants duplicate the shared logic; locking the shared helper pins the contract.
  • Existing test that already covers this (if any): None.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Error logs and UI that surface stream errors now show the provider-reported reason (e.g. Provider finish_reason: MALFORMED_FUNCTION_CALL) instead of the generic "An unknown error occurred". No config/default changes.

Diagram (if applicable)

Before:
[provider stream sets output.errorMessage = "Provider finish_reason: MALFORMED_FUNCTION_CALL"]
  -> [finalize detects stopReason === "error"]
  -> [throw new Error("An unknown error occurred")]
  -> [outer catch overwrites output.errorMessage with the thrown string]
  -> [caller logs "An unknown error occurred"]  // real cause lost

After:
[provider stream sets output.errorMessage = "Provider finish_reason: MALFORMED_FUNCTION_CALL"]
  -> [finalize detects stopReason === "error"]
  -> [throw new Error(output.errorMessage || "An unknown error occurred")]
  -> [outer catch reassigns output.errorMessage = error.message, which already is the real cause]
  -> [caller logs "Provider finish_reason: MALFORMED_FUNCTION_CALL"]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Note: output.errorMessage carries provider error strings that may contain request identifiers or truncated payload snippets. This is the same surface pi-embedded-runner already logs via output.errorMessage on the non-throwing paths, so no new sensitive-data exposure is introduced.

Repro + Verification

Environment (original bug)

  • OS: Ubuntu 22.04 x64 (production VPS)
  • Runtime: Node 22 + openclaw 2026.4.14 (323493f), systemd-managed gateway
  • Provider: Gemini via google-genai transport
  • Symptom: mid-flow invoice ingestion (2026-04-19 ~19:58 UTC) failed with output.errorMessage = "Provider finish_reason: MALFORMED_FUNCTION_CALL", but the pi-embedded-runner log surfaced only "An unknown error occurred". Root-caused by tracing the stream to the three throw sites in transport-stream-shared.ts and openai-transport-stream.ts.

Environment (fix verification)

  • OS: Windows 11 + Node 22.22 + pnpm 10.33 (fresh checkout of main)
  • pnpm exec vitest run src/agents/transport-stream-shared.test.ts → 6/6 pass
  • pnpm exec vitest run src/agents/openai-transport-stream.test.ts → 65/65 pass
  • New unit tests fail against pre-patch main (assert the real errorMessage survives the throw), pass after the change.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • finalizeTransportStream error branch with concrete errorMessage → throws the real message (new unit test)
    • finalizeTransportStream error branch without errorMessage → throws the generic fallback (new unit test)
    • Existing transport-stream-shared.test.ts suite still green (4 pre-existing tests)
    • Existing openai-transport-stream.test.ts suite still green (65 tests, covers the two updated call sites)
  • Edge cases checked:
    • errorMessage set to empty string → falls back to generic (|| short-circuits)
    • stopReason === "aborted" path unchanged (no errorMessage populated; still surfaces generic)
  • What I did not verify:
    • Full pnpm test run on Windows (unrelated platform-specific flakiness in other suites)
    • Live Gemini repro against the production VPS after patching (the observed-bug environment runs a node_modules hotpatch; the upstream fix is equivalent)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: a provider errorMessage may contain characters or patterns that downstream log parsers didn't expect (they used to always see the literal "An unknown error occurred").
    • Mitigation: the same strings are already surfaced via output.errorMessage on the non-throwing paths (failTransportStream), so any parser that reads error events already tolerates them. No new surface.
  • Risk: classifyFailoverReason behavior could shift if it ever starts reading the thrown Error.message.
    • Mitigation: current implementation in pi-embedded-helpers/errors.ts:1272 reads msg.errorMessage, not the thrown string — unchanged.

AI-assisted

  • AI-assisted (Claude Opus 4.7)
  • Degree of testing: lightly tested (targeted unit suites green; did not run full pnpm test on Windows due to unrelated platform flakiness).
  • Confirmed I understand what the code does: yes — three identical call sites swap a hardcoded string for output.errorMessage || "An unknown error occurred"; added unit tests lock in both branches in the shared helper that the OpenAI variants duplicate.
  • Codex review: not run (no local access).

When a provider stream sets `output.stopReason = "error"` with a concrete
`output.errorMessage` (e.g. `Provider finish_reason: MALFORMED_FUNCTION_CALL`),
the subsequent `throw new Error("An unknown error occurred")` discarded that
message. Callers like `pi-embedded-runner` only logged the generic string,
making post-mortems blind to the real provider signal.

Thread the real `output.errorMessage` into the throw, falling back to the
generic string when it is absent. No flow-control change — only the thrown
message.

Refs openclaw#1914 (root cause acknowledged in pi-ai, still swallowed here),
     openclaw#59524 (model fallback treats finish_reason:error as success),
     openclaw#60473 (sub-agent reports success on stopReason:error).

AI-assisted: Claude Opus 4.7 · lightly tested (new unit tests + existing
transport-stream-shared + openai-transport-stream suites green locally).
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Apr 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 19, 2026

Greptile Summary

This PR fixes a bug where finalizeTransportStream and the two OpenAI Responses transport variants discarded the provider-supplied output.errorMessage (e.g. "Provider finish_reason: MALFORMED_FUNCTION_CALL") by hardcoding "An unknown error occurred" in their throw statements. The fix is a minimal one-liner per call site (output.errorMessage || "An unknown error occurred"), and two new unit tests lock in both branches of the shared helper.

Confidence Score: 5/5

Safe to merge — minimal, well-tested change with no P0/P1 findings.

The fix is a targeted one-liner at three identical call sites, unit tests cover both the propagation and fallback branches, and existing test suites remain green. No logic regressions or security concerns identified.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(agents): propagate provider errorMes..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant