fix(azure):Drain split provider stream frames by galiniliev · Pull Request #80927 · openclaw/openclaw

galiniliev · 2026-05-12T06:38:08Z

Summary

Drain split provider response chunks inside the sanitizer until a parser-visible SSE event or JSON body is emitted.
Preserve fallback JSON body handling for non-SSE provider responses.
Add regressions for split SSE frames and split JSON bodies in src/agents/provider-transport-fetch.test.ts.

This fixes the stalled-session path where the OpenAI SDK waited indefinitely because ReadableStream.pull() returned before the sanitizer emitted a complete event.

Real behavior proof

Behavior or issue addressed: Provider SSE/JSON streams could stall when the sanitizer received split chunks and returned from ReadableStream.pull() before emitting a complete parser-visible event.
Real environment tested: Local OpenClaw gateway process on Node 22 with an isolated state directory under /tmp/openclaw-stalled-proof.lin3kf, gateway port 19023, and provider endpoint http://127.0.0.1:19024/v1 served by scripts/e2e/mock-openai-server.mjs.
Exact steps or command run after the patch: Started scripts/e2e/mock-openai-server.mjs on port 19024, started an isolated pnpm openclaw gateway on port 19023, configured openai/gpt-5.5 to use the local provider endpoint, then sent a direct callGateway RPC agent request.
Evidence after fix: Runtime log excerpt from the local OpenClaw gateway proof:

[model-fetch] start
status=200 contentType=text/event-stream
[responses] first_event
[responses] stream_done
gateway/ws assistant text gateway-ok
lifecycle end
RPC result: status=ok payload text=gateway-ok

Observed result after fix: The gateway request completed instead of stalling; the direct callGateway RPC returned status: ok and assistant payload text gateway-ok.
What was not tested: Live Azure/OpenAI credentials were not used because the available OpenAI key returned 401 in this environment. The proof used the local OpenClaw gateway against a local provider endpoint to exercise the same streaming transport path without external credentials.

Before the fix, I restored src/agents/provider-transport-fetch.ts to origin/main and ran the split-SSE regression with --testTimeout=1500; the test timed out at 1500 ms, reproducing the stalled stream behavior. After the fix, the same split-SSE regression passed in about 100 ms, and the split JSON fallback regression passed in about 121 ms.

Verification

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose - 33 passed
pnpm check:changed - passed
pnpm test:changed - passed, 2 shards, 5 files, 65 tests
pnpm build - passed
git diff --check - passed

clawsweeper · 2026-05-12T06:41:34Z

Codex review: needs maintainer review before merge.

Summary
The branch drains split provider SSE/JSON chunks in the provider response sanitizer, flattens memory-core corpus schemas, adds an Azure Responses first-event timeout, and updates tests plus changelog.

Reproducibility: yes. Source inspection shows current main can buffer split SSE/JSON chunks without emitting parser-visible output from a sanitizer pull, and the PR supplies before-timeout plus after-pass proof for the focused split-frame regressions.

Real behavior proof
Sufficient (logs): The PR body and follow-up comment provide copied after-fix gateway/RPC logs showing the real OpenClaw gateway path completed with status=ok and assistant text gateway-ok against a local provider endpoint.

Next step before merge
Protected maintainer labeling and the overlapping Azure stall PR require maintainer consolidation; there is no narrow automated repair to queue.

Security
Cleared: The diff touches provider stream sanitization, memory tool schemas, tests, and changelog only; I found no dependency, workflow, permission, secret-handling, or package-resolution change.

Review details

Best possible solution:

Land or fold the split-frame draining fix with focused regressions while consolidating the shared memory enum and Azure first-event timeout pieces with #81015.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main can buffer split SSE/JSON chunks without emitting parser-visible output from a sanitizer pull, and the PR supplies before-timeout plus after-pass proof for the focused split-frame regressions.

Is this the best way to solve the issue?

Yes. Draining inside the existing provider response sanitizer until an SSE event, JSON body, or stream close is the narrowest maintainable fix; the shared memory schema and Azure timeout pieces should be coordinated with the related branch rather than treated as a correctness flaw here.

What I checked:

Protected label: Live issue metadata shows this PR is open and carries the protected maintainer label, so it should stay open for explicit maintainer handling.
Current main one-read sanitizer path: Current main reads one upstream chunk per sanitizer pull in both JSON fallback and SSE sanitization; a split frame can be buffered without emitting parser-visible data from that pull. (src/agents/provider-transport-fetch.ts:78, 5681cfd83984)
PR drains until parser-visible output: The PR head loops inside the sanitizer until JSON fallback completes or SSE sanitization enqueues at least one data-bearing event, then returns to the reader. (src/agents/provider-transport-fetch.ts:77, 3317aea38bf8)
Regression coverage: The PR adds split-SSE and split-JSON regressions that consume the wrapped response via Stream.fromSSEResponse and assert the parsed item is delivered. (src/agents/provider-transport-fetch.test.ts:440, 3317aea38bf8)
Memory schema overlap: Current main still uses TypeBox literal unions for memory corpus fields, while the PR changes those fields to the flat stringEnum helper; that surface overlaps with fix(azure): bound Responses first-event stalls #81015. (extensions/memory-core/src/tools.shared.ts:34, 5681cfd83984)
Existing flat enum helper: The shared helper is explicitly documented in source as avoiding Type.Union([Type.Literal(...)]) because some providers reject anyOf in tool schemas. (src/agents/schema/string-enum.ts:10, 5681cfd83984)

Likely related people:

steipete: GitHub commit metadata and prior review context show repeated recent OpenAI/provider transport fixes plus the shared flat enum helper work on the touched surfaces. (role: recent provider transport and schema helper contributor; confidence: high; commits: 25e513e0782a, 1725eebe62d9, c2e3b6e6f819; files: src/agents/provider-transport-fetch.ts, src/agents/openai-transport-stream.ts, src/agents/schema/string-enum.ts)
Takhoffman: GitHub commit metadata shows recent merged memory-core tool-context work adjacent to the memory tool schema surface changed by this PR. (role: recent memory-core adjacent contributor; confidence: medium; commits: f74983e44220; files: extensions/memory-core/src/tools.shared.ts, extensions/memory-core/src/tools.test.ts)
Shakker: Local blame in this checkout attributes the current provider transport, OpenAI stream, memory schema, and string enum lines to the same recent current-main commit, so they are a routing candidate for this checked-out snapshot. (role: current checkout line-history contributor; confidence: low; commits: 7d208f3a5daf; files: src/agents/provider-transport-fetch.ts, src/agents/openai-transport-stream.ts, extensions/memory-core/src/tools.shared.ts)

Remaining risk / open question:

The branch overlaps with fix(azure): bound Responses first-event stalls #81015 on memory corpus schema changes and Azure first-event timeout behavior, so maintainers need to choose the merge or fold order.
The after-fix proof uses a local gateway plus local provider endpoint rather than live Azure credentials; the live Azure evidence is before/control evidence from Azure OpenAI Responses stalls before first event when memory tools are exposed #80926.
I did not run tests because this review was required to keep the checkout read-only.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5681cfd83984.

galiniliev · 2026-05-12T13:56:59Z

Updated branch proof after adding the provider transport fix.

Before / After Proof

Before

Using the rebased branch test with the origin/main implementation restored for src/agents/provider-transport-fetch.ts:

git restore --source=origin/main --worktree src/agents/provider-transport-fetch.ts
pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading until split SSE frames" --testTimeout=1500

Result: failed by timeout at 1500 ms. The test reproduced the stall: the response wrapper had read partial chunks but had not emitted a complete SSE event to Stream.fromSSEResponse().

After

Same focused regression with the PR implementation restored:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading until split SSE frames" --testTimeout=1500

Result: passed in 100 ms.

Companion split-JSON fallback regression:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading split JSON bodies" --testTimeout=1500

Result: passed in 121 ms.

Full provider transport file:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose

Result: 33 tests passed.

Gateway Proof

I started an isolated dev gateway and pointed openai/gpt-5.5 at the repo mock OpenAI server because the local OPENAI_API_KEY is a dummy key and live OpenAI returns 401. The gateway run used the real gateway/RPC/agent path and the OpenAI Responses stream shape from scripts/e2e/mock-openai-server.mjs.

Observed proof from gateway logs:

provider-transport-fetch: [model-fetch] start provider=openai api=openai-responses model=gpt-5.5 method=POST url=http://127.0.0.1:19024/v1/responses
provider-transport-fetch: [model-fetch] response ... status=200 ... contentType=text/event-stream
openai-transport: [responses] first_event ... type=response.output_item.added
openai-transport: [responses] stream_done ... events=3 ... stopReason=stop
gateway/ws: assistant stream event contained text=gateway-ok
RPC result: status: ok, payload text gateway-ok

Verification

pnpm test extensions/memory-core/src/tools.test.ts -- --reporter=verbose
pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose
pnpm check:changed
pnpm test:changed
pnpm build
git diff --check

galiniliev · 2026-05-12T16:49:56Z

Merged via squash.

Prepared head SHA: 03a7e1fec3886924d5685004c9cc8d91b2e0dabd
Merge commit: 4b28312bd8ec25b358c52cfc02ff88fb0f6f09e5

Thanks @galiniliev!

@galiniliev

Merged via squash. Prepared head SHA: 03a7e1f Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com> Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com> Reviewed-by: @galiniliev

@galiniliev

Merged via squash. Prepared head SHA: 03a7e1f Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com> Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com> Reviewed-by: @galiniliev

openclaw-barnacle Bot added extensions: memory-core Extension: memory-core size: XS maintainer Maintainer-authored PR labels May 12, 2026

clawsweeper Bot mentioned this pull request May 12, 2026

fix(memory-core): avoid anyOf in corpus schema for Azure OpenAI Responses compatibility #80940

Closed

CaptainTimon mentioned this pull request May 12, 2026

fix(azure): bound Responses first-event stalls #81015

Closed

25 tasks

galiniliev force-pushed the dev/galini/stalledsession branch from 2a3b582 to c2e5bd2 Compare May 12, 2026 13:54

openclaw-barnacle Bot added agents Agent runtime and tooling size: S and removed size: XS labels May 12, 2026

galiniliev changed the title ~~Use flat enums for memory tool corpus schemas~~ Drain split provider stream frames May 12, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026

galiniliev changed the title ~~Drain split provider stream frames~~ fix(azure):Drain split provider stream frames May 12, 2026

galiniliev requested a review from a team as a code owner May 12, 2026 16:21

galiniliev force-pushed the dev/galini/stalledsession branch 3 times, most recently from 3e57a41 to 2a5b223 Compare May 12, 2026 16:47

Galin Iliev added 5 commits May 12, 2026 16:49

fix memory tool corpus schema enums

86b2fb0

fix: drain split provider stream frames

dcac06c

fix: document split provider stream drain

53b31f8

fix: bound Azure responses stalls

b02f640

fix: credit Azure response stall report

03a7e1f

galiniliev force-pushed the dev/galini/stalledsession branch from 2a5b223 to 03a7e1f Compare May 12, 2026 16:49

galiniliev merged commit 4b28312 into openclaw:main May 12, 2026
17 of 18 checks passed

galiniliev deleted the dev/galini/stalledsession branch May 12, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(azure):Drain split provider stream frames#80927

fix(azure):Drain split provider stream frames#80927
galiniliev merged 5 commits into
openclaw:mainfrom
galiniliev:dev/galini/stalledsession

galiniliev commented May 12, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 12, 2026 •

edited

Loading

Uh oh!

galiniliev commented May 12, 2026

Uh oh!

Uh oh!

galiniliev commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

galiniliev commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real behavior proof

Verification

Uh oh!

clawsweeper Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

galiniliev commented May 12, 2026

Before / After Proof

Before

After

Gateway Proof

Verification

Uh oh!

Uh oh!

galiniliev commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

galiniliev commented May 12, 2026 •

edited

Loading

clawsweeper Bot commented May 12, 2026 •

edited

Loading