Skip to content

fix(azure):Drain split provider stream frames#80927

Merged
galiniliev merged 5 commits into
openclaw:mainfrom
galiniliev:dev/galini/stalledsession
May 12, 2026
Merged

fix(azure):Drain split provider stream frames#80927
galiniliev merged 5 commits into
openclaw:mainfrom
galiniliev:dev/galini/stalledsession

Conversation

@galiniliev
Copy link
Copy Markdown
Contributor

@galiniliev galiniliev commented May 12, 2026

Summary

  • Drain split provider response chunks inside the sanitizer until a parser-visible SSE event or JSON body is emitted.
  • Preserve fallback JSON body handling for non-SSE provider responses.
  • Add regressions for split SSE frames and split JSON bodies in src/agents/provider-transport-fetch.test.ts.

This fixes the stalled-session path where the OpenAI SDK waited indefinitely because ReadableStream.pull() returned before the sanitizer emitted a complete event.

Real behavior proof

  • Behavior or issue addressed: Provider SSE/JSON streams could stall when the sanitizer received split chunks and returned from ReadableStream.pull() before emitting a complete parser-visible event.
  • Real environment tested: Local OpenClaw gateway process on Node 22 with an isolated state directory under /tmp/openclaw-stalled-proof.lin3kf, gateway port 19023, and provider endpoint http://127.0.0.1:19024/v1 served by scripts/e2e/mock-openai-server.mjs.
  • Exact steps or command run after the patch: Started scripts/e2e/mock-openai-server.mjs on port 19024, started an isolated pnpm openclaw gateway on port 19023, configured openai/gpt-5.5 to use the local provider endpoint, then sent a direct callGateway RPC agent request.
  • Evidence after fix: Runtime log excerpt from the local OpenClaw gateway proof:
[model-fetch] start
status=200 contentType=text/event-stream
[responses] first_event
[responses] stream_done
gateway/ws assistant text gateway-ok
lifecycle end
RPC result: status=ok payload text=gateway-ok
  • Observed result after fix: The gateway request completed instead of stalling; the direct callGateway RPC returned status: ok and assistant payload text gateway-ok.
  • What was not tested: Live Azure/OpenAI credentials were not used because the available OpenAI key returned 401 in this environment. The proof used the local OpenClaw gateway against a local provider endpoint to exercise the same streaming transport path without external credentials.

Before the fix, I restored src/agents/provider-transport-fetch.ts to origin/main and ran the split-SSE regression with --testTimeout=1500; the test timed out at 1500 ms, reproducing the stalled stream behavior. After the fix, the same split-SSE regression passed in about 100 ms, and the split JSON fallback regression passed in about 121 ms.

Verification

  • pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose - 33 passed
  • pnpm check:changed - passed
  • pnpm test:changed - passed, 2 shards, 5 files, 65 tests
  • pnpm build - passed
  • git diff --check - passed

@openclaw-barnacle openclaw-barnacle Bot added extensions: memory-core Extension: memory-core size: XS maintainer Maintainer-authored PR labels May 12, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 12, 2026

Codex review: needs maintainer review before merge.

Summary
The branch drains split provider SSE/JSON chunks in the provider response sanitizer, flattens memory-core corpus schemas, adds an Azure Responses first-event timeout, and updates tests plus changelog.

Reproducibility: yes. Source inspection shows current main can buffer split SSE/JSON chunks without emitting parser-visible output from a sanitizer pull, and the PR supplies before-timeout plus after-pass proof for the focused split-frame regressions.

Real behavior proof
Sufficient (logs): The PR body and follow-up comment provide copied after-fix gateway/RPC logs showing the real OpenClaw gateway path completed with status=ok and assistant text gateway-ok against a local provider endpoint.

Next step before merge
Protected maintainer labeling and the overlapping Azure stall PR require maintainer consolidation; there is no narrow automated repair to queue.

Security
Cleared: The diff touches provider stream sanitization, memory tool schemas, tests, and changelog only; I found no dependency, workflow, permission, secret-handling, or package-resolution change.

Review details

Best possible solution:

Land or fold the split-frame draining fix with focused regressions while consolidating the shared memory enum and Azure first-event timeout pieces with #81015.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main can buffer split SSE/JSON chunks without emitting parser-visible output from a sanitizer pull, and the PR supplies before-timeout plus after-pass proof for the focused split-frame regressions.

Is this the best way to solve the issue?

Yes. Draining inside the existing provider response sanitizer until an SSE event, JSON body, or stream close is the narrowest maintainable fix; the shared memory schema and Azure timeout pieces should be coordinated with the related branch rather than treated as a correctness flaw here.

What I checked:

Likely related people:

  • steipete: GitHub commit metadata and prior review context show repeated recent OpenAI/provider transport fixes plus the shared flat enum helper work on the touched surfaces. (role: recent provider transport and schema helper contributor; confidence: high; commits: 25e513e0782a, 1725eebe62d9, c2e3b6e6f819; files: src/agents/provider-transport-fetch.ts, src/agents/openai-transport-stream.ts, src/agents/schema/string-enum.ts)
  • Takhoffman: GitHub commit metadata shows recent merged memory-core tool-context work adjacent to the memory tool schema surface changed by this PR. (role: recent memory-core adjacent contributor; confidence: medium; commits: f74983e44220; files: extensions/memory-core/src/tools.shared.ts, extensions/memory-core/src/tools.test.ts)
  • Shakker: Local blame in this checkout attributes the current provider transport, OpenAI stream, memory schema, and string enum lines to the same recent current-main commit, so they are a routing candidate for this checked-out snapshot. (role: current checkout line-history contributor; confidence: low; commits: 7d208f3a5daf; files: src/agents/provider-transport-fetch.ts, src/agents/openai-transport-stream.ts, extensions/memory-core/src/tools.shared.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5681cfd83984.

@galiniliev
Copy link
Copy Markdown
Contributor Author

Updated branch proof after adding the provider transport fix.

Before / After Proof

Before

Using the rebased branch test with the origin/main implementation restored for src/agents/provider-transport-fetch.ts:

git restore --source=origin/main --worktree src/agents/provider-transport-fetch.ts
pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading until split SSE frames" --testTimeout=1500

Result: failed by timeout at 1500 ms. The test reproduced the stall: the response wrapper had read partial chunks but had not emitted a complete SSE event to Stream.fromSSEResponse().

After

Same focused regression with the PR implementation restored:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading until split SSE frames" --testTimeout=1500

Result: passed in 100 ms.

Companion split-JSON fallback regression:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose -t "continues reading split JSON bodies" --testTimeout=1500

Result: passed in 121 ms.

Full provider transport file:

pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose

Result: 33 tests passed.

Gateway Proof

I started an isolated dev gateway and pointed openai/gpt-5.5 at the repo mock OpenAI server because the local OPENAI_API_KEY is a dummy key and live OpenAI returns 401. The gateway run used the real gateway/RPC/agent path and the OpenAI Responses stream shape from scripts/e2e/mock-openai-server.mjs.

Observed proof from gateway logs:

  • provider-transport-fetch: [model-fetch] start provider=openai api=openai-responses model=gpt-5.5 method=POST url=http://127.0.0.1:19024/v1/responses
  • provider-transport-fetch: [model-fetch] response ... status=200 ... contentType=text/event-stream
  • openai-transport: [responses] first_event ... type=response.output_item.added
  • openai-transport: [responses] stream_done ... events=3 ... stopReason=stop
  • gateway/ws: assistant stream event contained text=gateway-ok
  • RPC result: status: ok, payload text gateway-ok

Verification

  • pnpm test extensions/memory-core/src/tools.test.ts -- --reporter=verbose
  • pnpm test src/agents/provider-transport-fetch.test.ts -- --reporter=verbose
  • pnpm check:changed
  • pnpm test:changed
  • pnpm build
  • git diff --check

@galiniliev galiniliev changed the title Use flat enums for memory tool corpus schemas Drain split provider stream frames May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@galiniliev galiniliev changed the title Drain split provider stream frames fix(azure):Drain split provider stream frames May 12, 2026
@galiniliev galiniliev requested a review from a team as a code owner May 12, 2026 16:21
@openclaw-barnacle openclaw-barnacle Bot added channel: googlechat Channel integration: googlechat channel: line Channel integration: line channel: matrix Channel integration: matrix channel: nostr Channel integration: nostr channel: signal Channel integration: signal channel: telegram Channel integration: telegram channel: tlon Channel integration: tlon channel: voice-call Channel integration: voice-call channel: whatsapp-web Channel integration: whatsapp-web app: web-ui App: web-ui gateway Gateway runtime scripts Repository scripts commands Command implementations docker Docker and sandbox tooling channel: feishu Channel integration: feishu labels May 12, 2026
@openclaw-barnacle openclaw-barnacle Bot added size: M and removed channel: whatsapp-web Channel integration: whatsapp-web app: web-ui App: web-ui gateway Gateway runtime scripts Repository scripts commands Command implementations docker Docker and sandbox tooling channel: feishu Channel integration: feishu extensions: openai extensions: minimax channel: qqbot extensions: qa-lab extensions: codex plugin: bonjour Plugin integration: bonjour channel: synology-chat size: XL labels May 12, 2026
@galiniliev galiniliev force-pushed the dev/galini/stalledsession branch 3 times, most recently from 3e57a41 to 2a5b223 Compare May 12, 2026 16:47
@galiniliev galiniliev force-pushed the dev/galini/stalledsession branch from 2a5b223 to 03a7e1f Compare May 12, 2026 16:49
@galiniliev galiniliev merged commit 4b28312 into openclaw:main May 12, 2026
17 of 18 checks passed
@galiniliev
Copy link
Copy Markdown
Contributor Author

Merged via squash.

Thanks @galiniliev!

@galiniliev galiniliev deleted the dev/galini/stalledsession branch May 12, 2026 16:49
steipete pushed a commit that referenced this pull request May 12, 2026
Merged via squash.

Prepared head SHA: 03a7e1f
Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com>
Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com>
Reviewed-by: @galiniliev
eleqtrizit pushed a commit to eleqtrizit/openclaw that referenced this pull request May 14, 2026
Merged via squash.

Prepared head SHA: 03a7e1f
Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com>
Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com>
Reviewed-by: @galiniliev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling extensions: memory-core Extension: memory-core maintainer Maintainer-authored PR proof: sufficient ClawSweeper judged the real behavior proof convincing. size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant