fix(gateway): yield during embedded agent prep by sahilsatralkar · Pull Request #78958 · openclaw/openclaw

sahilsatralkar · 2026-05-07T13:40:24Z

Summary

Problem: embedded agent preparation can run heavy synchronous prep blocks in the Gateway process before model
execution.
Why it matters: under load, cheap Gateway/WebSocket requests can become coupled to long agent prep work,
contributing to poor responsiveness like the symptoms reported in [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861.
What changed: added a private cooperative setImmediate prep-yield helper and inserted safe yield checkpoints
after core tool construction, bootstrap context, and bundled tool preparation.
What did NOT change (scope boundary): no worker threads, priority queue, caching, lazy loading, protocol/config
changes, plugin/channel behavior, or public API changes.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: Gateway responsiveness during embedded agent preparation.
Real environment tested: Local macOS development checkout; no live reporter-like Windows/provider setup was
available.
Exact steps or command run after this patch: pnpm test ..., pnpm build, targeted formatter, and changed-lane
gate attempt.
Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked
artifact, or copied live output): terminal output from targeted tests and build showing passing results.
Observed result after fix: cheap Gateway request handling is covered by a regression test proving it is not
coupled to agent completion after accepted dispatch.
What was not tested: reporter’s Windows 10 / Node 25.9.0 / private provider config and live WebSocket latency
under real load.
Before evidence (optional but encouraged): [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861 issue logs.

Root Cause (if applicable)

Root cause: embedded agent prep performs several potentially expensive preparation phases inside the Gateway event
loop before model execution, with limited cooperative scheduling between phases.
Missing detection / guardrail: no focused regression test proving cheap Gateway requests remain serviceable while
an accepted agent run is still pending.
Contributing context (if known): [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861 reports long prep-stage timings in core-plugin-tools, bootstrap- context, bundle-tools, and system-prompt.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-methods/agent.test.ts, src/agents/pi-embedded-runner/run/ attempt.test.ts, src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
Scenario the test should lock in: accepted agent dispatch yields before heavy work, cheap Gateway requests can be
served while the agent result is pending, and prep checkpoint continuations await the injected yield before model
execution proceeds.
Why this is the smallest reliable guardrail: it verifies the exact scheduling seam without starting a real
Gateway, provider, or channel runtime.
Existing test that already covers this (if any): existing accepted-ack yield coverage in src/gateway/server- methods/agent.test.ts.
If no new test is added, why not: N/A; new tests were added.

User-visible / Behavior Changes

Gateway/Control UI responsiveness should improve during embedded agent prep because the runner now cooperatively
yields between major prep phases. No config or API changes.

Diagram (if applicable)

Before:
[agent accepted] -> [core tools + bootstrap + bundle tools prep in one long run] -> [model execution]  -> [cheap gateway request waits for event loop opportunity]

After:
[agent accepted] -> [core tools] -> [yield] -> [bootstrap] -> [yield] -> [bundle tools] -> [yield] -> [model execution] -> [cheap gateway request can run between prep phases]

Security Impact (required)

New permissions/capabilities? (Yes/No) No
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) No
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: macOS local development environment
Runtime/container: local Node/pnpm workspace
Model/provider: mocked/local tests only
Integration/channel (if any): none
Relevant config (redacted): default test config/mocks

Steps

Run targeted Gateway and agent runner tests.
Run targeted formatter and git diff --check.
Run pnpm build.
Run pnpm changed:lanes --json and pnpm check:changed.

Expected

Targeted tests pass.
Build passes.
Changed gate either passes or reports only unrelated actionable failures.

Actual

Targeted tests passed.
pnpm build passed.
pnpm check:changed failed in unrelated existing core test typecheck fixtures outside this diff.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Passing after:

pnpm test src/gateway/server-methods/agent.test.ts
pnpm test src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
pnpm test src/agents/pi-embedded-runner/run/attempt.test.ts
pnpm test src/process/command-queue.test.ts src/gateway/server/event-loop-health.test.ts src/gateway/server-
methods/agent.test.ts
pnpm test src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt-prep-
yield.test.ts
pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/agent.test.ts src/agents/pi-embedded-runner/run/
attempt.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt-prep-
yield.ts src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
git diff --check
pnpm build

Known gate caveat:

pnpm check:changed failed after pnpm install in unrelated existing test fixture type errors:
- src/agents/openai-transport-stream.test.ts
- src/agents/pi-embedded-runner/openai-stream-wrappers.test.ts

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: accepted agent dispatch still responds first; cheap Gateway request can be handled while agent
result is pending; prep yield helper budgets and resets correctly; prep checkpoint awaits the injected yield
before model execution continuation.
Edge cases checked: helper does not yield before budget; helper yields after budget; helper reset restarts
accounting.
What you did not verify: live reporter environment, Windows-specific behavior, real provider latency, real
WebSocket latency under high-load production config.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review
conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: cooperative yields can slightly increase best-case prep latency by adding event-loop turns.
- Mitigation: yields are limited to three major prep boundaries before locks, transcript mutation, provider
  stream handling, and model/tool execution.
Risk: this does not fully solve the broader architecture requested in [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861.
- Mitigation: PR is intentionally scoped as a small responsiveness mitigation; worker isolation, priority
  scheduling, caching, and lazy loading remain separate maintainer-level work.

Built with Codex

clawsweeper · 2026-05-07T13:43:52Z

Codex review: needs real behavior proof before merge.

Summary
Adds a setImmediate-backed embedded-agent prep-yield controller, awaits it after core-tool, bootstrap-context, and bundle-tool prep checkpoints, and adds focused runner/Gateway tests.

Reproducibility: no. not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints.

Real behavior proof
Needs real behavior proof before merge: Needs real behavior proof before merge: the PR lists mocked/local tests and build commands, but no redacted live Gateway/WebSocket output, terminal capture, recording, linked artifact, or runtime log showing the after-fix behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Human-only: the remaining blocker is contributor-owned real behavior proof and maintainer judgment on a partial performance mitigation, not a narrow repair automation can provide.

Security
Cleared: The diff adds an internal timer-based yield helper and tests without changing dependencies, workflows, permissions, secrets, package resolution, or network access.

Review details

Best possible solution:

Keep the PR open until the contributor adds redacted real behavior proof showing cheap Gateway/WebSocket requests remain responsive during embedded prep; then maintainers can decide whether this narrow yield mitigation should land while the broader tracker stays open.

Do we have a high-confidence way to reproduce the issue?

No, not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints.

Is this the best way to solve the issue?

Unclear: cooperative yielding is a reasonable narrow mitigation, but it is not proven as the best or sufficient fix for the related Gateway starvation report. A safer merge path needs real behavior proof plus maintainer judgment on the partial scope.

What I checked:

PR patch surface: The PR adds attempt-prep-yield.ts, imports it into runEmbeddedAttempt, and awaits yieldAfterAttemptPrepCheckpoint after the core-plugin-tools, bootstrap-context, and bundle-tools prep stages. (src/agents/pi-embedded-runner/run/attempt.ts:926, 5e35693d683d)
Current main baseline: Current main still marks core-plugin-tools, bootstrap-context, bundle-tools, and system-prompt in the embedded attempt path, and the search found no AttemptPrepYield/yieldAfterAttemptPrepCheckpoint helper on main. (src/agents/pi-embedded-runner/run/attempt.ts:1137, ea16a5e9e10c)
Existing accepted-ack mitigation: Current main already responds with the accepted agent payload and yields once before dispatching the runner, but the comment scopes that to flushing the accepted frame and immediate agent.wait calls, not to interleaving later embedded prep stages. (src/gateway/server-methods/agent.ts:1359, ea16a5e9e10c)
Queue architecture context: The queue docs still describe process-wide lanes and explicitly state there are no background worker threads, which matches the broader architecture concern related to this PR. Public docs: docs/concepts/queue.md. (docs/concepts/queue.md:108, ea16a5e9e10c)
Related issue remains open: The related Gateway event-loop report is still open and asks for broader worker-thread, priority scheduling, caching, lazy-loading, and degradation work beyond this PR's stated scope.
Real behavior proof gate: Live PR metadata shows the triage: needs-real-behavior-proof label, a failed Real behavior proof check on the PR head, and the PR body only cites targeted tests/build rather than live Gateway/WebSocket output or redacted runtime logs. (5e35693d683d)

Likely related people:

vincentkoc: Authored the accepted-agent dispatch deferral and nearby dispatch timer tests that are directly adjacent to this PR's responsiveness goal. (role: recent gateway scheduling contributor; confidence: high; commits: 985000026e35, b8f9137d31fb, e2eb5649d1b9; files: src/gateway/server-methods/agent.ts, src/gateway/server-methods/agent.test.ts)
steipete: Recent history includes stable system-prompt prep caching and slow embedded-run startup-stage tracing, both adjacent to the reported prep-stage latency. (role: embedded-run timing and prompt-prep contributor; confidence: medium; commits: 0f16edf329a9, 20e21173715e; files: src/agents/system-prompt.ts, src/agents/pi-embedded-runner/run/attempt-stage-timing.ts, src/agents/pi-embedded-runner/run/attempt.ts)
shakkernerd: Recent merged work changed current snapshot handling for embedded runs in the same runner area touched by this PR. (role: recent embedded-runner contributor; confidence: medium; commits: 5655c2b0666d; files: src/agents/pi-embedded-runner/run/attempt.ts)
jalehman: Recent merged prompt/runtime context work overlaps the embedded-runner prep stages discussed by the PR and related issue. (role: adjacent embedded-runner prompt contributor; confidence: medium; commits: 6dae3c273de6; files: src/agents/pi-embedded-runner/run/attempt.ts)

Remaining risk / open question:

No redacted live Gateway/WebSocket output, runtime log, recording, or linked artifact shows the after-fix behavior in a real setup.
The mitigation intentionally covers three prep checkpoints and does not address model-resolution, system-prompt, worker isolation, priority scheduling, caching, lazy-loading, or overload policy from the related open report.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ea16a5e9e10c.

fix(gateway): yield during embedded agent prep

5e35693

openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 7, 2026

clawsweeper Bot mentioned this pull request May 14, 2026

[CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gateway): yield during embedded agent prep#78958

fix(gateway): yield during embedded agent prep#78958
sahilsatralkar wants to merge 1 commit into
openclaw:mainfrom
sahilsatralkar:feature/issue-78861-gateway-responsiveness

sahilsatralkar commented May 7, 2026

Uh oh!

clawsweeper Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sahilsatralkar commented May 7, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Built with Codex

Uh oh!

clawsweeper Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented May 7, 2026 •

edited

Loading