Skip to content

fix(gateway): yield during embedded agent prep#78958

Open
sahilsatralkar wants to merge 1 commit into
openclaw:mainfrom
sahilsatralkar:feature/issue-78861-gateway-responsiveness
Open

fix(gateway): yield during embedded agent prep#78958
sahilsatralkar wants to merge 1 commit into
openclaw:mainfrom
sahilsatralkar:feature/issue-78861-gateway-responsiveness

Conversation

@sahilsatralkar
Copy link
Copy Markdown
Contributor

Summary

  • Problem: embedded agent preparation can run heavy synchronous prep blocks in the Gateway process before model
    execution.
  • Why it matters: under load, cheap Gateway/WebSocket requests can become coupled to long agent prep work,
    contributing to poor responsiveness like the symptoms reported in [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861.
  • What changed: added a private cooperative setImmediate prep-yield helper and inserted safe yield checkpoints
    after core tool construction, bootstrap context, and bundled tool preparation.
  • What did NOT change (scope boundary): no worker threads, priority queue, caching, lazy loading, protocol/config
    changes, plugin/channel behavior, or public API changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Gateway responsiveness during embedded agent preparation.
  • Real environment tested: Local macOS development checkout; no live reporter-like Windows/provider setup was
    available.
  • Exact steps or command run after this patch: pnpm test ..., pnpm build, targeted formatter, and changed-lane
    gate attempt.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked
    artifact, or copied live output): terminal output from targeted tests and build showing passing results.
  • Observed result after fix: cheap Gateway request handling is covered by a regression test proving it is not
    coupled to agent completion after accepted dispatch.
  • What was not tested: reporter’s Windows 10 / Node 25.9.0 / private provider config and live WebSocket latency
    under real load.
  • Before evidence (optional but encouraged): [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861 issue logs.

Root Cause (if applicable)

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-methods/agent.test.ts, src/agents/pi-embedded-runner/run/ attempt.test.ts, src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
  • Scenario the test should lock in: accepted agent dispatch yields before heavy work, cheap Gateway requests can be
    served while the agent result is pending, and prep checkpoint continuations await the injected yield before model
    execution proceeds.
  • Why this is the smallest reliable guardrail: it verifies the exact scheduling seam without starting a real
    Gateway, provider, or channel runtime.
  • Existing test that already covers this (if any): existing accepted-ack yield coverage in src/gateway/server- methods/agent.test.ts.
  • If no new test is added, why not: N/A; new tests were added.

User-visible / Behavior Changes

Gateway/Control UI responsiveness should improve during embedded agent prep because the runner now cooperatively
yields between major prep phases. No config or API changes.

Diagram (if applicable)

Before:
[agent accepted] -> [core tools + bootstrap + bundle tools prep in one long run] -> [model execution]  -> [cheap gateway request waits for event loop opportunity]

After:
[agent accepted] -> [core tools] -> [yield] -> [bootstrap] -> [yield] -> [bundle tools] -> [yield] -> [model execution] -> [cheap gateway request can run between prep phases]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS local development environment
  • Runtime/container: local Node/pnpm workspace
  • Model/provider: mocked/local tests only
  • Integration/channel (if any): none
  • Relevant config (redacted): default test config/mocks

Steps

  1. Run targeted Gateway and agent runner tests.
  2. Run targeted formatter and git diff --check.
  3. Run pnpm build.
  4. Run pnpm changed:lanes --json and pnpm check:changed.

Expected

  • Targeted tests pass.
  • Build passes.
  • Changed gate either passes or reports only unrelated actionable failures.

Actual

  • Targeted tests passed.
  • pnpm build passed.
  • pnpm check:changed failed in unrelated existing core test typecheck fixtures outside this diff.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Passing after:

  • pnpm test src/gateway/server-methods/agent.test.ts
  • pnpm test src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
  • pnpm test src/agents/pi-embedded-runner/run/attempt.test.ts
  • pnpm test src/process/command-queue.test.ts src/gateway/server/event-loop-health.test.ts src/gateway/server-
    methods/agent.test.ts
  • pnpm test src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt-prep-
    yield.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/agent.test.ts src/agents/pi-embedded-runner/run/
    attempt.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt-prep-
    yield.ts src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
  • git diff --check
  • pnpm build

Known gate caveat:

  • pnpm check:changed failed after pnpm install in unrelated existing test fixture type errors:
    • src/agents/openai-transport-stream.test.ts
    • src/agents/pi-embedded-runner/openai-stream-wrappers.test.ts

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: accepted agent dispatch still responds first; cheap Gateway request can be handled while agent
    result is pending; prep yield helper budgets and resets correctly; prep checkpoint awaits the injected yield
    before model execution continuation.
  • Edge cases checked: helper does not yield before budget; helper yields after budget; helper reset restarts
    accounting.
  • What you did not verify: live reporter environment, Windows-specific behavior, real provider latency, real
    WebSocket latency under high-load production config.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review
conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

Built with Codex

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 7, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 7, 2026

Codex review: needs real behavior proof before merge.

Summary
Adds a setImmediate-backed embedded-agent prep-yield controller, awaits it after core-tool, bootstrap-context, and bundle-tool prep checkpoints, and adds focused runner/Gateway tests.

Reproducibility: no. not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints.

Real behavior proof
Needs real behavior proof before merge: Needs real behavior proof before merge: the PR lists mocked/local tests and build commands, but no redacted live Gateway/WebSocket output, terminal capture, recording, linked artifact, or runtime log showing the after-fix behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Human-only: the remaining blocker is contributor-owned real behavior proof and maintainer judgment on a partial performance mitigation, not a narrow repair automation can provide.

Security
Cleared: The diff adds an internal timer-based yield helper and tests without changing dependencies, workflows, permissions, secrets, package resolution, or network access.

Review details

Best possible solution:

Keep the PR open until the contributor adds redacted real behavior proof showing cheap Gateway/WebSocket requests remain responsive during embedded prep; then maintainers can decide whether this narrow yield mitigation should land while the broader tracker stays open.

Do we have a high-confidence way to reproduce the issue?

No, not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints.

Is this the best way to solve the issue?

Unclear: cooperative yielding is a reasonable narrow mitigation, but it is not proven as the best or sufficient fix for the related Gateway starvation report. A safer merge path needs real behavior proof plus maintainer judgment on the partial scope.

What I checked:

  • PR patch surface: The PR adds attempt-prep-yield.ts, imports it into runEmbeddedAttempt, and awaits yieldAfterAttemptPrepCheckpoint after the core-plugin-tools, bootstrap-context, and bundle-tools prep stages. (src/agents/pi-embedded-runner/run/attempt.ts:926, 5e35693d683d)
  • Current main baseline: Current main still marks core-plugin-tools, bootstrap-context, bundle-tools, and system-prompt in the embedded attempt path, and the search found no AttemptPrepYield/yieldAfterAttemptPrepCheckpoint helper on main. (src/agents/pi-embedded-runner/run/attempt.ts:1137, ea16a5e9e10c)
  • Existing accepted-ack mitigation: Current main already responds with the accepted agent payload and yields once before dispatching the runner, but the comment scopes that to flushing the accepted frame and immediate agent.wait calls, not to interleaving later embedded prep stages. (src/gateway/server-methods/agent.ts:1359, ea16a5e9e10c)
  • Queue architecture context: The queue docs still describe process-wide lanes and explicitly state there are no background worker threads, which matches the broader architecture concern related to this PR. Public docs: docs/concepts/queue.md. (docs/concepts/queue.md:108, ea16a5e9e10c)
  • Related issue remains open: The related Gateway event-loop report is still open and asks for broader worker-thread, priority scheduling, caching, lazy-loading, and degradation work beyond this PR's stated scope.
  • Real behavior proof gate: Live PR metadata shows the triage: needs-real-behavior-proof label, a failed Real behavior proof check on the PR head, and the PR body only cites targeted tests/build rather than live Gateway/WebSocket output or redacted runtime logs. (5e35693d683d)

Likely related people:

  • vincentkoc: Authored the accepted-agent dispatch deferral and nearby dispatch timer tests that are directly adjacent to this PR's responsiveness goal. (role: recent gateway scheduling contributor; confidence: high; commits: 985000026e35, b8f9137d31fb, e2eb5649d1b9; files: src/gateway/server-methods/agent.ts, src/gateway/server-methods/agent.test.ts)
  • steipete: Recent history includes stable system-prompt prep caching and slow embedded-run startup-stage tracing, both adjacent to the reported prep-stage latency. (role: embedded-run timing and prompt-prep contributor; confidence: medium; commits: 0f16edf329a9, 20e21173715e; files: src/agents/system-prompt.ts, src/agents/pi-embedded-runner/run/attempt-stage-timing.ts, src/agents/pi-embedded-runner/run/attempt.ts)
  • shakkernerd: Recent merged work changed current snapshot handling for embedded runs in the same runner area touched by this PR. (role: recent embedded-runner contributor; confidence: medium; commits: 5655c2b0666d; files: src/agents/pi-embedded-runner/run/attempt.ts)
  • jalehman: Recent merged prompt/runtime context work overlaps the embedded-runner prep stages discussed by the PR and related issue. (role: adjacent embedded-runner prompt contributor; confidence: medium; commits: 6dae3c273de6; files: src/agents/pi-embedded-runner/run/attempt.ts)

Remaining risk / open question:

  • No redacted live Gateway/WebSocket output, runtime log, recording, or linked artifact shows the after-fix behavior in a real setup.
  • The mitigation intentionally covers three prep checkpoints and does not address model-resolution, system-prompt, worker isolation, priority scheduling, caching, lazy-loading, or overload policy from the related open report.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ea16a5e9e10c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant