fix(gateway): yield during embedded agent prep#78958
Conversation
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: no. not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints. Real behavior proof Next step before merge Security Review detailsBest possible solution: Keep the PR open until the contributor adds redacted real behavior proof showing cheap Gateway/WebSocket requests remain responsive during embedded prep; then maintainers can decide whether this narrow yield mitigation should land while the broader tracker stays open. Do we have a high-confidence way to reproduce the issue? No, not for the exact 15-100s WebSocket and multi-minute agent timings. Source inspection does confirm the relevant current-main prep stages, in-process queue model, and absence of this PR's prep-yield checkpoints. Is this the best way to solve the issue? Unclear: cooperative yielding is a reasonable narrow mitigation, but it is not proven as the best or sufficient fix for the related Gateway starvation report. A safer merge path needs real behavior proof plus maintainer judgment on the partial scope. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against ea16a5e9e10c. |
Summary
execution.
contributing to poor responsiveness like the symptoms reported in [CRITICAL] Single-threaded Event Loop Bottleneck — 100s WS Response Times, 3min Agent Tasks Even With Minimal Config #78861.
setImmediateprep-yield helper and inserted safe yield checkpointsafter core tool construction, bootstrap context, and bundled tool preparation.
changes, plugin/channel behavior, or public API changes.
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
available.
pnpm test ...,pnpm build, targeted formatter, and changed-lanegate attempt.
artifact, or copied live output): terminal output from targeted tests and build showing passing results.
coupled to agent completion after accepted dispatch.
under real load.
Root Cause (if applicable)
loop before model execution, with limited cooperative scheduling between phases.
an accepted agent run is still pending.
core-plugin-tools,bootstrap- context,bundle-tools, andsystem-prompt.Regression Test Plan (if applicable)
src/gateway/server-methods/agent.test.ts,src/agents/pi-embedded-runner/run/ attempt.test.ts,src/agents/pi-embedded-runner/run/attempt-prep-yield.test.tsserved while the agent result is pending, and prep checkpoint continuations await the injected yield before model
execution proceeds.
Gateway, provider, or channel runtime.
src/gateway/server- methods/agent.test.ts.User-visible / Behavior Changes
Gateway/Control UI responsiveness should improve during embedded agent prep because the runner now cooperatively
yields between major prep phases. No config or API changes.
Diagram (if applicable)
Security Impact (required)
Repro + Verification
Environment
Steps
Expected
Actual
Evidence
Attach at least one:
Passing after:
methods/agent.test.ts
yield.test.ts
attempt.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt-prep-
yield.ts src/agents/pi-embedded-runner/run/attempt-prep-yield.test.ts
Known gate caveat:
Human Verification (required)
What you personally verified (not just CI), and how:
result is pending; prep yield helper budgets and resets correctly; prep checkpoint awaits the injected yield
before model execution continuation.
accounting.
WebSocket latency under high-load production config.
Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review
conversation cleanup for maintainers.
Compatibility / Migration
Risks and Mitigations
stream handling, and model/tool execution.
scheduling, caching, and lazy loading remain separate maintainer-level work.
Built with Codex