fix(gateway): soft-reset pre-turn gap recovery (v0.5.38)#143
Conversation
Codex (gpt-5.1) recommended Alternative D: shared reactive overflow recovery helper, deferred-retry UX, fallback to pendingCompact=emergency. Full design + open-question resolutions in docs/design/v0.5.38-soft-reset-pre-turn-gap.md. Implements next: classify-in-catch + agent.softReset() + state write + fallback when softReset is unavailable/fails.
…mpt catch (v0.5.38)
Closes the soft-reset pre-turn gap: when an idle session has already
grown past the provider context limit (e.g. via background cron/boot/
sub-agent activity that didn't trip soft/hard pressure), the next user
turn's agent.prompt() / agent.promptStream() throws 'prompt is too long'
before any pre-turn pressure handler can fire. v0.5.29-v0.5.32 only
recovered when *compact itself* overflowed; this catch path was
unprotected, so the gateway just posted the raw provider error and the
loop continued.
Changes:
- Extract recoverFromContextOverflow() into src/agents/shared/overflow-
recovery.ts (file-private attemptSoftResetRecovery in lifecycle.ts is
removed; lifecycle now imports the shared helper).
- Add src/gateway/overflow.ts with recoverFromAgentTurnOverflow():
classify -> softReset -> persist memory state (forceInjectReason or
pendingCompact='emergency' fallback when softReset is unavailable
but compact isn't) -> append telemetry (level='gateway-overflow')
-> post deferred-retry hint.
- Wire it into Gateway.handleAgentTurn's catch. Plumb turnSource
('user'|'boot'|'subagent') through so background turns get distinct
copy ('Original work was not retried.' instead of 'please resend').
- handleStreaming() now returns hadVisibleText so the catch can
distinguish 'please resend' from 'response was interrupted' on
streaming overflows after partial text.
- export appendCompactLog from lifecycle so the gateway-side recovery
uses the same compact-timing.jsonl schema.
UX: deferred retry only. No transparent same-turn replay because a
streamed turn that already emitted text or executed tools would
duplicate output / side effects on retry.
Design: docs/design/v0.5.38-soft-reset-pre-turn-gap.md (codex-cli
Alternative D).
19 new tests across two files (584 total, +19 net): test/overflow-recovery.test.ts (8 tests) — direct unit tests on the extracted recoverFromContextOverflow helper: - non-overflow returns not-overflow without calling softReset - adapter without softReset returns unsupported - softReset success returns recovered with report + ✅ progress - softReset reset:false returns noop with reason +⚠️ progress - softReset throws returns failed with message + ❌ progress - softReset throws non-Error (string) — String() coerced, no TypeError (regression for v0.5.32 fix #4) - overflow buried in .cause chain still classifies (v0.5.30 regression) - onProgress optional — undefined callback handled test/gateway-overflow-recovery.test.ts (11 tests) — gateway-level integration tests on recoverFromAgentTurnOverflow with fake adapter + fake thread, covering the brief's required surface: - overflow during non-streaming prompt + softReset success → recovered hint + state.forceInjectReason='after-soft-reset' + cleared pendingCompact - overflow + softReset undefined + no compact → sanitized error, no arming - overflow + softReset throws + has compact → pendingCompact='emergency' armed + 'Recovery armed (failed: …)' hint - overflow + softReset reset:false + has compact → pendingCompact armed + 'Recovery armed (noop: <reason>)' hint - overflow + softReset fails + no compact → sanitized error, no arming (don't arm a flag the next turn can't honor) - non-overflow error → sanitized error, no recovery, no state mutation - streaming overflow before any text → '✅ Recovered. Please resend…' - streaming overflow after partial text → '♻️ Response was interrupted' - background turns (boot/subagent/cron) → distinct copy, no resend ask - thread.post throws → recovery still updates state, doesn't propagate - overflow in cause chain → still triggers recovery All tests pass: 584 / 584.
…, v0.5.38) pi-ai's streaming surfaces provider errors as model_error EVENTS, not thrown exceptions. The v0.5.38 catch in Gateway.handleAgentTurn only saw synchronous throws from agent.prompt()/promptStream(), so streamed 'prompt is too long' bypassed recoverFromAgentTurnOverflow entirely: the model_error case posted the raw error inline, the for-await continued to turn_end, and the function returned normally. On Telegram (streaming-default), the v0.5.38 fix was effectively a no-op. Per codex-cli design (see ~/.roundhouse/workspace/softreset-f1-codex-design.md): classify the model_error message in streaming.ts via isContextOverflowError. Non-overflow keeps the existing inline post + continue-loop. Overflow flushes, suppresses the inline raw post, and throws a typed StreamModelOverflowError carrying the provider message. The gateway catch's existing recoverFromAgentTurnOverflow handles it identically to synchronous-throw overflow — single recovery surface, no duplicate posts, no flag plumbed through StreamResult. 7 new streaming-overflow tests: - model_error overflow throws + suppresses raw post - model_error non-overflow keeps inline post + continues loop (regression) - text_delta then overflow throws with overflow message - non-overflow inline post format unchanged - end-to-end: overflow → recovery → 'please resend' wording - end-to-end: partial text → recovery → 'interrupted' wording - end-to-end: non-overflow → no recovery, raw post stands alone 591 tests passing (584 + 7).
…rflow hint (F2+F3)
F2: TurnSource included 'cron' but cron jobs run via cron/runner.ts in
their own session and never invoke Gateway.handleAgentTurn. Verified
via grep on this.handleAgentTurn — only user-message-handler,
fireBootTurn, and handleSubagentCompletion reach it. Removed 'cron'
from the union. If a future patch adds cron→main injection, add it
back at that point (YAGNI).
F3: pickUserMessage's 'unsupported' branch (no softReset on adapter,
no armed pending) returned the raw sanitized provider error
('⚠️ Error: prompt is too long: N > M'). User saw a wall-of-numbers
with no indication that recovery was unavailable. Now returns:
'⚠️ Session full — adapter doesn't support automatic recovery.
Run /compact manually or restart session.'
Updated:
- src/gateway/overflow.ts — TurnSource type, file header comment,
pickUserMessage unsupported branch.
- test/gateway-overflow-recovery.test.ts — drop 'cron' from
background-turn iteration; rewrite the 'unsupported-no-compact' test
to assert the new clear-guidance copy and verify the raw provider
error does NOT leak through.
- docs/design/v0.5.38-soft-reset-pre-turn-gap.md — match TurnSource shape.
591 tests passing.
F4 — appendCompactLog cross-domain import. src/gateway/overflow.ts imported appendCompactLog from src/memory/lifecycle, coupling gateway → memory/lifecycle (flagged as follow-up in the v0.5.38 design doc). Extracted appendCompactLog + CompactLogEntry to a new src/memory/telemetry.ts (~43 LOC). Both lifecycle.ts and gateway/overflow.ts now import from telemetry. lifecycle.ts re-exports the symbol for backwards compatibility (no external callers found via grep). F5 — MAX_ERROR_PREVIEW duplication. The constant existed in both src/gateway/gateway.ts and src/gateway/overflow.ts. gateway.ts had only the declaration (its sole consumer moved to overflow.ts with the v0.5.38 catch refactor) — deleted as dead code. overflow.ts keeps its single definition. F6 — slice(0, 100) magic number. pickUserMessage's failed-outcome reason truncation used a bare 100. Added named constant MAX_FAILURE_REASON_PREVIEW = 100 with a comment explaining why it's shorter than MAX_ERROR_PREVIEW (UI message bracketed by other copy, not a standalone error preview). 591 tests passing.
Reflects the post-review changes: - F1: streaming model_error overflow path now routed through the same recovery surface as synchronous-throw overflow (was the primary Telegram-path bug; pi-ai surfaces provider errors as events, not exceptions). - F2: TurnSource cron removal. - F3: clearer unsupported-overflow user message. - F4-F6: telemetry extraction, MAX_ERROR_PREVIEW dedup, named constant for failure-reason preview length. - Test count corrected to 591 (+26 net).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9797640771
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const streamResult = await this.handleStreaming(thread, agent.promptStream(agentThreadId, agentMessage), verboseThreads.has(agentThreadId), ac.signal); | ||
| turnUsedTools = streamResult.usedTools; | ||
| streamHadVisibleText = streamResult.hadVisibleText; |
There was a problem hiding this comment.
Preserve visible-stream state when overflow aborts streaming
streamHadVisibleText is only updated from streamResult after handleStreaming(...) returns, but handleStreaming now throws StreamModelOverflowError on overflow model_error events. In the common case where some text_delta was already emitted and then overflow occurs, this assignment never runs, so the catch path always receives hadVisibleText: false and posts the "Please resend your last message" copy. That is misleading after partial output and can prompt duplicate side effects if the user resends a message whose tools already started running.
Useful? React with 👍 / 👎.
Summary
Closes the v0.5.38 "soft-reset pre-turn gap": idle sessions that grow past the model's context window via background cron/boot/sub-agent work now recover transparently when the user's next message would overflow.
Root cause
Soft-reset infrastructure (v0.5.28–v0.5.32) only fired inside
flushMemoryThenCompact's catch — i.e. when compact itself overflowed. If a session crossed the limit without ever tripping proactive soft/hard pressure thresholds, the nextagent.prompt()threwprompt is too long. The gateway catch posted the raw provider error verbatim — no classification, nosoftResetcall, no state update. Loop persisted indefinitely.Design (codex-led)
Codex evaluated 4 alternatives, recommended D: shared reactive
recoverFromContextOverflowhelper used from bothflushMemoryThenCompact(existing path) and the new gateway catch. Deferred-retry UX (no transparent re-prompt — unsafe with streaming + tools).pendingCompact="emergency"fallback only whenagent.compactexists.Full design doc:
docs/design/v0.5.38-soft-reset-pre-turn-gap.mdCode review fixes (F1–F6)
After initial implementation, code review surfaced 6 findings; all addressed in commits 662f032 / 4df2d3d / 71c674e:
model_errorevents from pi-ai streaming bypassed the gateway catch. Fix (codex-designed): classify overflow instreaming.tsand throwStreamModelOverflowError, routing into existing gateway catch. +7 tests for streaming overflow scenarios.'cron'fromTurnSourceunion (cron jobs run viacron/runner.ts, neverhandleAgentTurn).unsupportedoutcome now posts a clear "adapter doesn't support recovery" hint instead of raw provider error.appendCompactLogtomemory/telemetry.ts(kills cross-domain coupling).MAX_ERROR_PREVIEW=200between gateway.ts + overflow.ts.MAX_FAILURE_REASON_PREVIEW=100constant for UI message slicing.Tests
591 passing (+7 from F1 streaming integration tests). Coverage:
Verification
Bug currently live on Roy's
mainsession (2.8MB, 211k tokens) for end-to-end validation post-merge. After ship + npm publish + gateway restart, send "hi" → expect ♻️ + ✅ recovery flow, not raw error.Files
src/agents/shared/overflow-recovery.ts(helper, no state side effects)src/gateway/overflow.ts(gateway catch path, free function for testability)src/memory/telemetry.ts(extracted from lifecycle.ts)docs/design/v0.5.38-soft-reset-pre-turn-gap.mdtest/overflow-recovery.test.ts,test/gateway-overflow-recovery.test.tsgateway.ts(catch wired to helper,turnSourceplumbed for boot/subagent)streaming.ts(model_error overflow → throw),lifecycle.ts(uses shared helper),CHANGELOG.md~430 LOC src + ~520 LOC tests, 7 commits.