fix(gateway): soft-reset pre-turn gap recovery (v0.5.38) by royosherove · Pull Request #143 · inceptionstack/roundhouse

royosherove · 2026-05-16T20:45:53Z

Summary

Closes the v0.5.38 "soft-reset pre-turn gap": idle sessions that grow past the model's context window via background cron/boot/sub-agent work now recover transparently when the user's next message would overflow.

Root cause

Soft-reset infrastructure (v0.5.28–v0.5.32) only fired inside flushMemoryThenCompact's catch — i.e. when compact itself overflowed. If a session crossed the limit without ever tripping proactive soft/hard pressure thresholds, the next agent.prompt() threw prompt is too long. The gateway catch posted the raw provider error verbatim — no classification, no softReset call, no state update. Loop persisted indefinitely.

Design (codex-led)

Codex evaluated 4 alternatives, recommended D: shared reactive recoverFromContextOverflow helper used from both flushMemoryThenCompact (existing path) and the new gateway catch. Deferred-retry UX (no transparent re-prompt — unsafe with streaming + tools). pendingCompact="emergency" fallback only when agent.compact exists.

Full design doc: docs/design/v0.5.38-soft-reset-pre-turn-gap.md

Code review fixes (F1–F6)

After initial implementation, code review surfaced 6 findings; all addressed in commits 662f032 / 4df2d3d / 71c674e:

F1 (P1 blocker): model_error events from pi-ai streaming bypassed the gateway catch. Fix (codex-designed): classify overflow in streaming.ts and throw StreamModelOverflowError, routing into existing gateway catch. +7 tests for streaming overflow scenarios.
F2: Dropped dead 'cron' from TurnSource union (cron jobs run via cron/runner.ts, never handleAgentTurn).
F3: unsupported outcome now posts a clear "adapter doesn't support recovery" hint instead of raw provider error.
F4: Extracted appendCompactLog to memory/telemetry.ts (kills cross-domain coupling).
F5: Deduped MAX_ERROR_PREVIEW=200 between gateway.ts + overflow.ts.
F6: Named MAX_FAILURE_REASON_PREVIEW=100 constant for UI message slicing.

Tests

591 passing (+7 from F1 streaming integration tests). Coverage:

9 gateway scenarios (overflow non-stream, overflow stream pre-text, overflow stream post-text, unsupported with/without compact, noop+failed armed-pending, non-overflow, background, concurrent-lock)
7 helper-level scenarios (regression for v0.5.30 wrapped-cause + v0.5.32 non-Error throw)
F1 streaming-path tests for model_error → throw → gateway recovery

Verification

Bug currently live on Roy's main session (2.8MB, 211k tokens) for end-to-end validation post-merge. After ship + npm publish + gateway restart, send "hi" → expect ♻️ + ✅ recovery flow, not raw error.

Files

New: src/agents/shared/overflow-recovery.ts (helper, no state side effects)
New: src/gateway/overflow.ts (gateway catch path, free function for testability)
New: src/memory/telemetry.ts (extracted from lifecycle.ts)
New: docs/design/v0.5.38-soft-reset-pre-turn-gap.md
New: test/overflow-recovery.test.ts, test/gateway-overflow-recovery.test.ts
Edit: gateway.ts (catch wired to helper, turnSource plumbed for boot/subagent)
Edit: streaming.ts (model_error overflow → throw), lifecycle.ts (uses shared helper), CHANGELOG.md

~430 LOC src + ~520 LOC tests, 7 commits.

Codex (gpt-5.1) recommended Alternative D: shared reactive overflow recovery helper, deferred-retry UX, fallback to pendingCompact=emergency. Full design + open-question resolutions in docs/design/v0.5.38-soft-reset-pre-turn-gap.md. Implements next: classify-in-catch + agent.softReset() + state write + fallback when softReset is unavailable/fails.

…mpt catch (v0.5.38) Closes the soft-reset pre-turn gap: when an idle session has already grown past the provider context limit (e.g. via background cron/boot/ sub-agent activity that didn't trip soft/hard pressure), the next user turn's agent.prompt() / agent.promptStream() throws 'prompt is too long' before any pre-turn pressure handler can fire. v0.5.29-v0.5.32 only recovered when *compact itself* overflowed; this catch path was unprotected, so the gateway just posted the raw provider error and the loop continued. Changes: - Extract recoverFromContextOverflow() into src/agents/shared/overflow- recovery.ts (file-private attemptSoftResetRecovery in lifecycle.ts is removed; lifecycle now imports the shared helper). - Add src/gateway/overflow.ts with recoverFromAgentTurnOverflow(): classify -> softReset -> persist memory state (forceInjectReason or pendingCompact='emergency' fallback when softReset is unavailable but compact isn't) -> append telemetry (level='gateway-overflow') -> post deferred-retry hint. - Wire it into Gateway.handleAgentTurn's catch. Plumb turnSource ('user'|'boot'|'subagent') through so background turns get distinct copy ('Original work was not retried.' instead of 'please resend'). - handleStreaming() now returns hadVisibleText so the catch can distinguish 'please resend' from 'response was interrupted' on streaming overflows after partial text. - export appendCompactLog from lifecycle so the gateway-side recovery uses the same compact-timing.jsonl schema. UX: deferred retry only. No transparent same-turn replay because a streamed turn that already emitted text or executed tools would duplicate output / side effects on retry. Design: docs/design/v0.5.38-soft-reset-pre-turn-gap.md (codex-cli Alternative D).

19 new tests across two files (584 total, +19 net): test/overflow-recovery.test.ts (8 tests) — direct unit tests on the extracted recoverFromContextOverflow helper: - non-overflow returns not-overflow without calling softReset - adapter without softReset returns unsupported - softReset success returns recovered with report + ✅ progress - softReset reset:false returns noop with reason + ⚠️ progress - softReset throws returns failed with message + ❌ progress - softReset throws non-Error (string) — String() coerced, no TypeError (regression for v0.5.32 fix #4) - overflow buried in .cause chain still classifies (v0.5.30 regression) - onProgress optional — undefined callback handled test/gateway-overflow-recovery.test.ts (11 tests) — gateway-level integration tests on recoverFromAgentTurnOverflow with fake adapter + fake thread, covering the brief's required surface: - overflow during non-streaming prompt + softReset success → recovered hint + state.forceInjectReason='after-soft-reset' + cleared pendingCompact - overflow + softReset undefined + no compact → sanitized error, no arming - overflow + softReset throws + has compact → pendingCompact='emergency' armed + 'Recovery armed (failed: …)' hint - overflow + softReset reset:false + has compact → pendingCompact armed + 'Recovery armed (noop: <reason>)' hint - overflow + softReset fails + no compact → sanitized error, no arming (don't arm a flag the next turn can't honor) - non-overflow error → sanitized error, no recovery, no state mutation - streaming overflow before any text → '✅ Recovered. Please resend…' - streaming overflow after partial text → '♻️ Response was interrupted' - background turns (boot/subagent/cron) → distinct copy, no resend ask - thread.post throws → recovery still updates state, doesn't propagate - overflow in cause chain → still triggers recovery All tests pass: 584 / 584.

…, v0.5.38) pi-ai's streaming surfaces provider errors as model_error EVENTS, not thrown exceptions. The v0.5.38 catch in Gateway.handleAgentTurn only saw synchronous throws from agent.prompt()/promptStream(), so streamed 'prompt is too long' bypassed recoverFromAgentTurnOverflow entirely: the model_error case posted the raw error inline, the for-await continued to turn_end, and the function returned normally. On Telegram (streaming-default), the v0.5.38 fix was effectively a no-op. Per codex-cli design (see ~/.roundhouse/workspace/softreset-f1-codex-design.md): classify the model_error message in streaming.ts via isContextOverflowError. Non-overflow keeps the existing inline post + continue-loop. Overflow flushes, suppresses the inline raw post, and throws a typed StreamModelOverflowError carrying the provider message. The gateway catch's existing recoverFromAgentTurnOverflow handles it identically to synchronous-throw overflow — single recovery surface, no duplicate posts, no flag plumbed through StreamResult. 7 new streaming-overflow tests: - model_error overflow throws + suppresses raw post - model_error non-overflow keeps inline post + continues loop (regression) - text_delta then overflow throws with overflow message - non-overflow inline post format unchanged - end-to-end: overflow → recovery → 'please resend' wording - end-to-end: partial text → recovery → 'interrupted' wording - end-to-end: non-overflow → no recovery, raw post stands alone 591 tests passing (584 + 7).

…rflow hint (F2+F3) F2: TurnSource included 'cron' but cron jobs run via cron/runner.ts in their own session and never invoke Gateway.handleAgentTurn. Verified via grep on this.handleAgentTurn — only user-message-handler, fireBootTurn, and handleSubagentCompletion reach it. Removed 'cron' from the union. If a future patch adds cron→main injection, add it back at that point (YAGNI). F3: pickUserMessage's 'unsupported' branch (no softReset on adapter, no armed pending) returned the raw sanitized provider error ('⚠️ Error: prompt is too long: N > M'). User saw a wall-of-numbers with no indication that recovery was unavailable. Now returns: '⚠️ Session full — adapter doesn't support automatic recovery. Run /compact manually or restart session.' Updated: - src/gateway/overflow.ts — TurnSource type, file header comment, pickUserMessage unsupported branch. - test/gateway-overflow-recovery.test.ts — drop 'cron' from background-turn iteration; rewrite the 'unsupported-no-compact' test to assert the new clear-guidance copy and verify the raw provider error does NOT leak through. - docs/design/v0.5.38-soft-reset-pre-turn-gap.md — match TurnSource shape. 591 tests passing.

F4 — appendCompactLog cross-domain import. src/gateway/overflow.ts imported appendCompactLog from src/memory/lifecycle, coupling gateway → memory/lifecycle (flagged as follow-up in the v0.5.38 design doc). Extracted appendCompactLog + CompactLogEntry to a new src/memory/telemetry.ts (~43 LOC). Both lifecycle.ts and gateway/overflow.ts now import from telemetry. lifecycle.ts re-exports the symbol for backwards compatibility (no external callers found via grep). F5 — MAX_ERROR_PREVIEW duplication. The constant existed in both src/gateway/gateway.ts and src/gateway/overflow.ts. gateway.ts had only the declaration (its sole consumer moved to overflow.ts with the v0.5.38 catch refactor) — deleted as dead code. overflow.ts keeps its single definition. F6 — slice(0, 100) magic number. pickUserMessage's failed-outcome reason truncation used a bare 100. Added named constant MAX_FAILURE_REASON_PREVIEW = 100 with a comment explaining why it's shorter than MAX_ERROR_PREVIEW (UI message bracketed by other copy, not a standalone error preview). 591 tests passing.

Reflects the post-review changes: - F1: streaming model_error overflow path now routed through the same recovery surface as synchronous-throw overflow (was the primary Telegram-path bug; pi-ai surfaces provider errors as events, not exceptions). - F2: TurnSource cron removal. - F3: clearer unsupported-overflow user message. - F4-F6: telemetry extraction, MAX_ERROR_PREVIEW dedup, named constant for failure-reason preview length. - Test count corrected to 591 (+26 net).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9797640771

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T20:49:02Z

            const streamResult = await this.handleStreaming(thread, agent.promptStream(agentThreadId, agentMessage), verboseThreads.has(agentThreadId), ac.signal);
            turnUsedTools = streamResult.usedTools;
+            streamHadVisibleText = streamResult.hadVisibleText;


Preserve visible-stream state when overflow aborts streaming

streamHadVisibleText is only updated from streamResult after handleStreaming(...) returns, but handleStreaming now throws StreamModelOverflowError on overflow model_error events. In the common case where some text_delta was already emitted and then overflow occurs, this assignment never runs, so the catch path always receives hadVisibleText: false and posts the "Please resend your last message" copy. That is misleading after partial output and can prompt duplicate side effects if the user resends a message whose tools already started running.

Useful? React with 👍 / 👎.

roundhouse-fix-bot and others added 7 commits May 16, 2026 20:20

royosherove merged commit 15c266e into main May 16, 2026
1 check passed

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

royosherove mentioned this pull request May 16, 2026

fix(streaming-overflow): preserve hadVisibleText state when escalating to recovery (v0.5.39) #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): soft-reset pre-turn gap recovery (v0.5.38)#143

fix(gateway): soft-reset pre-turn gap recovery (v0.5.38)#143
royosherove merged 7 commits into
mainfrom
fix/soft-reset-pre-turn-gap

royosherove commented May 16, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

royosherove commented May 16, 2026

Summary

Root cause

Design (codex-led)

Code review fixes (F1–F6)

Tests

Verification

Files

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant