Skip to content

feat: compaction stability overhaul — loop prevention, stream rejection fixes, benchmark tasks#94

Merged
minpeter merged 90 commits intomainfrom
feat/minimal-agent-speculative-compaction
Apr 6, 2026
Merged

feat: compaction stability overhaul — loop prevention, stream rejection fixes, benchmark tasks#94
minpeter merged 90 commits intomainfrom
feat/minimal-agent-speculative-compaction

Conversation

@minpeter
Copy link
Copy Markdown
Owner

@minpeter minpeter commented Apr 2, 2026

Summary

Large-scope stability release for the compaction subsystem. Resolves infinite compaction loops in small-context scenarios, eliminates process crashes caused by unhandled stream rejections, and lands numerous harness/TUI stability fixes. Also ships minimal-agent speculative compaction tuning and new benchmark tasks (the work that originally opened this PR).

Key fixes

🔁 Infinite compaction loop prevention (harness)

  • Per-turn cap (maxAcceptedCompactionsPerTurn, default 10): combines accepted + ineffective compactions. When the cap is hit, no further compaction runs that turn.
  • Relaxed acceptance gate: only fitsBudget failures reject a compaction attempt. belowTriggerThreshold/meetsMinSavings are observability-only signals.
  • Turn boundaries tracked via notifyNewUserTurn() wired from TUI and headless runtime.
  • Task-aware 2-step compaction (opt-in, enabled in CEA): extracts the current user turn's task intent before summarizing history, injecting it into the compacted user-turn content so the agent doesn't lose the work context.
  • Fixes `CompactionCircuitBreaker.getState()` consistency (cooldown-triggered reset no longer mixes pre/post-reset state in the snapshot).

💥 Unhandled rejection crash fixes

Crashes with `Unhandled rejection: NoOutputGeneratedError` when the provider returns an empty stream (or the network fails):

  • harness/agent.ts: `createAgent.stream()` eagerly invokes four promise getters (`finishReason`, `response`, `usage`, `totalUsage`). Production consumers only awaited three, leaving `totalUsage` as a floating rejected promise. Added no-op rejection silencers on all four original promise instances while still returning them unchanged, so awaiters receive rejections normally.
  • cea/entrypoints/main.ts: `buildAgentStreamWithTodoContinuation` wrapper creates new promise chains via IIFEs and `.then()` that fan out from the base stream. Added matching silencers on `continuationDecision`, `response`, and `finishReason` as defense-in-depth for partial-await call patterns.

🔧 Harness compaction stability (numerous commits from earlier work)

  • `updateActualUsage` no longer misattributes `totalTokens` as `promptTokens` in compaction trigger calculations.
  • Tool-result truncation happens immediately after `updateActualUsage` so context-overflow enforcement is accurate.
  • Tool-result truncation preserves output structure and uses consistent inner-field text extraction.
  • Truncation ceiling raised to 90% of context limit to prevent garbled context.
  • `actualUsage` is never nulled; `refreshEstimatedUsage` runs after every message mutation.
  • Circuit breaker auto-resets on cooldown expiry and tracks non-benign failures.
  • Tool-result truncation on `addModelMessages` context-budget overflow.
  • Fallback to message token estimation when usage reports zero.
  • Post-compact restoration uses boundary-based label matching.
  • `BackgroundMemoryExtractor` reused across agent rebuilds with tuned restoration budget.

📊 minimal-agent speculative compaction (original PR scope)

Enable speculative compaction in minimal-agent with thresholds tuned for 30-turn chatbot conversations within a 4096-token context window:

  • Context limit: 2000 → 4096
  • speculativeStartRatio: 0.8 (background compaction at ~2130 tokens)
  • thresholdRatio: 0.65 (blocking compaction at ~2662 tokens)
  • reserveTokens: 512
  • keepRecentTokens: 800 (~8-10 turns preserved)
  • 30-turn benchmark suite with memory-probe questions every 5 turns.

🎨 TUI & headless

  • `compactBeforeNextTurnIfNeeded` always runs regardless of probe success.
  • Context-usage notices use estimated tokens to avoid stale actualUsage races.
  • Compaction callbacks passed via `CompactionOrchestratorOptions.callbacks`.
  • Context suggestions in footer disabled by default (opt-in via config option).
  • Headless uses a local default for ATIF output path and `mkdir -p` before write.

📦 Benchmark additions

  • Compaction stress benchmark tasks.
  • Search-heavy compaction stress task (32K context).
  • Trajectory analysis scorer.
  • Harbor v1.1 format restructure.

Version bumps (via changesets)

Package Bump From → To
`plugsuits` (CEA) minor 2.2.1 → 2.3.0
`@ai-sdk-tool/harness` patch 1.2.0 → 1.2.1
`@ai-sdk-tool/tui` patch 3.0.0 → 3.0.1
`@ai-sdk-tool/headless` patch 3.0.0 → 3.0.1
`@plugsuits/minimal-agent` patch 0.2.3 → 0.2.4

CEA gets the minor bump because the user-facing feature set (task-aware compaction, loop prevention, crash fixes) surfaces through CEA's runtime. Harness/tui/headless changes are internal enhancements that preserve their public API contracts.

Benchmark results (minimal-agent, 4096 context)

Context Compaction Cycles Memory Retention Speculative Verified
2000 2+ 53% (9/17) ✅ spec started → applied → compacted
3000 2 71-76% (12-13/17) ✅ spec started → applied → compacted
4096 0 76-82% (13-14/17) threshold set, no trigger needed

At 4096, all 30 turns fit without compaction (~2300 tokens peak) — probe scores reflect model response quality, not information loss.

Testing

  • 1075+ tests pass across all workspaces (515 harness + 523 cea + 14 tui + 23 headless + minimal-agent)
  • Per-field isolation regression tests for promise silencers (mutation-verified: each guard is independently necessary)
  • Compaction loop prevention tests: 21 tests covering gate rejection, per-turn cap, infinite-loop scenarios
  • Compaction integration tests: 7 scenarios simulating 32k-context investigations, verbose usage, multi-turn flows — all produce 0 blocking compactions

Related

Tune compaction config for 2000-token context to support speculative
compaction during multi-turn chatbot conversations:

- Add speculativeStartRatio (0.75) to trigger background compaction at 750 tokens
- Set explicit thresholdRatio (0.5) for blocking compaction at 1000 tokens
- Lower reserveTokens from 500 to 400 (chatbot responses are shorter)
- Lower keepRecentTokens from 500 to 350 (~3-4 turns preserved)
- Add contextLimit to compaction config directly
- Add benchmark script (benchmark.ts) for automated 30-turn memory
  retention testing with probe questions every 5 turns, metrics table,
  and ASCII context usage chart
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Important

Review skipped

Auto reviews are disabled on this repository. To trigger a review, include @crb review in the PR description. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ca3cb33-2c4d-4a05-afb7-52db70680401

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/minimal-agent-speculative-compaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark script for testing compaction strategies in the minimal agent, along with updated configuration parameters for compaction tuning. The review identified redundant calls to setContextLimit in both the agent and the benchmark script, as this value is already configured during CheckpointHistory instantiation. Additionally, it was noted that compaction parameters are currently duplicated between the agent and the benchmark script, and should be centralized to ensure consistency.

Comment thread packages/minimal-agent/index.ts Outdated
Comment thread packages/minimal-agent/benchmark.ts Outdated
Comment thread packages/minimal-agent/benchmark.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 915d862903

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/minimal-agent/benchmark.ts
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/benchmark.ts">

<violation number="1" location="packages/minimal-agent/benchmark.ts:456">
P3: These compaction thresholds (`thresholdRatio`, `speculativeRatio`, `reserveTokens`, `keepRecentTokens`) are duplicated from `index.ts`. Since this benchmark exists to validate the agent's compaction behavior, the values should be imported from a shared location — otherwise a tuning change in `index.ts` silently makes this benchmark test the wrong config.</violation>

<violation number="2" location="packages/minimal-agent/benchmark.ts:603">
P2: Validate `--context-limit` before use. Small/invalid values can make chart `repeat()` counts negative and crash the benchmark.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/benchmark.ts Outdated
Comment thread packages/minimal-agent/benchmark.ts Outdated
minpeter added 3 commits April 1, 2026 19:34
Increase context budget from 2000 to 4096 tokens and retune thresholds
to minimize compaction cycles:

- contextLimit: 2000 → 4096
- reserveTokens: 400 → 512
- keepRecentTokens: 350 → 800 (~8-10 turns preserved)
- thresholdRatio: 0.5 → 0.65 (blocking at 2662 tokens)
- speculativeStartRatio: 0.75 → 0.8 (speculative at 2130 tokens)

Benchmark results (30-turn chatbot, GLM-5):
- Memory retention: 41% → 82% (14/17 probes passed)
- Compaction cycles: 3 → 0 (all 30 turns fit in context)
- All targeted probes (turns 5-20, 30) pass at 100%
- Turn 25 comprehensive probe: 3/6 (model response quality, not context loss)
…esults

Set temperature to 0 in generateText calls to eliminate model
nondeterminism. Verified 82% (14/17) retention at 4096 context
is reproducible across runs.
Add --output flag to benchmark for JSON result export.
Add visualize.py (matplotlib) that generates 4 charts from JSON results:
- retention_curve: Memory retention % vs context size
- token_usage: Context token usage over 30 turns
- probe_heatmap: Per-probe recall scores
- summary: 3-panel overview

Usage: python3 visualize.py results/*.json --output charts/
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 2, 2026

Benchmark Visualization — 5-Point Context Size Sweep

Ran the benchmark at 5 context sizes (1500, 2000, 2500, 3000, 4096) with temperature: 0. Charts generated via visualize.py.

How to reproduce

# Run benchmarks
for ctx in 1500 2000 2500 3000 4096; do
  pnpm --filter @plugsuits/minimal-agent benchmark --context-limit $ctx --output results/$ctx.json
done

# Generate charts
python3 packages/minimal-agent/visualize.py results/*.json --output charts/

Raw Data

Context Retention Compaction Cycles Peak Tokens Peak Utilization
1500 65% (11/17) 7 3245 216%
2000 53% (9/17) 1 1775 89%
2500 59% (10/17) 0 1998 80%
3000 82% (14/17) 0 1978 66%
4096 82% (14/17) 0 2412 59%

Key Insights

  1. Sweet spot at 3000+ tokens: Memory retention plateaus at 82% for context ≥ 3000. Going from 3000 → 4096 doesn't improve retention but provides more headroom for longer conversations.

  2. The "valley of death" at 2000: 2000 tokens sits at a boundary where compaction triggers once but the summary loses too much information. 1500 actually performs better (65%) despite 7 compaction cycles because the repeated summarization creates a stable compressed memory.

  3. Turn 25 is the universal bottleneck: The comprehensive recall probe ("list everything about me") scores poorly (0-3/6) across ALL context sizes, even when all information is in context. This is a model response quality issue, not a compaction issue.

  4. Improvement opportunities: The 82% ceiling is not from compaction — it's from the model's ability to enumerate facts when asked. Potential improvements:

    • System prompt engineering: "When asked to recall, list ALL known facts"
    • Structured user profile tracking
    • Better probe question design

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/visualize.py">

<violation number="1" location="packages/minimal-agent/visualize.py:284">
P2: Validate that `--output` has a following directory argument before indexing, otherwise this crashes with `IndexError` on malformed CLI input.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/visualize.py
…+ friendli)

Add --provider flag to benchmark supporting 'anthropic' and 'friendli'.
Refactor callModel to accept LanguageModel interface for provider-agnostic
benchmarking. Add @ai-sdk/anthropic dependency.

Opus benchmark result at 4096 context: 94% (16/17) vs GLM-5's 82% (14/17).
Key difference: Turn 25 comprehensive recall 6/6 (vs 3/6 with GLM-5),
confirming the 82% ceiling was model response quality, not compaction.
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 2, 2026

Model Comparison: GLM-5 vs Claude Opus 4

Ran the benchmark with --provider anthropic --model claude-opus-4-20250514 at 4096 context to test whether the 82% retention ceiling is a compaction issue or a model quality issue.

Results

Model Retention Turn 25 (comprehensive) Turn 30 (first+recent) Compaction
GLM-5 82% (14/17) 3/6 3/3 0 cycles
Claude Opus 4 94% (16/17) 6/6 2/3 0 cycles

Analysis

The 82% ceiling was indeed model quality, not compaction. With the same 4096-token context and same compaction config:

  • Turn 25 (comprehensive recall): Opus scores 6/6 vs GLM-5's 3/6. When asked "list everything you remember about me", Opus enumerates ALL facts (Alice, Italian, Max, Rust, San Francisco, hiking). GLM-5 omits several.
  • Turns 5-20 (targeted probes): Both models score 100% — confirming the compaction tuning preserves all information.
  • Turn 30: Opus scores 2/3 (missed "Italian" — likely because compaction happened at turn 27 and the summary compressed that detail). GLM-5 scored 3/3 because it had no compaction.

Key Insight

Opus's longer, more thorough responses (~2483 peak tokens) actually triggered speculative compaction at turn 25, while GLM-5's shorter responses (~2291 peak) never triggered it. Despite this compaction handicap, Opus still scored 12%p higher — proving that model capability is the dominant factor in memory retention, not context management.

Benchmark command

pnpm --filter @plugsuits/minimal-agent benchmark \
  --provider anthropic --model claude-opus-4-20250514 \
  --output results/opus-4096.json

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/benchmark.ts">

<violation number="1" location="packages/minimal-agent/benchmark.ts:653">
P2: Validate and normalize `--provider` before branching; invalid values currently default silently to Friendli.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/benchmark.ts
… chatbot

Replace generic system prompt with fact-retention-aware version that
instructs the model to remember personal information and list ALL
known facts when asked to recall.

Replace code-agent-oriented compaction prompt (Files & Changes,
Technical Discoveries) with chatbot-specific version that prioritizes:
- User Profile extraction (all personal details as bullet points)
- Conversation Highlights (topics, advice, decisions)
- Current Topic (for continuity)

Impact on memory retention (GLM-5):
- 2000 tokens: 53% → 71% (+18pp, compaction preserves user facts)
- 4096 tokens: 82% → 82% (no change, compaction not triggered)
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 2, 2026

System Prompt + Compaction Prompt Optimization

Changes

System prompt — added fact-retention instructions:

When the user shares personal information (name, preferences, pets, job, hobbies, etc.), remember it carefully.
When asked to recall information, list ALL known facts — do not omit any details.

Compaction prompt — replaced code-agent template (Files & Changes, Technical Discoveries) with chatbot-specific template:

  • User Profile: extract ALL personal details as bullet points
  • Conversation Highlights: topics, advice, decisions (brief)
  • Current Topic: what the user was last discussing

Results

Context Before (old prompts) After (new prompts) Delta
2000 53% (9/17) 71% (12/17) +18pp
4096 82% (14/17) 82% (14/17) +0pp

Analysis

The +18pp improvement at 2000 tokens is significant — it proves the chatbot compaction prompt preserves user facts much better than the code-agent default. The compaction summary now produces a "User Profile" section that explicitly lists all personal details.

At 4096, no change because compaction never triggers (all 30 turns fit in context). The improvement only manifests when compaction actually runs.

Cumulative improvement path

Change 2000 tokens 4096 tokens
Baseline (original) 53% n/a
+ Context 4096 82%
+ Chatbot prompts 71% 82%
+ Claude Opus model 94%

minpeter added 2 commits April 1, 2026 22:24
…rison

Extend conversation from 30 to 50 turns (10 probes total) to force
compaction at 4096 context. Add --baseline flag to benchmark for
A/B testing against the default code-agent compaction prompt.

Key result at 4096 context, 50 turns:
- Chatbot prompt: Turn 35 (post-compaction) scores 4/4 on pet recall
- Baseline prompt: Turn 35 scores 0/4 — pet info lost in compaction
- Overall: 54% vs 51% (chatbot vs baseline)
…nalysis

Upgrade CHATBOT_COMPACTION_PROMPT with techniques learned from Claude Code:

1. Analysis scratchpad: <analysis> block for think-before-summarize
   (harness already strips it, only <summary> content is kept)
2. All User Messages list: explicit section preserving user intent trail
3. Previous-summary fact preservation: carry forward ALL facts from
   prior compaction, never drop information across cycles
4. Partial compact awareness: focus summary on older messages since
   recent ones are preserved separately via keepRecentTokens

50-turn benchmark at 4096 context (GLM-5):
- Before: 54% (20/37)
- After:  62% (23/37) — +8pp improvement
- Turn 40 (post-compaction recall): 2/4 → 4/4 (perfect)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/index.ts">

<violation number="1" location="packages/minimal-agent/index.ts:30">
P2: Requiring an `<analysis>` section wastes the summarizer’s limited output budget and can reduce the amount of useful summary content retained.</violation>

<violation number="2" location="packages/minimal-agent/index.ts:44">
P1: The prompt asks to list every user message, but compaction instructions are injected as a user turn, so internal control text can be incorrectly persisted into summaries.</violation>
</file>

<file name="packages/minimal-agent/benchmark.ts">

<violation number="1" location="packages/minimal-agent/benchmark.ts:35">
P2: Exclude the internal compaction instruction from the “All User Messages” list; otherwise the model is likely to summarize the compaction prompt itself (it is sent as a user message), which bloats and pollutes the summary.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/index.ts
Comment thread packages/minimal-agent/index.ts
Comment thread packages/minimal-agent/benchmark.ts
Three new shared compaction modules inspired by Claude Code's context
management architecture:

1. CompactionCircuitBreaker (compaction-circuit-breaker.ts)
   - Tracks consecutive compaction failures, opens after 3 (configurable)
   - Auto-closes after cooldown period (default 60s)
   - Prevents infinite retry loops on irrecoverable context overflow

2. microCompactMessages (micro-compact.ts)
   - Pre-compaction step that shrinks old long assistant responses
   - Protects recent messages (configurable token window)
   - Preserves user messages and summary messages
   - Immutable: returns new array without modifying input

3. SessionMemoryTracker (session-memory.ts)
   - Structured key-value memory persisting across compaction cycles
   - Categorized facts: identity, preferences, relationships, context
   - getStructuredState() callback for CompactionConfig integration
   - extractFactsFromSummary() to parse User Profile from summaries
   - JSON serialization for persistence (toJSON/fromJSON)

All modules exported from @ai-sdk-tool/harness.
21 test files, 399 tests passing.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 12 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/charts/1500.json">

<violation number="1" location="packages/minimal-agent/charts/1500.json:255">
P2: This benchmark artifact is mislabeled or generated with the wrong context setting: `peakTokens` and per-turn `contextAfter` values exceed the declared `contextLimit` (1500), so the chart data is unreliable.</violation>
</file>

<file name="packages/harness/src/session-memory.ts">

<violation number="1" location="packages/harness/src/session-memory.ts:286">
P2: Guard non-object entries before normalizing imported facts to prevent runtime crashes when JSON contains `null` or other invalid items.</violation>
</file>

<file name="packages/harness/src/compaction-circuit-breaker.ts">

<violation number="1" location="packages/harness/src/compaction-circuit-breaker.ts:28">
P2: `maxConsecutiveFailures` can become `0` for fractional inputs (e.g. `0.5`), which breaks threshold semantics and can open the circuit prematurely.</violation>
</file>

<file name="packages/minimal-agent/charts/2000.json">

<violation number="1" location="packages/minimal-agent/charts/2000.json:158">
P2: The compaction delta text is inconsistent with the recorded context values (`1775` to `123`), so this chart entry reports an incorrect compaction magnitude.</violation>
</file>

<file name="packages/minimal-agent/charts/3000.json">

<violation number="1" location="packages/minimal-agent/charts/3000.json:254">
P2: `summary.compactionCycles` is inconsistent with the recorded compaction events/token reset and should not be 0.</violation>
</file>

<file name="packages/minimal-agent/charts/2500.json">

<violation number="1" location="packages/minimal-agent/charts/2500.json:172">
P2: The compaction delta annotation is inconsistent with the recorded context totals (`1991 -> 206`), so the `(-285)` value is incorrect/misleading.</violation>

<violation number="2" location="packages/minimal-agent/charts/2500.json:254">
P2: `summary.compactionCycles` is inconsistent with the recorded compaction events and should not be `0`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/charts/1500.json
Comment thread packages/harness/src/session-memory.ts
Comment thread packages/harness/src/compaction-circuit-breaker.ts
Comment thread packages/minimal-agent/charts/2000.json
Comment thread packages/minimal-agent/charts/3000.json
Comment thread packages/minimal-agent/charts/2500.json
Comment thread packages/minimal-agent/charts/2500.json
Connect all 3 Phase 1 modules into the compaction pipeline:

Circuit Breaker → CompactionOrchestrator:
- New optional circuitBreaker param in constructor
- checkAndCompact() skips when circuit is open
- recordSuccess/recordFailure on compaction outcome
- manualCompact() ignores circuit breaker (user intent)
- getState() exposes circuitBreakerOpen status

MicroCompact → CheckpointHistory:
- New microCompact option in CompactionConfig (boolean or options)
- Pre-compaction step: shrinks old assistant responses before summarization
- Reduces summarizer input tokens → better summary quality
- COMPACTION_DEBUG logging for tokensSaved/messagesModified

Session Memory → minimal-agent:
- SessionMemoryTracker instance wired via getStructuredState callback
- extractFactsFromSummary called on compaction completion
- Structured user profile injected into every compaction prompt

TUI + Headless compactionCallbacks:
- Both runners now accept compactionCallbacks in config
- Chains external callbacks with internal ones (both fire)
- Enables minimal-agent to hook into compaction lifecycle

Adaptive Thresholds → harness (from CEA):
- Moved computeAdaptiveThresholdRatio, computeCompactionMaxTokens,
  computeSpeculativeStartRatio from CEA to harness/compaction-policy
- CEA now imports from harness (backwards-compatible re-exports)
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 2, 2026

Phase 3 benchmarks complete. 2k: 57% (21/37, 4 compactions). 4k: 62% (23/37). 8k: 62% (23/37). Circuit Breaker + MicroCompact + Session Memory all verified working.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/index.ts">

<violation number="1" location="packages/minimal-agent/index.ts:160">
P2: Reset the SessionMemoryTracker on "new-session"; otherwise facts extracted from a previous session will keep being injected into compaction prompts for the new session.</violation>
</file>

<file name="packages/harness/src/compaction-policy.ts">

<violation number="1" location="packages/harness/src/compaction-policy.ts:147">
P2: Cap the compaction max tokens to the usable budget; the current 1024 minimum can exceed small context lengths or over-reserved budgets.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/index.ts
Comment thread packages/harness/src/compaction-policy.ts
minpeter added 2 commits April 1, 2026 23:58
Enable all 3 harness compaction features in minimal-agent:
- CircuitBreaker: passed to TUI/headless via new config option
- MicroCompact: enabled via microCompact: true in compaction config
- SessionMemory: already wired (getStructuredState + extractFactsFromSummary)

Also expose circuitBreaker option in TUI and headless runner configs
so any consuming agent can pass one through.
…tool result clearing

BackgroundMemoryExtractor system (Claude Code's #1 missing feature):
- BackgroundMemoryExtractor class: periodic LLM-based memory extraction
  with configurable thresholds (token growth + turn count), single-flight
  guard, and getStructuredState() for compaction integration
- MemoryStore interface with InMemoryStore and FileMemoryStore impls
- Two built-in presets: CHAT_MEMORY_PRESET (user facts) and
  CODE_MEMORY_PRESET (Claude Code-style session notes)

Tool Result MicroCompact extension:
- clearToolResults option to replace old tool_result content
- keepRecentToolResults to preserve N most recent results
- clearableToolNames filter for selective clearing
- Complements existing assistant text shrinking

24 test files, 418 tests passing.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/harness/src/micro-compact.ts">

<violation number="1" location="packages/harness/src/micro-compact.ts:243">
P2: Tool result clearing ignores `protectRecentTokens`, so recent tool outputs can be wiped even though they’re supposed to be preserved. Consider scoping tool-result clearing to the same protected window used for assistant compaction.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/harness/src/micro-compact.ts
…agents

Integration of new harness modules into consuming packages:

minimal-agent:
- Replace SessionMemoryTracker with BackgroundMemoryExtractor (chat preset)
- Aggressive thresholds for small context (300 tokens, 2 turns)
- Fire-and-forget onTurnComplete for non-blocking extraction

benchmark:
- Same BME integration as agent for fair comparison
- Each turn triggers extraction check

CEA:
- Enable tool result clearing: microCompact.clearToolResults = true
- Keep 5 most recent tool results intact

TUI + headless:
- New onTurnComplete callback in config interface
- Called after each model turn with messages + usage
- Non-blocking: doesn't delay main agent loop
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/index.ts">

<violation number="1" location="packages/minimal-agent/index.ts:168">
P2: `/new-session` only resets the session ID, but the new BackgroundMemoryExtractor + InMemoryStore are process‑global. That keeps previous user facts in structured state and can leak memory into the next session. Reset or recreate the extractor/store when a new session starts.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/minimal-agent/index.ts Outdated
Fix BackgroundMemoryExtractor returning template text via getStructuredState
before any extraction has occurred. This wasted tokens in small contexts
(2000 tokens: 65% → 46% regression).

Changes:
- getStructuredState returns undefined until first successful extraction
- Raise default thresholds: minTokenGrowth 300→500, minTurns 2→5
- Cap maxExtractionTokens at 500 for chat preset
- Update test expectation for pre-extraction state
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/harness/src/background-memory-extractor.ts">

<violation number="1" location="packages/harness/src/background-memory-extractor.ts:89">
P2: `getStructuredState()` now hides previously persisted memory after startup because it depends on a flag that is only set by new extractions.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/harness/src/background-memory-extractor.ts
@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented Apr 5, 2026

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 5, 2026

@codex review

…iles

ultracite formatter requires trailing newlines on JSON files; CI lint
was failing on 10 chart files missing them.
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 5, 2026

@crb review

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18 issues found across 105 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/minimal-agent/charts/3000.json">

<violation number="1" location="packages/minimal-agent/charts/3000.json:193">
P2: `spec applied` is attached to the wrong turn: the context reset happens at turn 26, but the event is logged at turn 27, which makes this benchmark trace inconsistent for downstream analysis/visualization.</violation>
</file>

<file name="packages/headless/src/stream-processor.ts">

<violation number="1" location="packages/headless/src/stream-processor.ts:232">
P1: `tool-call` events can be lost because calls are only recorded when a pending entry already exists.</violation>
</file>

<file name="packages/minimal-agent/package.json">

<violation number="1" location="packages/minimal-agent/package.json:18">
P3: `@ai-sdk/anthropic` is benchmark-only usage but was added to runtime `dependencies`; move it to `devDependencies` to avoid unnecessary production dependency footprint.</violation>
</file>

<file name="packages/cea/benchmark/tasks/compaction-stress-long/tests/test.sh">

<violation number="1" location="packages/cea/benchmark/tasks/compaction-stress-long/tests/test.sh:2">
P2: Enable `pipefail` so failures in piped commands are not silently ignored during verification.</violation>
</file>

<file name="packages/cea/benchmark/cea-memory-bench.sh">

<violation number="1" location="packages/cea/benchmark/cea-memory-bench.sh:44">
P2: Do not swallow benchmark execution failures with `|| true`; it masks real run errors and can produce invalid summary data.</violation>
</file>

<file name="packages/minimal-agent/benchmark.ts">

<violation number="1" location="packages/minimal-agent/benchmark.ts:1035">
P2: Use the package env schema instead of reading `process.env` directly; this file currently bypasses the repository’s required validated env pattern.</violation>
</file>

<file name="packages/minimal-agent/BENCHMARK-RESULTS.md">

<violation number="1" location="packages/minimal-agent/BENCHMARK-RESULTS.md:7">
P3: The benchmark configuration is internally inconsistent: it says 16 probes, but reported results use 17 probe/scoring items. Align this count so readers can correctly reproduce and interpret retention metrics.</violation>
</file>

<file name="packages/minimal-agent/visualize.py">

<violation number="1" location="packages/minimal-agent/visualize.py:286">
P2: Validate that `--output` has a following directory argument before reading `args[idx + 1]` to avoid an unhandled IndexError.</violation>
</file>

<file name="packages/harness/src/compaction-orchestrator.ts">

<violation number="1" location="packages/harness/src/compaction-orchestrator.ts:86">
P3: The JSDoc default value is incorrect (`3`), but runtime default is `10`.</violation>

<violation number="2" location="packages/harness/src/compaction-orchestrator.ts:437">
P2: `maxAcceptedCompactionsPerTurn` can become 0 for fractional inputs, which immediately blocks all auto/speculative compactions.</violation>
</file>

<file name="packages/harness/src/compaction-prompts.ts">

<violation number="1" location="packages/harness/src/compaction-prompts.ts:269">
P2: Task-intent extraction sends unbounded user history to the model, which can overflow context and silently disable task-aware compaction on long conversations.</violation>

<violation number="2" location="packages/harness/src/compaction-prompts.ts:281">
P2: Task-intent extraction is scoped to all historical user messages instead of the latest user turn, which can reintroduce stale goals into the summary prompt.</violation>
</file>

<file name="packages/cea/benchmark/harbor_agent.py">

<violation number="1" location="packages/cea/benchmark/harbor_agent.py:24">
P2: `populate_context_post_run` looks for `trajectory.json` in the wrong location relative to where the run command writes it, so metrics extraction can silently stop working.</violation>
</file>

<file name="packages/headless/src/runner.ts">

<violation number="1" location="packages/headless/src/runner.ts:740">
P2: Synchronous throws from `onTurnComplete` bypass the `.catch()` handler and crash the runner. `Promise.resolve(expr)` evaluates `expr` eagerly — a sync throw propagates before the promise is created. Use `Promise.resolve().then(...)` to defer evaluation into a microtask, making both sync throws and async rejections non-fatal.</violation>
</file>

<file name="packages/harness/src/compaction-types.ts">

<violation number="1" location="packages/harness/src/compaction-types.ts:261">
P3: `CompactionRejectionReason` includes unreachable variants that the current acceptance logic can never produce.</violation>
</file>

<file name="packages/minimal-agent/charts/2500.json">

<violation number="1" location="packages/minimal-agent/charts/2500.json:172">
P2: The compaction delta annotation is incorrect for turn 24 (`-285` does not match the `contextAfter` drop from 1991 to 206).</violation>
</file>

<file name="packages/harness/src/compaction-circuit-breaker.ts">

<violation number="1" location="packages/harness/src/compaction-circuit-breaker.ts:27">
P2: `maxConsecutiveFailures` validation allows values like `0.5`, which are floored to `0` and produce an invalid failure threshold.</violation>
</file>

<file name="packages/cea/benchmark/install-agent.sh.j2">

<violation number="1" location="packages/cea/benchmark/install-agent.sh.j2:8">
P2: Validate Node major version, not just presence, so unsupported preinstalled versions are upgraded.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/headless/src/stream-processor.ts Outdated
Comment thread packages/minimal-agent/charts/3000.json
Comment thread packages/cea/benchmark/tasks/compaction-stress-long/tests/test.sh
Comment thread packages/cea/benchmark/cea-memory-bench.sh Outdated
Comment thread packages/minimal-agent/benchmark.ts Outdated
Comment thread packages/cea/benchmark/install-agent.sh.j2 Outdated
Comment thread packages/minimal-agent/package.json Outdated
Comment thread packages/minimal-agent/BENCHMARK-RESULTS.md Outdated
Comment thread packages/harness/src/compaction-orchestrator.ts Outdated
Comment thread packages/harness/src/compaction-types.ts
…xture markers

The compaction-stress-search benchmark task seeds a fake codebase for
the agent to explore. Two placeholder credentials looked realistic
enough to trigger GitGuardian's secret scanner:

- JWT_SECRET = "super-secret-jwt-key-2024-prod"
- ADMIN_DEFAULT_PASSWORD = "admin123!@#"

Replace with BENCHMARK_FIXTURE_FAKE_* markers that make the fixture
nature obvious to secret scanners and future readers.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd79b397c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/headless/src/runner.ts
Comment thread packages/headless/src/stream-processor.ts Outdated
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 5, 2026

CKPT, 잘 동작함

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9e8f55f2f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cea/src/commands/compact.ts Outdated
Comment thread packages/cea/src/entrypoints/main.ts Outdated
…add input validation

- Remove redundant setContextLimit() calls in minimal-agent index.ts and benchmark.ts
  (CheckpointHistory constructor already sets contextLimit from compaction config)
- Extract shared compaction constants into compaction-config.ts to prevent silent drift
- Add --context-limit positive integer validation in benchmark CLI
- Add --provider allowlist validation (friendli | anthropic) in benchmark CLI
- Normalize thresholdRatio in computeContextBudget() to guard against invalid values
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9612710fd8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/minimal-agent/benchmark.ts
Comment thread packages/cea/benchmark/scorer.py Outdated
…y fixes

- install-agent.sh.j2: replace curl pipe with download-then-execute to avoid masking failures
- compact.ts: wrap compact() in try/catch to propagate failure status
- main.ts: use logical OR for ATIF_OUTPUT_PATH to handle empty strings
- cea-memory-bench.sh: remove || true that swallows benchmark failures, fix grep double-zero
- package.json: move @ai-sdk/anthropic to devDependencies (benchmark-only)
- BENCHMARK-RESULTS.md: fix probe count (16 → 17)
- compaction-orchestrator.ts: fix JSDoc default (3 → 10)
- compaction-types.ts: document currently-unused rejection reason variants
- benchmark/AGENTS.md: fix uniq -c expected output format
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13fde5ff83

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cea/src/commands/compact.ts
Comment thread packages/cea/src/entrypoints/main.ts
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 5, 2026

@cubic-dev-ai review

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented Apr 5, 2026

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 107 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/harness/src/compaction-types.ts">

<violation number="1" location="packages/harness/src/compaction-types.ts:244">
P3: The acceptance criteria comment contradicts the actual behavior (only `fitsBudget` is enforced). Update the doc so it doesn’t imply belowTriggerThreshold/meetsMinSavings are hard gates.</violation>
</file>

<file name="packages/cea/src/context/system-prompt.ts">

<violation number="1" location="packages/cea/src/context/system-prompt.ts:113">
P2: This new rule contradicts the earlier “preserve path format / never change absolute vs. relative” guidance. If a task specifies a relative path, the prompt now simultaneously requires keeping it relative and converting it to absolute, which is inconsistent.</violation>
</file>

<file name="packages/minimal-agent/index.ts">

<violation number="1" location="packages/minimal-agent/index.ts:90">
P2: Avoid defaulting FRIENDLI_TOKEN to an empty string; fail fast or require a real token so missing credentials are caught early.</violation>

<violation number="2" location="packages/minimal-agent/index.ts:185">
P2: Handle history compaction shrinking the messages list before slicing; otherwise new user messages added in the same turn after compaction can be skipped and their facts never extracted.</violation>
</file>

<file name="packages/minimal-agent/env.ts">

<violation number="1" location="packages/minimal-agent/env.ts:9">
P2: Make FRIENDLI_TOKEN required in the schema so the agent fails fast with a validation error instead of running with an empty API key.</violation>
</file>

<file name="packages/harness/src/checkpoint-history.ts">

<violation number="1" location="packages/harness/src/checkpoint-history.ts:1459">
P1: `collectToolResultEntries` iterates over all messages (`this.messages`) but the token budget is calculated from active messages only. Truncating a pre-summary tool result decreases `remaining` without actually reducing active-context token usage, so the loop may exit early while active tokens still exceed `hardCeiling`.

Scope the scan to active messages, e.g. by computing an active-message start offset and skipping earlier indices.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/harness/src/checkpoint-history.ts Outdated
Comment thread packages/cea/src/context/system-prompt.ts Outdated
Comment thread packages/minimal-agent/index.ts
Comment thread packages/minimal-agent/index.ts Outdated
Comment thread packages/minimal-agent/env.ts
Comment thread packages/harness/src/compaction-types.ts Outdated
minpeter added 2 commits April 5, 2026 16:24
- checkpoint-history: scope collectToolResultEntries to active messages only
- compaction-types: fix CompactionEffectiveness doc to match actual behavior
- system-prompt: clarify path rule exception for generated scripts
- minimal-agent: guard onTurnComplete slice against compaction-shrunk history
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27f67f2a71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cea/src/entrypoints/main.ts Outdated
Comment thread packages/tui/src/agent-tui.ts
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 5, 2026

@cubic-dev-ai review

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented Apr 5, 2026

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 107 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/cea/benchmark/scorer.py">

<violation number="1" location="packages/cea/benchmark/scorer.py:40">
P2: Guard against `total_prompt_tokens` being 0 before dividing; otherwise this can crash on empty/error trajectories.</violation>
</file>

<file name="packages/harness/src/background-memory-extractor.ts">

<violation number="1" location="packages/harness/src/background-memory-extractor.ts:311">
P1: In incremental mode, when the LLM response contains no `<update>` or `<memory>` tags, `resolveExtractedMemory` falls through to `parseMemoryFromResponse` which accepts the raw LLM text. This overwrites the structured notes store with unstructured conversational text (e.g., "No updates needed."), corrupting the memory. In the incremental path, return `undefined` when no recognized tags are present to preserve the existing notes.</violation>
</file>

<file name="packages/minimal-agent/benchmark.ts">

<violation number="1" location="packages/minimal-agent/benchmark.ts:907">
P2: Don’t record zero-token usage when the provider omits usage data — it forces actual usage to 0 and makes compaction/metrics ignore the real context size. Guard updateActualUsage so it only runs when usage is present (or provide an estimated fallback).</violation>
</file>

<file name="packages/cea/benchmark/tasks/compaction-stress-search/instruction.md">

<violation number="1" location="packages/cea/benchmark/tasks/compaction-stress-search/instruction.md:49">
P2: The output file path in the instructions doesn’t match the verifier. The test expects `/agent/work/audit_report.txt`, but the instructions tell the agent to write to `/work/audit_report.txt`, which will fail validation.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/harness/src/background-memory-extractor.ts
Comment thread packages/cea/benchmark/scorer.py
Comment thread packages/minimal-agent/benchmark.ts
- scorer.py: guard against divide-by-zero when total_prompt_tokens is 0
- env.ts: add ATIF_OUTPUT_PATH to validated env schema
- main.ts: read ATIF_OUTPUT_PATH from env instead of process.env directly
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 219c7923c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

f"logs_dir contents: {list(self.logs_dir.glob('**/*')) if self.logs_dir.exists() else 'dir not found'}"
)
d = self.logs_dir
path = Path(os.getenv("ATIF_OUTPUT_PATH", str(d / "trajectory.json")))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read trajectory from the generated agent subdirectory

populate_context_post_run defaults to logs_dir/trajectory.json, but the run command writes trajectories to /logs/agent/trajectory.json via ATIF_OUTPUT_PATH. Since that env var is only set on the executed agent command, it may not be present when this post-run Python method executes, and there is no fallback to logs_dir/agent/trajectory.json; in that common layout the file is missed and benchmark token metrics are silently reported as zero.

Useful? React with 👍 / 👎.

Comment on lines +740 to +744
Promise.resolve()
.then(() =>
config.onTurnComplete?.(
config.messageHistory.getAll(),
normalizedUsage
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Await turn-complete hook before triggering compaction

This callback is launched in a detached promise and execution immediately continues into speculative/blocking compaction. In this PR, CEA wires onTurnComplete to collect read-file restoration items used by compaction-complete restoration; when context pressure triggers immediate compaction, that race lets compaction run before tracking finishes, so the freshly read context is omitted from restoration and memory continuity degrades.

Useful? React with 👍 / 👎.

@minpeter minpeter merged commit 2f62589 into main Apr 6, 2026
6 checks passed
@minpeter
Copy link
Copy Markdown
Owner Author

minpeter commented Apr 6, 2026

Post-merge follow-up items

Tracked in #97 — test coverage gaps, code quality cleanup, and architecture concerns identified during the final merge review (Oracle + automated agents).

None are blocking; all CI green, 100/100 threads resolved. See issue for full context and checklist.

@minpeter minpeter deleted the feat/minimal-agent-speculative-compaction branch April 22, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant