Conversation
Tune compaction config for 2000-token context to support speculative compaction during multi-turn chatbot conversations: - Add speculativeStartRatio (0.75) to trigger background compaction at 750 tokens - Set explicit thresholdRatio (0.5) for blocking compaction at 1000 tokens - Lower reserveTokens from 500 to 400 (chatbot responses are shorter) - Lower keepRecentTokens from 500 to 350 (~3-4 turns preserved) - Add contextLimit to compaction config directly - Add benchmark script (benchmark.ts) for automated 30-turn memory retention testing with probe questions every 5 turns, metrics table, and ASCII context usage chart
|
Important Review skippedAuto reviews are disabled on this repository. To trigger a review, include ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new benchmark script for testing compaction strategies in the minimal agent, along with updated configuration parameters for compaction tuning. The review identified redundant calls to setContextLimit in both the agent and the benchmark script, as this value is already configured during CheckpointHistory instantiation. Additionally, it was noted that compaction parameters are currently duplicated between the agent and the benchmark script, and should be centralized to ensure consistency.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 915d862903
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
2 issues found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/benchmark.ts">
<violation number="1" location="packages/minimal-agent/benchmark.ts:456">
P3: These compaction thresholds (`thresholdRatio`, `speculativeRatio`, `reserveTokens`, `keepRecentTokens`) are duplicated from `index.ts`. Since this benchmark exists to validate the agent's compaction behavior, the values should be imported from a shared location — otherwise a tuning change in `index.ts` silently makes this benchmark test the wrong config.</violation>
<violation number="2" location="packages/minimal-agent/benchmark.ts:603">
P2: Validate `--context-limit` before use. Small/invalid values can make chart `repeat()` counts negative and crash the benchmark.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Increase context budget from 2000 to 4096 tokens and retune thresholds to minimize compaction cycles: - contextLimit: 2000 → 4096 - reserveTokens: 400 → 512 - keepRecentTokens: 350 → 800 (~8-10 turns preserved) - thresholdRatio: 0.5 → 0.65 (blocking at 2662 tokens) - speculativeStartRatio: 0.75 → 0.8 (speculative at 2130 tokens) Benchmark results (30-turn chatbot, GLM-5): - Memory retention: 41% → 82% (14/17 probes passed) - Compaction cycles: 3 → 0 (all 30 turns fit in context) - All targeted probes (turns 5-20, 30) pass at 100% - Turn 25 comprehensive probe: 3/6 (model response quality, not context loss)
…esults Set temperature to 0 in generateText calls to eliminate model nondeterminism. Verified 82% (14/17) retention at 4096 context is reproducible across runs.
Add --output flag to benchmark for JSON result export. Add visualize.py (matplotlib) that generates 4 charts from JSON results: - retention_curve: Memory retention % vs context size - token_usage: Context token usage over 30 turns - probe_heatmap: Per-probe recall scores - summary: 3-panel overview Usage: python3 visualize.py results/*.json --output charts/
Benchmark Visualization — 5-Point Context Size SweepRan the benchmark at 5 context sizes (1500, 2000, 2500, 3000, 4096) with How to reproduce# Run benchmarks
for ctx in 1500 2000 2500 3000 4096; do
pnpm --filter @plugsuits/minimal-agent benchmark --context-limit $ctx --output results/$ctx.json
done
# Generate charts
python3 packages/minimal-agent/visualize.py results/*.json --output charts/Raw Data
Key Insights
|
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/visualize.py">
<violation number="1" location="packages/minimal-agent/visualize.py:284">
P2: Validate that `--output` has a following directory argument before indexing, otherwise this crashes with `IndexError` on malformed CLI input.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…+ friendli) Add --provider flag to benchmark supporting 'anthropic' and 'friendli'. Refactor callModel to accept LanguageModel interface for provider-agnostic benchmarking. Add @ai-sdk/anthropic dependency. Opus benchmark result at 4096 context: 94% (16/17) vs GLM-5's 82% (14/17). Key difference: Turn 25 comprehensive recall 6/6 (vs 3/6 with GLM-5), confirming the 82% ceiling was model response quality, not compaction.
Model Comparison: GLM-5 vs Claude Opus 4Ran the benchmark with Results
AnalysisThe 82% ceiling was indeed model quality, not compaction. With the same 4096-token context and same compaction config:
Key InsightOpus's longer, more thorough responses (~2483 peak tokens) actually triggered speculative compaction at turn 25, while GLM-5's shorter responses (~2291 peak) never triggered it. Despite this compaction handicap, Opus still scored 12%p higher — proving that model capability is the dominant factor in memory retention, not context management. Benchmark commandpnpm --filter @plugsuits/minimal-agent benchmark \
--provider anthropic --model claude-opus-4-20250514 \
--output results/opus-4096.json |
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/benchmark.ts">
<violation number="1" location="packages/minimal-agent/benchmark.ts:653">
P2: Validate and normalize `--provider` before branching; invalid values currently default silently to Friendli.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
… chatbot Replace generic system prompt with fact-retention-aware version that instructs the model to remember personal information and list ALL known facts when asked to recall. Replace code-agent-oriented compaction prompt (Files & Changes, Technical Discoveries) with chatbot-specific version that prioritizes: - User Profile extraction (all personal details as bullet points) - Conversation Highlights (topics, advice, decisions) - Current Topic (for continuity) Impact on memory retention (GLM-5): - 2000 tokens: 53% → 71% (+18pp, compaction preserves user facts) - 4096 tokens: 82% → 82% (no change, compaction not triggered)
System Prompt + Compaction Prompt OptimizationChangesSystem prompt — added fact-retention instructions: Compaction prompt — replaced code-agent template (Files & Changes, Technical Discoveries) with chatbot-specific template:
Results
AnalysisThe +18pp improvement at 2000 tokens is significant — it proves the chatbot compaction prompt preserves user facts much better than the code-agent default. The compaction summary now produces a "User Profile" section that explicitly lists all personal details. At 4096, no change because compaction never triggers (all 30 turns fit in context). The improvement only manifests when compaction actually runs. Cumulative improvement path
|
…rison Extend conversation from 30 to 50 turns (10 probes total) to force compaction at 4096 context. Add --baseline flag to benchmark for A/B testing against the default code-agent compaction prompt. Key result at 4096 context, 50 turns: - Chatbot prompt: Turn 35 (post-compaction) scores 4/4 on pet recall - Baseline prompt: Turn 35 scores 0/4 — pet info lost in compaction - Overall: 54% vs 51% (chatbot vs baseline)
…nalysis Upgrade CHATBOT_COMPACTION_PROMPT with techniques learned from Claude Code: 1. Analysis scratchpad: <analysis> block for think-before-summarize (harness already strips it, only <summary> content is kept) 2. All User Messages list: explicit section preserving user intent trail 3. Previous-summary fact preservation: carry forward ALL facts from prior compaction, never drop information across cycles 4. Partial compact awareness: focus summary on older messages since recent ones are preserved separately via keepRecentTokens 50-turn benchmark at 4096 context (GLM-5): - Before: 54% (20/37) - After: 62% (23/37) — +8pp improvement - Turn 40 (post-compaction recall): 2/4 → 4/4 (perfect)
There was a problem hiding this comment.
3 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/index.ts">
<violation number="1" location="packages/minimal-agent/index.ts:30">
P2: Requiring an `<analysis>` section wastes the summarizer’s limited output budget and can reduce the amount of useful summary content retained.</violation>
<violation number="2" location="packages/minimal-agent/index.ts:44">
P1: The prompt asks to list every user message, but compaction instructions are injected as a user turn, so internal control text can be incorrectly persisted into summaries.</violation>
</file>
<file name="packages/minimal-agent/benchmark.ts">
<violation number="1" location="packages/minimal-agent/benchmark.ts:35">
P2: Exclude the internal compaction instruction from the “All User Messages” list; otherwise the model is likely to summarize the compaction prompt itself (it is sent as a user message), which bloats and pollutes the summary.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Three new shared compaction modules inspired by Claude Code's context management architecture: 1. CompactionCircuitBreaker (compaction-circuit-breaker.ts) - Tracks consecutive compaction failures, opens after 3 (configurable) - Auto-closes after cooldown period (default 60s) - Prevents infinite retry loops on irrecoverable context overflow 2. microCompactMessages (micro-compact.ts) - Pre-compaction step that shrinks old long assistant responses - Protects recent messages (configurable token window) - Preserves user messages and summary messages - Immutable: returns new array without modifying input 3. SessionMemoryTracker (session-memory.ts) - Structured key-value memory persisting across compaction cycles - Categorized facts: identity, preferences, relationships, context - getStructuredState() callback for CompactionConfig integration - extractFactsFromSummary() to parse User Profile from summaries - JSON serialization for persistence (toJSON/fromJSON) All modules exported from @ai-sdk-tool/harness. 21 test files, 399 tests passing.
There was a problem hiding this comment.
7 issues found across 12 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/charts/1500.json">
<violation number="1" location="packages/minimal-agent/charts/1500.json:255">
P2: This benchmark artifact is mislabeled or generated with the wrong context setting: `peakTokens` and per-turn `contextAfter` values exceed the declared `contextLimit` (1500), so the chart data is unreliable.</violation>
</file>
<file name="packages/harness/src/session-memory.ts">
<violation number="1" location="packages/harness/src/session-memory.ts:286">
P2: Guard non-object entries before normalizing imported facts to prevent runtime crashes when JSON contains `null` or other invalid items.</violation>
</file>
<file name="packages/harness/src/compaction-circuit-breaker.ts">
<violation number="1" location="packages/harness/src/compaction-circuit-breaker.ts:28">
P2: `maxConsecutiveFailures` can become `0` for fractional inputs (e.g. `0.5`), which breaks threshold semantics and can open the circuit prematurely.</violation>
</file>
<file name="packages/minimal-agent/charts/2000.json">
<violation number="1" location="packages/minimal-agent/charts/2000.json:158">
P2: The compaction delta text is inconsistent with the recorded context values (`1775` to `123`), so this chart entry reports an incorrect compaction magnitude.</violation>
</file>
<file name="packages/minimal-agent/charts/3000.json">
<violation number="1" location="packages/minimal-agent/charts/3000.json:254">
P2: `summary.compactionCycles` is inconsistent with the recorded compaction events/token reset and should not be 0.</violation>
</file>
<file name="packages/minimal-agent/charts/2500.json">
<violation number="1" location="packages/minimal-agent/charts/2500.json:172">
P2: The compaction delta annotation is inconsistent with the recorded context totals (`1991 -> 206`), so the `(-285)` value is incorrect/misleading.</violation>
<violation number="2" location="packages/minimal-agent/charts/2500.json:254">
P2: `summary.compactionCycles` is inconsistent with the recorded compaction events and should not be `0`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Connect all 3 Phase 1 modules into the compaction pipeline: Circuit Breaker → CompactionOrchestrator: - New optional circuitBreaker param in constructor - checkAndCompact() skips when circuit is open - recordSuccess/recordFailure on compaction outcome - manualCompact() ignores circuit breaker (user intent) - getState() exposes circuitBreakerOpen status MicroCompact → CheckpointHistory: - New microCompact option in CompactionConfig (boolean or options) - Pre-compaction step: shrinks old assistant responses before summarization - Reduces summarizer input tokens → better summary quality - COMPACTION_DEBUG logging for tokensSaved/messagesModified Session Memory → minimal-agent: - SessionMemoryTracker instance wired via getStructuredState callback - extractFactsFromSummary called on compaction completion - Structured user profile injected into every compaction prompt TUI + Headless compactionCallbacks: - Both runners now accept compactionCallbacks in config - Chains external callbacks with internal ones (both fire) - Enables minimal-agent to hook into compaction lifecycle Adaptive Thresholds → harness (from CEA): - Moved computeAdaptiveThresholdRatio, computeCompactionMaxTokens, computeSpeculativeStartRatio from CEA to harness/compaction-policy - CEA now imports from harness (backwards-compatible re-exports)
|
Phase 3 benchmarks complete. 2k: 57% (21/37, 4 compactions). 4k: 62% (23/37). 8k: 62% (23/37). Circuit Breaker + MicroCompact + Session Memory all verified working. |
There was a problem hiding this comment.
2 issues found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/index.ts">
<violation number="1" location="packages/minimal-agent/index.ts:160">
P2: Reset the SessionMemoryTracker on "new-session"; otherwise facts extracted from a previous session will keep being injected into compaction prompts for the new session.</violation>
</file>
<file name="packages/harness/src/compaction-policy.ts">
<violation number="1" location="packages/harness/src/compaction-policy.ts:147">
P2: Cap the compaction max tokens to the usable budget; the current 1024 minimum can exceed small context lengths or over-reserved budgets.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Enable all 3 harness compaction features in minimal-agent: - CircuitBreaker: passed to TUI/headless via new config option - MicroCompact: enabled via microCompact: true in compaction config - SessionMemory: already wired (getStructuredState + extractFactsFromSummary) Also expose circuitBreaker option in TUI and headless runner configs so any consuming agent can pass one through.
…tool result clearing BackgroundMemoryExtractor system (Claude Code's #1 missing feature): - BackgroundMemoryExtractor class: periodic LLM-based memory extraction with configurable thresholds (token growth + turn count), single-flight guard, and getStructuredState() for compaction integration - MemoryStore interface with InMemoryStore and FileMemoryStore impls - Two built-in presets: CHAT_MEMORY_PRESET (user facts) and CODE_MEMORY_PRESET (Claude Code-style session notes) Tool Result MicroCompact extension: - clearToolResults option to replace old tool_result content - keepRecentToolResults to preserve N most recent results - clearableToolNames filter for selective clearing - Complements existing assistant text shrinking 24 test files, 418 tests passing.
There was a problem hiding this comment.
1 issue found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/harness/src/micro-compact.ts">
<violation number="1" location="packages/harness/src/micro-compact.ts:243">
P2: Tool result clearing ignores `protectRecentTokens`, so recent tool outputs can be wiped even though they’re supposed to be preserved. Consider scoping tool-result clearing to the same protected window used for assistant compaction.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…agents Integration of new harness modules into consuming packages: minimal-agent: - Replace SessionMemoryTracker with BackgroundMemoryExtractor (chat preset) - Aggressive thresholds for small context (300 tokens, 2 turns) - Fire-and-forget onTurnComplete for non-blocking extraction benchmark: - Same BME integration as agent for fair comparison - Each turn triggers extraction check CEA: - Enable tool result clearing: microCompact.clearToolResults = true - Keep 5 most recent tool results intact TUI + headless: - New onTurnComplete callback in config interface - Called after each model turn with messages + usage - Non-blocking: doesn't delay main agent loop
There was a problem hiding this comment.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/index.ts">
<violation number="1" location="packages/minimal-agent/index.ts:168">
P2: `/new-session` only resets the session ID, but the new BackgroundMemoryExtractor + InMemoryStore are process‑global. That keeps previous user facts in structured state and can leak memory into the next session. Reset or recreate the extractor/store when a new session starts.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Fix BackgroundMemoryExtractor returning template text via getStructuredState before any extraction has occurred. This wasted tokens in small contexts (2000 tokens: 65% → 46% regression). Changes: - getStructuredState returns undefined until first successful extraction - Raise default thresholds: minTokenGrowth 300→500, minTurns 2→5 - Cap maxExtractionTokens at 500 for chat preset - Update test expectation for pre-extraction state
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/harness/src/background-memory-extractor.ts">
<violation number="1" location="packages/harness/src/background-memory-extractor.ts:89">
P2: `getStructuredState()` now hides previously persisted memory after startup because it depends on a flag that is only set by new extractions.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
@minpeter I have started the AI code review. It will take a few minutes to complete. |
|
@codex review |
…iles ultracite formatter requires trailing newlines on JSON files; CI lint was failing on 10 chart files missing them.
|
@crb review |
There was a problem hiding this comment.
18 issues found across 105 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/minimal-agent/charts/3000.json">
<violation number="1" location="packages/minimal-agent/charts/3000.json:193">
P2: `spec applied` is attached to the wrong turn: the context reset happens at turn 26, but the event is logged at turn 27, which makes this benchmark trace inconsistent for downstream analysis/visualization.</violation>
</file>
<file name="packages/headless/src/stream-processor.ts">
<violation number="1" location="packages/headless/src/stream-processor.ts:232">
P1: `tool-call` events can be lost because calls are only recorded when a pending entry already exists.</violation>
</file>
<file name="packages/minimal-agent/package.json">
<violation number="1" location="packages/minimal-agent/package.json:18">
P3: `@ai-sdk/anthropic` is benchmark-only usage but was added to runtime `dependencies`; move it to `devDependencies` to avoid unnecessary production dependency footprint.</violation>
</file>
<file name="packages/cea/benchmark/tasks/compaction-stress-long/tests/test.sh">
<violation number="1" location="packages/cea/benchmark/tasks/compaction-stress-long/tests/test.sh:2">
P2: Enable `pipefail` so failures in piped commands are not silently ignored during verification.</violation>
</file>
<file name="packages/cea/benchmark/cea-memory-bench.sh">
<violation number="1" location="packages/cea/benchmark/cea-memory-bench.sh:44">
P2: Do not swallow benchmark execution failures with `|| true`; it masks real run errors and can produce invalid summary data.</violation>
</file>
<file name="packages/minimal-agent/benchmark.ts">
<violation number="1" location="packages/minimal-agent/benchmark.ts:1035">
P2: Use the package env schema instead of reading `process.env` directly; this file currently bypasses the repository’s required validated env pattern.</violation>
</file>
<file name="packages/minimal-agent/BENCHMARK-RESULTS.md">
<violation number="1" location="packages/minimal-agent/BENCHMARK-RESULTS.md:7">
P3: The benchmark configuration is internally inconsistent: it says 16 probes, but reported results use 17 probe/scoring items. Align this count so readers can correctly reproduce and interpret retention metrics.</violation>
</file>
<file name="packages/minimal-agent/visualize.py">
<violation number="1" location="packages/minimal-agent/visualize.py:286">
P2: Validate that `--output` has a following directory argument before reading `args[idx + 1]` to avoid an unhandled IndexError.</violation>
</file>
<file name="packages/harness/src/compaction-orchestrator.ts">
<violation number="1" location="packages/harness/src/compaction-orchestrator.ts:86">
P3: The JSDoc default value is incorrect (`3`), but runtime default is `10`.</violation>
<violation number="2" location="packages/harness/src/compaction-orchestrator.ts:437">
P2: `maxAcceptedCompactionsPerTurn` can become 0 for fractional inputs, which immediately blocks all auto/speculative compactions.</violation>
</file>
<file name="packages/harness/src/compaction-prompts.ts">
<violation number="1" location="packages/harness/src/compaction-prompts.ts:269">
P2: Task-intent extraction sends unbounded user history to the model, which can overflow context and silently disable task-aware compaction on long conversations.</violation>
<violation number="2" location="packages/harness/src/compaction-prompts.ts:281">
P2: Task-intent extraction is scoped to all historical user messages instead of the latest user turn, which can reintroduce stale goals into the summary prompt.</violation>
</file>
<file name="packages/cea/benchmark/harbor_agent.py">
<violation number="1" location="packages/cea/benchmark/harbor_agent.py:24">
P2: `populate_context_post_run` looks for `trajectory.json` in the wrong location relative to where the run command writes it, so metrics extraction can silently stop working.</violation>
</file>
<file name="packages/headless/src/runner.ts">
<violation number="1" location="packages/headless/src/runner.ts:740">
P2: Synchronous throws from `onTurnComplete` bypass the `.catch()` handler and crash the runner. `Promise.resolve(expr)` evaluates `expr` eagerly — a sync throw propagates before the promise is created. Use `Promise.resolve().then(...)` to defer evaluation into a microtask, making both sync throws and async rejections non-fatal.</violation>
</file>
<file name="packages/harness/src/compaction-types.ts">
<violation number="1" location="packages/harness/src/compaction-types.ts:261">
P3: `CompactionRejectionReason` includes unreachable variants that the current acceptance logic can never produce.</violation>
</file>
<file name="packages/minimal-agent/charts/2500.json">
<violation number="1" location="packages/minimal-agent/charts/2500.json:172">
P2: The compaction delta annotation is incorrect for turn 24 (`-285` does not match the `contextAfter` drop from 1991 to 206).</violation>
</file>
<file name="packages/harness/src/compaction-circuit-breaker.ts">
<violation number="1" location="packages/harness/src/compaction-circuit-breaker.ts:27">
P2: `maxConsecutiveFailures` validation allows values like `0.5`, which are floored to `0` and produce an invalid failure threshold.</violation>
</file>
<file name="packages/cea/benchmark/install-agent.sh.j2">
<violation number="1" location="packages/cea/benchmark/install-agent.sh.j2:8">
P2: Validate Node major version, not just presence, so unsupported preinstalled versions are upgraded.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…xture markers The compaction-stress-search benchmark task seeds a fake codebase for the agent to explore. Two placeholder credentials looked realistic enough to trigger GitGuardian's secret scanner: - JWT_SECRET = "super-secret-jwt-key-2024-prod" - ADMIN_DEFAULT_PASSWORD = "admin123!@#" Replace with BENCHMARK_FIXTURE_FAKE_* markers that make the fixture nature obvious to secret scanners and future readers.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd79b397c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
CKPT, 잘 동작함 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9e8f55f2f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…add input validation - Remove redundant setContextLimit() calls in minimal-agent index.ts and benchmark.ts (CheckpointHistory constructor already sets contextLimit from compaction config) - Extract shared compaction constants into compaction-config.ts to prevent silent drift - Add --context-limit positive integer validation in benchmark CLI - Add --provider allowlist validation (friendli | anthropic) in benchmark CLI - Normalize thresholdRatio in computeContextBudget() to guard against invalid values
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9612710fd8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…y fixes - install-agent.sh.j2: replace curl pipe with download-then-execute to avoid masking failures - compact.ts: wrap compact() in try/catch to propagate failure status - main.ts: use logical OR for ATIF_OUTPUT_PATH to handle empty strings - cea-memory-bench.sh: remove || true that swallows benchmark failures, fix grep double-zero - package.json: move @ai-sdk/anthropic to devDependencies (benchmark-only) - BENCHMARK-RESULTS.md: fix probe count (16 → 17) - compaction-orchestrator.ts: fix JSDoc default (3 → 10) - compaction-types.ts: document currently-unused rejection reason variants - benchmark/AGENTS.md: fix uniq -c expected output format
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13fde5ff83
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@cubic-dev-ai review |
@minpeter I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
6 issues found across 107 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/harness/src/compaction-types.ts">
<violation number="1" location="packages/harness/src/compaction-types.ts:244">
P3: The acceptance criteria comment contradicts the actual behavior (only `fitsBudget` is enforced). Update the doc so it doesn’t imply belowTriggerThreshold/meetsMinSavings are hard gates.</violation>
</file>
<file name="packages/cea/src/context/system-prompt.ts">
<violation number="1" location="packages/cea/src/context/system-prompt.ts:113">
P2: This new rule contradicts the earlier “preserve path format / never change absolute vs. relative” guidance. If a task specifies a relative path, the prompt now simultaneously requires keeping it relative and converting it to absolute, which is inconsistent.</violation>
</file>
<file name="packages/minimal-agent/index.ts">
<violation number="1" location="packages/minimal-agent/index.ts:90">
P2: Avoid defaulting FRIENDLI_TOKEN to an empty string; fail fast or require a real token so missing credentials are caught early.</violation>
<violation number="2" location="packages/minimal-agent/index.ts:185">
P2: Handle history compaction shrinking the messages list before slicing; otherwise new user messages added in the same turn after compaction can be skipped and their facts never extracted.</violation>
</file>
<file name="packages/minimal-agent/env.ts">
<violation number="1" location="packages/minimal-agent/env.ts:9">
P2: Make FRIENDLI_TOKEN required in the schema so the agent fails fast with a validation error instead of running with an empty API key.</violation>
</file>
<file name="packages/harness/src/checkpoint-history.ts">
<violation number="1" location="packages/harness/src/checkpoint-history.ts:1459">
P1: `collectToolResultEntries` iterates over all messages (`this.messages`) but the token budget is calculated from active messages only. Truncating a pre-summary tool result decreases `remaining` without actually reducing active-context token usage, so the loop may exit early while active tokens still exceed `hardCeiling`.
Scope the scan to active messages, e.g. by computing an active-message start offset and skipping earlier indices.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- checkpoint-history: scope collectToolResultEntries to active messages only - compaction-types: fix CompactionEffectiveness doc to match actual behavior - system-prompt: clarify path rule exception for generated scripts - minimal-agent: guard onTurnComplete slice against compaction-shrunk history
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27f67f2a71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@cubic-dev-ai review |
@minpeter I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
4 issues found across 107 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/cea/benchmark/scorer.py">
<violation number="1" location="packages/cea/benchmark/scorer.py:40">
P2: Guard against `total_prompt_tokens` being 0 before dividing; otherwise this can crash on empty/error trajectories.</violation>
</file>
<file name="packages/harness/src/background-memory-extractor.ts">
<violation number="1" location="packages/harness/src/background-memory-extractor.ts:311">
P1: In incremental mode, when the LLM response contains no `<update>` or `<memory>` tags, `resolveExtractedMemory` falls through to `parseMemoryFromResponse` which accepts the raw LLM text. This overwrites the structured notes store with unstructured conversational text (e.g., "No updates needed."), corrupting the memory. In the incremental path, return `undefined` when no recognized tags are present to preserve the existing notes.</violation>
</file>
<file name="packages/minimal-agent/benchmark.ts">
<violation number="1" location="packages/minimal-agent/benchmark.ts:907">
P2: Don’t record zero-token usage when the provider omits usage data — it forces actual usage to 0 and makes compaction/metrics ignore the real context size. Guard updateActualUsage so it only runs when usage is present (or provide an estimated fallback).</violation>
</file>
<file name="packages/cea/benchmark/tasks/compaction-stress-search/instruction.md">
<violation number="1" location="packages/cea/benchmark/tasks/compaction-stress-search/instruction.md:49">
P2: The output file path in the instructions doesn’t match the verifier. The test expects `/agent/work/audit_report.txt`, but the instructions tell the agent to write to `/work/audit_report.txt`, which will fail validation.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- scorer.py: guard against divide-by-zero when total_prompt_tokens is 0 - env.ts: add ATIF_OUTPUT_PATH to validated env schema - main.ts: read ATIF_OUTPUT_PATH from env instead of process.env directly
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 219c7923c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| f"logs_dir contents: {list(self.logs_dir.glob('**/*')) if self.logs_dir.exists() else 'dir not found'}" | ||
| ) | ||
| d = self.logs_dir | ||
| path = Path(os.getenv("ATIF_OUTPUT_PATH", str(d / "trajectory.json"))) |
There was a problem hiding this comment.
Read trajectory from the generated agent subdirectory
populate_context_post_run defaults to logs_dir/trajectory.json, but the run command writes trajectories to /logs/agent/trajectory.json via ATIF_OUTPUT_PATH. Since that env var is only set on the executed agent command, it may not be present when this post-run Python method executes, and there is no fallback to logs_dir/agent/trajectory.json; in that common layout the file is missed and benchmark token metrics are silently reported as zero.
Useful? React with 👍 / 👎.
| Promise.resolve() | ||
| .then(() => | ||
| config.onTurnComplete?.( | ||
| config.messageHistory.getAll(), | ||
| normalizedUsage |
There was a problem hiding this comment.
Await turn-complete hook before triggering compaction
This callback is launched in a detached promise and execution immediately continues into speculative/blocking compaction. In this PR, CEA wires onTurnComplete to collect read-file restoration items used by compaction-complete restoration; when context pressure triggers immediate compaction, that race lets compaction run before tracking finishes, so the freshly read context is omitted from restoration and memory continuity degrades.
Useful? React with 👍 / 👎.
Post-merge follow-up itemsTracked in #97 — test coverage gaps, code quality cleanup, and architecture concerns identified during the final merge review (Oracle + automated agents). None are blocking; all CI green, 100/100 threads resolved. See issue for full context and checklist. |
Summary
Large-scope stability release for the compaction subsystem. Resolves infinite compaction loops in small-context scenarios, eliminates process crashes caused by unhandled stream rejections, and lands numerous harness/TUI stability fixes. Also ships minimal-agent speculative compaction tuning and new benchmark tasks (the work that originally opened this PR).
Key fixes
🔁 Infinite compaction loop prevention (harness)
maxAcceptedCompactionsPerTurn, default 10): combines accepted + ineffective compactions. When the cap is hit, no further compaction runs that turn.fitsBudgetfailures reject a compaction attempt.belowTriggerThreshold/meetsMinSavingsare observability-only signals.notifyNewUserTurn()wired from TUI and headless runtime.💥 Unhandled rejection crash fixes
Crashes with `Unhandled rejection: NoOutputGeneratedError` when the provider returns an empty stream (or the network fails):
🔧 Harness compaction stability (numerous commits from earlier work)
📊 minimal-agent speculative compaction (original PR scope)
Enable speculative compaction in minimal-agent with thresholds tuned for 30-turn chatbot conversations within a 4096-token context window:
🎨 TUI & headless
📦 Benchmark additions
Version bumps (via changesets)
CEA gets the minor bump because the user-facing feature set (task-aware compaction, loop prevention, crash fixes) surfaces through CEA's runtime. Harness/tui/headless changes are internal enhancements that preserve their public API contracts.
Benchmark results (minimal-agent, 4096 context)
At 4096, all 30 turns fit without compaction (~2300 tokens peak) — probe scores reflect model response quality, not information loss.
Testing
Related