Add slice-v1 retrieval and benchmark tooling#144
Conversation
Add opt-in slice-v1 retrieval, richer sketch semantics, real-workspace benchmark helpers, and calibration/diagnostic coverage for the post-v0.21 payoff layer. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR adds an opt-in task-conditioned retrieval mode "slice-v1" that anchors and traverses a KnowledgeGraph to produce ordered slice node lists and metadata, exposes retrieval_strategy/slice through CLI/stdio/command paths with validation, enriches sketch output with deterministic env/side-effect hints, and introduces real-workspace benchmark tooling (edge-counts, multi-workspace runner/summarizer, probe calibration) plus tests and docs. ChangesTask-Conditioned Slice-v1 Retrieval Strategy
Real-Workspace Benchmark Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/infrastructure/context-pack-command.ts (1)
221-244:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
--retrieval-strategyis ignored for review packs.The new flag is forwarded for explain/impact only.
runContextPackCommand()returns early fortask === 'review', sographify-ts pack --task review --retrieval-strategy slice-v1succeeds but has no effect. Either reject that combination or thread it into the review bundle path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/infrastructure/context-pack-command.ts` around lines 221 - 244, runContextPackCommand() currently ignores options.retrievalStrategy for task === 'review' because it returns early; update the review path to either validate and reject the combination or pass retrievalStrategy into the review bundle retrieval call: locate runContextPackCommand(), the early-return branch that handles task === 'review', and the code that builds retrieval options for dependencies.retrieveContext (and any call sites of retrieveContext or buildCommunityLabels/compactImpactResult used for review bundles) and ensure options.retrievalStrategy (and options.retrievalLevel if relevant) are forwarded into the retrieval options for review processing, or add an explicit validation that throws a clear error when task === 'review' and retrievalStrategy is set so callers know the flag is unsupported.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs`:
- Line 11: Wrap the JSON.parse call for graph (currently: const graph =
JSON.parse(readFileSync(graphPath, 'utf8'))) in a try/catch so malformed JSON
doesn't crash the CLI: read the file into a string first (using
readFileSync(graphPath, 'utf8')), then try to JSON.parse that string, catch
SyntaxError (or any error), print a clear error via console.error including
graphPath and the error.message, and exit with process.exit(1); ensure the
caught error path does not allow further processing of the invalid graph
variable.
In `@docs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjs`:
- Around line 148-161: The classification logic mislabels non-neutral cases:
change the hurts/expands branch condition in the block that constructs note
(uses tokenDelta, qualityDelta, labelDelta, summary) from (tokenDelta > 0 &&
qualityDelta <= 0) to use OR so it catches either token expansions or quality
regressions; specifically, update the condition guarding
summary.hurts_or_expands.push(note) to (tokenDelta > 0 || qualityDelta < 0) so
quality regressions with token savings and token expansions with quality gains
are correctly classified.
In `@docs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.sh`:
- Around line 8-22: The current run_workspace function and PROMPTS_FILE
assignment can allow typos in GRAPHIFY_BENCH_REAL_PROMPTS,
GRAPHIFY_BENCH_BACKEND, or GRAPHIFY_BENCH_MONOREPO to silently pass through;
update run_workspace (and the PROMPTS_FILE resolution) to validate inputs and
fail fast: check that PROMPTS_FILE exists (-f) and that the workspace_path and
any required backend/monorepo paths exist (-d or -f as appropriate), print a
clear error message naming the missing variable (e.g.,
GRAPHIFY_BENCH_REAL_PROMPTS, GRAPHIFY_BENCH_BACKEND, GRAPHIFY_BENCH_MONOREPO)
and return non-zero/exit so run.sh is not invoked with invalid paths; keep
checks adjacent to the PROMPTS_FILE assignment and at the top of run_workspace
for quick failure.
In `@docs/benchmarks/2026-05-11-spi-vs-legacy/summarize-real-workspaces.mjs`:
- Line 26: The JSON.parse call inside workspaceNames.map can throw on malformed
summary files; wrap the readFileSync/JSON.parse logic used in workspaceNames.map
so each iteration uses a try-catch that captures errors reading/parsing
join(bundleDir, name, 'summary.json') and rethrows or logs a new Error that
includes the workspace name and the file path (e.g., mention name and
'summary.json') so you can identify which workspace failed; update the mapping
to return either the parsed object or propagate the enriched error for later
handling.
In `@src/runtime/context-pack-resolution.ts`:
- Around line 379-385: The side-effect inference currently uses behaviorEdges
(which includes structural labels like 'contains' and 'method') causing false
positives; instead build an execution-only edge set (e.g., execBehaviorEdges =
relationLabels(node, relationIndex, 'outgoing', ['calls']) or another allowlist
of execution-only labels) and pass that execBehaviorEdges into sideEffectHints()
(replace the current sideEffectHints(behaviorEdges) call) so side-effect tags
(queue_event, db_write, etc.) are derived only from executable/calls edges.
In `@src/runtime/retrieve.ts`:
- Around line 1835-1867: The semantic/rerank path currently rebuilds candidates
from all eligible nodes and ignores the lexical slice; update
retrieveContextAsync (and the code paths that call
buildRetrieveResultFromOrderedCandidates) so that when options.retrievalStrategy
=== 'slice-v1' you detect and propagate the lexical slice (lexicalResult.slice
or the sliced.ordered_ids + sliced.metadata returned by
sliceCandidatesForRetrieve) into the semantic/rerank flow and use that
ordered_ids list to constrain the candidate pool before reranking; ensure
sliced.metadata is forwarded into buildRetrieveResultFromOrderedCandidates and
that any fallback to scoredNodeFromGraph only occurs for IDs within the slice so
the slice-v1 constraint and its metadata are honored end-to-end.
In `@src/runtime/retrieve/slicing.ts`:
- Around line 182-185: The traversal code currently drops any neighbor not
already present in scoredCandidates by calling scoredById.get(neighborId) and
continuing when missing; change this so the slice can expand beyond the initial
candidate pool by performing a graph-backed neighbor lookup when scoredById
lacks neighborId (e.g., query the graph index / adjacency map for neighborId,
construct a candidate object for that node, compute its score via the existing
scoring routine, insert it into scoredCandidates/scoredById, and then proceed
with traversal); apply the same change to the other traversal branch that uses
scoredById (the block flagged around the second occurrence) so both paths can
pull in graph-adjacent evidence rather than treating scoredCandidates as a hard
boundary.
In `@src/runtime/stdio/tools.ts`:
- Around line 888-891: The code currently parses retrieval_strategy via
parseRetrievalStrategyParam and returns failure when invalid, but the review
flow (task: 'review') bypasses this check so context_pack can silently accept
unsupported values; update the review branch to validate retrieval_strategy
early: call parseRetrievalStrategyParam(helpers, toolArguments) (or reuse the
existing contextPackStrategy check) inside the review handling path and if it
returns 'invalid' call helpers.failure(id, helpers.jsonrpcInvalidParams,
'retrieval_strategy must be one of default, slice-v1'), or alternatively pass
the validated contextPackStrategy into the review flow so review honors
slice-v1. Ensure you reference parseRetrievalStrategyParam, contextPackStrategy,
helpers.failure and task: 'review' when making the change.
---
Outside diff comments:
In `@src/infrastructure/context-pack-command.ts`:
- Around line 221-244: runContextPackCommand() currently ignores
options.retrievalStrategy for task === 'review' because it returns early; update
the review path to either validate and reject the combination or pass
retrievalStrategy into the review bundle retrieval call: locate
runContextPackCommand(), the early-return branch that handles task === 'review',
and the code that builds retrieval options for dependencies.retrieveContext (and
any call sites of retrieveContext or buildCommunityLabels/compactImpactResult
used for review bundles) and ensure options.retrievalStrategy (and
options.retrievalLevel if relevant) are forwarded into the retrieval options for
review processing, or add an explicit validation that throws a clear error when
task === 'review' and retrievalStrategy is set so callers know the flag is
unsupported.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 70fe1576-27ea-49e9-abf2-0305a581fe5a
📒 Files selected for processing (24)
CHANGELOG.mdREADME.mddocs/benchmarks/2026-05-11-spi-vs-legacy/README.mddocs/benchmarks/2026-05-11-spi-vs-legacy/REAL_WORKSPACE_REPORT_TEMPLATE.mddocs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjsdocs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjsdocs/benchmarks/2026-05-11-spi-vs-legacy/prompts.real-workspace.example.jsondocs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.shdocs/benchmarks/2026-05-11-spi-vs-legacy/run.shdocs/benchmarks/2026-05-11-spi-vs-legacy/summarize-real-workspaces.mjssrc/cli/parser.tssrc/contracts/context-pack.tssrc/infrastructure/context-pack-command.tssrc/runtime/context-pack-resolution.tssrc/runtime/retrieve.tssrc/runtime/retrieve/slicing.tssrc/runtime/stdio/definitions.tssrc/runtime/stdio/tools.tstests/unit/benchmark-graph-stats.test.tstests/unit/benchmark-real-workspace.test.tstests/unit/context-pack-resolution-sketch.test.tstests/unit/retrieve-slice-surface.test.tstests/unit/retrieve-slice-v1.test.tstests/unit/stdio-slice-surface.test.ts
Tighten benchmark script failures, keep slice-v1 constrained through async retrieval, reject unsupported review retrieval_strategy usage, and limit sketch side-effect inference to execution edges. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/unit/benchmark-graph-stats.test.ts (1)
33-43: ⚡ Quick winAdd an explicit non-array
edgesregression test.You already cover the “missing
edges” fallback. Adding a case whereedgesexists but is not an array would harden this contract and prevent silent regressions.Proposed test addition
+ it('falls back to zero edges when graph.json has a non-array edges value', () => { + withTempGraph({ + nodes: [{ id: 'a' }, { id: 'b' }], + edges: { source: 'a', target: 'b' }, + }, (graphPath) => { + const output = execFileSync('node', [ + 'docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs', + graphPath, + ], { cwd: process.cwd(), encoding: 'utf8' }) + expect(JSON.parse(output)).toEqual({ node_count: 2, edge_count: 0 }) + }) + })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/benchmark-graph-stats.test.ts` around lines 33 - 43, Add a new unit test (next to the existing "falls back to zero edges when graph.json omits the edges array") that uses withTempGraph to write a graph JSON where edges is present but not an array (e.g., edges: null or edges: {} or a string) and then calls the same script ('docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs') via execFileSync; assert JSON.parse(output) equals { node_count: 2, edge_count: 0 } to ensure the graph-stats.mjs code path that reads edges treats non-array values the same as missing arrays.src/runtime/context-pack-resolution.ts (1)
382-383: ⚡ Quick winStabilize env/config label ordering for deterministic sketch output.
relationLabels(...)preserves relationship insertion order, so output can drift if upstream edge order varies. SortingreadsEnvandconfigbefore rendering will make this deterministic.Proposed patch
- const config = relationLabels(node, relationIndex, 'outgoing', ['uses_config']) - const readsEnv = relationLabels(node, relationIndex, 'outgoing', ['reads_env']) + const config = relationLabels(node, relationIndex, 'outgoing', ['uses_config']).sort((a, b) => a.localeCompare(b)) + const readsEnv = relationLabels(node, relationIndex, 'outgoing', ['reads_env']).sort((a, b) => a.localeCompare(b))Also applies to: 402-407
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/runtime/context-pack-resolution.ts` around lines 382 - 383, The output can vary because relationLabels(node, relationIndex, 'outgoing', ['uses_config']) and relationLabels(node, relationIndex, 'outgoing', ['reads_env']) preserve insertion order; to make sketch output deterministic, sort the arrays returned by relationLabels (e.g., sort config and readsEnv) before any rendering/serialization. Locate places where config and readsEnv are used (the calls to relationLabels and subsequent rendering logic around the variables named config and readsEnv, plus the similar block around lines 402-407) and replace direct use with a stableSortedConfig and stableSortedReadsEnv (or sort in-place) so downstream code always consumes a consistently ordered list.tests/unit/benchmark-real-workspace.test.ts (1)
135-137: ⚡ Quick winReplace deprecated
toThrowErrormatcher withtoThrowacross test files.Vitest 4 treats
toThrowErroras a deprecated alias. Replace withtoThrowto keep matcher usage current.Also applies to
tests/unit/benchmark-graph-stats.test.ts:53Suggested diff
tests/unit/benchmark-real-workspace.test.ts: - })).toThrowError(expect.objectContaining({ + })).toThrow(expect.objectContaining({ stderr: expect.stringContaining('GRAPHIFY_BENCH_REAL_PROMPTS'), })) ... - })).toThrowError(expect.objectContaining({ + })).toThrow(expect.objectContaining({ stderr: expect.stringContaining('GRAPHIFY_BENCH_BACKEND'), })) ... - })).toThrowError(expect.objectContaining({ + })).toThrow(expect.objectContaining({ stderr: expect.stringContaining('monorepo'), })) tests/unit/benchmark-graph-stats.test.ts: - ], { cwd: process.cwd(), encoding: 'utf8', stdio: 'pipe' })).toThrowError( + ], { cwd: process.cwd(), encoding: 'utf8', stdio: 'pipe' })).toThrow(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/benchmark-real-workspace.test.ts` around lines 135 - 137, Replace the deprecated matcher call expect(...).toThrowError(...) with expect(...).toThrow(...) in the failing tests: locate the occurrences of expect(...).toThrowError(expect.objectContaining({ stderr: expect.stringContaining('GRAPHIFY_BENCH_REAL_PROMPTS') })) in the benchmark-real-workspace test and the similar expect(...).toThrowError(...) in benchmark-graph-stats.test.ts (the exact invocation uses expect.objectContaining and expect.stringContaining) and swap toThrowError to toThrow while keeping the objectContaining argument unchanged so the assertion semantics remain the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/runtime/context-pack-resolution.ts`:
- Around line 382-383: The output can vary because relationLabels(node,
relationIndex, 'outgoing', ['uses_config']) and relationLabels(node,
relationIndex, 'outgoing', ['reads_env']) preserve insertion order; to make
sketch output deterministic, sort the arrays returned by relationLabels (e.g.,
sort config and readsEnv) before any rendering/serialization. Locate places
where config and readsEnv are used (the calls to relationLabels and subsequent
rendering logic around the variables named config and readsEnv, plus the similar
block around lines 402-407) and replace direct use with a stableSortedConfig and
stableSortedReadsEnv (or sort in-place) so downstream code always consumes a
consistently ordered list.
In `@tests/unit/benchmark-graph-stats.test.ts`:
- Around line 33-43: Add a new unit test (next to the existing "falls back to
zero edges when graph.json omits the edges array") that uses withTempGraph to
write a graph JSON where edges is present but not an array (e.g., edges: null or
edges: {} or a string) and then calls the same script
('docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs') via execFileSync;
assert JSON.parse(output) equals { node_count: 2, edge_count: 0 } to ensure the
graph-stats.mjs code path that reads edges treats non-array values the same as
missing arrays.
In `@tests/unit/benchmark-real-workspace.test.ts`:
- Around line 135-137: Replace the deprecated matcher call
expect(...).toThrowError(...) with expect(...).toThrow(...) in the failing
tests: locate the occurrences of
expect(...).toThrowError(expect.objectContaining({ stderr:
expect.stringContaining('GRAPHIFY_BENCH_REAL_PROMPTS') })) in the
benchmark-real-workspace test and the similar expect(...).toThrowError(...) in
benchmark-graph-stats.test.ts (the exact invocation uses expect.objectContaining
and expect.stringContaining) and swap toThrowError to toThrow while keeping the
objectContaining argument unchanged so the assertion semantics remain the same.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 76a15b07-9bac-422f-b668-9af5e8bf3c51
📒 Files selected for processing (18)
docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjsdocs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjsdocs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.shdocs/benchmarks/2026-05-11-spi-vs-legacy/summarize-real-workspaces.mjssrc/infrastructure/context-pack-command.tssrc/runtime/benchmark/probe-calibration.tssrc/runtime/context-pack-resolution.tssrc/runtime/retrieve.tssrc/runtime/retrieve/slicing.tssrc/runtime/stdio/tools.tstests/unit/benchmark-graph-stats.test.tstests/unit/benchmark-probe-calibration.test.tstests/unit/benchmark-real-workspace.test.tstests/unit/context-pack-resolution-sketch.test.tstests/unit/retrieve-semantic.test.tstests/unit/retrieve-slice-surface.test.tstests/unit/retrieve-slice-v1.test.tstests/unit/stdio-slice-surface.test.ts
✅ Files skipped from review due to trivial changes (2)
- tests/unit/benchmark-probe-calibration.test.ts
- tests/unit/retrieve-slice-surface.test.ts
🚧 Files skipped from review as they are similar to previous changes (7)
- docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs
- docs/benchmarks/2026-05-11-spi-vs-legacy/summarize-real-workspaces.mjs
- docs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjs
- tests/unit/retrieve-slice-v1.test.ts
- docs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.sh
- src/runtime/retrieve/slicing.ts
- src/runtime/retrieve.ts
Summary
slice-v1retrieval plus MCP/CLI surface support and regression coverageresolution: 'sketch'with deterministic env/config and side-effect hintsVerification
npm run typechecknpm run buildnpm run test:runbash docs/benchmarks/2026-05-11-spi-vs-legacy/run.shBenchmark notes
947vs SPI cold100267,254bytes ->45,589bytes0helps /7no material change /0hurts-or-expandsNotes
slice-v1is opt-in and experimentalSummary by CodeRabbit
New Features
Documentation