v0.10.1 — honest numbers + lean MCP profile#28
Conversation
Ship a lean MCP tool surface by default. The Claude Code session-start
overhead from advertising all 21 tools writes ~16-22K tokens to
cache_creation_input_tokens (priced at 1.25x input rate) on a fresh
session. Most agents only call retrieve, impact, call_chain,
community_overview, pr_impact, and graph_stats — gating the other 15
behind GRAPHIFY_TOOL_PROFILE=full reclaims the cache budget.
- Add CORE_TOOL_NAMES, McpToolProfile, activeMcpTools, and
resolveToolProfileFromEnv to runtime/stdio/definitions.ts.
- Wire tools/list and tools/call through the active profile in
runtime/stdio-server.ts; non-core calls in core mode return
JSONRPC_METHOD_NOT_FOUND with a documented hint pointing at
GRAPHIFY_TOOL_PROFILE=full in .mcp.json.
- Default the generated .mcp.json (Claude / Cursor / VS Code Copilot)
to env: { GRAPHIFY_TOOL_PROFILE: 'core' } via installMcpServer.
- Tests: stdio-tool-profile.test.ts covers profile selection and
gating end-to-end; install.test.ts asserts the env block; the
existing stdio-server.test.ts opts into 'full' for behavior tests.
The previously-published 384x retrieve-compression headline was computed
against an internal-only baseline_mode='full' prompt that no real agent
ever sends. The credible measurement is the 2026-04-30 native_agent
comparison against a production NestJS+Next.js codebase: 3x fewer
tool-call turns, ~2.8x faster end-to-end latency, and 2.6x fewer total
input tokens as billed by Anthropic. The cold-start cost premium is
disclosed honestly (~+13% on a single-question session, amortizing on
multi-question sessions); v0.10.1 also flips the default to the lean
6-tool 'core' profile so cold-start cost approaches parity.
- src/infrastructure/install.ts: replace the RETRIEVE_FIRST_MESSAGE
string baked into the .claude PreToolUse / .gemini BeforeTool /
.codex hook payloads with the measured copy ('3x fewer turns,
~2.8x faster on a real production codebase'). The base64 encoding
is regenerated automatically at install time.
- README.md: replace the Benchmarks section with the measured table
and the cold-start cost honesty disclosure; cite the public
artifact at docs/benchmarks/2026-04-30-govalidate/ that lands in
the next commit.
- examples/why-graphify.md: replace the headline efficiency section
and the Benchmark Summary with the same measured numbers; drop
the stale '17 MCP tools' line and document the core/full profile
split.
- Tests: install-templates.test.ts and why-graphify-doc.test.ts
enforce no '384x'/'397x'/'897x' substrings in the public copy
and the install hook payload.
The existing 'full' and 'bounded' baseline modes both build synthetic baseline prompts from the project corpus, which no real agent ever sends. Their reduction_ratio is a cl100k_base estimate, not an Anthropic-billed measurement. native_agent runs the user's --exec command twice — once with graphify-out/graph.json, .mcp.json, CLAUDE.md, and .claude/ snapshot-renamed out of the working directory (baseline), once with them restored (graphify) — and reports the usage blocks from 'claude --output-format json' verbatim. - src/infrastructure/compare.ts: extend CompareBaselineMode with 'native_agent' alongside the existing modes. Add executeNativeAgentCompare, parseAnthropicResultEvent, and the NativeAgentCompareReport / NativeAgentRunner types. Atomic rename / try-finally restore guarantees no project state is left behind even if the baseline runner crashes; a probe test verifies the snapshot doesn't hide graphify-out/compare/<ts>/. Add an explicit synthetic-baseline disclosure line to formatCompareSummary for full/bounded so a reader cannot mistake the synthetic ratio for an Anthropic-billed measurement. - src/cli/parser.ts and src/cli/main.ts: accept and document the new mode. Default stays 'full' for backward compat. - tests/fixtures/mock-claude-runner.mjs: deterministic mock that emits the Anthropic JSON shape so the smoke test runs without a real model call. - tests/unit/compare-native-agent.test.ts: covers success path, crash safety, bare-project absent state, exec_command redaction, runner_error fallback, snapshot scope (compare/<ts> stays writable), and stream-json parsing.
The headline numbers in the README, examples/why-graphify.md, and the install hook payload all reference the 2026-04-30 native_agent measurement against the GoValidate codebase. Commit the raw evidence so anyone can reproduce the totals from the same files graphify-ts ships in its package directory. - docs/benchmarks/2026-04-30-govalidate/baseline-session.json: the 'claude --output-format json' result event from the no-graphify run, with the answer body redacted (the question is internal). - docs/benchmarks/2026-04-30-govalidate/graphify-session.json: the graphify-enabled run, same shape. - docs/benchmarks/2026-04-30-govalidate/verify.sh: bash + node reproducer that prints the per-file usage blocks and computes the headline reductions (3x turns, 2.77x latency, 2.63x input tokens). Uses $DIR rather than absolute paths so the script reproduces from any checkout. - docs/benchmarks/2026-04-30-govalidate/README.md: narrative, setup, headline table, sum-of-fields explanation, and reproduction recipe for running the same comparison on a reader's own codebase. - tests/unit/benchmark-artifact.test.ts: ensures the README cites numbers that are computable from the committed JSON, asserts no stale '384x'/'897x' marketing claims in the README, and runs verify.sh end-to-end (skipped only when jq is missing).
Bump package.json from 0.10.0 to 0.10.1 and document the changes that
shipped in this PR:
Added: GRAPHIFY_TOOL_PROFILE env var (core profile = 6 tools);
compare --baseline-mode native_agent with Anthropic-reported
usage; public benchmark artifact under
docs/benchmarks/2026-04-30-govalidate/.
Changed: replace 384x/897x strawman headline with measured 3x/2.8x
numbers everywhere (README, examples/why-graphify.md,
install hook payload); compare summary now labels synthetic
full/bounded ratios explicitly.
Fixed: cold-start cost regression (cache_creation overhead from the
full 21-tool MCP surface) by shipping core as the default.
Eval gate stays green (Recall 100%, MRR 1.000, Snippet coverage 100%).
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a tool-profile system ( Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as "CLI\n(compare --baseline-mode native_agent)"
participant FS as "Filesystem\n(artifact snapshot/restore)"
participant Baseline as "Baseline Env\n(hidden artifacts)"
participant Graphify as "Graphify Env\n(restored artifacts)"
participant API as "Anthropic API"
participant Reporter as "Reporter / Formatter"
User->>CLI: run compare --baseline-mode native_agent --exec <cmd>
CLI->>FS: snapshot graphify artifacts
CLI->>FS: hide/remove artifacts
CLI->>Baseline: run user --exec (baseline)
Baseline->>API: invoke model (no MCP tools)
API-->>Baseline: response + trailing JSON usage
Baseline->>CLI: emit/parse Anthropic usage
CLI->>FS: restore artifacts
CLI->>Graphify: run user --exec (graphify)
Graphify->>API: invoke model (with MCP tools)
API-->>Graphify: response + trailing JSON usage
Graphify->>CLI: emit/parse Anthropic usage
CLI->>Reporter: compute reductions (turns, tokens, cost)
Reporter-->>User: display native-agent compare summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
Final acceptance checkAll ship-readiness criteria from the plan are satisfied:
Note: the public artifact uses |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/fixtures/mock-claude-runner.mjs (1)
23-57: ⚡ Quick winReduce benchmark-number drift by sourcing fixture payloads from artifact JSON.
The baseline/graphify numeric blocks are duplicated from
docs/benchmarks/2026-04-30-govalidate/*.json. Loading those files here would keep smoke fixtures aligned automatically as artifacts evolve.Proposed refactor sketch
import { existsSync, readFileSync } from 'node:fs' +import { dirname, resolve } from 'node:path' +import { fileURLToPath } from 'node:url' @@ -const baseline = { - ... -} - -const graphify = { - ... -} +const here = dirname(fileURLToPath(import.meta.url)) +const benchmarkDir = resolve(here, '../../docs/benchmarks/2026-04-30-govalidate') +const baselineArtifact = JSON.parse(readFileSync(resolve(benchmarkDir, 'baseline-session.json'), 'utf8')) +const graphifyArtifact = JSON.parse(readFileSync(resolve(benchmarkDir, 'graphify-session.json'), 'utf8')) + +const baseline = { + ...baselineArtifact, + result: `mock baseline answer for prompt of length ${prompt.length}`, +} +const graphify = { + ...graphifyArtifact, + result: `mock graphify answer for prompt of length ${prompt.length}`, +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fixtures/mock-claude-runner.mjs` around lines 23 - 57, Replace the duplicated hard-coded numeric fixture blocks by loading the corresponding artifact JSON(s) at runtime and mapping their fields into the existing baseline and graphify objects: read and parse the benchmark artifact JSON(s) referenced in docs/benchmarks/2026-04-30-govalidate, extract fields for duration_ms, duration_api_ms, num_turns, result (use prompt.length when composing result strings), session_id, total_cost_usd and the usage subfields (input_tokens, cache_creation_input_tokens, cache_read_input_tokens, output_tokens), and assign them into the existing baseline and graphify variables (keeping their keys unchanged); add a small fallback/default behavior if the artifact file is missing or malformed so tests still run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/benchmarks/2026-04-30-govalidate/README.md`:
- Around line 26-29: The fenced code block containing the token calculations
(the lines starting with "baseline_total_input_tokens = 14 + 40,648 + 574,528 =
615,190" and "graphify_total_input_tokens = 13 + 92,833 + 140,662 = 233,508")
needs a language tag to satisfy markdownlint MD040; edit the README.md fenced
block to start with ```text instead of ``` so the block is explicitly marked as
plain text.
- Around line 57-60: Update the README reproduction steps to use the current CLI
flow: replace the invalid command string "graphify-ts claude install --project
/path/to/your/repo" with instructions to cd into the target repository and run
"graphify-ts claude install" (i.e., run the install command from the target repo
directory rather than passing a --project flag); update the two-line snippet
under the graph generation steps so it reflects "graphify-ts generate
/path/to/your/repo" followed by the corrected install invocation.
In `@examples/why-graphify.md`:
- Line 22: Update the paragraph that tells users to set
GRAPHIFY_TOOL_PROFILE=full in .mcp.json to also mention alternative MCP config
locations for non-Claude installs: explicitly list .cursor/mcp.json and
.vscode/mcp.json so Cursor and Copilot users are directed to the correct file,
and ensure the sentence that names Cursor and Copilot references these
alternative paths (e.g., "set GRAPHIFY_TOOL_PROFILE=full in .mcp.json (or
.cursor/mcp.json / .vscode/mcp.json for Cursor/Copilot installs)").
In `@src/infrastructure/install.ts`:
- Around line 499-505: Existing server config env is always overwritten with env
= { GRAPHIFY_TOOL_PROFILE: 'core' } which drops user custom entries and silently
downgrades a preexisting GRAPHIFY_TOOL_PROFILE='full'; instead, detect and merge
with any existing env block before writing: read the current server config's env
(e.g., existingEnv), build mergedEnv = { GRAPHIFY_TOOL_PROFILE: 'core',
...existingEnv } so existing keys (including a user-set GRAPHIFY_TOOL_PROFILE)
override the default, then use mergedEnv for serverConfig in place of the
hardcoded env; apply this in the code that creates serverConfig (the block
referencing env, serverConfig, isVscode, npxCommand, npxArgs).
In `@src/runtime/stdio-server.ts`:
- Around line 553-558: Update the error text returned in the branch that checks
if a tool is disabled (the block that uses toolName, isCoreToolName(profile),
failure(id, JSONRPC_METHOD_NOT_FOUND, ...)) to use a generic, profile-agnostic
message; replace the hard-coded reference to "'core' profile" and ".mcp.json"
with something like "Tool '<toolName>' is not enabled in the active profile."
and optionally suggest checking the application's MCP/configuration, ensuring
the change is made where isCoreToolName is evaluated so all callers (Cursor,
Copilot, etc.) get the generic message.
---
Nitpick comments:
In `@tests/fixtures/mock-claude-runner.mjs`:
- Around line 23-57: Replace the duplicated hard-coded numeric fixture blocks by
loading the corresponding artifact JSON(s) at runtime and mapping their fields
into the existing baseline and graphify objects: read and parse the benchmark
artifact JSON(s) referenced in docs/benchmarks/2026-04-30-govalidate, extract
fields for duration_ms, duration_api_ms, num_turns, result (use prompt.length
when composing result strings), session_id, total_cost_usd and the usage
subfields (input_tokens, cache_creation_input_tokens, cache_read_input_tokens,
output_tokens), and assign them into the existing baseline and graphify
variables (keeping their keys unchanged); add a small fallback/default behavior
if the artifact file is missing or malformed so tests still run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f7b4a1d-7a20-4f03-9ea5-3365b6703b88
📒 Files selected for processing (23)
CHANGELOG.mdREADME.mddocs/benchmarks/2026-04-30-govalidate/README.mddocs/benchmarks/2026-04-30-govalidate/baseline-session.jsondocs/benchmarks/2026-04-30-govalidate/graphify-session.jsondocs/benchmarks/2026-04-30-govalidate/verify.shexamples/why-graphify.mdpackage.jsonsrc/cli/main.tssrc/cli/parser.tssrc/infrastructure/compare.tssrc/infrastructure/install.tssrc/runtime/stdio-server.tssrc/runtime/stdio/definitions.tstests/fixtures/mock-claude-runner.mjstests/unit/benchmark-artifact.test.tstests/unit/cli.test.tstests/unit/compare-native-agent.test.tstests/unit/install-templates.test.tstests/unit/install.test.tstests/unit/stdio-server.test.tstests/unit/stdio-tool-profile.test.tstests/unit/why-graphify-doc.test.ts
- src/infrastructure/install.ts: real bug — installMcpServer was
overwriting the env block on re-install, silently downgrading a
user-customized GRAPHIFY_TOOL_PROFILE=full back to 'core' and
dropping unrelated user-set env keys (e.g. HTTP_PROXY). Now reads
the existing server config's env (if any) and merges with the
defaults so user values win. Test in install.test.ts covers the
reinstall round-trip.
- src/runtime/stdio-server.ts: gating error message no longer
hardcodes "'core' profile" / ".mcp.json" — it's now profile- and
client-agnostic and lists the three supported MCP config locations
(.mcp.json / .cursor/mcp.json / .vscode/mcp.json) so Cursor and
Copilot users see the right path.
- examples/why-graphify.md: extend the GRAPHIFY_TOOL_PROFILE=full
pointer to mention .cursor/mcp.json and .vscode/mcp.json
alongside .mcp.json.
- docs/benchmarks/2026-04-30-govalidate/README.md: tag the math
fenced block as 'text' (markdownlint MD040); rewrite the
reproduction recipe to use the actual CLI flow ('cd /path/to/repo
&& graphify-ts claude install') instead of the invalid
'--project /path/to/repo' flag.
- tests/fixtures/mock-claude-runner.mjs: load duration_ms,
num_turns, total_cost_usd, and the usage block from
docs/benchmarks/2026-04-30-govalidate/{baseline,graphify}-session.json
at runtime so the smoke fixture and the public artifact stay in
sync. Falls back to inline defaults when the artifact is missing.
All 1236 tests pass; native_agent smoke test still emits 3x/2.77x/2.63x
matching the public artifact's verify.sh output.
Addressed CodeRabbit feedback in 811adc5
Skipping: the docstring-coverage warning (3.13% vs CodeRabbit's 80% target) — the codebase doesn't follow that convention and adding boilerplate JSDoc to every function would be noise. Verification re-run:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Single PR landing five connected changes for v0.10.1:
GRAPHIFY_TOOL_PROFILEenv var — defaults tocore(6 tools) instead of advertising all 21. Cuts ~16–22K ofcache_creation_input_tokensper Claude Code session.384×/897×strawman headline (computed against an internal-only baseline) with the measured 2026-04-30 native_agent comparison.compare --baseline-mode native_agent— runs the user's--exectwice (with and without graphify config files snapshot-renamed) and reports Anthropic-billedusageblocks verbatim. Atomic try/finally restore guarantees no project state is left behind even if the runner crashes.docs/benchmarks/2026-04-30-govalidate/with both rawclaude --output-format jsonoutputs and averify.shreproducer.package.jsonto0.10.1and dates the CHANGELOG entry.Headline numbers (measured 2026-04-30 against the GoValidate codebase)
All numbers come from
claude --output-format jsonusagefields, not local prompt-token estimates. Reproduce withbash docs/benchmarks/2026-04-30-govalidate/verify.sh.CHANGELOG entry
Added
GRAPHIFY_TOOL_PROFILEenv var: defaults tocore(6 tools —retrieve,impact,call_chain,community_overview,pr_impact,graph_stats); set tofullto opt into the legacy 21-tool surface. The Claude / Cursor / VS Code Copilot install templates now writeenv: { GRAPHIFY_TOOL_PROFILE: "core" }into the generated.mcp.json. Reducescache_creation_input_tokensper session by roughly 16–22K on a Claude Code session start.compare --baseline-mode native_agent: runs--exectwice (snapshot-renamed and restored) and reports Anthropic-billedusageblocks verbatim. Atomic try/finally restore.docs/benchmarks/2026-04-30-govalidate/with both rawclaude --output-format jsonoutputs and averify.shreproducer.Changed
384×/397×/897×strawman headlines with measured 3× fewer turns, ~2.8× faster, 2.6× fewer total input tokens in the README,examples/why-graphify.md, and theclaude installPreToolUse hook payload.Fixed
coreMCP tool profile by default cutscache_creation_input_tokensoverhead by ~16–22K tokens per fresh session.Test plan
npm run clean && npm run build— cleannpm run typecheck— cleannpm run test:run— 1234/1234 passing across 64 test filesexamples/demo-repo— Recall 100%, MRR 1.000, Snippet coverage 100% (CI thresholds ≥95/≥0.95/≥95)compare --baseline-mode native_agentsmoke test — emits 3× turns / 2.77× faster / 2.63× tokens with both Anthropic-reported usage blocks; report.json shape matches the speccore: 6, full: 21"3x fewer turns", no stale384x/897xdocs/benchmarks/2026-04-30-govalidate/verify.sh— exits 0, prints615190/233508totalsnpm pack --dry-run—mohammednagy-graphify-ts-0.10.1.tgzbuilds cleanlyWhat this PR does NOT do
baseline_mode: 'full'or'bounded'(adds'native_agent'alongside; default stays'full'for one minor).MCP_PROTOCOL_VERSION.EXTRACTOR_CACHE_VERSION(extraction shape unchanged).v0.10.1tag handles publish).Summary by CodeRabbit
New Features
Documentation