Skip to content

v0.10.1 — honest numbers + lean MCP profile#28

Merged
mohanagy merged 6 commits into
mainfrom
feat/v0.10.1-honest-numbers-lean-profile
May 1, 2026
Merged

v0.10.1 — honest numbers + lean MCP profile#28
mohanagy merged 6 commits into
mainfrom
feat/v0.10.1-honest-numbers-lean-profile

Conversation

@mohanagy
Copy link
Copy Markdown
Owner

@mohanagy mohanagy commented May 1, 2026

Summary

Single PR landing five connected changes for v0.10.1:

  1. GRAPHIFY_TOOL_PROFILE env var — defaults to core (6 tools) instead of advertising all 21. Cuts ~16–22K of cache_creation_input_tokens per Claude Code session.
  2. Honest benchmark numbers — replaces the discredited 384× / 897× strawman headline (computed against an internal-only baseline) with the measured 2026-04-30 native_agent comparison.
  3. compare --baseline-mode native_agent — runs the user's --exec twice (with and without graphify config files snapshot-renamed) and reports Anthropic-billed usage blocks verbatim. Atomic try/finally restore guarantees no project state is left behind even if the runner crashes.
  4. Public benchmark artifact — committed docs/benchmarks/2026-04-30-govalidate/ with both raw claude --output-format json outputs and a verify.sh reproducer.
  5. Release — bumps package.json to 0.10.1 and dates the CHANGELOG entry.

Headline numbers (measured 2026-04-30 against the GoValidate codebase)

Metric Baseline (no graphify) Graphify (core profile) Δ
Tool-call turns 9 3 3× fewer
Latency 96,368 ms 34,744 ms ~2.77× faster
Total input tokens (Anthropic-reported) 615,190 233,508 2.63× less
Cost per session $0.62 $0.70 +13% on cold start; amortizes on multi-question sessions

All numbers come from claude --output-format json usage fields, not local prompt-token estimates. Reproduce with bash docs/benchmarks/2026-04-30-govalidate/verify.sh.

CHANGELOG entry

Added

  • GRAPHIFY_TOOL_PROFILE env var: defaults to core (6 tools — retrieve, impact, call_chain, community_overview, pr_impact, graph_stats); set to full to opt into the legacy 21-tool surface. The Claude / Cursor / VS Code Copilot install templates now write env: { GRAPHIFY_TOOL_PROFILE: "core" } into the generated .mcp.json. Reduces cache_creation_input_tokens per session by roughly 16–22K on a Claude Code session start.
  • compare --baseline-mode native_agent: runs --exec twice (snapshot-renamed and restored) and reports Anthropic-billed usage blocks verbatim. Atomic try/finally restore.
  • Public benchmark artifact: docs/benchmarks/2026-04-30-govalidate/ with both raw claude --output-format json outputs and a verify.sh reproducer.

Changed

  • Honest benchmark numbers: replaced 384× / 397× / 897× strawman headlines with measured 3× fewer turns, ~2.8× faster, 2.6× fewer total input tokens in the README, examples/why-graphify.md, and the claude install PreToolUse hook payload.
  • Compare summary framing: synthetic full/bounded baselines are now explicitly disclosed as "synthetic prompt-token estimate (cl100k_base)".

Fixed

  • Cold-start cost regression: lean core MCP tool profile by default cuts cache_creation_input_tokens overhead by ~16–22K tokens per fresh session.

Test plan

  • npm run clean && npm run build — clean
  • npm run typecheck — clean
  • npm run test:run1234/1234 passing across 64 test files
  • Eval gate against examples/demo-repoRecall 100%, MRR 1.000, Snippet coverage 100% (CI thresholds ≥95/≥0.95/≥95)
  • compare --baseline-mode native_agent smoke test — emits 3× turns / 2.77× faster / 2.63× tokens with both Anthropic-reported usage blocks; report.json shape matches the spec
  • Tool profile count — core: 6, full: 21
  • Hook payload — decoded base64 contains "3x fewer turns", no stale 384x/897x
  • docs/benchmarks/2026-04-30-govalidate/verify.sh — exits 0, prints 615190 / 233508 totals
  • npm pack --dry-runmohammednagy-graphify-ts-0.10.1.tgz builds cleanly

What this PR does NOT do

  • Does not delete baseline_mode: 'full' or 'bounded' (adds 'native_agent' alongside; default stays 'full' for one minor).
  • Does not change MCP_PROTOCOL_VERSION.
  • Does not bump EXTRACTOR_CACHE_VERSION (extraction shape unchanged).
  • Does not lower the eval CI thresholds.
  • Does not auto-publish to npm (the release workflow on the v0.10.1 tag handles publish).

Summary by CodeRabbit

  • New Features

    • Added GRAPHIFY_TOOL_PROFILE env var (defaults to "core") to control MCP tool availability.
    • Added compare --baseline-mode native_agent for real A/B runs that report Anthropic usage.
  • Documentation

    • Updated benchmarks and examples with reproducible baseline vs Graphify “core” metrics (turns, latency, tokens, cost).
    • Published public benchmark artifacts and a verify.sh reproducer.

mohanagy added 5 commits May 1, 2026 06:46
Ship a lean MCP tool surface by default. The Claude Code session-start
overhead from advertising all 21 tools writes ~16-22K tokens to
cache_creation_input_tokens (priced at 1.25x input rate) on a fresh
session. Most agents only call retrieve, impact, call_chain,
community_overview, pr_impact, and graph_stats — gating the other 15
behind GRAPHIFY_TOOL_PROFILE=full reclaims the cache budget.

- Add CORE_TOOL_NAMES, McpToolProfile, activeMcpTools, and
  resolveToolProfileFromEnv to runtime/stdio/definitions.ts.
- Wire tools/list and tools/call through the active profile in
  runtime/stdio-server.ts; non-core calls in core mode return
  JSONRPC_METHOD_NOT_FOUND with a documented hint pointing at
  GRAPHIFY_TOOL_PROFILE=full in .mcp.json.
- Default the generated .mcp.json (Claude / Cursor / VS Code Copilot)
  to env: { GRAPHIFY_TOOL_PROFILE: 'core' } via installMcpServer.
- Tests: stdio-tool-profile.test.ts covers profile selection and
  gating end-to-end; install.test.ts asserts the env block; the
  existing stdio-server.test.ts opts into 'full' for behavior tests.
The previously-published 384x retrieve-compression headline was computed
against an internal-only baseline_mode='full' prompt that no real agent
ever sends. The credible measurement is the 2026-04-30 native_agent
comparison against a production NestJS+Next.js codebase: 3x fewer
tool-call turns, ~2.8x faster end-to-end latency, and 2.6x fewer total
input tokens as billed by Anthropic. The cold-start cost premium is
disclosed honestly (~+13% on a single-question session, amortizing on
multi-question sessions); v0.10.1 also flips the default to the lean
6-tool 'core' profile so cold-start cost approaches parity.

- src/infrastructure/install.ts: replace the RETRIEVE_FIRST_MESSAGE
  string baked into the .claude PreToolUse / .gemini BeforeTool /
  .codex hook payloads with the measured copy ('3x fewer turns,
  ~2.8x faster on a real production codebase'). The base64 encoding
  is regenerated automatically at install time.
- README.md: replace the Benchmarks section with the measured table
  and the cold-start cost honesty disclosure; cite the public
  artifact at docs/benchmarks/2026-04-30-govalidate/ that lands in
  the next commit.
- examples/why-graphify.md: replace the headline efficiency section
  and the Benchmark Summary with the same measured numbers; drop
  the stale '17 MCP tools' line and document the core/full profile
  split.
- Tests: install-templates.test.ts and why-graphify-doc.test.ts
  enforce no '384x'/'397x'/'897x' substrings in the public copy
  and the install hook payload.
The existing 'full' and 'bounded' baseline modes both build synthetic
baseline prompts from the project corpus, which no real agent ever
sends. Their reduction_ratio is a cl100k_base estimate, not an
Anthropic-billed measurement. native_agent runs the user's --exec
command twice — once with graphify-out/graph.json, .mcp.json, CLAUDE.md,
and .claude/ snapshot-renamed out of the working directory (baseline),
once with them restored (graphify) — and reports the usage blocks from
'claude --output-format json' verbatim.

- src/infrastructure/compare.ts: extend CompareBaselineMode with
  'native_agent' alongside the existing modes. Add
  executeNativeAgentCompare, parseAnthropicResultEvent, and the
  NativeAgentCompareReport / NativeAgentRunner types. Atomic
  rename / try-finally restore guarantees no project state is left
  behind even if the baseline runner crashes; a probe test verifies
  the snapshot doesn't hide graphify-out/compare/<ts>/. Add an
  explicit synthetic-baseline disclosure line to formatCompareSummary
  for full/bounded so a reader cannot mistake the synthetic ratio
  for an Anthropic-billed measurement.
- src/cli/parser.ts and src/cli/main.ts: accept and document the
  new mode. Default stays 'full' for backward compat.
- tests/fixtures/mock-claude-runner.mjs: deterministic mock that
  emits the Anthropic JSON shape so the smoke test runs without a
  real model call.
- tests/unit/compare-native-agent.test.ts: covers success path,
  crash safety, bare-project absent state, exec_command redaction,
  runner_error fallback, snapshot scope (compare/<ts> stays
  writable), and stream-json parsing.
The headline numbers in the README, examples/why-graphify.md, and the
install hook payload all reference the 2026-04-30 native_agent
measurement against the GoValidate codebase. Commit the raw evidence
so anyone can reproduce the totals from the same files graphify-ts
ships in its package directory.

- docs/benchmarks/2026-04-30-govalidate/baseline-session.json:
  the 'claude --output-format json' result event from the no-graphify
  run, with the answer body redacted (the question is internal).
- docs/benchmarks/2026-04-30-govalidate/graphify-session.json: the
  graphify-enabled run, same shape.
- docs/benchmarks/2026-04-30-govalidate/verify.sh: bash + node
  reproducer that prints the per-file usage blocks and computes
  the headline reductions (3x turns, 2.77x latency, 2.63x input
  tokens). Uses $DIR rather than absolute paths so the script
  reproduces from any checkout.
- docs/benchmarks/2026-04-30-govalidate/README.md: narrative,
  setup, headline table, sum-of-fields explanation, and
  reproduction recipe for running the same comparison on a
  reader's own codebase.
- tests/unit/benchmark-artifact.test.ts: ensures the README cites
  numbers that are computable from the committed JSON, asserts no
  stale '384x'/'897x' marketing claims in the README, and runs
  verify.sh end-to-end (skipped only when jq is missing).
Bump package.json from 0.10.0 to 0.10.1 and document the changes that
shipped in this PR:

  Added: GRAPHIFY_TOOL_PROFILE env var (core profile = 6 tools);
         compare --baseline-mode native_agent with Anthropic-reported
         usage; public benchmark artifact under
         docs/benchmarks/2026-04-30-govalidate/.

  Changed: replace 384x/897x strawman headline with measured 3x/2.8x
           numbers everywhere (README, examples/why-graphify.md,
           install hook payload); compare summary now labels synthetic
           full/bounded ratios explicitly.

  Fixed:   cold-start cost regression (cache_creation overhead from the
           full 21-tool MCP surface) by shipping core as the default.

Eval gate stays green (Recall 100%, MRR 1.000, Snippet coverage 100%).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a tool-profile system (core/full), a new compare --baseline-mode native_agent flow that captures real Anthropic usage by running the user command twice (baseline with artifacts hidden, then graphify), commits benchmark artifacts and a verifier, and updates docs/tests and installer to default GRAPHIFY_TOOL_PROFILE=core.

Changes

Cohort / File(s) Summary
Version & Changelog
CHANGELOG.md, package.json
Bump package to v0.10.1 and add changelog entry documenting GRAPHIFY_TOOL_PROFILE, native_agent baseline mode, and new benchmark workflow.
Documentation & Benchmarks
README.md, examples/why-graphify.md, docs/benchmarks/2026-04-30-govalidate/*
Add reproducible A/B benchmark README, raw JSON artifacts, verify.sh, update README/examples with measured metrics and cold-start disclosure.
CLI
src/cli/main.ts, src/cli/parser.ts
Extend baselineMode type and CLI help to accept/document 'native_agent' (executes --exec twice and reports Anthropic usage).
Compare implementation
src/infrastructure/compare.ts
Add native_agent baseline mode, artifact snapshot/restore logic, parsing of trailing Anthropic JSON events, new types/helpers (parseAnthropicResultEvent, executeNativeAgentCompare, formatNativeAgentCompareSummary), and native-agent summary formatting.
Installer
src/infrastructure/install.ts
Install templates now inject env: { GRAPHIFY_TOOL_PROFILE: "core" } into generated MCP server configs and update pre-tool instructions.
MCP tool profiles & runtime gating
src/runtime/stdio/definitions.ts, src/runtime/stdio-server.ts
Introduce McpToolProfile, CORE_TOOL_NAMES, activeMcpTools(), resolveToolProfileFromEnv(), isCoreToolName(); tools/list and tools/call now gate available tools by profile and return JSON-RPC not-found for disabled tools.
Tests & Fixtures
tests/fixtures/mock-claude-runner.mjs, tests/unit/*
Add mock claude runner and multiple tests: native-agent compare tests, benchmark artifact verifier tests, stdio tool-profile/unit tests, install/template tests, CLI help update test, and suite setup to control GRAPHIFY_TOOL_PROFILE.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as "CLI\n(compare --baseline-mode native_agent)"
    participant FS as "Filesystem\n(artifact snapshot/restore)"
    participant Baseline as "Baseline Env\n(hidden artifacts)"
    participant Graphify as "Graphify Env\n(restored artifacts)"
    participant API as "Anthropic API"
    participant Reporter as "Reporter / Formatter"

    User->>CLI: run compare --baseline-mode native_agent --exec <cmd>
    CLI->>FS: snapshot graphify artifacts
    CLI->>FS: hide/remove artifacts
    CLI->>Baseline: run user --exec (baseline)
    Baseline->>API: invoke model (no MCP tools)
    API-->>Baseline: response + trailing JSON usage
    Baseline->>CLI: emit/parse Anthropic usage
    CLI->>FS: restore artifacts
    CLI->>Graphify: run user --exec (graphify)
    Graphify->>API: invoke model (with MCP tools)
    API-->>Graphify: response + trailing JSON usage
    Graphify->>CLI: emit/parse Anthropic usage
    CLI->>Reporter: compute reductions (turns, tokens, cost)
    Reporter-->>User: display native-agent compare summary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

  • feat: support Gemini compare usage capture #19: Extends compare logic to parse runner stdout and capture provider-reported usage — closely related to the native_agent usage-capture additions in src/infrastructure/compare.ts.
  • Feature/competitive roadmap #9: Modifies MCP tool definitions and stdio handling — relevant because this PR adds tool-profile gating that filters/controls those tools.
  • feat: ship workflow hardening #25: Updates MCP tool-selection logic and definitions — overlaps with the new CORE_TOOL_NAMES, activeMcpTools, and stdio-server gating.

Poem

🐰 I hid the graph, then ran it twice,

measured tokens, turns, and price.
Six tools light, or full to call,
snapshots dance and benchmarks fall.
A hop for truth — metrics aligned! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'v0.10.1 — honest numbers + lean MCP profile' directly summarizes the main changes: version bump, honest benchmark replacement, and lean MCP profile introduction.
Description check ✅ Passed The PR description fully covers all required template sections with concrete details on changes, testing procedures, and results, plus comprehensive context on implementation and validation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/v0.10.1-honest-numbers-lean-profile

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@mohanagy
Copy link
Copy Markdown
Owner Author

mohanagy commented May 1, 2026

Final acceptance check

=== EVAL GATE ===
Recall:           100.0%
MRR:              1.000
Snippet coverage: 100.0%
- backend: recall 100%, MRR 1.000, snippets 100%, grounded 100%
- general: recall 100%, MRR 1.000, snippets 100%, grounded 100%

=== TOOL PROFILE COUNT ===
core: 6 full: 21

=== HOOK PAYLOAD CHECK ===
OK: hook contains "3x fewer turns", no stale 384x/897x claims

=== COMPARE NATIVE_AGENT SMOKE ===
[graphify compare] completed 1 native_agent question(s)
- "What is the cluster module?"
    num_turns: baseline 9 → graphify 3 (3x fewer)
    latency:   baseline 96368ms → graphify 34744ms (2.77x faster)
    input_tokens (Anthropic-reported): baseline 615190 → graphify 233508 (2.63x less)

=== verify.sh (against committed evidence) ===
baseline_total_input_tokens : 615190
graphify_total_input_tokens : 233508
input_token_reduction        : 2.63x
num_turns_reduction          : 3x
latency_reduction            : 2.77x
baseline_total_cost_usd      : $0.62
graphify_total_cost_usd      : $0.70

=== package.json + npm pack ===
version: 0.10.1
filename: mohammednagy-graphify-ts-0.10.1.tgz
total files: 302  package size: 335.0 kB

All ship-readiness criteria from the plan are satisfied:

  • ✅ Eval gate green (Recall ≥95, MRR ≥0.95, Snippet coverage ≥95)
  • core profile returns exactly 6 tools, full returns 21
  • ✅ Decoded hook contains 3x fewer turns, contains no 384x / 897x
  • compare --baseline-mode native_agent produces a report with both Anthropic-reported usage blocks and computed reductions matching the public artifact
  • docs/benchmarks/2026-04-30-govalidate/verify.sh exits 0 on the committed evidence and prints totals matching the README
  • package.json is 0.10.1, CHANGELOG.md has the dated ## [0.10.1] - 2026-05-01 entry
  • npm pack --dry-run succeeds

Note: the public artifact uses 233,508 total input tokens for the graphify run (the exact sum of 13 + 92,833 + 140,662 from the committed usage block) rather than the original spec's headline 234,308. This was a discrepancy in the spec — using the JSON-derived sum makes the README, why-graphify.md, the CHANGELOG, and verify.sh self-consistent.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
tests/fixtures/mock-claude-runner.mjs (1)

23-57: ⚡ Quick win

Reduce benchmark-number drift by sourcing fixture payloads from artifact JSON.

The baseline/graphify numeric blocks are duplicated from docs/benchmarks/2026-04-30-govalidate/*.json. Loading those files here would keep smoke fixtures aligned automatically as artifacts evolve.

Proposed refactor sketch
 import { existsSync, readFileSync } from 'node:fs'
+import { dirname, resolve } from 'node:path'
+import { fileURLToPath } from 'node:url'
@@
-const baseline = {
-  ...
-}
-
-const graphify = {
-  ...
-}
+const here = dirname(fileURLToPath(import.meta.url))
+const benchmarkDir = resolve(here, '../../docs/benchmarks/2026-04-30-govalidate')
+const baselineArtifact = JSON.parse(readFileSync(resolve(benchmarkDir, 'baseline-session.json'), 'utf8'))
+const graphifyArtifact = JSON.parse(readFileSync(resolve(benchmarkDir, 'graphify-session.json'), 'utf8'))
+
+const baseline = {
+  ...baselineArtifact,
+  result: `mock baseline answer for prompt of length ${prompt.length}`,
+}
+const graphify = {
+  ...graphifyArtifact,
+  result: `mock graphify answer for prompt of length ${prompt.length}`,
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/mock-claude-runner.mjs` around lines 23 - 57, Replace the
duplicated hard-coded numeric fixture blocks by loading the corresponding
artifact JSON(s) at runtime and mapping their fields into the existing baseline
and graphify objects: read and parse the benchmark artifact JSON(s) referenced
in docs/benchmarks/2026-04-30-govalidate, extract fields for duration_ms,
duration_api_ms, num_turns, result (use prompt.length when composing result
strings), session_id, total_cost_usd and the usage subfields (input_tokens,
cache_creation_input_tokens, cache_read_input_tokens, output_tokens), and assign
them into the existing baseline and graphify variables (keeping their keys
unchanged); add a small fallback/default behavior if the artifact file is
missing or malformed so tests still run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/benchmarks/2026-04-30-govalidate/README.md`:
- Around line 26-29: The fenced code block containing the token calculations
(the lines starting with "baseline_total_input_tokens = 14 + 40,648 + 574,528 =
615,190" and "graphify_total_input_tokens = 13 + 92,833 + 140,662 = 233,508")
needs a language tag to satisfy markdownlint MD040; edit the README.md fenced
block to start with ```text instead of ``` so the block is explicitly marked as
plain text.
- Around line 57-60: Update the README reproduction steps to use the current CLI
flow: replace the invalid command string "graphify-ts claude install --project
/path/to/your/repo" with instructions to cd into the target repository and run
"graphify-ts claude install" (i.e., run the install command from the target repo
directory rather than passing a --project flag); update the two-line snippet
under the graph generation steps so it reflects "graphify-ts generate
/path/to/your/repo" followed by the corrected install invocation.

In `@examples/why-graphify.md`:
- Line 22: Update the paragraph that tells users to set
GRAPHIFY_TOOL_PROFILE=full in .mcp.json to also mention alternative MCP config
locations for non-Claude installs: explicitly list .cursor/mcp.json and
.vscode/mcp.json so Cursor and Copilot users are directed to the correct file,
and ensure the sentence that names Cursor and Copilot references these
alternative paths (e.g., "set GRAPHIFY_TOOL_PROFILE=full in .mcp.json (or
.cursor/mcp.json / .vscode/mcp.json for Cursor/Copilot installs)").

In `@src/infrastructure/install.ts`:
- Around line 499-505: Existing server config env is always overwritten with env
= { GRAPHIFY_TOOL_PROFILE: 'core' } which drops user custom entries and silently
downgrades a preexisting GRAPHIFY_TOOL_PROFILE='full'; instead, detect and merge
with any existing env block before writing: read the current server config's env
(e.g., existingEnv), build mergedEnv = { GRAPHIFY_TOOL_PROFILE: 'core',
...existingEnv } so existing keys (including a user-set GRAPHIFY_TOOL_PROFILE)
override the default, then use mergedEnv for serverConfig in place of the
hardcoded env; apply this in the code that creates serverConfig (the block
referencing env, serverConfig, isVscode, npxCommand, npxArgs).

In `@src/runtime/stdio-server.ts`:
- Around line 553-558: Update the error text returned in the branch that checks
if a tool is disabled (the block that uses toolName, isCoreToolName(profile),
failure(id, JSONRPC_METHOD_NOT_FOUND, ...)) to use a generic, profile-agnostic
message; replace the hard-coded reference to "'core' profile" and ".mcp.json"
with something like "Tool '<toolName>' is not enabled in the active profile."
and optionally suggest checking the application's MCP/configuration, ensuring
the change is made where isCoreToolName is evaluated so all callers (Cursor,
Copilot, etc.) get the generic message.

---

Nitpick comments:
In `@tests/fixtures/mock-claude-runner.mjs`:
- Around line 23-57: Replace the duplicated hard-coded numeric fixture blocks by
loading the corresponding artifact JSON(s) at runtime and mapping their fields
into the existing baseline and graphify objects: read and parse the benchmark
artifact JSON(s) referenced in docs/benchmarks/2026-04-30-govalidate, extract
fields for duration_ms, duration_api_ms, num_turns, result (use prompt.length
when composing result strings), session_id, total_cost_usd and the usage
subfields (input_tokens, cache_creation_input_tokens, cache_read_input_tokens,
output_tokens), and assign them into the existing baseline and graphify
variables (keeping their keys unchanged); add a small fallback/default behavior
if the artifact file is missing or malformed so tests still run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f7b4a1d-7a20-4f03-9ea5-3365b6703b88

📥 Commits

Reviewing files that changed from the base of the PR and between b607208 and 298d142.

📒 Files selected for processing (23)
  • CHANGELOG.md
  • README.md
  • docs/benchmarks/2026-04-30-govalidate/README.md
  • docs/benchmarks/2026-04-30-govalidate/baseline-session.json
  • docs/benchmarks/2026-04-30-govalidate/graphify-session.json
  • docs/benchmarks/2026-04-30-govalidate/verify.sh
  • examples/why-graphify.md
  • package.json
  • src/cli/main.ts
  • src/cli/parser.ts
  • src/infrastructure/compare.ts
  • src/infrastructure/install.ts
  • src/runtime/stdio-server.ts
  • src/runtime/stdio/definitions.ts
  • tests/fixtures/mock-claude-runner.mjs
  • tests/unit/benchmark-artifact.test.ts
  • tests/unit/cli.test.ts
  • tests/unit/compare-native-agent.test.ts
  • tests/unit/install-templates.test.ts
  • tests/unit/install.test.ts
  • tests/unit/stdio-server.test.ts
  • tests/unit/stdio-tool-profile.test.ts
  • tests/unit/why-graphify-doc.test.ts

Comment thread docs/benchmarks/2026-04-30-govalidate/README.md Outdated
Comment thread docs/benchmarks/2026-04-30-govalidate/README.md Outdated
Comment thread examples/why-graphify.md Outdated
Comment thread src/infrastructure/install.ts
Comment thread src/runtime/stdio-server.ts
- src/infrastructure/install.ts: real bug — installMcpServer was
  overwriting the env block on re-install, silently downgrading a
  user-customized GRAPHIFY_TOOL_PROFILE=full back to 'core' and
  dropping unrelated user-set env keys (e.g. HTTP_PROXY). Now reads
  the existing server config's env (if any) and merges with the
  defaults so user values win. Test in install.test.ts covers the
  reinstall round-trip.
- src/runtime/stdio-server.ts: gating error message no longer
  hardcodes "'core' profile" / ".mcp.json" — it's now profile- and
  client-agnostic and lists the three supported MCP config locations
  (.mcp.json / .cursor/mcp.json / .vscode/mcp.json) so Cursor and
  Copilot users see the right path.
- examples/why-graphify.md: extend the GRAPHIFY_TOOL_PROFILE=full
  pointer to mention .cursor/mcp.json and .vscode/mcp.json
  alongside .mcp.json.
- docs/benchmarks/2026-04-30-govalidate/README.md: tag the math
  fenced block as 'text' (markdownlint MD040); rewrite the
  reproduction recipe to use the actual CLI flow ('cd /path/to/repo
  && graphify-ts claude install') instead of the invalid
  '--project /path/to/repo' flag.
- tests/fixtures/mock-claude-runner.mjs: load duration_ms,
  num_turns, total_cost_usd, and the usage block from
  docs/benchmarks/2026-04-30-govalidate/{baseline,graphify}-session.json
  at runtime so the smoke fixture and the public artifact stay in
  sync. Falls back to inline defaults when the artifact is missing.

All 1236 tests pass; native_agent smoke test still emits 3x/2.77x/2.63x
matching the public artifact's verify.sh output.
@mohanagy
Copy link
Copy Markdown
Owner Author

mohanagy commented May 1, 2026

Addressed CodeRabbit feedback in 811adc5

# Comment Fix
1 docs/benchmarks/.../README.md MD040 — math fenced block missing language tag Changed to ```text
2 docs/benchmarks/.../README.md reproduction step uses non-existent --project flag Rewrote to cd /path/to/your/repo && graphify-ts generate . && graphify-ts claude install
3 examples/why-graphify.md only mentions .mcp.json for opting into full profile Now lists .mcp.json / .cursor/mcp.json / .vscode/mcp.json
4 src/infrastructure/install.ts overwrote env block on re-install, silently downgrading user-set GRAPHIFY_TOOL_PROFILE=full and dropping user-set env keys Real bug — fixed via merge: env = { GRAPHIFY_TOOL_PROFILE: 'core', ...existingEnv }. Test added in install.test.ts covers the round-trip
5 src/runtime/stdio-server.ts gating error hardcodes 'core' profile and .mcp.json Generic message naming the active profile; lists all three supported MCP config locations
6 (Nitpick) tests/fixtures/mock-claude-runner.mjs duplicated benchmark numbers Now loads from docs/benchmarks/2026-04-30-govalidate/*.json at runtime; fallback to inline defaults if artifact missing

Skipping: the docstring-coverage warning (3.13% vs CodeRabbit's 80% target) — the codebase doesn't follow that convention and adding boilerplate JSDoc to every function would be noise.

Verification re-run:

  • npm run typecheck clean
  • npm run test:run1236/1236 passing (added 2 new tests for the env-merge fix)
  • Smoke test still emits 3x turns / 2.77x faster / 2.63x tokens with the artifact-loaded fixtures

@mohanagy
Copy link
Copy Markdown
Owner Author

mohanagy commented May 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant