feat(PF-691): codegraph diff <oldDb> <newDb> DB-vs-DB structural primitive#39
Conversation
…imitive First consumer of the PF-690 fingerprint columns (PR #38). Compares two `.codegraph/codegraph.db` files and reports added/removed/changed nodes and edges, with `changedFields` enumerating which tracked column drifted on each change (astHash → body change, sigHash/signature → contract change, qualifiedName → in-file move, etc.). ## Design (council RFC) CodeGraph stays VCS-agnostic — the diff operates on already-built DB files, the caller handles git checkouts. Calling `git stash` / `git checkout` from the indexer would be destructive (agy: "massive anti-pattern"). Both council members (Codex + agy) converged on DB-vs-DB; this is that primitive. - Nodes matched by stable `id` (deterministic from filePath + kind + name + line via `generateNodeId`). Renames and file moves surface as removed-old + added-new pairs. - Edges matched by canonical identity `(source, target, kind, line, col)` per the PF-625 UNIQUE INDEX, with NULL line/col folding to -1. ## Read-only safety (Codex round 1 BLOCKER, fixed in round 2) `diffDatabases` opens both DBs via `new DatabaseSync(\`file:${escaped}?immutable=1\`, { readOnly: true })`, not the production `DatabaseConnection.open` path. The latter sets `journal_mode = WAL` and runs forward migrations — both mutate the file on disk, which is unacceptable for a tool whose purpose is comparing historical snapshots. `immutable=1` tells SQLite the file will never change, skipping all locking and -shm/-wal sidecar creation. The regression test `does not mutate either DB or create WAL sidecars` snapshots size + mtime of `db`, `db-wal`, `db-shm`, `db-journal` before diff and asserts they all match after. ## v5 schema fallback `loadNodes` probes `PRAGMA table_info('nodes')` and projects missing fingerprint columns as `NULL AS <col>`, so diffing a pre-PR-#38 v5 backup against a v6 DB works cleanly — fingerprint fields just don't appear in `changedFields` for legacy rows. Regression test: `handles legacy schema (no fingerprint columns) without failing`. ## Surface - `codegraph diff <oldDb> <newDb> [-j|--json]` CLI subcommand - `diffDatabases(oldPath, newPath): DiffResult` library export from `src/diff.ts` - `schemas/cli/diff.json` JSON Schema (envelope.tool enum widened to include diff/duplicates/explain for PRs #39-#41) - 9 unit tests covering identical/added/removed/body-change/signature- change/summary-counts/missing-DB/read-only-no-mutation/v5-fallback ## Validation - `npx tsc --noEmit`: clean - `npx vitest run`: 1040 passed | 2 skipped | 60 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 ## Reviewer trail - Codex pass 1 → BLOCKER on read-only safety - Round 2 closure → "closed — proceed to PR" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63bc19da1e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const oldDb = openReadOnly(oldDbPath); | ||
| const newDb = openReadOnly(newDbPath); |
There was a problem hiding this comment.
Ensure old DB is closed if new DB open fails
Open oldDb and newDb before entering the try means an exception from openReadOnly(newDbPath) (e.g., unreadable path or SQLite open error) skips the finally block entirely, leaving the already-open old handle unclosed. In long-lived consumers of diffDatabases this can leak file descriptors and retain SQLite file locks (notably preventing DB file operations on Windows) until process exit.
Useful? React with 👍 / 👎.
…rint coverage Council double-Codex review (passes A + B) flagged real gaps in PR #39's scope. Addressed BLOCKERs + top REVIEWs: ## BLOCKER fixes - **File-level diff missing**: a file added/removed with zero indexed symbols silently produced empty output. Added `addedFiles`, `removedFiles`, `changedFiles` arrays via the `files` table. changedFiles surfaces `contentHash`, `size`, `nodeCount` drift. - **Non-tree-sitter NULL fingerprints**: Liquid/Vue/Svelte/YAML extractors emit nodes with NULL `ast_hash`, so body-level changes in those file types silently report "unchanged". Added `fingerprintCoverage: { old, new }` counter so consumers can detect the gap. CLI prints a yellow warning when coverage is partial. - **Line-sensitive node IDs**: documented in source as a known pre-existing CodeGraph design decision (not introduced by this PR); downstream consumers wanting move-stable matching should diff by (filePath, qualifiedName) after consuming this output. ## REVIEW fixes - **Edge metadata never compared**: edges with the same canonical identity but different `metadata`/`provenance` were silently matched as "unchanged", even though `codegraph explain` surfaces exactly those fields. Added `changedEdges: EdgeChange[]` with `changedFields` listing which of metadata/provenance differs. - **`-1` sentinel leaked into JSON output**: internal Map keys still use -1 as the NULL bucket (matches PF-625 UNIQUE INDEX), but `edgeIdentityForOutput` now preserves `null` in the public `EdgeIdentity` shape. - **Read-only validation deferred**: `DatabaseSync` opens lazily, so a non-SQLite `.db` file would error at first query with a raw SQLITE_NOTADB message. `openReadOnly` now runs `SELECT 1` to force validation, then wraps as `Invalid CodeGraph database at <path>`. - **`codegraph diff` missing from CLI top-of-file usage block**. Added. - **Text output truncation silent**: now prints "+N more (use --json for full output)" footer. ## Schema + tests - `schemas/cli/diff.json` widened: requires the 4 new fields, declares `nullable line/col` correctly, includes `fingerprintCoverage` and `fileChange`/`edgeChange` definitions. - 5 new tests: null-instead-of-`-1`, file-level add/remove/change, edge metadata drift via hand-built fixture, corrupt-DB wrapper, expanded summary-count invariant. ## Validation - `tsc --noEmit`: clean - `vitest run`: 1045 passed | 2 skipped | 60 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codex PR review P2: putting both `openReadOnly` calls before the `try` block leaked the old DB's SQLite handle when the second open failed (unreadable path, corrupt file, etc.). Wrap newDb open in a nested try/catch that closes oldDb before re-throwing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Replaces auto-closed PR #40 (closed when PR #39's base branch was deleted on merge). Combines the original feat commit + round 2 fixes (fileCount, coveredByExactGroup, fingerprint coverage gate, human-meaningful sort) + the Codex-PR-review fixes (shapeNodes double-count + unfingerprintable-kind error message). ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` ## Read-only safety Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39. ## Schema gate - v5 DBs (no fingerprint columns): "duplicates requires schema v6+". - Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0): "Run `codegraph index` to refresh fingerprints" — but ONLY when the DB has no fingerprints anywhere. If other kinds have fingerprints but the requested kinds don't, surface "framework-extractor nodes aren't fingerprinted; try --kind=function,method". ## Output shape - `groups[]`: each group has `kind`, `fingerprint`, `members[]`, `fileCount`, `coveredByExactGroup`. - `summary`: includes `fingerprintCoverage: { eligible, withAstHash }` so consumers can detect partial coverage. - Sort: member count DESC → max line span DESC → first member filePath ASC → startLine ASC → fingerprint ASC (deterministic, human-meaningful). - `shapeNodes` counts ONLY shape-only members (not double-counted with `exactNodes`). ## Tests 19 unit tests + 1 schema validation test covering: defaults, exact + shape group detection, min-lines filter, sort (primary through tertiary), suppression rule, empty --kind error, v5 schema rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage exposure, fileCount, coveredByExactGroup, unfingerprintable-kind distinction, shapeNodes double-count fix, read-only invariant. ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 + round 2 closure (subdir resolution, --min-lines strict parsing, comment wording, URI escaping, sort coverage) - Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes double-count, unfingerprintable-kind error message) - Council double-Codex deep review → confirmed all BLOCKERs fixed ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Third consumer of PF-690 fingerprint columns + first consumer of edge resolver provenance. Replaces auto-closed PR #41 (closed when its base branch was deleted on PR #39/#42 merge); same content rebased onto main. ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> --kind <k> # canonical fallback [--line N] [--col N] # disambiguators codegraph explain <edgeId> --json ``` ## Honest scope `traceAvailable: false` is locked at false in the output — the resolver in `src/resolution/` discards loser strategy attempts, so only the winning resolver tag + confidence survive. The tool surfaces persisted breadcrumbs, NOT a causal explanation. A future PR will persist `edge_resolution_traces` and flip this to `true`. ## Round-trip stability `callers --json` / `callees --json` now expose `edge: { kind, source, target, line, col }` per relation, so users can pipe output into `codegraph explain --source/--target/--kind` even across rebuilds (when edgeId is invalidated). ## Read-only safety Same `pathToFileURL + '?immutable=1'` URI pattern as PRs #39/#42. ## Schema gate No v6 requirement — provenance/metadata columns existed in v5. `assertCodeGraphDb` checks both `edges` AND `nodes` tables in a single query; rejects with `not a CodeGraph database` if either is missing. ## Output fields - Edge identity (source, target, kind, line, col) - Source + target node snapshots (qualifiedName, filePath, startLine) - `extractorProvenance` (tree-sitter / scip / heuristic) - `resolvedBy` strategy tag + `confidence` ∈ [0, 1] - `metadata` (object form) + `rawMetadata` (string form for arrays/scalars/malformed — PR review fix so non-object metadata isn't silently dropped) - `traceAvailable` (always `false` today; honest signal) ## Edge.id exposure `Edge` gains optional `id?: number` field (index-local, resets on rebuild). `rowToEdge` propagates it. `collectGraphRelations` adds `edgeId` + `edge` object to JSON output. ## Tests 12 unit tests + 1 schema validation test covering: integer-id lookup with provenance assertions, canonical lookup, ambiguous canonical with `--line` disambiguation (hand-built fixture, no skip-on-precondition), no-match, missing-id, invalid-id rejection, narrative format, non-CodeGraph DB rejection, read-only invariant, missing DB, `traceAvailable: false` locked, non-object metadata exposed via `rawMetadata`. ## Reviewer trail - Codex RFC → locked 5 design forks (positional+canonical identifier, edge.id exposure, JSON+text, skip --rerun, ambiguity error) - Codex pass 1 + round 2 closure (kind required, dbCheck both tables, confidence clamping, edgeId schema docs) - Council double-Codex deep review → 2 BLOCKERs addressed (canonical round-trip needed edge object in callers JSON; traceAvailable honesty) - Codex PR review on round 2 → no findings ## Validation - `tsc --noEmit`: clean - `vitest run`: 1078 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… v0.10.0 (#44) Closes the MCP gap from PRs #39/#42/#43: those landed CLI commands + library exports but didn't register the three new tools with the MCP server. Agents using Claude Code / Cursor / opencode etc. via MCP saw only the original 9 codegraph_* tools. This PR exposes all 12. ## What's added - **3 MCP tool descriptors** with input schemas + agent-facing descriptions: `codegraph_diff`, `codegraph_duplicates`, `codegraph_explain`. Each description explains scope, defaults, and round-trip flow (e.g., "get edgeId from codegraph_callers JSON output, pass it here"). - **3 dispatcher cases** in `ToolHandler.execute()`. - **3 handlers** (`handleDiff`, `handleDuplicates`, `handleExplain`) with their own markdown formatters: `formatDiffResult`, `formatDuplicatesResult`, `formatExplainMcp`. - **`resolveDbPathReadOnly(projectPath)`** helper — runs the project-access policy + walks up to find `.codegraph/` WITHOUT opening the DB in WAL mode (which would mutate it). Critical for diff: opening via the production `DatabaseConnection.open` path would defeat the immutable-snapshot guarantee. ## Honest-scope signals carried through - `codegraph_explain` MCP output always renders a "Scope note" block calling out `traceAvailable: false` — the resolver still discards loser strategy attempts. - `codegraph_diff` MCP output renders the `fingerprintCoverage` warning when Liquid/Vue/Svelte gaps make body-level diffs blind. - `codegraph_duplicates` MCP output includes coverage + kinds + min-lines context in every response (including zero-result). ## Codex review fixes applied - **BLOCKER**: `resolveDbPathReadOnly` skipped the access check for non-existent paths AND never re-checked `resolvedRoot`. A caller passing `/disallowed/repo/nonexistent-child` could `findNearestCodeGraphRoot` walk UP to `/disallowed/repo` and return its DB — bypassing the allowlist. Fix: explicit `projectAccess.check(resolvedRoot)` after root resolution, matching the post-cache logic in `getCodeGraph()`. Regression test installs an allowlist excluding the fixture root and passes a non-existent child of it; expects access denial. - **REVIEW**: `Number(args.maxChangedNodes) || 20` treated `maxChangedNodes: 0` as the default. Switched to explicit `undefined ? default : Number(...)` so 0 means "render no detail rows" as the formatter supports. - **NITPICK**: tightened `traceAvailable: false` test to assert the dedicated `Scope note ... traceAvailable: false` block rather than any mention. Codex round-2 closure hit a sandbox issue (stale `/tmp/papersflow-pf-694` workspace from another session caused it to review unrelated convex code). The two fixes are mechanically obvious; tests pin both. ## Version bump 0.9.4 → **0.10.0** — minor bump for the new MCP surface + CLI commands from PRs #39/#42/#43. No breaking changes. ## End-to-end verification ``` echo '{"jsonrpc":"2.0","id":1,"method":"initialize",...} {"jsonrpc":"2.0","id":2,"method":"tools/list",...}' \ | codegraph serve --mcp --path /tmp/mcp-smoke ``` returns 12 tools, and a live `codegraph_duplicates` call against a real fixture returns a 1-group exact clone with markdown formatting. ## Tests 18 new MCP integration tests: - 5 tool descriptor + schema-shape contract tests - 3 codegraph_diff handler tests (happy path + missing arg + uninitialized project) - 3 codegraph_duplicates handler tests (real duplicates + malformed inputs + zero-result context) - 5 codegraph_explain handler tests (id lookup with traceAvailable, mixed-mode rejection, empty-input rejection, kind-required, non-positive id) - 1 dispatcher unknown-tool test - 1 BLOCKER regression test (allowlist + non-existent child → access denial on resolved root) ## Validation - `tsc --noEmit`: clean - `vitest run`: **1096 passed | 2 skipped | 63 files** - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 - MCP stdio smoke test: tools/list returns 12 tools incl. the new three; tools/call → codegraph_duplicates returns a real exact clone group from a fixture Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
First consumer of the PF-690 fingerprint columns (PR #38, schema v6). Compares two
.codegraph/codegraph.dbfiles and reports structural deltas at the node + edge level. Designed as a primitive: PRs #40 (codegraph duplicates) and #41 (codegraph explain) will share the same fingerprint-column substrate.id(deterministic from filePath + kind + name + line viagenerateNodeId). Renames + file moves naturally surface as removed-old + added-new pairs.(source, target, kind, line, col)per the PF-625 UNIQUE INDEX, with NULL line/col folding to -1.changedFieldsenumerates which tracked column drifted:astHash→ body change,sigHash/signature→ contract change,qualifiedName→ in-file move, etc. Downstream drift-detection tools key off these field names.Design — council RFC outcome
CodeGraph stays VCS-agnostic. The diff operates on already-built DB files; the caller handles git checkouts. Both Codex and agy converged on DB-vs-DB; calling
git stash/git checkoutfrom the indexer would be destructive (agy: "massive anti-pattern"). This PR ships that primitive.Read-only safety (Codex round 1 BLOCKER → fixed)
Round 1 flagged:
diffDatabasesgoing throughDatabaseConnection.openranjournal_mode = WAL+ forward migrations, both mutating the file. Fix: a dedicatedopenReadOnly()helper using SQLite'sfile:…?immutable=1URI:immutable=1tells SQLite the file will never change, skipping all locking and-shm/-walsidecar creation that would otherwise happen even for read-only access of a WAL-mode DB. Path escaping handles%/?/#. Regression testdoes not mutate either DB or create WAL sidecarssnapshots size + mtime ofdb,db-wal,db-shm,db-journalbefore diff and asserts they all match after.v5 schema fallback
Backups indexed before PR #38 lack the fingerprint columns.
loadNodesprobesPRAGMA table_info('nodes')and dynamically projects missing columns asNULL AS <col>, so v5 backups diff cleanly — fingerprint fields just don't appear inchangedFieldsfor legacy rows. Regression test:handles legacy schema (no fingerprint columns) without failinghand-builds two v5-shape SQLite files and diffs them.Files
src/diff.ts—diffDatabases(oldPath, newPath): DiffResultlibrary exportsrc/bin/codegraph.ts—codegraph diff <oldDb> <newDb> [-j]CLI subcommandschemas/cli/envelope.json—toolenum widened to includediff,duplicates,explain(PRs feat(PF-691): codegraph diff <oldDb> <newDb> DB-vs-DB structural primitive #39–feat(PF-693): codegraph explain single-edge resolution trace #41)schemas/cli/diff.json— JSON Schema for the diff envelope__tests__/diff.test.ts— 9 unit tests__tests__/cli-json-schemas.test.ts— adds the diff schema-validation testReviewer trail
BLOCKERdiffDatabasesnot read-only — opens viaDatabaseConnection.openwhich sets WAL + runs migrationsfile:…?immutable=1URI; added size+mtime+sidecar regression testREVIEWDatabaseConnectionno longer imported in diff; URI parameters honored by node:sqlite; v5 fallback test meaningfulTest plan
npx tsc --noEmit— cleannpx vitest run— 1040 passed | 2 skipped | 60 files, 0 failuresnpm run build— clean (Node 24)npm run test:eval:structural— required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0🤖 Generated with Claude Code