Skip to content

feat(PF-690): schema v6 + per-symbol fingerprint columns (duplicate/drift/diff foundation)#38

Merged
mbenhamd merged 2 commits into
mainfrom
feature/pf-690-fingerprint-columns
May 24, 2026
Merged

feat(PF-690): schema v6 + per-symbol fingerprint columns (duplicate/drift/diff foundation)#38
mbenhamd merged 2 commits into
mainfrom
feature/pf-690-fingerprint-columns

Conversation

@mbenhamd
Copy link
Copy Markdown
Owner

Summary

First slice of the trace/duplicate/drift roadmap that Codex + agy debated in the design RFC. Pure data infrastructure — no new CLI/MCP surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates), PR #41 (codegraph_explain) will consume these columns.

What's in this PR

src/extraction/fingerprints.ts (new, ~245 lines):

  • astHash (Type-1): normalized AST token stream, identifiers + literals preserved exactly. Detects "same code, only whitespace/comments differ".
  • astShapeHash (Type-2): identifier leaves in non-semantic positions replaced by _ID. Detects "same code, renamed locals". Member-access targets, callees, kwarg names, type names, import names preserved by parent-context check.
  • sigHash: SHA-256 of the signature string. Null when no signature.
  • Comment + whitespace stripped; trivia tokens excluded via namedChild walk.

Schema v6 migration + partial indexes on WHERE NOT NULL so duplicate-detection sweeps are O(log N) lookups. callPatternHash reserved for post-resolution population by a later PR.

Extraction wiring (tree-sitter.ts:createNode): hashes computed from the in-memory tree-sitter subtree. agy's RFC point — AST already in memory, hashing is microseconds — verified: 2.4s with vs 2.9s without on a 107-file corpus (overhead below run-to-run variance, well under Codex's ≤15% budget).

v1 contract (locked by 14 tests)

  • Detects: Type-1 (whitespace/comment insensitive), Type-2 (renamed-locals).
  • Does NOT detect: Type-3 (statement reorder), Type-4 (semantic equivalence).
  • Literal sensitivity: kept. 1024 vs 2048 is NOT a clone. Council explicitly accepted "miss literal-only differs" as the strongest counterpoint for security/config code.

Bug-pin trail

  • Codex pass 1 BLOCKER: tree-sitter-python parses obj.start() as attribute(identifier "obj", identifier "start") — type-only rename would conflate obj.start() with obj.stop().
    • Fixed via shouldPreserveIdentifier(node, parent) with rules for attribute / call.function field / keyword_argument / type_* / import_* parent contexts.
  • Codex round 2 REVIEW: Python kwarg g(start=1) was still conflating with g(stop=1).
    • Fixed by adding keyword_argument to SEMANTIC_PARENT_TYPES.
  • Codex round 3 NITPICK: comment misrepresented the kwarg over-preservation trade-off (preserves value-side identifiers too).
    • Reworded with explicit acknowledgment: tighter field-specific check deferred to follow-up.

Test plan

  • npx tsc --noEmit — clean
  • npm test1026 passed | 2 skipped (was 1012 on main; +14 fingerprint tests)
  • npm run test:eval:structural — 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs main baseline)
  • Index-time benchmark: 107-file corpus, 2.4s with fingerprints vs 2.9s without (within run-to-run noise, well under ≤15% target)
  • Bug-pin verification: each Codex finding has a corresponding regression test

Tests added (__tests__/fingerprints.test.ts, 14 cases)

  1. sigHash determinism + null on missing signature
  2. Determinism: same input → same hex
  3. Type-1: whitespace/comment edits preserve astHash
  4. Type-2: renamed locals share astShapeHash, NOT astHash
  5. Member rename diverges (TS property_identifier)
  6. Literal change diverges (security sensitivity pinned)
  7. Control-flow reorder diverges (Type-3 NOT detected, pinned)
  8. Python: obj.start() vs obj.stop() diverge (BLOCKER regression)
  9. Python bare callee: start() vs stop() diverge
  10. Python kwarg: g(start=1) vs g(stop=1) diverge (round-2 regression)
  11. Python param rename: same astShapeHash, different astHash
  12. Cross-language: TS ≠ Python even when semantically equivalent

Reviewer trail

  • Codex pass 1: 1 BLOCKER + 1 REVIEW + 1 NITPICK — all addressed.
  • Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python kwarg) + NITPICK addressed.
  • Codex round 3: Both round-2 findings CLOSED. Final NITPICK on doc comment fixed. Codex final call: "ship".
  • CodeRabbit CLI: free-tier limit hit earlier; relying on the GitHub app review after merge.

Merge handoff

Ready-to-merge. Branch feature/pf-690-fingerprint-columns. Schema migration is additive only (4 nullable columns, partial indexes). Existing databases auto-migrate on next open via runMigrations. New nodes get hashed at extract time; existing nodes get NULL until next reindex.

🤖 Generated with Claude Code

…e/drift/diff infrastructure

First slice of the trace/duplicate/drift roadmap that Codex + agy
debated in the design RFC. Pure data infrastructure — no new CLI/MCP
surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates),
and PR #41 (codegraph_explain) will consume these columns.

## What changed

- `src/extraction/fingerprints.ts` (new, ~245 lines): SHA-256 hashes
  computed from the in-memory tree-sitter subtree.
  - `astHash` (Type-1): normalized token stream with identifiers +
    literals preserved exactly. Detects "same code, only
    whitespace/comments differ".
  - `astShapeHash` (Type-2): identifier leaves in non-semantic
    positions replaced by `_ID`. Detects "same code, renamed
    locals". Property/field/type identifiers preserved by type;
    member-access targets, callees, kwarg names, type names,
    import names preserved by parent-context check.
  - `sigHash`: SHA-256 of the signature string. Null when no
    signature was extracted.
  - Comment + whitespace stripped; trivia tokens (commas,
    semicolons, braces) excluded via `namedChild` walk.

- Schema v6 migration (`src/db/migrations.ts:93-119`,
  `src/db/schema.sql`): adds 4 nullable columns to `nodes` table —
  `ast_hash`, `ast_shape_hash`, `sig_hash`, `call_pattern_hash`.
  Partial indexes on the two body hashes (`WHERE NOT NULL`) so
  duplicate-detection sweeps are O(log N) lookups instead of full
  scans. `callPatternHash` is reserved for post-resolution
  population by a later PR.

- Extraction wiring (`src/extraction/tree-sitter.ts:425-460`):
  computes the three body hashes from the already-parsed tree-sitter
  subtree inside `createNode`. agy's RFC point — that the AST is
  already in memory and hashing is microseconds — verified on a
  107-file codegraph src/ corpus: 2.4s with vs 2.9s without
  (overhead below run-to-run variance, well under Codex's ≤15%
  budget).

- `Node` interface (`src/types.ts:165-198`): adds nullable
  `astHash`, `astShapeHash`, `sigHash`, `callPatternHash` fields
  with provenance docstrings.

- `queries.ts` insertNode / updateNode / rowToNode: round-trip the
  fingerprint columns nullably so framework-extractor synthesized
  route nodes (no body) keep `null` fingerprints — downstream
  consumers filter with `WHERE ast_hash IS NOT NULL`.

## v1 contract (Council RFC, locked by tests)

- Detects: Type-1 (whitespace/comment-insensitive clones),
  Type-2 (renamed-locals clones).
- Does NOT detect: Type-3 (statement reorder), Type-4 (semantic
  equivalence).
- Literal values preserved → security/config code where the literal
  matters does not falsely conflate. Strongest counterpoint the
  council named ("miss literal-only differs") explicitly accepted.

## Bug-pin verified during review

Codex pass 1 caught a real BLOCKER: `tree-sitter-python` parses
`obj.start()` as `attribute(identifier "obj", identifier "start")`
(both children are plain `identifier`), so a type-only rename rule
would have conflated `obj.start()` with `obj.stop()`. Fix: parent-
context check (`shouldPreserveIdentifier`) preserves identifiers
in semantic positions — `attribute` children, `call.function`
field, `keyword_argument` children, types, imports. Codex round 2
caught a follow-on: Python kwargs (`g(start=1)`) — added
`keyword_argument` to the semantic-parent set.

## Tests (`__tests__/fingerprints.test.ts`, 14 cases)

- sigHash determinism + null on missing signature.
- Determinism: same input → same hex.
- Type-1: whitespace/comment edits preserve astHash.
- Type-2: renamed locals share astShapeHash, NOT astHash.
- Member rename diverges (TS `property_identifier` path).
- Literal change diverges (security sensitivity pinned).
- Control-flow reorder diverges (Type-3 NOT detected, pinned).
- Python regression: `obj.start()` vs `obj.stop()` diverge
  (member preserved despite both being `identifier`).
- Python bare callee: `start()` vs `stop()` diverge.
- Python kwarg: `g(start=1)` vs `g(stop=1)` diverge.
- Python param rename: same astShapeHash, different astHash.
- Cross-language: TS body ≠ Python body even when semantically
  equivalent.

## Reviewer trail

- Codex pass 1: 1 BLOCKER (Python member conflation) + 1 REVIEW
  (missing Python tests) + 1 NITPICK (stale comment).
- Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python
  kwarg conflation) + NITPICK (header repeated stale claim).
- Codex round 3: Both round-2 findings CLOSED. Last NITPICK
  (kwarg comment misrepresented the over-preservation trade-off).
  Codex authorized "iterate for the comment fix, then ship".
- Doc comment now accurately describes the trade-off: kwarg
  set-membership preserves any direct identifier leaf including
  value-side identifiers; tighter field-specific check deferred
  to a follow-up.

## Verification

- tsc --noEmit clean
- npm test: 1026 passed | 2 skipped (was 1012 on main; +14
  fingerprint tests)
- npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00
  fp=0 (no regression vs main baseline)
- Index-time delta on 107-file corpus: 2.4s with vs 2.9s without —
  below run-to-run variance, well under ≤15% target.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c869917e5f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +196 to +199
for (let i = 0; i < n.namedChildCount; i++) {
const child = n.namedChild(i);
if (child) walk(child, n);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include non-trivia anonymous tokens in fingerprint stream

emitTokenStream descends with namedChild() only, so anonymous tokens are dropped from both hashes. In tree-sitter, many semantic tokens (for example + vs -, == vs !=, and/or in some grammars) are anonymous children, which means code with different behavior can still produce identical astHash/astShapeHash. This creates false positives in the duplicate/drift foundation this commit introduces (e.g., return x + 1 and return x - 1 can collide), so the traversal needs to include non-trivia anonymous tokens rather than excluding all anonymous children.

Useful? React with 👍 / 👎.

Comment on lines +435 to +436
const astHash = computeAstHash(node, this.source);
const astShapeHash = computeAstShapeHash(node, this.source);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Hash declaration subtree instead of whichever node is passed

Fingerprints are computed from the raw node argument, but many existing createNode call sites pass identifier/name children (for example Lua and generic variable paths) rather than the full declaration subtree. For those symbols, astHash captures mostly the name token and astShapeHash often collapses to the same _ID pattern, causing unrelated symbols to cluster as duplicates and weakening drift/diff signals. The hash input should be canonicalized to the declaration/body node (or call sites must be normalized) before hashing.

Useful? React with 👍 / 👎.

Comment on lines +56 to +61
const COMMENT_NODE_TYPES = new Set<string>([
'comment',
'line_comment',
'block_comment',
'doc_comment',
'shebang_line',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat documentation_comment nodes as hash-trivia

The comment-trivia allowlist omits documentation_comment, so grammars that emit that node type will include doc-comment text in the token stream and change fingerprints on comment-only edits. That breaks the Type-1 contract introduced in this commit (whitespace/comment-insensitive hashing) and can fragment duplicate groups after purely documentation updates.

Useful? React with 👍 / 👎.

Comment on lines +435 to +436
const astHash = computeAstHash(node, this.source);
const astShapeHash = computeAstShapeHash(node, this.source);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply fingerprinting beyond the TreeSitter createNode path

Fingerprint population is wired only inside TreeSitterExtractor.createNode, so symbols emitted by non-tree-sitter extractors (the same extractFromSource pipeline also routes to custom extractors) are persisted with ast_*/sig_hash left null even when they carry signatures. This creates inconsistent coverage for the new duplicate/drift foundation because query behavior now depends on extractor path rather than symbol content.

Useful? React with 👍 / 👎.

…tion (Codex round 4)

Codex round 4 deep sweep verified via real `tree-sitter-wasms` parses
that the v1 fingerprint rule conflated semantically different code
in Ruby, Java, C#, and Rust. The original rule only handled Python's
`attribute` shape + the `call.function` field. Each of these four
languages emits plain `identifier` for member/callee positions but
under DIFFERENT parent node types:

  - Ruby:    `user.start` -> call(identifier "user", identifier "start"),
             method field carries the member name (not `function`).
  - Java:    `obj.start()` -> method_invocation(identifier, identifier).
  - C#:      `obj.Start()` -> invocation_expression > member_access_expression(identifier, identifier).
  - Rust:    `Router::new()` -> call_expression > scoped_identifier(identifier "Router", identifier "new").

Fix: extend `SEMANTIC_PARENT_TYPES` with `method_invocation`,
`member_access_expression`, `invocation_expression`,
`scoped_identifier`, `scoped_call_expression`, `field_expression`.
Add `call.method` field check to `shouldPreserveIdentifier` to
cover Ruby's dual-purpose `call` type. Same set-membership v1
trade-off applies (accepts false negative on receiver names rather
than risk semantic-name conflation).

Plus Codex round 4 REVIEW: migration v6 was not idempotent under
concurrent-open race. Two processes hitting a v5 database could
both read version 5, both enter migration, and the second's
`ALTER TABLE ADD COLUMN` would fail with duplicate-column even
though the resulting schema is fine. Fixed via `PRAGMA table_info`
pre-check per column so already-applied additions become no-ops.
`CREATE INDEX IF NOT EXISTS` was already idempotent.

Tests added (4 cross-language regressions):
- Ruby `user.start` vs `user.stop` -> different astShapeHash
- Java `obj.start()` vs `obj.stop()` -> different astShapeHash
- C# `obj.Start()` vs `obj.Stop()` -> different astShapeHash
- Rust `Router::new()` vs `Router::default()` -> different astShapeHash

Each pins the specific cross-language failure mode Codex verified.

Reviewer trail:
- Codex round 4 (deep sweep, 6 attack vectors): found 1 BLOCKER
  (cross-language conflation) + 1 REVIEW (migration race). Both
  fixed; remaining 4 vectors confirmed clean (hash determinism,
  persistence completeness, ERROR/MISSING handling, createNode
  hook coverage).
- CodeRabbit CLI: ran against the same diff, no findings.
- Claude Explore subagent: returned 7 findings; 3 already covered
  here (cross-language tests, migration safety), 4 deferred as
  documentation/contract clarifications (line-ending CRLF
  normalization, downstream kwarg trade-off doc, callPatternHash
  contract clarity, SQLite version compat — node:sqlite ships
  SQLite 3.42+ which fully supports partial indexes).

Verification:
- tsc --noEmit clean
- npm test: 1030 passed | 2 skipped (was 1026 last commit;
  +4 cross-language tests)
- npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00
  fp=0 (no regression vs baseline)
- All 18 fingerprint tests pass deterministically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd merged commit 5fbc909 into main May 24, 2026
@mbenhamd mbenhamd deleted the feature/pf-690-fingerprint-columns branch May 24, 2026 12:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0be6c9ab23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/db/migrations.ts
Comment on lines +113 to +116
const cols = db
.prepare("PRAGMA table_info('nodes')")
.all() as Array<{ name: string }>;
const hasCol = (name: string): boolean => cols.some((c) => c.name === name);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Refresh column checks under write lock before ALTERs

The v6 migration computes cols once, before any ALTER TABLE, and then reuses that snapshot for all hasCol checks. With two processes opening a v5 DB concurrently, both transactions can run PRAGMA table_info('nodes') first (the adapter uses BEGIN/deferred transactions), then one process adds the columns and commits while the second still executes ALTER TABLE ... ADD COLUMN ast_hash from stale metadata, raising a duplicate-column error and aborting startup. To make this idempotent under concurrency, acquire a write lock before the check (e.g., immediate transaction) or re-check/catch duplicate-column failures per ALTER.

Useful? React with 👍 / 👎.

mbenhamd added a commit that referenced this pull request May 24, 2026
Second consumer of the PF-690 fingerprint columns (PR #38). Reports
Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups
under council-locked defaults: function+method kinds, ≥10 lines, shape
groups whose members already form an exact group are suppressed.

Stacked on top of PR #39 because both consume the same envelope-enum
widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)`
slot without duplicating the schema edit.

## Surface

```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

- `[path]` resolves through `resolveProjectPath` (matches `status`/
  `index`/`sync`), so running from a subdirectory walks up to the
  project root.
- `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would
  happily accept `10abc` / `1.5` / `+10`, hiding typos.

## Defaults locked by council RFC (Codex)

| Fork | Decision |
|---|---|
| Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both |
| `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) |
| `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) |
| Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) |
| Sort order | member count DESC → max line span DESC → fingerprint ASC |

## Read-only safety (Codex round 1 BLOCKER pattern from PR #39)

Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical
Node API for filesystem-path → file:// URL conversion, correctly
escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved
URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar
creation even when the DB is in WAL mode.

## v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints`
probes `PRAGMA table_info('nodes')` and throws
`duplicates requires schema v6+ (PR #38 fingerprint columns).`
instead of silently returning empty groups.

## Tests (13 unit + 1 schema validation)

- Defaults exposed as constants for regression pinning
- Same-named functions → exact group (Type-1)
- Different-named identical bodies → shape group (Type-2)
- Single-symbol fingerprint → no group (HAVING > 1)
- `--min-lines` filters one-liners with default, surfaces with `--min-lines 1`
- Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary
- Shape group with member set ≡ exact group → suppressed
- Empty `--kind` list → clear error
- v5 schema → "requires schema v6+" error
- Missing DB path → "not found" error
- Read-only invariant: snapshot DB + sidecars before/after, assert equal

## Reviewer trail

- Codex RFC debate → locked 5 design forks
- Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema)
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mbenhamd added a commit that referenced this pull request May 24, 2026
Second consumer of the PF-690 fingerprint columns (PR #38). Reports
Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups
under council-locked defaults: function+method kinds, ≥10 lines, shape
groups whose members already form an exact group are suppressed.

Stacked on top of PR #39 because both consume the same envelope-enum
widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)`
slot without duplicating the schema edit.

## Surface

```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

- `[path]` resolves through `resolveProjectPath` (matches `status`/
  `index`/`sync`), so running from a subdirectory walks up to the
  project root.
- `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would
  happily accept `10abc` / `1.5` / `+10`, hiding typos.

## Defaults locked by council RFC (Codex)

| Fork | Decision |
|---|---|
| Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both |
| `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) |
| `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) |
| Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) |
| Sort order | member count DESC → max line span DESC → fingerprint ASC |

## Read-only safety (Codex round 1 BLOCKER pattern from PR #39)

Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical
Node API for filesystem-path → file:// URL conversion, correctly
escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved
URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar
creation even when the DB is in WAL mode.

## v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints`
probes `PRAGMA table_info('nodes')` and throws
`duplicates requires schema v6+ (PR #38 fingerprint columns).`
instead of silently returning empty groups.

## Tests (13 unit + 1 schema validation)

- Defaults exposed as constants for regression pinning
- Same-named functions → exact group (Type-1)
- Different-named identical bodies → shape group (Type-2)
- Single-symbol fingerprint → no group (HAVING > 1)
- `--min-lines` filters one-liners with default, surfaces with `--min-lines 1`
- Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary
- Shape group with member set ≡ exact group → suppressed
- Empty `--kind` list → clear error
- v5 schema → "requires schema v6+" error
- Missing DB path → "not found" error
- Read-only invariant: snapshot DB + sidecars before/after, assert equal

## Reviewer trail

- Codex RFC debate → locked 5 design forks
- Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema)
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mbenhamd added a commit that referenced this pull request May 24, 2026
…itive (#39)

* feat(PF-691): `codegraph diff <oldDb> <newDb>` DB-vs-DB structural primitive

First consumer of the PF-690 fingerprint columns (PR #38). Compares two
`.codegraph/codegraph.db` files and reports added/removed/changed nodes
and edges, with `changedFields` enumerating which tracked column drifted
on each change (astHash → body change, sigHash/signature → contract
change, qualifiedName → in-file move, etc.).

## Design (council RFC)

CodeGraph stays VCS-agnostic — the diff operates on already-built DB
files, the caller handles git checkouts. Calling `git stash` / `git
checkout` from the indexer would be destructive (agy: "massive
anti-pattern"). Both council members (Codex + agy) converged on
DB-vs-DB; this is that primitive.

- Nodes matched by stable `id` (deterministic from filePath + kind +
  name + line via `generateNodeId`). Renames and file moves surface as
  removed-old + added-new pairs.
- Edges matched by canonical identity `(source, target, kind, line,
  col)` per the PF-625 UNIQUE INDEX, with NULL line/col folding to -1.

## Read-only safety (Codex round 1 BLOCKER, fixed in round 2)

`diffDatabases` opens both DBs via `new DatabaseSync(\`file:${escaped}?immutable=1\`, { readOnly: true })`,
not the production `DatabaseConnection.open` path. The latter sets
`journal_mode = WAL` and runs forward migrations — both mutate the file
on disk, which is unacceptable for a tool whose purpose is comparing
historical snapshots. `immutable=1` tells SQLite the file will never
change, skipping all locking and -shm/-wal sidecar creation. The
regression test `does not mutate either DB or create WAL sidecars`
snapshots size + mtime of `db`, `db-wal`, `db-shm`, `db-journal` before
diff and asserts they all match after.

## v5 schema fallback

`loadNodes` probes `PRAGMA table_info('nodes')` and projects missing
fingerprint columns as `NULL AS <col>`, so diffing a pre-PR-#38 v5
backup against a v6 DB works cleanly — fingerprint fields just don't
appear in `changedFields` for legacy rows. Regression test:
`handles legacy schema (no fingerprint columns) without failing`.

## Surface

- `codegraph diff <oldDb> <newDb> [-j|--json]` CLI subcommand
- `diffDatabases(oldPath, newPath): DiffResult` library export from
  `src/diff.ts`
- `schemas/cli/diff.json` JSON Schema (envelope.tool enum widened to
  include diff/duplicates/explain for PRs #39-#41)
- 9 unit tests covering identical/added/removed/body-change/signature-
  change/summary-counts/missing-DB/read-only-no-mutation/v5-fallback

## Validation

- `npx tsc --noEmit`: clean
- `npx vitest run`: 1040 passed | 2 skipped | 60 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  blocking=0, recall=1.00, precision=1.00, fp=0

## Reviewer trail

- Codex pass 1 → BLOCKER on read-only safety
- Round 2 closure → "closed — proceed to PR"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PF-691): PR #39 round 3 — file-level diff, edge metadata, fingerprint coverage

Council double-Codex review (passes A + B) flagged real gaps in PR #39's
scope. Addressed BLOCKERs + top REVIEWs:

## BLOCKER fixes

- **File-level diff missing**: a file added/removed with zero indexed
  symbols silently produced empty output. Added `addedFiles`,
  `removedFiles`, `changedFiles` arrays via the `files` table.
  changedFiles surfaces `contentHash`, `size`, `nodeCount` drift.

- **Non-tree-sitter NULL fingerprints**: Liquid/Vue/Svelte/YAML
  extractors emit nodes with NULL `ast_hash`, so body-level changes
  in those file types silently report "unchanged". Added
  `fingerprintCoverage: { old, new }` counter so consumers can detect
  the gap. CLI prints a yellow warning when coverage is partial.

- **Line-sensitive node IDs**: documented in source as a known
  pre-existing CodeGraph design decision (not introduced by this PR);
  downstream consumers wanting move-stable matching should diff by
  (filePath, qualifiedName) after consuming this output.

## REVIEW fixes

- **Edge metadata never compared**: edges with the same canonical
  identity but different `metadata`/`provenance` were silently
  matched as "unchanged", even though `codegraph explain` surfaces
  exactly those fields. Added `changedEdges: EdgeChange[]` with
  `changedFields` listing which of metadata/provenance differs.

- **`-1` sentinel leaked into JSON output**: internal Map keys still
  use -1 as the NULL bucket (matches PF-625 UNIQUE INDEX), but
  `edgeIdentityForOutput` now preserves `null` in the public
  `EdgeIdentity` shape.

- **Read-only validation deferred**: `DatabaseSync` opens lazily, so
  a non-SQLite `.db` file would error at first query with a raw
  SQLITE_NOTADB message. `openReadOnly` now runs `SELECT 1` to force
  validation, then wraps as `Invalid CodeGraph database at <path>`.

- **`codegraph diff` missing from CLI top-of-file usage block**.
  Added.

- **Text output truncation silent**: now prints "+N more (use --json
  for full output)" footer.

## Schema + tests

- `schemas/cli/diff.json` widened: requires the 4 new fields, declares
  `nullable line/col` correctly, includes `fingerprintCoverage` and
  `fileChange`/`edgeChange` definitions.
- 5 new tests: null-instead-of-`-1`, file-level add/remove/change,
  edge metadata drift via hand-built fixture, corrupt-DB wrapper,
  expanded summary-count invariant.

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1045 passed | 2 skipped | 60 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PF-691): close oldDb if openReadOnly(newDb) throws

Codex PR review P2: putting both `openReadOnly` calls before the
`try` block leaked the old DB's SQLite handle when the second open
failed (unreadable path, corrupt file, etc.). Wrap newDb open in a
nested try/catch that closes oldDb before re-throwing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mbenhamd added a commit that referenced this pull request May 24, 2026
Second consumer of PF-690 fingerprint columns (PR #38, schema v6).
Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone
groups under council-locked defaults: function+method kinds, ≥10
lines, shape groups whose members already form an exact group are
suppressed.

Replaces auto-closed PR #40 (closed when PR #39's base branch was
deleted on merge). Combines the original feat commit + round 2 fixes
(fileCount, coveredByExactGroup, fingerprint coverage gate,
human-meaningful sort) + the Codex-PR-review fixes (shapeNodes
double-count + unfingerprintable-kind error message).

## Surface
```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

## Read-only safety
Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39.

## Schema gate
- v5 DBs (no fingerprint columns): "duplicates requires schema v6+".
- Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0):
  "Run `codegraph index` to refresh fingerprints" — but ONLY when
  the DB has no fingerprints anywhere. If other kinds have
  fingerprints but the requested kinds don't, surface
  "framework-extractor nodes aren't fingerprinted; try
  --kind=function,method".

## Output shape
- `groups[]`: each group has `kind`, `fingerprint`, `members[]`,
  `fileCount`, `coveredByExactGroup`.
- `summary`: includes `fingerprintCoverage: { eligible, withAstHash }`
  so consumers can detect partial coverage.
- Sort: member count DESC → max line span DESC → first member
  filePath ASC → startLine ASC → fingerprint ASC (deterministic,
  human-meaningful).
- `shapeNodes` counts ONLY shape-only members (not double-counted
  with `exactNodes`).

## Tests
19 unit tests + 1 schema validation test covering: defaults,
exact + shape group detection, min-lines filter, sort (primary
through tertiary), suppression rule, empty --kind error, v5 schema
rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage
exposure, fileCount, coveredByExactGroup, unfingerprintable-kind
distinction, shapeNodes double-count fix, read-only invariant.

## Reviewer trail
- Codex RFC → locked 5 design forks
- Codex pass 1 + round 2 closure (subdir resolution, --min-lines
  strict parsing, comment wording, URI escaping, sort coverage)
- Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes
  double-count, unfingerprintable-kind error message)
- Council double-Codex deep review → confirmed all BLOCKERs fixed

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant