feat(PF-690): schema v6 + per-symbol fingerprint columns (duplicate/drift/diff foundation)#38
Conversation
…e/drift/diff infrastructure First slice of the trace/duplicate/drift roadmap that Codex + agy debated in the design RFC. Pure data infrastructure — no new CLI/MCP surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates), and PR #41 (codegraph_explain) will consume these columns. ## What changed - `src/extraction/fingerprints.ts` (new, ~245 lines): SHA-256 hashes computed from the in-memory tree-sitter subtree. - `astHash` (Type-1): normalized token stream with identifiers + literals preserved exactly. Detects "same code, only whitespace/comments differ". - `astShapeHash` (Type-2): identifier leaves in non-semantic positions replaced by `_ID`. Detects "same code, renamed locals". Property/field/type identifiers preserved by type; member-access targets, callees, kwarg names, type names, import names preserved by parent-context check. - `sigHash`: SHA-256 of the signature string. Null when no signature was extracted. - Comment + whitespace stripped; trivia tokens (commas, semicolons, braces) excluded via `namedChild` walk. - Schema v6 migration (`src/db/migrations.ts:93-119`, `src/db/schema.sql`): adds 4 nullable columns to `nodes` table — `ast_hash`, `ast_shape_hash`, `sig_hash`, `call_pattern_hash`. Partial indexes on the two body hashes (`WHERE NOT NULL`) so duplicate-detection sweeps are O(log N) lookups instead of full scans. `callPatternHash` is reserved for post-resolution population by a later PR. - Extraction wiring (`src/extraction/tree-sitter.ts:425-460`): computes the three body hashes from the already-parsed tree-sitter subtree inside `createNode`. agy's RFC point — that the AST is already in memory and hashing is microseconds — verified on a 107-file codegraph src/ corpus: 2.4s with vs 2.9s without (overhead below run-to-run variance, well under Codex's ≤15% budget). - `Node` interface (`src/types.ts:165-198`): adds nullable `astHash`, `astShapeHash`, `sigHash`, `callPatternHash` fields with provenance docstrings. - `queries.ts` insertNode / updateNode / rowToNode: round-trip the fingerprint columns nullably so framework-extractor synthesized route nodes (no body) keep `null` fingerprints — downstream consumers filter with `WHERE ast_hash IS NOT NULL`. ## v1 contract (Council RFC, locked by tests) - Detects: Type-1 (whitespace/comment-insensitive clones), Type-2 (renamed-locals clones). - Does NOT detect: Type-3 (statement reorder), Type-4 (semantic equivalence). - Literal values preserved → security/config code where the literal matters does not falsely conflate. Strongest counterpoint the council named ("miss literal-only differs") explicitly accepted. ## Bug-pin verified during review Codex pass 1 caught a real BLOCKER: `tree-sitter-python` parses `obj.start()` as `attribute(identifier "obj", identifier "start")` (both children are plain `identifier`), so a type-only rename rule would have conflated `obj.start()` with `obj.stop()`. Fix: parent- context check (`shouldPreserveIdentifier`) preserves identifiers in semantic positions — `attribute` children, `call.function` field, `keyword_argument` children, types, imports. Codex round 2 caught a follow-on: Python kwargs (`g(start=1)`) — added `keyword_argument` to the semantic-parent set. ## Tests (`__tests__/fingerprints.test.ts`, 14 cases) - sigHash determinism + null on missing signature. - Determinism: same input → same hex. - Type-1: whitespace/comment edits preserve astHash. - Type-2: renamed locals share astShapeHash, NOT astHash. - Member rename diverges (TS `property_identifier` path). - Literal change diverges (security sensitivity pinned). - Control-flow reorder diverges (Type-3 NOT detected, pinned). - Python regression: `obj.start()` vs `obj.stop()` diverge (member preserved despite both being `identifier`). - Python bare callee: `start()` vs `stop()` diverge. - Python kwarg: `g(start=1)` vs `g(stop=1)` diverge. - Python param rename: same astShapeHash, different astHash. - Cross-language: TS body ≠ Python body even when semantically equivalent. ## Reviewer trail - Codex pass 1: 1 BLOCKER (Python member conflation) + 1 REVIEW (missing Python tests) + 1 NITPICK (stale comment). - Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python kwarg conflation) + NITPICK (header repeated stale claim). - Codex round 3: Both round-2 findings CLOSED. Last NITPICK (kwarg comment misrepresented the over-preservation trade-off). Codex authorized "iterate for the comment fix, then ship". - Doc comment now accurately describes the trade-off: kwarg set-membership preserves any direct identifier leaf including value-side identifiers; tighter field-specific check deferred to a follow-up. ## Verification - tsc --noEmit clean - npm test: 1026 passed | 2 skipped (was 1012 on main; +14 fingerprint tests) - npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs main baseline) - Index-time delta on 107-file corpus: 2.4s with vs 2.9s without — below run-to-run variance, well under ≤15% target. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c869917e5f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (let i = 0; i < n.namedChildCount; i++) { | ||
| const child = n.namedChild(i); | ||
| if (child) walk(child, n); | ||
| } |
There was a problem hiding this comment.
Include non-trivia anonymous tokens in fingerprint stream
emitTokenStream descends with namedChild() only, so anonymous tokens are dropped from both hashes. In tree-sitter, many semantic tokens (for example + vs -, == vs !=, and/or in some grammars) are anonymous children, which means code with different behavior can still produce identical astHash/astShapeHash. This creates false positives in the duplicate/drift foundation this commit introduces (e.g., return x + 1 and return x - 1 can collide), so the traversal needs to include non-trivia anonymous tokens rather than excluding all anonymous children.
Useful? React with 👍 / 👎.
| const astHash = computeAstHash(node, this.source); | ||
| const astShapeHash = computeAstShapeHash(node, this.source); |
There was a problem hiding this comment.
Hash declaration subtree instead of whichever node is passed
Fingerprints are computed from the raw node argument, but many existing createNode call sites pass identifier/name children (for example Lua and generic variable paths) rather than the full declaration subtree. For those symbols, astHash captures mostly the name token and astShapeHash often collapses to the same _ID pattern, causing unrelated symbols to cluster as duplicates and weakening drift/diff signals. The hash input should be canonicalized to the declaration/body node (or call sites must be normalized) before hashing.
Useful? React with 👍 / 👎.
| const COMMENT_NODE_TYPES = new Set<string>([ | ||
| 'comment', | ||
| 'line_comment', | ||
| 'block_comment', | ||
| 'doc_comment', | ||
| 'shebang_line', |
There was a problem hiding this comment.
Treat documentation_comment nodes as hash-trivia
The comment-trivia allowlist omits documentation_comment, so grammars that emit that node type will include doc-comment text in the token stream and change fingerprints on comment-only edits. That breaks the Type-1 contract introduced in this commit (whitespace/comment-insensitive hashing) and can fragment duplicate groups after purely documentation updates.
Useful? React with 👍 / 👎.
| const astHash = computeAstHash(node, this.source); | ||
| const astShapeHash = computeAstShapeHash(node, this.source); |
There was a problem hiding this comment.
Apply fingerprinting beyond the TreeSitter createNode path
Fingerprint population is wired only inside TreeSitterExtractor.createNode, so symbols emitted by non-tree-sitter extractors (the same extractFromSource pipeline also routes to custom extractors) are persisted with ast_*/sig_hash left null even when they carry signatures. This creates inconsistent coverage for the new duplicate/drift foundation because query behavior now depends on extractor path rather than symbol content.
Useful? React with 👍 / 👎.
…tion (Codex round 4)
Codex round 4 deep sweep verified via real `tree-sitter-wasms` parses
that the v1 fingerprint rule conflated semantically different code
in Ruby, Java, C#, and Rust. The original rule only handled Python's
`attribute` shape + the `call.function` field. Each of these four
languages emits plain `identifier` for member/callee positions but
under DIFFERENT parent node types:
- Ruby: `user.start` -> call(identifier "user", identifier "start"),
method field carries the member name (not `function`).
- Java: `obj.start()` -> method_invocation(identifier, identifier).
- C#: `obj.Start()` -> invocation_expression > member_access_expression(identifier, identifier).
- Rust: `Router::new()` -> call_expression > scoped_identifier(identifier "Router", identifier "new").
Fix: extend `SEMANTIC_PARENT_TYPES` with `method_invocation`,
`member_access_expression`, `invocation_expression`,
`scoped_identifier`, `scoped_call_expression`, `field_expression`.
Add `call.method` field check to `shouldPreserveIdentifier` to
cover Ruby's dual-purpose `call` type. Same set-membership v1
trade-off applies (accepts false negative on receiver names rather
than risk semantic-name conflation).
Plus Codex round 4 REVIEW: migration v6 was not idempotent under
concurrent-open race. Two processes hitting a v5 database could
both read version 5, both enter migration, and the second's
`ALTER TABLE ADD COLUMN` would fail with duplicate-column even
though the resulting schema is fine. Fixed via `PRAGMA table_info`
pre-check per column so already-applied additions become no-ops.
`CREATE INDEX IF NOT EXISTS` was already idempotent.
Tests added (4 cross-language regressions):
- Ruby `user.start` vs `user.stop` -> different astShapeHash
- Java `obj.start()` vs `obj.stop()` -> different astShapeHash
- C# `obj.Start()` vs `obj.Stop()` -> different astShapeHash
- Rust `Router::new()` vs `Router::default()` -> different astShapeHash
Each pins the specific cross-language failure mode Codex verified.
Reviewer trail:
- Codex round 4 (deep sweep, 6 attack vectors): found 1 BLOCKER
(cross-language conflation) + 1 REVIEW (migration race). Both
fixed; remaining 4 vectors confirmed clean (hash determinism,
persistence completeness, ERROR/MISSING handling, createNode
hook coverage).
- CodeRabbit CLI: ran against the same diff, no findings.
- Claude Explore subagent: returned 7 findings; 3 already covered
here (cross-language tests, migration safety), 4 deferred as
documentation/contract clarifications (line-ending CRLF
normalization, downstream kwarg trade-off doc, callPatternHash
contract clarity, SQLite version compat — node:sqlite ships
SQLite 3.42+ which fully supports partial indexes).
Verification:
- tsc --noEmit clean
- npm test: 1030 passed | 2 skipped (was 1026 last commit;
+4 cross-language tests)
- npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00
fp=0 (no regression vs baseline)
- All 18 fingerprint tests pass deterministically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0be6c9ab23
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const cols = db | ||
| .prepare("PRAGMA table_info('nodes')") | ||
| .all() as Array<{ name: string }>; | ||
| const hasCol = (name: string): boolean => cols.some((c) => c.name === name); |
There was a problem hiding this comment.
Refresh column checks under write lock before ALTERs
The v6 migration computes cols once, before any ALTER TABLE, and then reuses that snapshot for all hasCol checks. With two processes opening a v5 DB concurrently, both transactions can run PRAGMA table_info('nodes') first (the adapter uses BEGIN/deferred transactions), then one process adds the columns and commits while the second still executes ALTER TABLE ... ADD COLUMN ast_hash from stale metadata, raising a duplicate-column error and aborting startup. To make this idempotent under concurrency, acquire a write lock before the check (e.g., immediate transaction) or re-check/catch duplicate-column failures per ALTER.
Useful? React with 👍 / 👎.
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…itive (#39) * feat(PF-691): `codegraph diff <oldDb> <newDb>` DB-vs-DB structural primitive First consumer of the PF-690 fingerprint columns (PR #38). Compares two `.codegraph/codegraph.db` files and reports added/removed/changed nodes and edges, with `changedFields` enumerating which tracked column drifted on each change (astHash → body change, sigHash/signature → contract change, qualifiedName → in-file move, etc.). ## Design (council RFC) CodeGraph stays VCS-agnostic — the diff operates on already-built DB files, the caller handles git checkouts. Calling `git stash` / `git checkout` from the indexer would be destructive (agy: "massive anti-pattern"). Both council members (Codex + agy) converged on DB-vs-DB; this is that primitive. - Nodes matched by stable `id` (deterministic from filePath + kind + name + line via `generateNodeId`). Renames and file moves surface as removed-old + added-new pairs. - Edges matched by canonical identity `(source, target, kind, line, col)` per the PF-625 UNIQUE INDEX, with NULL line/col folding to -1. ## Read-only safety (Codex round 1 BLOCKER, fixed in round 2) `diffDatabases` opens both DBs via `new DatabaseSync(\`file:${escaped}?immutable=1\`, { readOnly: true })`, not the production `DatabaseConnection.open` path. The latter sets `journal_mode = WAL` and runs forward migrations — both mutate the file on disk, which is unacceptable for a tool whose purpose is comparing historical snapshots. `immutable=1` tells SQLite the file will never change, skipping all locking and -shm/-wal sidecar creation. The regression test `does not mutate either DB or create WAL sidecars` snapshots size + mtime of `db`, `db-wal`, `db-shm`, `db-journal` before diff and asserts they all match after. ## v5 schema fallback `loadNodes` probes `PRAGMA table_info('nodes')` and projects missing fingerprint columns as `NULL AS <col>`, so diffing a pre-PR-#38 v5 backup against a v6 DB works cleanly — fingerprint fields just don't appear in `changedFields` for legacy rows. Regression test: `handles legacy schema (no fingerprint columns) without failing`. ## Surface - `codegraph diff <oldDb> <newDb> [-j|--json]` CLI subcommand - `diffDatabases(oldPath, newPath): DiffResult` library export from `src/diff.ts` - `schemas/cli/diff.json` JSON Schema (envelope.tool enum widened to include diff/duplicates/explain for PRs #39-#41) - 9 unit tests covering identical/added/removed/body-change/signature- change/summary-counts/missing-DB/read-only-no-mutation/v5-fallback ## Validation - `npx tsc --noEmit`: clean - `npx vitest run`: 1040 passed | 2 skipped | 60 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 ## Reviewer trail - Codex pass 1 → BLOCKER on read-only safety - Round 2 closure → "closed — proceed to PR" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PF-691): PR #39 round 3 — file-level diff, edge metadata, fingerprint coverage Council double-Codex review (passes A + B) flagged real gaps in PR #39's scope. Addressed BLOCKERs + top REVIEWs: ## BLOCKER fixes - **File-level diff missing**: a file added/removed with zero indexed symbols silently produced empty output. Added `addedFiles`, `removedFiles`, `changedFiles` arrays via the `files` table. changedFiles surfaces `contentHash`, `size`, `nodeCount` drift. - **Non-tree-sitter NULL fingerprints**: Liquid/Vue/Svelte/YAML extractors emit nodes with NULL `ast_hash`, so body-level changes in those file types silently report "unchanged". Added `fingerprintCoverage: { old, new }` counter so consumers can detect the gap. CLI prints a yellow warning when coverage is partial. - **Line-sensitive node IDs**: documented in source as a known pre-existing CodeGraph design decision (not introduced by this PR); downstream consumers wanting move-stable matching should diff by (filePath, qualifiedName) after consuming this output. ## REVIEW fixes - **Edge metadata never compared**: edges with the same canonical identity but different `metadata`/`provenance` were silently matched as "unchanged", even though `codegraph explain` surfaces exactly those fields. Added `changedEdges: EdgeChange[]` with `changedFields` listing which of metadata/provenance differs. - **`-1` sentinel leaked into JSON output**: internal Map keys still use -1 as the NULL bucket (matches PF-625 UNIQUE INDEX), but `edgeIdentityForOutput` now preserves `null` in the public `EdgeIdentity` shape. - **Read-only validation deferred**: `DatabaseSync` opens lazily, so a non-SQLite `.db` file would error at first query with a raw SQLITE_NOTADB message. `openReadOnly` now runs `SELECT 1` to force validation, then wraps as `Invalid CodeGraph database at <path>`. - **`codegraph diff` missing from CLI top-of-file usage block**. Added. - **Text output truncation silent**: now prints "+N more (use --json for full output)" footer. ## Schema + tests - `schemas/cli/diff.json` widened: requires the 4 new fields, declares `nullable line/col` correctly, includes `fingerprintCoverage` and `fileChange`/`edgeChange` definitions. - 5 new tests: null-instead-of-`-1`, file-level add/remove/change, edge metadata drift via hand-built fixture, corrupt-DB wrapper, expanded summary-count invariant. ## Validation - `tsc --noEmit`: clean - `vitest run`: 1045 passed | 2 skipped | 60 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PF-691): close oldDb if openReadOnly(newDb) throws Codex PR review P2: putting both `openReadOnly` calls before the `try` block leaked the old DB's SQLite handle when the second open failed (unreadable path, corrupt file, etc.). Wrap newDb open in a nested try/catch that closes oldDb before re-throwing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Replaces auto-closed PR #40 (closed when PR #39's base branch was deleted on merge). Combines the original feat commit + round 2 fixes (fileCount, coveredByExactGroup, fingerprint coverage gate, human-meaningful sort) + the Codex-PR-review fixes (shapeNodes double-count + unfingerprintable-kind error message). ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` ## Read-only safety Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39. ## Schema gate - v5 DBs (no fingerprint columns): "duplicates requires schema v6+". - Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0): "Run `codegraph index` to refresh fingerprints" — but ONLY when the DB has no fingerprints anywhere. If other kinds have fingerprints but the requested kinds don't, surface "framework-extractor nodes aren't fingerprinted; try --kind=function,method". ## Output shape - `groups[]`: each group has `kind`, `fingerprint`, `members[]`, `fileCount`, `coveredByExactGroup`. - `summary`: includes `fingerprintCoverage: { eligible, withAstHash }` so consumers can detect partial coverage. - Sort: member count DESC → max line span DESC → first member filePath ASC → startLine ASC → fingerprint ASC (deterministic, human-meaningful). - `shapeNodes` counts ONLY shape-only members (not double-counted with `exactNodes`). ## Tests 19 unit tests + 1 schema validation test covering: defaults, exact + shape group detection, min-lines filter, sort (primary through tertiary), suppression rule, empty --kind error, v5 schema rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage exposure, fileCount, coveredByExactGroup, unfingerprintable-kind distinction, shapeNodes double-count fix, read-only invariant. ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 + round 2 closure (subdir resolution, --min-lines strict parsing, comment wording, URI escaping, sort coverage) - Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes double-count, unfingerprintable-kind error message) - Council double-Codex deep review → confirmed all BLOCKERs fixed ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
First slice of the trace/duplicate/drift roadmap that Codex + agy debated in the design RFC. Pure data infrastructure — no new CLI/MCP surface yet. PR #39 (
codegraph_diff), PR #40 (codegraph_duplicates), PR #41 (codegraph_explain) will consume these columns.What's in this PR
src/extraction/fingerprints.ts(new, ~245 lines):astHash(Type-1): normalized AST token stream, identifiers + literals preserved exactly. Detects "same code, only whitespace/comments differ".astShapeHash(Type-2): identifier leaves in non-semantic positions replaced by_ID. Detects "same code, renamed locals". Member-access targets, callees, kwarg names, type names, import names preserved by parent-context check.sigHash: SHA-256 of the signature string. Null when no signature.namedChildwalk.Schema v6 migration + partial indexes on
WHERE NOT NULLso duplicate-detection sweeps are O(log N) lookups.callPatternHashreserved for post-resolution population by a later PR.Extraction wiring (
tree-sitter.ts:createNode): hashes computed from the in-memory tree-sitter subtree. agy's RFC point — AST already in memory, hashing is microseconds — verified: 2.4s with vs 2.9s without on a 107-file corpus (overhead below run-to-run variance, well under Codex's ≤15% budget).v1 contract (locked by 14 tests)
1024vs2048is NOT a clone. Council explicitly accepted "miss literal-only differs" as the strongest counterpoint for security/config code.Bug-pin trail
tree-sitter-pythonparsesobj.start()asattribute(identifier "obj", identifier "start")— type-only rename would conflateobj.start()withobj.stop().shouldPreserveIdentifier(node, parent)with rules forattribute/call.functionfield /keyword_argument/type_*/import_*parent contexts.g(start=1)was still conflating withg(stop=1).keyword_argumentto SEMANTIC_PARENT_TYPES.Test plan
npx tsc --noEmit— cleannpm test— 1026 passed | 2 skipped (was 1012 on main; +14 fingerprint tests)npm run test:eval:structural— 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs main baseline)Tests added (
__tests__/fingerprints.test.ts, 14 cases)property_identifier)obj.start()vsobj.stop()diverge (BLOCKER regression)start()vsstop()divergeg(start=1)vsg(stop=1)diverge (round-2 regression)Reviewer trail
Merge handoff
Ready-to-merge. Branch
feature/pf-690-fingerprint-columns. Schema migration is additive only (4 nullable columns, partial indexes). Existing databases auto-migrate on next open viarunMigrations. New nodes get hashed at extract time; existing nodes get NULL until next reindex.🤖 Generated with Claude Code