feat(PF-692): codegraph duplicates clone-detection CLI by mbenhamd · Pull Request #40 · mbenhamd/codegraph

mbenhamd · 2026-05-24T13:03:28Z

Summary

Second consumer of the PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (ast_hash) and Type-2 (ast_shape_hash) clone groups in the indexed graph.

Stacked on feature/pf-691-codegraph-diff (PR #39). Both consume the same envelope-enum widening; PR #40 inherits the duplicates enum slot without duplicating that schema edit. Once PR #39 lands, this rebases to main.

codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]

Design — council RFC outcome (Codex)

Fork	Decision	Rationale
Type-2 dedup vs Type-1	Suppress shape groups whose member set EQUALS an exact group	Type-1 implies Type-2; reporting both inflates the summary
`--min-lines` floor	10	Standard CPD/jscpd default; filters one-liner accessors
`--kind` default	`function,method`	Most useful clone targets; class-level often framework-shaped noise
Body-line source	`endLine - startLine + 1`	Language-agnostic, no source rescan; blank/comment lines count toward threshold (accepted trade-off)
Sort order	member count DESC → max line span DESC → fingerprint ASC	Biggest clone-set first, deterministic tie-breakers

Real fingerprint behavior pinned by tests

The ast_hash includes the function DECLARATION name (rename-locals only normalizes LOCAL identifiers, not the function's own name). Two same-bodied functions with different declaration names produce DIFFERENT ast_hash but IDENTICAL ast_shape_hash. Tests use this distinction:

Same-named handler/handler across two files → exact (Type-1) group
adder/alsoAdder with identical bodies → shape-only (Type-2) group

Read-only safety (same Codex BLOCKER pattern from PR #39)

openReadOnly uses pathToFileURL(dbPath).href + '?immutable=1' — canonical Node API for filesystem-path → file:// URL, correctly handles spaces, non-ASCII, Windows drive letters, all SQLite-reserved URI delimiters. immutable=1 prevents -shm/-wal sidecar creation even on WAL-mode DBs. Regression test snapshots size + mtime of db/-wal/-shm/-journal before/after and asserts equal.

v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. assertSchemaSupportsFingerprints probes PRAGMA table_info('nodes') and throws duplicates requires schema v6+ (PR #38 fingerprint columns). instead of silently returning empty groups.

Reviewer trail

Pass	Severity	Finding	Resolution
Codex pass 1	`REVIEW`	`duplicates [path]` didn't use `resolveProjectPath`	Now does — matches `status`/`index`/`sync`
Codex pass 1	`REVIEW`	`--min-lines` parseInt permissive (`10abc`, `1.5`)	Added strict `/^[1-9]\d*$/` regex validation
Codex pass 1	`NITPICK`	Comment said "subset" but impl is "equals"	Comment fixed
Codex pass 1	`NITPICK`	Hand-rolled URI escape missed spaces/non-ASCII	Switched to `pathToFileURL`
Codex pass 1	`NITPICK`	Sort tests only covered primary key	Added secondary (span) + tertiary (fingerprint) coverage
Codex round 2	—	All findings verified addressed	"closed — proceed to PR"

Files

src/duplicates.ts — findDuplicates(dbPath, opts): DuplicatesResult library export (~330 lines)
src/bin/codegraph.ts — codegraph duplicates [path] CLI subcommand
schemas/cli/duplicates.json — JSON Schema (envelope.tool=duplicates, members minItems:2)
__tests__/duplicates.test.ts — 13 unit tests
__tests__/cli-json-schemas.test.ts — JSON envelope validation test

Test plan

npx tsc --noEmit — clean
npx vitest run — 1054 passed | 2 skipped | 61 files, 0 failures
npm run build — clean (Node 24)
npm run test:eval:structural — required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0
Codex pass 1 + round 2 closure
CodeRabbit app review on PR

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a79047061

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T13:09:53Z

+    for (const [fingerprint, members] of shape) {
+      groups.push({ kind: 'shape', fingerprint, members });
+      shapeNodes += members.length;
+    }


Exclude exact members when reporting shape-only node totals

findDuplicates currently computes shapeNodes as the sum of all members in retained shape groups, but shape groups can legitimately include symbols that are also in exact groups (the suppression only removes exact-equal member sets). In that common case (e.g., two exact clones plus one renamed clone sharing shape), the summary overcounts and the CLI’s “Shape-only duplicate nodes” value is incorrect, which can mislead users comparing Type-1 vs Type-2 impact.

Useful? React with 👍 / 👎.

Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…overedByExactGroup Council double-Codex review flagged real gaps. Addressed BLOCKER + top REVIEWs: ## BLOCKER fix - **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**: the schema check verified columns exist but didn't verify they're populated. A user upgrading from v5 sees "no duplicates" and thinks their code is clone-free, when really the index is blind. Fix: new `fingerprintCoverage(db, kinds, minLines)` query counts eligible rows vs fingerprinted rows; `assertFingerprintCoverage` throws "0 of N eligible nodes have fingerprints. Run `codegraph index`" when coverage is zero on a non-empty eligible set. ## REVIEW fixes - **No file-count or scope per group**: consumers couldn't distinguish "same function in two files" (refactor) from "accessor pattern repeated in the same class" (often legitimate) without post-processing. Added `fileCount: number` per `DuplicateGroup`. - **Shape superset hides exact subset overlap**: a shape group whose members include an exact subgroup is informative when the user knows about the overlap. Added `coveredByExactGroup: boolean` — true on shape groups when at least one member is also in an exact group; always false on exact groups. - **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary is first-member filePath ASC, quaternary is first-member startLine ASC, fingerprint only as final fallback. Same clone reproducibly appears at the same output position across rebuilds AND humans can read the order. - **Member ORDER BY tied on (file_path, start_line)**: overloads / generated rows could share both. Appended `id` to the SQL ORDER BY for fully deterministic ordering. - **Text-mode truncation silently capped at 20 groups / 5 members**: CLI now prints "Output truncated: N more group(s) and M more member(s) hidden. Use --json for full output." footer when capped. - **Zero-result message blamed only min size**: includes selected kinds, min-lines, and fingerprint coverage so users can tell whether the empty result is real or an artefact. - **Schema-validation test only exercised empty output**: rewritten with a real `function shared` duplicate across two files; asserts `exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`. ## Schema - `summary.fingerprintCoverage: { eligible, withAstHash }` required. - `group.fileCount: integer minimum 1` required. - `group.coveredByExactGroup: boolean` required. ## Validation - `tsc --noEmit`: clean - `vitest run`: 1063 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39f02ded4a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T13:46:18Z

+  if (cov.withAstHash === 0) {
+    throw new Error(
+      `duplicates: 0 of ${cov.eligible} eligible nodes have fingerprints. ` +
+        `This usually means the DB was migrated to v6 but not re-indexed. ` +
+        `Run \`codegraph index\` to refresh fingerprints.`,
+    );


Handle unfingerprinted kinds without throwing reindex error

assertFingerprintCoverage unconditionally throws when eligible > 0 and withAstHash === 0, but some kinds are intentionally stored without fingerprints (for example, synthetic component nodes created by the Svelte/Vue extractors omit astHash). In a healthy, fully indexed DB, running codegraph duplicates --kind component will therefore fail with a misleading “migrated to v6 but not re-indexed” error instead of returning an empty result or a coverage warning, so valid user queries are rejected.

Useful? React with 👍 / 👎.

Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…overedByExactGroup Council double-Codex review flagged real gaps. Addressed BLOCKER + top REVIEWs: ## BLOCKER fix - **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**: the schema check verified columns exist but didn't verify they're populated. A user upgrading from v5 sees "no duplicates" and thinks their code is clone-free, when really the index is blind. Fix: new `fingerprintCoverage(db, kinds, minLines)` query counts eligible rows vs fingerprinted rows; `assertFingerprintCoverage` throws "0 of N eligible nodes have fingerprints. Run `codegraph index`" when coverage is zero on a non-empty eligible set. ## REVIEW fixes - **No file-count or scope per group**: consumers couldn't distinguish "same function in two files" (refactor) from "accessor pattern repeated in the same class" (often legitimate) without post-processing. Added `fileCount: number` per `DuplicateGroup`. - **Shape superset hides exact subset overlap**: a shape group whose members include an exact subgroup is informative when the user knows about the overlap. Added `coveredByExactGroup: boolean` — true on shape groups when at least one member is also in an exact group; always false on exact groups. - **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary is first-member filePath ASC, quaternary is first-member startLine ASC, fingerprint only as final fallback. Same clone reproducibly appears at the same output position across rebuilds AND humans can read the order. - **Member ORDER BY tied on (file_path, start_line)**: overloads / generated rows could share both. Appended `id` to the SQL ORDER BY for fully deterministic ordering. - **Text-mode truncation silently capped at 20 groups / 5 members**: CLI now prints "Output truncated: N more group(s) and M more member(s) hidden. Use --json for full output." footer when capped. - **Zero-result message blamed only min size**: includes selected kinds, min-lines, and fingerprint coverage so users can tell whether the empty result is real or an artefact. - **Schema-validation test only exercised empty output**: rewritten with a real `function shared` duplicate across two files; asserts `exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`. ## Schema - `summary.fingerprintCoverage: { eligible, withAstHash }` required. - `group.fileCount: integer minimum 1` required. - `group.coveredByExactGroup: boolean` required. ## Validation - `tsc --noEmit`: clean - `vitest run`: 1063 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…essage Codex PR review P2 findings on `feature/pf-692-codegraph-duplicates`: 1. **shapeNodes inflated by exact members**: when a shape group is a strict superset of an exact group (e.g. {A.f, B.f} exact + A.g shape-only → shape group {A.f, B.f, A.g}), `shapeNodes` was summing all 3, double-counting A.f and B.f which already appear in `exactNodes`. Fix: only increment `shapeNodes` for members not already in any exact group. 2. **`assertFingerprintCoverage` misdiagnoses unfingerprintable kinds**: the previous unconditional "run codegraph index" error was wrong when the user requested a kind that's intentionally not fingerprinted (framework-extractor `component`/`route`/`vue` nodes bypass `createNode`'s hash path). Fix: probe `dbHasAnyFingerprint` first — if any node in the DB has a fingerprint but the requested kinds don't, surface "these kinds aren't fingerprinted" instead. Tests: 2 new regression tests pinning both behaviors via hand-built fixtures. ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 61 files - `npm run build`: clean - `npm run test:eval:structural`: required=5/5, all-pass=8/8 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Second consumer of PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Replaces auto-closed PR #40 (closed when PR #39's base branch was deleted on merge). Combines the original feat commit + round 2 fixes (fileCount, coveredByExactGroup, fingerprint coverage gate, human-meaningful sort) + the Codex-PR-review fixes (shapeNodes double-count + unfingerprintable-kind error message). ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` ## Read-only safety Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39. ## Schema gate - v5 DBs (no fingerprint columns): "duplicates requires schema v6+". - Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0): "Run `codegraph index` to refresh fingerprints" — but ONLY when the DB has no fingerprints anywhere. If other kinds have fingerprints but the requested kinds don't, surface "framework-extractor nodes aren't fingerprinted; try --kind=function,method". ## Output shape - `groups[]`: each group has `kind`, `fingerprint`, `members[]`, `fileCount`, `coveredByExactGroup`. - `summary`: includes `fingerprintCoverage: { eligible, withAstHash }` so consumers can detect partial coverage. - Sort: member count DESC → max line span DESC → first member filePath ASC → startLine ASC → fingerprint ASC (deterministic, human-meaningful). - `shapeNodes` counts ONLY shape-only members (not double-counted with `exactNodes`). ## Tests 19 unit tests + 1 schema validation test covering: defaults, exact + shape group detection, min-lines filter, sort (primary through tertiary), suppression rule, empty --kind error, v5 schema rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage exposure, fileCount, coveredByExactGroup, unfingerprintable-kind distinction, shapeNodes double-count fix, read-only invariant. ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 + round 2 closure (subdir resolution, --min-lines strict parsing, comment wording, URI escaping, sort coverage) - Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes double-count, unfingerprintable-kind error message) - Council double-Codex deep review → confirmed all BLOCKERs fixed ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

mbenhamd mentioned this pull request May 24, 2026

feat(PF-693): codegraph explain single-edge resolution trace #41

Closed

6 tasks

mbenhamd force-pushed the feature/pf-692-codegraph-duplicates branch from 3a79047 to 39f02de Compare May 24, 2026 13:40

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

mbenhamd and others added 3 commits May 24, 2026 15:48

mbenhamd force-pushed the feature/pf-692-codegraph-duplicates branch from 39f02de to 2a713c1 Compare May 24, 2026 13:50

mbenhamd mentioned this pull request May 24, 2026

feat(PF-691): codegraph diff <oldDb> <newDb> DB-vs-DB structural primitive #39

Merged

6 tasks

mbenhamd deleted the branch feature/pf-691-codegraph-diff May 24, 2026 13:51

mbenhamd closed this May 24, 2026

mbenhamd mentioned this pull request May 24, 2026

feat(PF-692): codegraph duplicates clone-detection CLI #42

Merged

mbenhamd deleted the feature/pf-692-codegraph-duplicates branch May 24, 2026 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(PF-692): codegraph duplicates clone-detection CLI#40

feat(PF-692): codegraph duplicates clone-detection CLI#40
mbenhamd wants to merge 3 commits into
feature/pf-691-codegraph-difffrom
feature/pf-692-codegraph-duplicates

mbenhamd commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbenhamd commented May 24, 2026

Summary

Design — council RFC outcome (Codex)

Real fingerprint behavior pinned by tests

Read-only safety (same Codex BLOCKER pattern from PR #39)

v5 schema rejection

Reviewer trail

Files

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant