feat(PF-692): codegraph duplicates clone-detection CLI#40
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a79047061
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const [fingerprint, members] of shape) { | ||
| groups.push({ kind: 'shape', fingerprint, members }); | ||
| shapeNodes += members.length; | ||
| } |
There was a problem hiding this comment.
Exclude exact members when reporting shape-only node totals
findDuplicates currently computes shapeNodes as the sum of all members in retained shape groups, but shape groups can legitimately include symbols that are also in exact groups (the suppression only removes exact-equal member sets). In that common case (e.g., two exact clones plus one renamed clone sharing shape), the summary overcounts and the CLI’s “Shape-only duplicate nodes” value is incorrect, which can mislead users comparing Type-1 vs Type-2 impact.
Useful? React with 👍 / 👎.
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…overedByExactGroup
Council double-Codex review flagged real gaps. Addressed BLOCKER +
top REVIEWs:
## BLOCKER fix
- **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**:
the schema check verified columns exist but didn't verify they're
populated. A user upgrading from v5 sees "no duplicates" and thinks
their code is clone-free, when really the index is blind. Fix: new
`fingerprintCoverage(db, kinds, minLines)` query counts eligible
rows vs fingerprinted rows; `assertFingerprintCoverage` throws
"0 of N eligible nodes have fingerprints. Run `codegraph index`"
when coverage is zero on a non-empty eligible set.
## REVIEW fixes
- **No file-count or scope per group**: consumers couldn't distinguish
"same function in two files" (refactor) from "accessor pattern
repeated in the same class" (often legitimate) without
post-processing. Added `fileCount: number` per `DuplicateGroup`.
- **Shape superset hides exact subset overlap**: a shape group whose
members include an exact subgroup is informative when the user
knows about the overlap. Added `coveredByExactGroup: boolean` —
true on shape groups when at least one member is also in an exact
group; always false on exact groups.
- **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary
is first-member filePath ASC, quaternary is first-member startLine
ASC, fingerprint only as final fallback. Same clone reproducibly
appears at the same output position across rebuilds AND humans can
read the order.
- **Member ORDER BY tied on (file_path, start_line)**: overloads /
generated rows could share both. Appended `id` to the SQL ORDER BY
for fully deterministic ordering.
- **Text-mode truncation silently capped at 20 groups / 5 members**:
CLI now prints "Output truncated: N more group(s) and M more
member(s) hidden. Use --json for full output." footer when capped.
- **Zero-result message blamed only min size**: includes selected
kinds, min-lines, and fingerprint coverage so users can tell
whether the empty result is real or an artefact.
- **Schema-validation test only exercised empty output**: rewritten
with a real `function shared` duplicate across two files; asserts
`exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`.
## Schema
- `summary.fingerprintCoverage: { eligible, withAstHash }` required.
- `group.fileCount: integer minimum 1` required.
- `group.coveredByExactGroup: boolean` required.
## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1063 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
recall=1.00, precision=1.00, fp=0
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3a79047 to
39f02de
Compare
Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39f02ded4a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (cov.withAstHash === 0) { | ||
| throw new Error( | ||
| `duplicates: 0 of ${cov.eligible} eligible nodes have fingerprints. ` + | ||
| `This usually means the DB was migrated to v6 but not re-indexed. ` + | ||
| `Run \`codegraph index\` to refresh fingerprints.`, | ||
| ); |
There was a problem hiding this comment.
Handle unfingerprinted kinds without throwing reindex error
assertFingerprintCoverage unconditionally throws when eligible > 0 and withAstHash === 0, but some kinds are intentionally stored without fingerprints (for example, synthetic component nodes created by the Svelte/Vue extractors omit astHash). In a healthy, fully indexed DB, running codegraph duplicates --kind component will therefore fail with a misleading “migrated to v6 but not re-indexed” error instead of returning an empty result or a coverage warning, so valid user queries are rejected.
Useful? React with 👍 / 👎.
Second consumer of the PF-690 fingerprint columns (PR #38). Reports Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Stacked on top of PR #39 because both consume the same envelope-enum widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)` slot without duplicating the schema edit. ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` - `[path]` resolves through `resolveProjectPath` (matches `status`/ `index`/`sync`), so running from a subdirectory walks up to the project root. - `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would happily accept `10abc` / `1.5` / `+10`, hiding typos. ## Defaults locked by council RFC (Codex) | Fork | Decision | |---|---| | Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both | | `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) | | `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) | | Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) | | Sort order | member count DESC → max line span DESC → fingerprint ASC | ## Read-only safety (Codex round 1 BLOCKER pattern from PR #39) Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical Node API for filesystem-path → file:// URL conversion, correctly escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar creation even when the DB is in WAL mode. ## v5 schema rejection Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints` probes `PRAGMA table_info('nodes')` and throws `duplicates requires schema v6+ (PR #38 fingerprint columns).` instead of silently returning empty groups. ## Tests (13 unit + 1 schema validation) - Defaults exposed as constants for regression pinning - Same-named functions → exact group (Type-1) - Different-named identical bodies → shape group (Type-2) - Single-symbol fingerprint → no group (HAVING > 1) - `--min-lines` filters one-liners with default, surfaces with `--min-lines 1` - Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary - Shape group with member set ≡ exact group → suppressed - Empty `--kind` list → clear error - v5 schema → "requires schema v6+" error - Missing DB path → "not found" error - Read-only invariant: snapshot DB + sidecars before/after, assert equal ## Reviewer trail - Codex RFC debate → locked 5 design forks - Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema) - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…overedByExactGroup
Council double-Codex review flagged real gaps. Addressed BLOCKER +
top REVIEWs:
## BLOCKER fix
- **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**:
the schema check verified columns exist but didn't verify they're
populated. A user upgrading from v5 sees "no duplicates" and thinks
their code is clone-free, when really the index is blind. Fix: new
`fingerprintCoverage(db, kinds, minLines)` query counts eligible
rows vs fingerprinted rows; `assertFingerprintCoverage` throws
"0 of N eligible nodes have fingerprints. Run `codegraph index`"
when coverage is zero on a non-empty eligible set.
## REVIEW fixes
- **No file-count or scope per group**: consumers couldn't distinguish
"same function in two files" (refactor) from "accessor pattern
repeated in the same class" (often legitimate) without
post-processing. Added `fileCount: number` per `DuplicateGroup`.
- **Shape superset hides exact subset overlap**: a shape group whose
members include an exact subgroup is informative when the user
knows about the overlap. Added `coveredByExactGroup: boolean` —
true on shape groups when at least one member is also in an exact
group; always false on exact groups.
- **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary
is first-member filePath ASC, quaternary is first-member startLine
ASC, fingerprint only as final fallback. Same clone reproducibly
appears at the same output position across rebuilds AND humans can
read the order.
- **Member ORDER BY tied on (file_path, start_line)**: overloads /
generated rows could share both. Appended `id` to the SQL ORDER BY
for fully deterministic ordering.
- **Text-mode truncation silently capped at 20 groups / 5 members**:
CLI now prints "Output truncated: N more group(s) and M more
member(s) hidden. Use --json for full output." footer when capped.
- **Zero-result message blamed only min size**: includes selected
kinds, min-lines, and fingerprint coverage so users can tell
whether the empty result is real or an artefact.
- **Schema-validation test only exercised empty output**: rewritten
with a real `function shared` duplicate across two files; asserts
`exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`.
## Schema
- `summary.fingerprintCoverage: { eligible, withAstHash }` required.
- `group.fileCount: integer minimum 1` required.
- `group.coveredByExactGroup: boolean` required.
## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1063 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
recall=1.00, precision=1.00, fp=0
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…essage
Codex PR review P2 findings on `feature/pf-692-codegraph-duplicates`:
1. **shapeNodes inflated by exact members**: when a shape group is a
strict superset of an exact group (e.g. {A.f, B.f} exact + A.g
shape-only → shape group {A.f, B.f, A.g}), `shapeNodes` was
summing all 3, double-counting A.f and B.f which already appear
in `exactNodes`. Fix: only increment `shapeNodes` for members
not already in any exact group.
2. **`assertFingerprintCoverage` misdiagnoses unfingerprintable kinds**:
the previous unconditional "run codegraph index" error was wrong
when the user requested a kind that's intentionally not
fingerprinted (framework-extractor `component`/`route`/`vue`
nodes bypass `createNode`'s hash path). Fix: probe `dbHasAnyFingerprint`
first — if any node in the DB has a fingerprint but the requested
kinds don't, surface "these kinds aren't fingerprinted" instead.
Tests: 2 new regression tests pinning both behaviors via hand-built
fixtures.
## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 61 files
- `npm run build`: clean
- `npm run test:eval:structural`: required=5/5, all-pass=8/8
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
39f02de to
2a713c1
Compare
Third consumer of the CLI envelope/cli-tools family, completing the trio with diff (PR #39) and duplicates (PR #40). Surfaces every persisted provenance breadcrumb for a single edge: - Edge identity (source, target, kind, line, col) - Source + target node snapshots - Extractor provenance: `edges.provenance` column (tree-sitter / scip / heuristic) - Resolver provenance: `metadata.resolvedBy` strategy tag (import / framework / qualified-name / exact-match / instance-method / file-path / fuzzy) + `metadata.confidence` - Full raw metadata JSON for forward-compat Stacked on PR #40 (which stacks on PR #39). ## Surface ``` codegraph explain <edgeId> # happy path codegraph explain --source <id> --target <id> \ # rebuild-stable canonical --kind <k> [--line N] [--col N] codegraph explain <edgeId> --json # full envelope ``` Default text output uses `formatExplainNarrative` for a concise multi-line trace; `--json` emits the full payload via `cliJsonEnvelope('explain', …)`. ## Council RFC outcome (Codex) | Fork | Decision | |---|---| | A: identifier surface | **Both** — positional `<edgeId>` + canonical flags | | B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through | | C: output shape | **JSON + text narrative** (terminal use case dominates) | | D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode | | E: ambiguous canonical lookup | **Error with `--line N --col N` hint** | ## Edge.id exposure `Edge` gains an optional `id?: number` field with a docstring noting it's index-local (resets on rebuild). `rowToEdge` propagates it from the DB row. `collectGraphRelations` includes `edgeId` in the caller/callee JSON entries when present. `schemas/cli/callers.json` and `schemas/cli/callees.json` document the new optional field. ## Read-only safety `pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays bit-for-bit unchanged. Regression test snapshots size + mtime of `db`/`-wal`/`-shm`/`-journal` before/after and asserts equal. ## Schema gate No v6 requirement — `provenance`/`metadata` columns existed in v5. `assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes` tables via a single `WHERE name IN (…)` probe. Missing resolver metadata is tolerated: `resolvedBy: null`, `confidence: null`, `metadata: null` — explain explains what's there. ## Tests (10 unit + 1 schema validation) - Integer-id lookup surfaces resolver provenance - Canonical lookup by (source, target, kind) - Ambiguous canonical → throws with `--line` hint - No-match canonical → throws with full identifier in message - Missing edge id → throws "no edge with id N" - Non-positive / non-integer edge id → rejected before DB open - Narrative format contains expected keywords - Non-CodeGraph SQLite file → "not a CodeGraph database" - Read-only invariant: DB + sidecars unchanged after explain - Missing DB path → "not found" ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER) - `--kind` should be required for canonical lookup - `assertCodeGraphDb` should check both tables - `confidence` should be clamped to [0, 1] - Document `edgeId` in callers/callees schemas - Codex round 2 → "closed — proceed to PR" ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 62 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second consumer of PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone groups under council-locked defaults: function+method kinds, ≥10 lines, shape groups whose members already form an exact group are suppressed. Replaces auto-closed PR #40 (closed when PR #39's base branch was deleted on merge). Combines the original feat commit + round 2 fixes (fileCount, coveredByExactGroup, fingerprint coverage gate, human-meaningful sort) + the Codex-PR-review fixes (shapeNodes double-count + unfingerprintable-kind error message). ## Surface ``` codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json] ``` ## Read-only safety Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39. ## Schema gate - v5 DBs (no fingerprint columns): "duplicates requires schema v6+". - Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0): "Run `codegraph index` to refresh fingerprints" — but ONLY when the DB has no fingerprints anywhere. If other kinds have fingerprints but the requested kinds don't, surface "framework-extractor nodes aren't fingerprinted; try --kind=function,method". ## Output shape - `groups[]`: each group has `kind`, `fingerprint`, `members[]`, `fileCount`, `coveredByExactGroup`. - `summary`: includes `fingerprintCoverage: { eligible, withAstHash }` so consumers can detect partial coverage. - Sort: member count DESC → max line span DESC → first member filePath ASC → startLine ASC → fingerprint ASC (deterministic, human-meaningful). - `shapeNodes` counts ONLY shape-only members (not double-counted with `exactNodes`). ## Tests 19 unit tests + 1 schema validation test covering: defaults, exact + shape group detection, min-lines filter, sort (primary through tertiary), suppression rule, empty --kind error, v5 schema rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage exposure, fileCount, coveredByExactGroup, unfingerprintable-kind distinction, shapeNodes double-count fix, read-only invariant. ## Reviewer trail - Codex RFC → locked 5 design forks - Codex pass 1 + round 2 closure (subdir resolution, --min-lines strict parsing, comment wording, URI escaping, sort coverage) - Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes double-count, unfingerprintable-kind error message) - Council double-Codex deep review → confirmed all BLOCKERs fixed ## Validation - `tsc --noEmit`: clean - `vitest run`: 1065 passed | 2 skipped | 61 files - `npm run build`: clean (Node 24) - `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Second consumer of the PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (
ast_hash) and Type-2 (ast_shape_hash) clone groups in the indexed graph.Stacked on
feature/pf-691-codegraph-diff(PR #39). Both consume the same envelope-enum widening; PR #40 inherits theduplicatesenum slot without duplicating that schema edit. Once PR #39 lands, this rebases tomain.Design — council RFC outcome (Codex)
--min-linesfloor--kinddefaultfunction,methodendLine - startLine + 1Real fingerprint behavior pinned by tests
The
ast_hashincludes the function DECLARATION name (rename-locals only normalizes LOCAL identifiers, not the function's own name). Two same-bodied functions with different declaration names produce DIFFERENTast_hashbut IDENTICALast_shape_hash. Tests use this distinction:handler/handleracross two files → exact (Type-1) groupadder/alsoAdderwith identical bodies → shape-only (Type-2) groupRead-only safety (same Codex BLOCKER pattern from PR #39)
openReadOnlyusespathToFileURL(dbPath).href + '?immutable=1'— canonical Node API for filesystem-path → file:// URL, correctly handles spaces, non-ASCII, Windows drive letters, all SQLite-reserved URI delimiters.immutable=1prevents-shm/-walsidecar creation even on WAL-mode DBs. Regression test snapshots size + mtime ofdb/-wal/-shm/-journalbefore/after and asserts equal.v5 schema rejection
Pre-PR-#38 DBs lack fingerprint columns.
assertSchemaSupportsFingerprintsprobesPRAGMA table_info('nodes')and throwsduplicates requires schema v6+ (PR #38 fingerprint columns).instead of silently returning empty groups.Reviewer trail
REVIEWduplicates [path]didn't useresolveProjectPathstatus/index/syncREVIEW--min-linesparseInt permissive (10abc,1.5)/^[1-9]\d*$/regex validationNITPICKNITPICKpathToFileURLNITPICKFiles
src/duplicates.ts—findDuplicates(dbPath, opts): DuplicatesResultlibrary export (~330 lines)src/bin/codegraph.ts—codegraph duplicates [path]CLI subcommandschemas/cli/duplicates.json— JSON Schema (envelope.tool=duplicates, members minItems:2)__tests__/duplicates.test.ts— 13 unit tests__tests__/cli-json-schemas.test.ts— JSON envelope validation testTest plan
npx tsc --noEmit— cleannpx vitest run— 1054 passed | 2 skipped | 61 files, 0 failuresnpm run build— clean (Node 24)npm run test:eval:structural— required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0🤖 Generated with Claude Code