Skip to content

feat(PF-692): codegraph duplicates clone-detection CLI#40

Closed
mbenhamd wants to merge 3 commits into
feature/pf-691-codegraph-difffrom
feature/pf-692-codegraph-duplicates
Closed

feat(PF-692): codegraph duplicates clone-detection CLI#40
mbenhamd wants to merge 3 commits into
feature/pf-691-codegraph-difffrom
feature/pf-692-codegraph-duplicates

Conversation

@mbenhamd
Copy link
Copy Markdown
Owner

Summary

Second consumer of the PF-690 fingerprint columns (PR #38, schema v6). Reports Type-1 (ast_hash) and Type-2 (ast_shape_hash) clone groups in the indexed graph.

Stacked on feature/pf-691-codegraph-diff (PR #39). Both consume the same envelope-enum widening; PR #40 inherits the duplicates enum slot without duplicating that schema edit. Once PR #39 lands, this rebases to main.

codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]

Design — council RFC outcome (Codex)

Fork Decision Rationale
Type-2 dedup vs Type-1 Suppress shape groups whose member set EQUALS an exact group Type-1 implies Type-2; reporting both inflates the summary
--min-lines floor 10 Standard CPD/jscpd default; filters one-liner accessors
--kind default function,method Most useful clone targets; class-level often framework-shaped noise
Body-line source endLine - startLine + 1 Language-agnostic, no source rescan; blank/comment lines count toward threshold (accepted trade-off)
Sort order member count DESC → max line span DESC → fingerprint ASC Biggest clone-set first, deterministic tie-breakers

Real fingerprint behavior pinned by tests

The ast_hash includes the function DECLARATION name (rename-locals only normalizes LOCAL identifiers, not the function's own name). Two same-bodied functions with different declaration names produce DIFFERENT ast_hash but IDENTICAL ast_shape_hash. Tests use this distinction:

  • Same-named handler/handler across two files → exact (Type-1) group
  • adder/alsoAdder with identical bodies → shape-only (Type-2) group

Read-only safety (same Codex BLOCKER pattern from PR #39)

openReadOnly uses pathToFileURL(dbPath).href + '?immutable=1' — canonical Node API for filesystem-path → file:// URL, correctly handles spaces, non-ASCII, Windows drive letters, all SQLite-reserved URI delimiters. immutable=1 prevents -shm/-wal sidecar creation even on WAL-mode DBs. Regression test snapshots size + mtime of db/-wal/-shm/-journal before/after and asserts equal.

v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. assertSchemaSupportsFingerprints probes PRAGMA table_info('nodes') and throws duplicates requires schema v6+ (PR #38 fingerprint columns). instead of silently returning empty groups.

Reviewer trail

Pass Severity Finding Resolution
Codex pass 1 REVIEW duplicates [path] didn't use resolveProjectPath Now does — matches status/index/sync
Codex pass 1 REVIEW --min-lines parseInt permissive (10abc, 1.5) Added strict /^[1-9]\d*$/ regex validation
Codex pass 1 NITPICK Comment said "subset" but impl is "equals" Comment fixed
Codex pass 1 NITPICK Hand-rolled URI escape missed spaces/non-ASCII Switched to pathToFileURL
Codex pass 1 NITPICK Sort tests only covered primary key Added secondary (span) + tertiary (fingerprint) coverage
Codex round 2 All findings verified addressed "closed — proceed to PR"

Files

  • src/duplicates.tsfindDuplicates(dbPath, opts): DuplicatesResult library export (~330 lines)
  • src/bin/codegraph.tscodegraph duplicates [path] CLI subcommand
  • schemas/cli/duplicates.json — JSON Schema (envelope.tool=duplicates, members minItems:2)
  • __tests__/duplicates.test.ts — 13 unit tests
  • __tests__/cli-json-schemas.test.ts — JSON envelope validation test

Test plan

  • npx tsc --noEmit — clean
  • npx vitest run1054 passed | 2 skipped | 61 files, 0 failures
  • npm run build — clean (Node 24)
  • npm run test:eval:structural — required=5/5, all-pass=8/8, blocking=0, recall=1.00, precision=1.00, fp=0
  • Codex pass 1 + round 2 closure
  • CodeRabbit app review on PR

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a79047061

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/duplicates.ts
Comment on lines +311 to +314
for (const [fingerprint, members] of shape) {
groups.push({ kind: 'shape', fingerprint, members });
shapeNodes += members.length;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude exact members when reporting shape-only node totals

findDuplicates currently computes shapeNodes as the sum of all members in retained shape groups, but shape groups can legitimately include symbols that are also in exact groups (the suppression only removes exact-equal member sets). In that common case (e.g., two exact clones plus one renamed clone sharing shape), the summary overcounts and the CLI’s “Shape-only duplicate nodes” value is incorrect, which can mislead users comparing Type-1 vs Type-2 impact.

Useful? React with 👍 / 👎.

mbenhamd added a commit that referenced this pull request May 24, 2026
Second consumer of the PF-690 fingerprint columns (PR #38). Reports
Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups
under council-locked defaults: function+method kinds, ≥10 lines, shape
groups whose members already form an exact group are suppressed.

Stacked on top of PR #39 because both consume the same envelope-enum
widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)`
slot without duplicating the schema edit.

## Surface

```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

- `[path]` resolves through `resolveProjectPath` (matches `status`/
  `index`/`sync`), so running from a subdirectory walks up to the
  project root.
- `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would
  happily accept `10abc` / `1.5` / `+10`, hiding typos.

## Defaults locked by council RFC (Codex)

| Fork | Decision |
|---|---|
| Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both |
| `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) |
| `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) |
| Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) |
| Sort order | member count DESC → max line span DESC → fingerprint ASC |

## Read-only safety (Codex round 1 BLOCKER pattern from PR #39)

Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical
Node API for filesystem-path → file:// URL conversion, correctly
escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved
URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar
creation even when the DB is in WAL mode.

## v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints`
probes `PRAGMA table_info('nodes')` and throws
`duplicates requires schema v6+ (PR #38 fingerprint columns).`
instead of silently returning empty groups.

## Tests (13 unit + 1 schema validation)

- Defaults exposed as constants for regression pinning
- Same-named functions → exact group (Type-1)
- Different-named identical bodies → shape group (Type-2)
- Single-symbol fingerprint → no group (HAVING > 1)
- `--min-lines` filters one-liners with default, surfaces with `--min-lines 1`
- Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary
- Shape group with member set ≡ exact group → suppressed
- Empty `--kind` list → clear error
- v5 schema → "requires schema v6+" error
- Missing DB path → "not found" error
- Read-only invariant: snapshot DB + sidecars before/after, assert equal

## Reviewer trail

- Codex RFC debate → locked 5 design forks
- Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema)
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mbenhamd added a commit that referenced this pull request May 24, 2026
…overedByExactGroup

Council double-Codex review flagged real gaps. Addressed BLOCKER +
top REVIEWs:

## BLOCKER fix

- **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**:
  the schema check verified columns exist but didn't verify they're
  populated. A user upgrading from v5 sees "no duplicates" and thinks
  their code is clone-free, when really the index is blind. Fix: new
  `fingerprintCoverage(db, kinds, minLines)` query counts eligible
  rows vs fingerprinted rows; `assertFingerprintCoverage` throws
  "0 of N eligible nodes have fingerprints. Run `codegraph index`"
  when coverage is zero on a non-empty eligible set.

## REVIEW fixes

- **No file-count or scope per group**: consumers couldn't distinguish
  "same function in two files" (refactor) from "accessor pattern
  repeated in the same class" (often legitimate) without
  post-processing. Added `fileCount: number` per `DuplicateGroup`.

- **Shape superset hides exact subset overlap**: a shape group whose
  members include an exact subgroup is informative when the user
  knows about the overlap. Added `coveredByExactGroup: boolean` —
  true on shape groups when at least one member is also in an exact
  group; always false on exact groups.

- **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary
  is first-member filePath ASC, quaternary is first-member startLine
  ASC, fingerprint only as final fallback. Same clone reproducibly
  appears at the same output position across rebuilds AND humans can
  read the order.

- **Member ORDER BY tied on (file_path, start_line)**: overloads /
  generated rows could share both. Appended `id` to the SQL ORDER BY
  for fully deterministic ordering.

- **Text-mode truncation silently capped at 20 groups / 5 members**:
  CLI now prints "Output truncated: N more group(s) and M more
  member(s) hidden. Use --json for full output." footer when capped.

- **Zero-result message blamed only min size**: includes selected
  kinds, min-lines, and fingerprint coverage so users can tell
  whether the empty result is real or an artefact.

- **Schema-validation test only exercised empty output**: rewritten
  with a real `function shared` duplicate across two files; asserts
  `exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`.

## Schema

- `summary.fingerprintCoverage: { eligible, withAstHash }` required.
- `group.fileCount: integer minimum 1` required.
- `group.coveredByExactGroup: boolean` required.

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1063 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd force-pushed the feature/pf-692-codegraph-duplicates branch from 3a79047 to 39f02de Compare May 24, 2026 13:40
mbenhamd added a commit that referenced this pull request May 24, 2026
Third consumer of the CLI envelope/cli-tools family, completing the
trio with diff (PR #39) and duplicates (PR #40). Surfaces every
persisted provenance breadcrumb for a single edge:

- Edge identity (source, target, kind, line, col)
- Source + target node snapshots
- Extractor provenance: `edges.provenance` column (tree-sitter / scip
  / heuristic)
- Resolver provenance: `metadata.resolvedBy` strategy tag (import /
  framework / qualified-name / exact-match / instance-method /
  file-path / fuzzy) + `metadata.confidence`
- Full raw metadata JSON for forward-compat

Stacked on PR #40 (which stacks on PR #39).

## Surface

```
codegraph explain <edgeId>                          # happy path
codegraph explain --source <id> --target <id> \     # rebuild-stable canonical
                  --kind <k> [--line N] [--col N]
codegraph explain <edgeId> --json                   # full envelope
```

Default text output uses `formatExplainNarrative` for a concise
multi-line trace; `--json` emits the full payload via
`cliJsonEnvelope('explain', …)`.

## Council RFC outcome (Codex)

| Fork | Decision |
|---|---|
| A: identifier surface | **Both** — positional `<edgeId>` + canonical flags |
| B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through |
| C: output shape | **JSON + text narrative** (terminal use case dominates) |
| D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode |
| E: ambiguous canonical lookup | **Error with `--line N --col N` hint** |

## Edge.id exposure

`Edge` gains an optional `id?: number` field with a docstring noting
it's index-local (resets on rebuild). `rowToEdge` propagates it from
the DB row. `collectGraphRelations` includes `edgeId` in the
caller/callee JSON entries when present. `schemas/cli/callers.json`
and `schemas/cli/callees.json` document the new optional field.

## Read-only safety

`pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as
PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays
bit-for-bit unchanged. Regression test snapshots size + mtime of
`db`/`-wal`/`-shm`/`-journal` before/after and asserts equal.

## Schema gate

No v6 requirement — `provenance`/`metadata` columns existed in v5.
`assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes`
tables via a single `WHERE name IN (…)` probe. Missing resolver
metadata is tolerated: `resolvedBy: null`, `confidence: null`,
`metadata: null` — explain explains what's there.

## Tests (10 unit + 1 schema validation)

- Integer-id lookup surfaces resolver provenance
- Canonical lookup by (source, target, kind)
- Ambiguous canonical → throws with `--line` hint
- No-match canonical → throws with full identifier in message
- Missing edge id → throws "no edge with id N"
- Non-positive / non-integer edge id → rejected before DB open
- Narrative format contains expected keywords
- Non-CodeGraph SQLite file → "not a CodeGraph database"
- Read-only invariant: DB + sidecars unchanged after explain
- Missing DB path → "not found"

## Reviewer trail

- Codex RFC → locked 5 design forks
- Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER)
  - `--kind` should be required for canonical lookup
  - `assertCodeGraphDb` should check both tables
  - `confidence` should be clamped to [0, 1]
  - Document `edgeId` in callers/callees schemas
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 62 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  blocking=0, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39f02ded4a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/duplicates.ts Outdated
Comment on lines +224 to +229
if (cov.withAstHash === 0) {
throw new Error(
`duplicates: 0 of ${cov.eligible} eligible nodes have fingerprints. ` +
`This usually means the DB was migrated to v6 but not re-indexed. ` +
`Run \`codegraph index\` to refresh fingerprints.`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle unfingerprinted kinds without throwing reindex error

assertFingerprintCoverage unconditionally throws when eligible > 0 and withAstHash === 0, but some kinds are intentionally stored without fingerprints (for example, synthetic component nodes created by the Svelte/Vue extractors omit astHash). In a healthy, fully indexed DB, running codegraph duplicates --kind component will therefore fail with a misleading “migrated to v6 but not re-indexed” error instead of returning an empty result or a coverage warning, so valid user queries are rejected.

Useful? React with 👍 / 👎.

mbenhamd and others added 3 commits May 24, 2026 15:48
Second consumer of the PF-690 fingerprint columns (PR #38). Reports
Type-1 (exact `ast_hash`) and Type-2 (`ast_shape_hash`) clone groups
under council-locked defaults: function+method kinds, ≥10 lines, shape
groups whose members already form an exact group are suppressed.

Stacked on top of PR #39 because both consume the same envelope-enum
widening; PR #40 inherits diff's `cliJsonEnvelope('duplicates', …)`
slot without duplicating the schema edit.

## Surface

```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

- `[path]` resolves through `resolveProjectPath` (matches `status`/
  `index`/`sync`), so running from a subdirectory walks up to the
  project root.
- `--min-lines` validates with `/^[1-9]\d*$/` — `parseInt` would
  happily accept `10abc` / `1.5` / `+10`, hiding typos.

## Defaults locked by council RFC (Codex)

| Fork | Decision |
|---|---|
| Type-2 dedup against Type-1 | Suppress shape groups whose member set EQUALS an exact group; a Type-1 clone is by definition a Type-2 clone, no value in reporting both |
| `--min-lines` floor | 10 (standard CPD/jscpd default; filters one-liner accessors that would flood output) |
| `--kind` default | `function,method` (most useful clone targets; class-level often framework-shaped noise) |
| Body-line counting | `endLine - startLine + 1` from the nodes row (coarse but language-agnostic; no source rescan needed) |
| Sort order | member count DESC → max line span DESC → fingerprint ASC |

## Read-only safety (Codex round 1 BLOCKER pattern from PR #39)

Opens DB via `pathToFileURL(dbPath).href + '?immutable=1'` — canonical
Node API for filesystem-path → file:// URL conversion, correctly
escaping spaces, non-ASCII, Windows drive letters, and SQLite-reserved
URI delimiters. `immutable=1` flag prevents `-shm`/`-wal` sidecar
creation even when the DB is in WAL mode.

## v5 schema rejection

Pre-PR-#38 DBs lack fingerprint columns. `assertSchemaSupportsFingerprints`
probes `PRAGMA table_info('nodes')` and throws
`duplicates requires schema v6+ (PR #38 fingerprint columns).`
instead of silently returning empty groups.

## Tests (13 unit + 1 schema validation)

- Defaults exposed as constants for regression pinning
- Same-named functions → exact group (Type-1)
- Different-named identical bodies → shape group (Type-2)
- Single-symbol fingerprint → no group (HAVING > 1)
- `--min-lines` filters one-liners with default, surfaces with `--min-lines 1`
- Sort: member count DESC primary, line span DESC secondary, fingerprint ASC tertiary
- Shape group with member set ≡ exact group → suppressed
- Empty `--kind` list → clear error
- v5 schema → "requires schema v6+" error
- Missing DB path → "not found" error
- Read-only invariant: snapshot DB + sidecars before/after, assert equal

## Reviewer trail

- Codex RFC debate → locked 5 design forks
- Codex pass 1 → 2 REVIEW (subdir resolution, --min-lines parsing) + 3 NITPICK
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1054 passed | 2 skipped | 61 files (PR #39: 1042 → +13 duplicates + 1 schema)
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…overedByExactGroup

Council double-Codex review flagged real gaps. Addressed BLOCKER +
top REVIEWs:

## BLOCKER fix

- **Migrated-but-not-reindexed v6 DBs silently returned 0 groups**:
  the schema check verified columns exist but didn't verify they're
  populated. A user upgrading from v5 sees "no duplicates" and thinks
  their code is clone-free, when really the index is blind. Fix: new
  `fingerprintCoverage(db, kinds, minLines)` query counts eligible
  rows vs fingerprinted rows; `assertFingerprintCoverage` throws
  "0 of N eligible nodes have fingerprints. Run `codegraph index`"
  when coverage is zero on a non-empty eligible set.

## REVIEW fixes

- **No file-count or scope per group**: consumers couldn't distinguish
  "same function in two files" (refactor) from "accessor pattern
  repeated in the same class" (often legitimate) without
  post-processing. Added `fileCount: number` per `DuplicateGroup`.

- **Shape superset hides exact subset overlap**: a shape group whose
  members include an exact subgroup is informative when the user
  knows about the overlap. Added `coveredByExactGroup: boolean` —
  true on shape groups when at least one member is also in an exact
  group; always false on exact groups.

- **Sort tertiary was SHA-256 hash (human-meaningless)**: now tertiary
  is first-member filePath ASC, quaternary is first-member startLine
  ASC, fingerprint only as final fallback. Same clone reproducibly
  appears at the same output position across rebuilds AND humans can
  read the order.

- **Member ORDER BY tied on (file_path, start_line)**: overloads /
  generated rows could share both. Appended `id` to the SQL ORDER BY
  for fully deterministic ordering.

- **Text-mode truncation silently capped at 20 groups / 5 members**:
  CLI now prints "Output truncated: N more group(s) and M more
  member(s) hidden. Use --json for full output." footer when capped.

- **Zero-result message blamed only min size**: includes selected
  kinds, min-lines, and fingerprint coverage so users can tell
  whether the empty result is real or an artefact.

- **Schema-validation test only exercised empty output**: rewritten
  with a real `function shared` duplicate across two files; asserts
  `exactGroups >= 1`, `fileCount === 2`, `coveredByExactGroup === false`.

## Schema

- `summary.fingerprintCoverage: { eligible, withAstHash }` required.
- `group.fileCount: integer minimum 1` required.
- `group.coveredByExactGroup: boolean` required.

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1063 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…essage

Codex PR review P2 findings on `feature/pf-692-codegraph-duplicates`:

1. **shapeNodes inflated by exact members**: when a shape group is a
   strict superset of an exact group (e.g. {A.f, B.f} exact + A.g
   shape-only → shape group {A.f, B.f, A.g}), `shapeNodes` was
   summing all 3, double-counting A.f and B.f which already appear
   in `exactNodes`. Fix: only increment `shapeNodes` for members
   not already in any exact group.

2. **`assertFingerprintCoverage` misdiagnoses unfingerprintable kinds**:
   the previous unconditional "run codegraph index" error was wrong
   when the user requested a kind that's intentionally not
   fingerprinted (framework-extractor `component`/`route`/`vue`
   nodes bypass `createNode`'s hash path). Fix: probe `dbHasAnyFingerprint`
   first — if any node in the DB has a fingerprint but the requested
   kinds don't, surface "these kinds aren't fingerprinted" instead.

Tests: 2 new regression tests pinning both behaviors via hand-built
fixtures.

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 61 files
- `npm run build`: clean
- `npm run test:eval:structural`: required=5/5, all-pass=8/8

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd force-pushed the feature/pf-692-codegraph-duplicates branch from 39f02de to 2a713c1 Compare May 24, 2026 13:50
mbenhamd added a commit that referenced this pull request May 24, 2026
Third consumer of the CLI envelope/cli-tools family, completing the
trio with diff (PR #39) and duplicates (PR #40). Surfaces every
persisted provenance breadcrumb for a single edge:

- Edge identity (source, target, kind, line, col)
- Source + target node snapshots
- Extractor provenance: `edges.provenance` column (tree-sitter / scip
  / heuristic)
- Resolver provenance: `metadata.resolvedBy` strategy tag (import /
  framework / qualified-name / exact-match / instance-method /
  file-path / fuzzy) + `metadata.confidence`
- Full raw metadata JSON for forward-compat

Stacked on PR #40 (which stacks on PR #39).

## Surface

```
codegraph explain <edgeId>                          # happy path
codegraph explain --source <id> --target <id> \     # rebuild-stable canonical
                  --kind <k> [--line N] [--col N]
codegraph explain <edgeId> --json                   # full envelope
```

Default text output uses `formatExplainNarrative` for a concise
multi-line trace; `--json` emits the full payload via
`cliJsonEnvelope('explain', …)`.

## Council RFC outcome (Codex)

| Fork | Decision |
|---|---|
| A: identifier surface | **Both** — positional `<edgeId>` + canonical flags |
| B: expose `edgeId` in callers/callees JSON | **Yes** — small `rowToEdge` patch plumbs it through |
| C: output shape | **JSON + text narrative** (terminal use case dominates) |
| D: `--rerun` | **Skip** — re-resolution is a different diagnostic mode |
| E: ambiguous canonical lookup | **Error with `--line N --col N` hint** |

## Edge.id exposure

`Edge` gains an optional `id?: number` field with a docstring noting
it's index-local (resets on rebuild). `rowToEdge` propagates it from
the DB row. `collectGraphRelations` includes `edgeId` in the
caller/callee JSON entries when present. `schemas/cli/callers.json`
and `schemas/cli/callees.json` document the new optional field.

## Read-only safety

`pathToFileURL(dbPath).href + '?immutable=1'` — same pattern as
PRs #39 and #40. No `-shm`/`-wal` sidecar creation; the DB stays
bit-for-bit unchanged. Regression test snapshots size + mtime of
`db`/`-wal`/`-shm`/`-journal` before/after and asserts equal.

## Schema gate

No v6 requirement — `provenance`/`metadata` columns existed in v5.
`assertCodeGraphDb` rejects DBs missing BOTH `edges` AND `nodes`
tables via a single `WHERE name IN (…)` probe. Missing resolver
metadata is tolerated: `resolvedBy: null`, `confidence: null`,
`metadata: null` — explain explains what's there.

## Tests (10 unit + 1 schema validation)

- Integer-id lookup surfaces resolver provenance
- Canonical lookup by (source, target, kind)
- Ambiguous canonical → throws with `--line` hint
- No-match canonical → throws with full identifier in message
- Missing edge id → throws "no edge with id N"
- Non-positive / non-integer edge id → rejected before DB open
- Narrative format contains expected keywords
- Non-CodeGraph SQLite file → "not a CodeGraph database"
- Read-only invariant: DB + sidecars unchanged after explain
- Missing DB path → "not found"

## Reviewer trail

- Codex RFC → locked 5 design forks
- Codex pass 1 → 3 REVIEW + 5 NITPICK (no BLOCKER)
  - `--kind` should be required for canonical lookup
  - `assertCodeGraphDb` should check both tables
  - `confidence` should be clamped to [0, 1]
  - Document `edgeId` in callers/callees schemas
- Codex round 2 → "closed — proceed to PR"

## Validation

- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 62 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  blocking=0, recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd deleted the branch feature/pf-691-codegraph-diff May 24, 2026 13:51
@mbenhamd mbenhamd closed this May 24, 2026
mbenhamd added a commit that referenced this pull request May 24, 2026
Second consumer of PF-690 fingerprint columns (PR #38, schema v6).
Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone
groups under council-locked defaults: function+method kinds, ≥10
lines, shape groups whose members already form an exact group are
suppressed.

Replaces auto-closed PR #40 (closed when PR #39's base branch was
deleted on merge). Combines the original feat commit + round 2 fixes
(fileCount, coveredByExactGroup, fingerprint coverage gate,
human-meaningful sort) + the Codex-PR-review fixes (shapeNodes
double-count + unfingerprintable-kind error message).

## Surface
```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

## Read-only safety
Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39.

## Schema gate
- v5 DBs (no fingerprint columns): "duplicates requires schema v6+".
- Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0):
  "Run `codegraph index` to refresh fingerprints" — but ONLY when
  the DB has no fingerprints anywhere. If other kinds have
  fingerprints but the requested kinds don't, surface
  "framework-extractor nodes aren't fingerprinted; try
  --kind=function,method".

## Output shape
- `groups[]`: each group has `kind`, `fingerprint`, `members[]`,
  `fileCount`, `coveredByExactGroup`.
- `summary`: includes `fingerprintCoverage: { eligible, withAstHash }`
  so consumers can detect partial coverage.
- Sort: member count DESC → max line span DESC → first member
  filePath ASC → startLine ASC → fingerprint ASC (deterministic,
  human-meaningful).
- `shapeNodes` counts ONLY shape-only members (not double-counted
  with `exactNodes`).

## Tests
19 unit tests + 1 schema validation test covering: defaults,
exact + shape group detection, min-lines filter, sort (primary
through tertiary), suppression rule, empty --kind error, v5 schema
rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage
exposure, fileCount, coveredByExactGroup, unfingerprintable-kind
distinction, shapeNodes double-count fix, read-only invariant.

## Reviewer trail
- Codex RFC → locked 5 design forks
- Codex pass 1 + round 2 closure (subdir resolution, --min-lines
  strict parsing, comment wording, URI escaping, sort coverage)
- Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes
  double-count, unfingerprintable-kind error message)
- Council double-Codex deep review → confirmed all BLOCKERs fixed

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd deleted the feature/pf-692-codegraph-duplicates branch May 24, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant