Skip to content

feat(PF-692): codegraph duplicates clone-detection CLI#42

Merged
mbenhamd merged 1 commit into
mainfrom
feature/pf-692-codegraph-duplicates-v2
May 24, 2026
Merged

feat(PF-692): codegraph duplicates clone-detection CLI#42
mbenhamd merged 1 commit into
mainfrom
feature/pf-692-codegraph-duplicates-v2

Conversation

@mbenhamd
Copy link
Copy Markdown
Owner

Replaces auto-closed PR #40 (closed when PR #39's base branch deleted on merge). Same content, rebased onto main. All council BLOCKERs + REVIEWs + Codex PR review findings already addressed.

Surface

`codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]`

Validation

  • 1065 tests pass, eval 8/8, tsc + build clean

Reviewer trail (carried forward from closed PR #40)

  • Codex RFC + pass 1 + round 2 closure
  • Council double-Codex deep review (oversimplification + missing code + pass-but-wrong)
  • Codex GitHub PR review (2 P2 findings addressed: shapeNodes double-count, unfingerprintable-kind error)

See merged PR #39 for the diff primitive this builds on.

🤖 Generated with Claude Code

Second consumer of PF-690 fingerprint columns (PR #38, schema v6).
Reports Type-1 (`ast_hash`) and Type-2 (`ast_shape_hash`) clone
groups under council-locked defaults: function+method kinds, ≥10
lines, shape groups whose members already form an exact group are
suppressed.

Replaces auto-closed PR #40 (closed when PR #39's base branch was
deleted on merge). Combines the original feat commit + round 2 fixes
(fileCount, coveredByExactGroup, fingerprint coverage gate,
human-meaningful sort) + the Codex-PR-review fixes (shapeNodes
double-count + unfingerprintable-kind error message).

## Surface
```
codegraph duplicates [path] [--kind function,method] [--min-lines 10] [-j|--json]
```

## Read-only safety
Same `pathToFileURL + '?immutable=1'` URI pattern locked in PR #39.

## Schema gate
- v5 DBs (no fingerprint columns): "duplicates requires schema v6+".
- Migrated-not-reindexed v6 DBs (eligible > 0, withAstHash = 0):
  "Run `codegraph index` to refresh fingerprints" — but ONLY when
  the DB has no fingerprints anywhere. If other kinds have
  fingerprints but the requested kinds don't, surface
  "framework-extractor nodes aren't fingerprinted; try
  --kind=function,method".

## Output shape
- `groups[]`: each group has `kind`, `fingerprint`, `members[]`,
  `fileCount`, `coveredByExactGroup`.
- `summary`: includes `fingerprintCoverage: { eligible, withAstHash }`
  so consumers can detect partial coverage.
- Sort: member count DESC → max line span DESC → first member
  filePath ASC → startLine ASC → fingerprint ASC (deterministic,
  human-meaningful).
- `shapeNodes` counts ONLY shape-only members (not double-counted
  with `exactNodes`).

## Tests
19 unit tests + 1 schema validation test covering: defaults,
exact + shape group detection, min-lines filter, sort (primary
through tertiary), suppression rule, empty --kind error, v5 schema
rejection, migrated-not-reindexed BLOCKER, fingerprintCoverage
exposure, fileCount, coveredByExactGroup, unfingerprintable-kind
distinction, shapeNodes double-count fix, read-only invariant.

## Reviewer trail
- Codex RFC → locked 5 design forks
- Codex pass 1 + round 2 closure (subdir resolution, --min-lines
  strict parsing, comment wording, URI escaping, sort coverage)
- Codex PR review on round 2 → 2 P2 findings addressed (shapeNodes
  double-count, unfingerprintable-kind error message)
- Council double-Codex deep review → confirmed all BLOCKERs fixed

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1065 passed | 2 skipped | 61 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mbenhamd mbenhamd merged commit bd581c2 into main May 24, 2026
@mbenhamd mbenhamd deleted the feature/pf-692-codegraph-duplicates-v2 branch May 24, 2026 13:56
mbenhamd added a commit that referenced this pull request May 24, 2026
Third consumer of PF-690 fingerprint columns + first consumer of
edge resolver provenance. Replaces auto-closed PR #41 (closed when
its base branch was deleted on PR #39/#42 merge); same content
rebased onto main.

## Surface
```
codegraph explain <edgeId>                                  # happy path
codegraph explain --source <id> --target <id> --kind <k>    # canonical fallback
                  [--line N] [--col N]                      # disambiguators
codegraph explain <edgeId> --json
```

## Honest scope
`traceAvailable: false` is locked at false in the output — the
resolver in `src/resolution/` discards loser strategy attempts, so
only the winning resolver tag + confidence survive. The tool
surfaces persisted breadcrumbs, NOT a causal explanation. A future
PR will persist `edge_resolution_traces` and flip this to `true`.

## Round-trip stability
`callers --json` / `callees --json` now expose `edge: { kind,
source, target, line, col }` per relation, so users can pipe
output into `codegraph explain --source/--target/--kind` even
across rebuilds (when edgeId is invalidated).

## Read-only safety
Same `pathToFileURL + '?immutable=1'` URI pattern as PRs #39/#42.

## Schema gate
No v6 requirement — provenance/metadata columns existed in v5.
`assertCodeGraphDb` checks both `edges` AND `nodes` tables in a
single query; rejects with `not a CodeGraph database` if either
is missing.

## Output fields
- Edge identity (source, target, kind, line, col)
- Source + target node snapshots (qualifiedName, filePath,
  startLine)
- `extractorProvenance` (tree-sitter / scip / heuristic)
- `resolvedBy` strategy tag + `confidence` ∈ [0, 1]
- `metadata` (object form) + `rawMetadata` (string form for
  arrays/scalars/malformed — PR review fix so non-object metadata
  isn't silently dropped)
- `traceAvailable` (always `false` today; honest signal)

## Edge.id exposure
`Edge` gains optional `id?: number` field (index-local, resets on
rebuild). `rowToEdge` propagates it. `collectGraphRelations` adds
`edgeId` + `edge` object to JSON output.

## Tests
12 unit tests + 1 schema validation test covering: integer-id
lookup with provenance assertions, canonical lookup, ambiguous
canonical with `--line` disambiguation (hand-built fixture, no
skip-on-precondition), no-match, missing-id, invalid-id rejection,
narrative format, non-CodeGraph DB rejection, read-only invariant,
missing DB, `traceAvailable: false` locked, non-object metadata
exposed via `rawMetadata`.

## Reviewer trail
- Codex RFC → locked 5 design forks (positional+canonical
  identifier, edge.id exposure, JSON+text, skip --rerun, ambiguity
  error)
- Codex pass 1 + round 2 closure (kind required, dbCheck both
  tables, confidence clamping, edgeId schema docs)
- Council double-Codex deep review → 2 BLOCKERs addressed (canonical
  round-trip needed edge object in callers JSON; traceAvailable
  honesty)
- Codex PR review on round 2 → no findings

## Validation
- `tsc --noEmit`: clean
- `vitest run`: 1078 passed | 2 skipped | 62 files
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mbenhamd added a commit that referenced this pull request May 24, 2026
… v0.10.0 (#44)

Closes the MCP gap from PRs #39/#42/#43: those landed CLI commands +
library exports but didn't register the three new tools with the MCP
server. Agents using Claude Code / Cursor / opencode etc. via MCP saw
only the original 9 codegraph_* tools. This PR exposes all 12.

## What's added

- **3 MCP tool descriptors** with input schemas + agent-facing
  descriptions: `codegraph_diff`, `codegraph_duplicates`,
  `codegraph_explain`. Each description explains scope, defaults,
  and round-trip flow (e.g., "get edgeId from codegraph_callers
  JSON output, pass it here").
- **3 dispatcher cases** in `ToolHandler.execute()`.
- **3 handlers** (`handleDiff`, `handleDuplicates`, `handleExplain`)
  with their own markdown formatters: `formatDiffResult`,
  `formatDuplicatesResult`, `formatExplainMcp`.
- **`resolveDbPathReadOnly(projectPath)`** helper — runs the
  project-access policy + walks up to find `.codegraph/` WITHOUT
  opening the DB in WAL mode (which would mutate it). Critical for
  diff: opening via the production `DatabaseConnection.open` path
  would defeat the immutable-snapshot guarantee.

## Honest-scope signals carried through

- `codegraph_explain` MCP output always renders a "Scope note"
  block calling out `traceAvailable: false` — the resolver still
  discards loser strategy attempts.
- `codegraph_diff` MCP output renders the `fingerprintCoverage`
  warning when Liquid/Vue/Svelte gaps make body-level diffs blind.
- `codegraph_duplicates` MCP output includes coverage + kinds +
  min-lines context in every response (including zero-result).

## Codex review fixes applied

- **BLOCKER**: `resolveDbPathReadOnly` skipped the access check
  for non-existent paths AND never re-checked `resolvedRoot`. A
  caller passing `/disallowed/repo/nonexistent-child` could
  `findNearestCodeGraphRoot` walk UP to `/disallowed/repo` and
  return its DB — bypassing the allowlist. Fix: explicit
  `projectAccess.check(resolvedRoot)` after root resolution,
  matching the post-cache logic in `getCodeGraph()`. Regression
  test installs an allowlist excluding the fixture root and
  passes a non-existent child of it; expects access denial.
- **REVIEW**: `Number(args.maxChangedNodes) || 20` treated
  `maxChangedNodes: 0` as the default. Switched to explicit
  `undefined ? default : Number(...)` so 0 means "render no
  detail rows" as the formatter supports.
- **NITPICK**: tightened `traceAvailable: false` test to assert
  the dedicated `Scope note ... traceAvailable: false` block
  rather than any mention.

Codex round-2 closure hit a sandbox issue (stale `/tmp/papersflow-pf-694`
workspace from another session caused it to review unrelated convex
code). The two fixes are mechanically obvious; tests pin both.

## Version bump

0.9.4 → **0.10.0** — minor bump for the new MCP surface + CLI
commands from PRs #39/#42/#43. No breaking changes.

## End-to-end verification

```
echo '{"jsonrpc":"2.0","id":1,"method":"initialize",...}
{"jsonrpc":"2.0","id":2,"method":"tools/list",...}' \
  | codegraph serve --mcp --path /tmp/mcp-smoke
```
returns 12 tools, and a live `codegraph_duplicates` call against a
real fixture returns a 1-group exact clone with markdown
formatting.

## Tests

18 new MCP integration tests:
- 5 tool descriptor + schema-shape contract tests
- 3 codegraph_diff handler tests (happy path + missing arg +
  uninitialized project)
- 3 codegraph_duplicates handler tests (real duplicates +
  malformed inputs + zero-result context)
- 5 codegraph_explain handler tests (id lookup with
  traceAvailable, mixed-mode rejection, empty-input rejection,
  kind-required, non-positive id)
- 1 dispatcher unknown-tool test
- 1 BLOCKER regression test (allowlist + non-existent child →
  access denial on resolved root)

## Validation

- `tsc --noEmit`: clean
- `vitest run`: **1096 passed | 2 skipped | 63 files**
- `npm run build`: clean (Node 24)
- `npm run test:eval:structural`: required=5/5, all-pass=8/8,
  recall=1.00, precision=1.00, fp=0
- MCP stdio smoke test: tools/list returns 12 tools incl. the new
  three; tools/call → codegraph_duplicates returns a real exact
  clone group from a fixture

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant