perf: optimize cycles and stats for large codebases#781
Conversation
findCycles: build adjacency list directly from DB rows and run Tarjan inline — eliminates intermediate CodeGraph construction, toEdgeArray serialization, and redundant label CodeGraph rebuild. statsData/moduleMapData: replace O(V×E) correlated subqueries in findHotspots and moduleMapData with pre-aggregated LEFT JOINs. Replace full-table-load + JS filtering in countNodesByKind, countEdgesByKind, and countRoles with SQL WHERE clauses using testFilterSQL. Remove buildTestFileIds (no longer needed). Closes #763
|
Claude finished @carlos-alm's task —— View job PR Review: Performance Optimization for Cycles and StatsTodo List
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance timeouts in cycles and stats commands on large codebases (#763) Why This Approach: The optimizations target specific algorithmic bottlenecks:
Risk Assessment: ✅ Low Risk - These are pure performance optimizations that maintain identical functionality. The approach directly addresses computational complexity bottlenecks rather than masking symptoms. Phase 2: Configuration Change Detection ✅Status: ✅ No Configuration Changes
Phase 3: Test Integrity ✅Status: ✅ Tests Remain Robust
Phase 4: Code Quality Review ✅Excellent Optimizations:
Implementation Quality:
No Issues Found: No bugs, security concerns, or performance regressions detected. Phase 5: Backlog Compliance ✅
Final Recommendation
Summary: This PR demonstrates excellent performance engineering - targeting algorithmic bottlenecks with surgical precision while maintaining full API compatibility. The optimizations are mathematically sound (O(V×E) → O(E)) and implementation is clean. Strong approve. |
Greptile SummaryThis PR is a targeted performance optimization for two query-heavy code paths: cycle detection ( Key changes:
The two previously-raised concerns (edge deduplication and Confidence Score: 5/5Safe to merge — no P0/P1 issues remain; prior review concerns about edge deduplication and hotspot edge-kind filtering are both resolved in this head commit. All remaining findings are P2 or lower. The core correctness risks raised in prior rounds (missing deduplication, wrong edge-kind inclusion in hotspot subqueries) are addressed. The algorithmic changes are semantically equivalent to the old paths, the test suite passes, and the SQL rewrites follow established patterns already used elsewhere in the codebase. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI
participant findCycles
participant DB
participant tarjanFromEdges
participant native
CLI->>findCycles: findCycles(db, {fileLevel, noTests})
findCycles->>DB: getFileNodesAll() or getCallableNodes()
DB-->>findCycles: node rows
findCycles->>DB: getImportEdges() or getCallEdges()
DB-->>findCycles: edge rows
Note over findCycles: Deduplicate via Set<key><br/>Build label→label edge list
findCycles->>native: detectCycles(edges) [if available]
native-->>findCycles: string[][]
findCycles->>tarjanFromEdges: tarjanFromEdges(edges) [JS fallback]
Note over tarjanFromEdges: Build adj Map<br/>Run Tarjan SCC<br/>Filter SCCs length > 1
tarjanFromEdges-->>findCycles: string[][]
findCycles-->>CLI: cycles
Reviews (2): Last reviewed commit: "fix: deduplicate edges in findCycles and..." | Re-trigger Greptile |
| for (const e of getImportEdges(db)) { | ||
| if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue; | ||
| if (e.source_id === e.target_id) continue; | ||
| const src = idToFile.get(e.source_id)!; | ||
| const tgt = idToFile.get(e.target_id)!; | ||
| edges.push({ source: src, target: tgt }); | ||
| } | ||
| } else { | ||
| let nodes = getCallableNodes(db); | ||
| if (noTests) nodes = nodes.filter((n) => !isTestFile(n.file)); | ||
| const nodeIds = new Set<number>(); | ||
| const idToLabel = new Map<number, string>(); | ||
| for (const n of nodes) { | ||
| nodeIds.add(n.id); | ||
| idToLabel.set(n.id, `${n.name}|${n.file}`); | ||
| } | ||
| for (const e of getCallEdges(db)) { | ||
| if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue; | ||
| if (e.source_id === e.target_id) continue; | ||
| const src = idToLabel.get(e.source_id)!; | ||
| const tgt = idToLabel.get(e.target_id)!; | ||
| edges.push({ source: src, target: tgt }); | ||
| } |
There was a problem hiding this comment.
Duplicate edges not deduplicated — regression vs. old
buildDependencyGraph
The old buildDependencyGraph in src/graph/builders/dependency.ts explicitly deduplicated edges in both the file-level and function-level branches via an if (!graph.hasEdge(src, tgt)) guard before adding to the CodeGraph. This meant multiple import statements from file A to file B, or multiple call sites from function A to function B, were collapsed to a single edge.
The new code pushes raw DB rows directly into edges without any deduplication. A function that calls another function from ten distinct call sites produces ten identical { source, target } pairs in the array.
Impact on the native path: native.detectCycles(edges) now receives a larger and potentially duplicate-heavy list compared to what it previously received. There is no guarantee the native implementation handles duplicate edges the same way as the JS Tarjan path, creating an observable divergence between the two engines.
Impact on the JS path: tarjanFromEdges remains correct — the indices.has(w) guard prevents re-processing — but redundant adjacency-list entries add unnecessary work, which runs counter to this PR's own performance goals.
The fix is to track seen (source, target) pairs in a Set and skip duplicates before pushing into edges, mirroring the hasEdge guard that existed in buildDependencyGraph.
There was a problem hiding this comment.
Fixed in a511c82. Added a Set<string> to track seen (source, target) pairs using a null-byte-separated key. Both the file-level and function-level branches now skip duplicate edges before pushing into the array, restoring the deduplication guarantee that existed in the old buildDependencyGraph path. This ensures the native engine receives the same unique edge set as the JS Tarjan path.
| LEFT JOIN ( | ||
| SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id | ||
| ) fi ON fi.target_id = n.id | ||
| LEFT JOIN ( | ||
| SELECT source_id, COUNT(*) AS cnt FROM edges GROUP BY source_id | ||
| ) fo ON fo.source_id = n.id | ||
| WHERE n.kind = 'file' ${testFilter} | ||
| ORDER BY (SELECT COUNT(*) FROM edges WHERE target_id = n.id) | ||
| + (SELECT COUNT(*) FROM edges WHERE source_id = n.id) DESC | ||
| ORDER BY COALESCE(fi.cnt, 0) + COALESCE(fo.cnt, 0) DESC | ||
| LIMIT ? | ||
| `) |
There was a problem hiding this comment.
Pre-aggregated fan-in/fan-out subqueries count ALL edge types, including edges between non-file nodes
The two LEFT JOIN subqueries aggregate over the full edges table:
LEFT JOIN (
SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id
) fi ON fi.target_id = n.idThis counts every edge whose target_id equals the file node's ID — including contains, calls, imports, parameter_of, etc. The old correlated-subquery approach (SELECT COUNT(*) FROM edges WHERE target_id = n.id) had identical semantics, so there is no correctness regression here.
However, unlike moduleMapData (which explicitly excludes contains, parameter_of, receiver in its equivalent subqueries), findHotspots counts all edge kinds. This means a file's fan_in will be dominated by contains edges (one per function/method in the file) rather than by meaningful import or call fan-in. If hotspot ranking by inter-file coupling is the intent, it may be worth adding a WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filter to the subqueries to match moduleMapData's approach — or at minimum confirm this is intentional.
There was a problem hiding this comment.
Fixed in a511c82. Added WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filters to both the fan-in and fan-out subqueries in findHotspots, matching the filtering already used by moduleMapData. Hotspot ranking now reflects inter-file coupling (imports, calls) rather than being dominated by internal structure edges.
…spots (#781) Restore edge deduplication that existed in the old buildDependencyGraph path — without it, duplicate (source, target) pairs from multiple call sites inflated the edge array passed to both native and JS engines. Also align findHotspots fan-in/fan-out subqueries with moduleMapData by excluding contains, parameter_of, and receiver edges so hotspot ranking reflects inter-file coupling rather than internal structure edges.
…m/optave/ops-codegraph-tool into perf/fix-763-cycles-stats-timeout
Summary
findCycles: Build adjacency list directly from DB rows and run Tarjan inline — eliminates intermediateCodeGraphconstruction,toEdgeArray()serialization, and redundant labelCodeGraphrebuildfindHotspots/moduleMapData: Replace O(V×E) correlated subqueries with pre-aggregatedLEFT JOINs (edge counts grouped once, joined by ID)countNodesByKind/countEdgesByKind/countRoles: Replace full-table-load + JS.filter()with SQLWHEREclauses viatestFilterSQL; removebuildTestFileIds(no longer needed)Test plan
tests/graph/— 271 tests pass (cycles, algorithms, builders)tests/integration/queries.test.ts— 99 tests passtests/integration/roles.test.ts— 12 tests passbiome check)Closes #763