perf: optimize cycles and stats for large codebases by carlos-alm · Pull Request #781 · optave/ops-codegraph-tool

carlos-alm · 2026-04-03T03:54:33Z

Summary

findCycles: Build adjacency list directly from DB rows and run Tarjan inline — eliminates intermediate CodeGraph construction, toEdgeArray() serialization, and redundant label CodeGraph rebuild
findHotspots / moduleMapData: Replace O(V×E) correlated subqueries with pre-aggregated LEFT JOINs (edge counts grouped once, joined by ID)
countNodesByKind / countEdgesByKind / countRoles: Replace full-table-load + JS .filter() with SQL WHERE clauses via testFilterSQL; remove buildTestFileIds (no longer needed)

Test plan

tests/graph/ — 271 tests pass (cycles, algorithms, builders)
tests/integration/queries.test.ts — 99 tests pass
tests/integration/roles.test.ts — 12 tests pass
Lint clean (biome check)

Closes #763

findCycles: build adjacency list directly from DB rows and run Tarjan inline — eliminates intermediate CodeGraph construction, toEdgeArray serialization, and redundant label CodeGraph rebuild. statsData/moduleMapData: replace O(V×E) correlated subqueries in findHotspots and moduleMapData with pre-aggregated LEFT JOINs. Replace full-table-load + JS filtering in countNodesByKind, countEdgesByKind, and countRoles with SQL WHERE clauses using testFilterSQL. Remove buildTestFileIds (no longer needed). Closes #763

claude · 2026-04-03T03:54:44Z

Claude finished @carlos-alm's task —— View job

PR Review: Performance Optimization for Cycles and Stats

Todo List

Phase 1: Root cause analysis - understand the timeout problem and solution approach
Phase 2: Configuration change detection - check for any rule relaxations
Phase 3: Test integrity - verify tests remain robust
Phase 4: Code quality review - analyze the optimization implementation
Phase 5: Backlog compliance assessment - rate against project criteria
Final Assessment: Provide rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Performance timeouts in cycles and stats commands on large codebases (#763)

Why This Approach: The optimizations target specific algorithmic bottlenecks:

findCycles: Eliminates O(V+E) overhead from CodeGraph construction + serialization
findHotspots/moduleMapData: Replaces O(V×E) correlated subqueries with O(E) pre-aggregated JOINs
Stats count functions: Replaces full-table JavaScript filtering with SQL WHERE clauses

Risk Assessment: ✅ Low Risk - These are pure performance optimizations that maintain identical functionality. The approach directly addresses computational complexity bottlenecks rather than masking symptoms.

Phase 2: Configuration Change Detection ✅

Status: ✅ No Configuration Changes

No changes to biome.json, package.json, or test timeouts
Only 2 source files modified: cycles.ts and module-map.ts
No rule relaxations or test expectations lowered
buildTestFileIds function removed (no longer needed due to SQL optimization)

Phase 3: Test Integrity ✅

Status: ✅ Tests Remain Robust

Existing test suite (tests/graph/cycles.test.ts) validates behavior via public API
Tests create DB directly and call findCycles(db) - optimizations are transparent
No test modifications needed since public API unchanged
PR claims all 271 graph tests + 99 integration tests + 12 roles tests pass

Phase 4: Code Quality Review ✅

Excellent Optimizations:

findCycles optimization (cycles.ts:12-61):
- Before: buildDependencyGraph() → CodeGraph → toEdgeArray() → rebuild CodeGraph for labels
- After: Direct adjacency list from DB rows → inline Tarjan algorithm
- Impact: Eliminates 3 O(V+E) operations, huge memory savings
findHotspots optimization (module-map.ts:111-135):
- Before: Correlated subqueries (SELECT COUNT(*) FROM edges WHERE target_id = n.id) - O(V×E)
- After: Pre-aggregated LEFT JOINs with grouped edge counts - O(E)
- Impact: Massive improvement for large graphs
SQL filtering replacement:
- Before: buildTestFileIds() loads all files into memory, filters in JavaScript
- After: SQL WHERE clauses via testFilterSQL()
- Impact: Database does the filtering, much faster

Implementation Quality:

✅ Maintains identical public APIs
✅ Preserves native/WASM fallback patterns
✅ Clean SQL with proper parameterization
✅ Efficient adjacency list construction
✅ Correct Tarjan implementation (existing algorithm, relocated)

No Issues Found: No bugs, security concerns, or performance regressions detected.

Phase 5: Backlog Compliance ✅

Zero-dep: ✓ No new runtime dependencies
Foundation-aligned: ✓ Fully compliant - improves core analysis without violating principles
Problem-fit: 5/5 - Directly addresses "AI agents timeout on large codebases" core problem
Breaking: No - Pure internal optimizations, identical public APIs
Tier: Tier 1 (zero-dep ✓ + foundation-aligned ✓)

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: Exceptional performance engineering that directly solves the core timeout problem. Clean implementation with no breaking changes, proper fallbacks maintained, and transparent API preservation. This is exactly the kind of optimization needed for large codebases.

Summary: This PR demonstrates excellent performance engineering - targeting algorithmic bottlenecks with surgical precision while maintaining full API compatibility. The optimizations are mathematically sound (O(V×E) → O(E)) and implementation is clean. Strong approve.

greptile-apps · 2026-04-03T04:00:10Z

Greptile Summary

This PR is a targeted performance optimization for two query-heavy code paths: cycle detection (findCycles) and stats/module-map aggregation (statsData, moduleMapData, findHotspots). The changes eliminate intermediate CodeGraph object construction and replace O(V×E) correlated subqueries with pre-aggregated LEFT JOIN subqueries.

Key changes:

findCycles: Builds a label-keyed adjacency list directly from DB rows and runs Tarjan SCC inline (tarjanFromEdges), skipping buildDependencyGraph, CodeGraph, and toEdgeArray(). A Set<string> with null-byte-separated keys deduplicates edges before they enter the algorithm.
findHotspots: Swaps correlated per-row SELECT COUNT(*) for two pre-aggregated LEFT JOIN subqueries (with kind NOT IN ('contains', 'parameter_of', 'receiver') filters, added per prior review feedback). The SQL LIMIT ? replaces the JS .slice().
moduleMapData: Same pre-aggregated JOIN pattern for out_edges/in_edges; selects only needed columns (n.file) instead of n.*.
countNodesByKind / countEdgesByKind / countRoles: Replace full-table load + JS .filter() with testFilterSQL-generated WHERE clauses; buildTestFileIds removed.

The two previously-raised concerns (edge deduplication and contains/parameter_of/receiver exclusion in hotspot subqueries) are both addressed in the current head commit.

Confidence Score: 5/5

Safe to merge — no P0/P1 issues remain; prior review concerns about edge deduplication and hotspot edge-kind filtering are both resolved in this head commit.

All remaining findings are P2 or lower. The core correctness risks raised in prior rounds (missing deduplication, wrong edge-kind inclusion in hotspot subqueries) are addressed. The algorithmic changes are semantically equivalent to the old paths, the test suite passes, and the SQL rewrites follow established patterns already used elsewhere in the codebase.

No files require special attention.

Important Files Changed

Filename	Overview
src/domain/graph/cycles.ts	Replaces intermediate CodeGraph construction with a direct DB-row adjacency list and inline Tarjan SCC; adds deduplication via a null-byte-separated key Set, matching the old buildDependencyGraph guard. Logic is correct.
src/domain/analysis/module-map.ts	Replaces O(V×E) correlated subqueries with pre-aggregated LEFT JOINs for findHotspots and moduleMapData; replaces full-table JS filters with SQL WHERE clauses for countNodesByKind, countEdgesByKind, and countRoles; edge-kind exclusion filters correctly added to hotspot subqueries per prior review feedback.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant findCycles
    participant DB
    participant tarjanFromEdges
    participant native

    CLI->>findCycles: findCycles(db, {fileLevel, noTests})
    findCycles->>DB: getFileNodesAll() or getCallableNodes()
    DB-->>findCycles: node rows
    findCycles->>DB: getImportEdges() or getCallEdges()
    DB-->>findCycles: edge rows
    Note over findCycles: Deduplicate via Set<key><br/>Build label→label edge list
    findCycles->>native: detectCycles(edges) [if available]
    native-->>findCycles: string[][]
    findCycles->>tarjanFromEdges: tarjanFromEdges(edges) [JS fallback]
    Note over tarjanFromEdges: Build adj Map<br/>Run Tarjan SCC<br/>Filter SCCs length > 1
    tarjanFromEdges-->>findCycles: string[][]
    findCycles-->>CLI: cycles

_{Reviews (2): Last reviewed commit: "fix: deduplicate edges in findCycles and..." | Re-trigger Greptile}

greptile-apps · 2026-04-03T04:00:14Z

src/domain/graph/cycles.ts

+    for (const e of getImportEdges(db)) {
+      if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue;
+      if (e.source_id === e.target_id) continue;
+      const src = idToFile.get(e.source_id)!;
+      const tgt = idToFile.get(e.target_id)!;
+      edges.push({ source: src, target: tgt });
+    }
+  } else {
+    let nodes = getCallableNodes(db);
+    if (noTests) nodes = nodes.filter((n) => !isTestFile(n.file));
+    const nodeIds = new Set<number>();
+    const idToLabel = new Map<number, string>();
+    for (const n of nodes) {
+      nodeIds.add(n.id);
+      idToLabel.set(n.id, `${n.name}|${n.file}`);
+    }
+    for (const e of getCallEdges(db)) {
+      if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue;
+      if (e.source_id === e.target_id) continue;
+      const src = idToLabel.get(e.source_id)!;
+      const tgt = idToLabel.get(e.target_id)!;
+      edges.push({ source: src, target: tgt });
    }


Duplicate edges not deduplicated — regression vs. old buildDependencyGraph

The old buildDependencyGraph in src/graph/builders/dependency.ts explicitly deduplicated edges in both the file-level and function-level branches via an if (!graph.hasEdge(src, tgt)) guard before adding to the CodeGraph. This meant multiple import statements from file A to file B, or multiple call sites from function A to function B, were collapsed to a single edge.

The new code pushes raw DB rows directly into edges without any deduplication. A function that calls another function from ten distinct call sites produces ten identical { source, target } pairs in the array.

Impact on the native path: native.detectCycles(edges) now receives a larger and potentially duplicate-heavy list compared to what it previously received. There is no guarantee the native implementation handles duplicate edges the same way as the JS Tarjan path, creating an observable divergence between the two engines.

Impact on the JS path: tarjanFromEdges remains correct — the indices.has(w) guard prevents re-processing — but redundant adjacency-list entries add unnecessary work, which runs counter to this PR's own performance goals.

The fix is to track seen (source, target) pairs in a Set and skip duplicates before pushing into edges, mirroring the hasEdge guard that existed in buildDependencyGraph.

Fixed in a511c82. Added a Set<string> to track seen (source, target) pairs using a null-byte-separated key. Both the file-level and function-level branches now skip duplicate edges before pushing into the array, restoring the deduplication guarantee that existed in the old buildDependencyGraph path. This ensures the native engine receives the same unique edge set as the JS Tarjan path.

greptile-apps · 2026-04-03T04:00:15Z

src/domain/analysis/module-map.ts

+      LEFT JOIN (
+        SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id
+      ) fi ON fi.target_id = n.id
+      LEFT JOIN (
+        SELECT source_id, COUNT(*) AS cnt FROM edges GROUP BY source_id
+      ) fo ON fo.source_id = n.id
      WHERE n.kind = 'file' ${testFilter}
-      ORDER BY (SELECT COUNT(*) FROM edges WHERE target_id = n.id)
-             + (SELECT COUNT(*) FROM edges WHERE source_id = n.id) DESC
+      ORDER BY COALESCE(fi.cnt, 0) + COALESCE(fo.cnt, 0) DESC
+      LIMIT ?
    `)


Pre-aggregated fan-in/fan-out subqueries count ALL edge types, including edges between non-file nodes

The two LEFT JOIN subqueries aggregate over the full edges table:

LEFT JOIN ( SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id ) fi ON fi.target_id = n.id

This counts every edge whose target_id equals the file node's ID — including contains, calls, imports, parameter_of, etc. The old correlated-subquery approach (SELECT COUNT(*) FROM edges WHERE target_id = n.id) had identical semantics, so there is no correctness regression here.

However, unlike moduleMapData (which explicitly excludes contains, parameter_of, receiver in its equivalent subqueries), findHotspots counts all edge kinds. This means a file's fan_in will be dominated by contains edges (one per function/method in the file) rather than by meaningful import or call fan-in. If hotspot ranking by inter-file coupling is the intent, it may be worth adding a WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filter to the subqueries to match moduleMapData's approach — or at minimum confirm this is intentional.

Fixed in a511c82. Added WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filters to both the fan-in and fan-out subqueries in findHotspots, matching the filtering already used by moduleMapData. Hotspot ranking now reflects inter-file coupling (imports, calls) rather than being dominated by internal structure edges.

…spots (#781) Restore edge deduplication that existed in the old buildDependencyGraph path — without it, duplicate (source, target) pairs from multiple call sites inflated the edge array passed to both native and JS engines. Also align findHotspots fan-in/fan-out subqueries with moduleMapData by excluding contains, parameter_of, and receiver edges so hotspot ranking reflects inter-file coupling rather than internal structure edges.

carlos-alm · 2026-04-03T04:52:42Z

@greptileai

…tats-timeout

…m/optave/ops-codegraph-tool into perf/fix-763-cycles-stats-timeout

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

carlos-alm added 4 commits April 2, 2026 22:23

Merge branch 'main' into perf/fix-763-cycles-stats-timeout

a54bb7a

Merge branch 'main' into perf/fix-763-cycles-stats-timeout

40e7dbb

Merge branch 'main' into perf/fix-763-cycles-stats-timeout

594f31c

carlos-alm added 4 commits April 2, 2026 22:58

Merge branch 'main' into perf/fix-763-cycles-stats-timeout

226473d

Merge remote-tracking branch 'origin/main' into perf/fix-763-cycles-s…

84c5f8d

…tats-timeout

Merge branch 'perf/fix-763-cycles-stats-timeout' of https://github.co…

2e74b84

…m/optave/ops-codegraph-tool into perf/fix-763-cycles-stats-timeout

Merge branch 'main' into perf/fix-763-cycles-stats-timeout

ebce492

carlos-alm merged commit 203d670 into main Apr 3, 2026
12 checks passed

carlos-alm deleted the perf/fix-763-cycles-stats-timeout branch April 3, 2026 06:15

github-actions bot locked and limited conversation to collaborators Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize cycles and stats for large codebases#781

perf: optimize cycles and stats for large codebases#781
carlos-alm merged 9 commits intomainfrom
perf/fix-763-cycles-stats-timeout

carlos-alm commented Apr 3, 2026

Uh oh!

claude bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

carlos-alm Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

carlos-alm Apr 3, 2026

Uh oh!

carlos-alm commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 3, 2026

Summary

Test plan

Uh oh!

claude bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Performance Optimization for Cycles and Stats

Todo List

Phase 1: Root Cause Analysis ✅

Phase 2: Configuration Change Detection ✅

Phase 3: Test Integrity ✅

Phase 4: Code Quality Review ✅

Phase 5: Backlog Compliance ✅

Final Recommendation

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 3, 2026 •

edited

Loading

greptile-apps bot commented Apr 3, 2026 •

edited

Loading