Skip to content

perf: optimize cycles and stats for large codebases#781

Merged
carlos-alm merged 9 commits intomainfrom
perf/fix-763-cycles-stats-timeout
Apr 3, 2026
Merged

perf: optimize cycles and stats for large codebases#781
carlos-alm merged 9 commits intomainfrom
perf/fix-763-cycles-stats-timeout

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • findCycles: Build adjacency list directly from DB rows and run Tarjan inline — eliminates intermediate CodeGraph construction, toEdgeArray() serialization, and redundant label CodeGraph rebuild
  • findHotspots / moduleMapData: Replace O(V×E) correlated subqueries with pre-aggregated LEFT JOINs (edge counts grouped once, joined by ID)
  • countNodesByKind / countEdgesByKind / countRoles: Replace full-table-load + JS .filter() with SQL WHERE clauses via testFilterSQL; remove buildTestFileIds (no longer needed)

Test plan

  • tests/graph/ — 271 tests pass (cycles, algorithms, builders)
  • tests/integration/queries.test.ts — 99 tests pass
  • tests/integration/roles.test.ts — 12 tests pass
  • Lint clean (biome check)

Closes #763

findCycles: build adjacency list directly from DB rows and run Tarjan
inline — eliminates intermediate CodeGraph construction, toEdgeArray
serialization, and redundant label CodeGraph rebuild.

statsData/moduleMapData: replace O(V×E) correlated subqueries in
findHotspots and moduleMapData with pre-aggregated LEFT JOINs.
Replace full-table-load + JS filtering in countNodesByKind,
countEdgesByKind, and countRoles with SQL WHERE clauses using
testFilterSQL. Remove buildTestFileIds (no longer needed).

Closes #763
@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Performance Optimization for Cycles and Stats

Todo List

  • Phase 1: Root cause analysis - understand the timeout problem and solution approach
  • Phase 2: Configuration change detection - check for any rule relaxations
  • Phase 3: Test integrity - verify tests remain robust
  • Phase 4: Code quality review - analyze the optimization implementation
  • Phase 5: Backlog compliance assessment - rate against project criteria
  • Final Assessment: Provide rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Performance timeouts in cycles and stats commands on large codebases (#763)

Why This Approach: The optimizations target specific algorithmic bottlenecks:

  • findCycles: Eliminates O(V+E) overhead from CodeGraph construction + serialization
  • findHotspots/moduleMapData: Replaces O(V×E) correlated subqueries with O(E) pre-aggregated JOINs
  • Stats count functions: Replaces full-table JavaScript filtering with SQL WHERE clauses

Risk Assessment: ✅ Low Risk - These are pure performance optimizations that maintain identical functionality. The approach directly addresses computational complexity bottlenecks rather than masking symptoms.

Phase 2: Configuration Change Detection ✅

Status: ✅ No Configuration Changes

  • No changes to biome.json, package.json, or test timeouts
  • Only 2 source files modified: cycles.ts and module-map.ts
  • No rule relaxations or test expectations lowered
  • buildTestFileIds function removed (no longer needed due to SQL optimization)

Phase 3: Test Integrity ✅

Status: ✅ Tests Remain Robust

  • Existing test suite (tests/graph/cycles.test.ts) validates behavior via public API
  • Tests create DB directly and call findCycles(db) - optimizations are transparent
  • No test modifications needed since public API unchanged
  • PR claims all 271 graph tests + 99 integration tests + 12 roles tests pass

Phase 4: Code Quality Review ✅

Excellent Optimizations:

  1. findCycles optimization (cycles.ts:12-61):

    • Before: buildDependencyGraph()CodeGraphtoEdgeArray() → rebuild CodeGraph for labels
    • After: Direct adjacency list from DB rows → inline Tarjan algorithm
    • Impact: Eliminates 3 O(V+E) operations, huge memory savings
  2. findHotspots optimization (module-map.ts:111-135):

    • Before: Correlated subqueries (SELECT COUNT(*) FROM edges WHERE target_id = n.id) - O(V×E)
    • After: Pre-aggregated LEFT JOINs with grouped edge counts - O(E)
    • Impact: Massive improvement for large graphs
  3. SQL filtering replacement:

    • Before: buildTestFileIds() loads all files into memory, filters in JavaScript
    • After: SQL WHERE clauses via testFilterSQL()
    • Impact: Database does the filtering, much faster

Implementation Quality:

  • ✅ Maintains identical public APIs
  • ✅ Preserves native/WASM fallback patterns
  • ✅ Clean SQL with proper parameterization
  • ✅ Efficient adjacency list construction
  • ✅ Correct Tarjan implementation (existing algorithm, relocated)

No Issues Found: No bugs, security concerns, or performance regressions detected.

Phase 5: Backlog Compliance ✅

  • Zero-dep: ✓ No new runtime dependencies
  • Foundation-aligned: ✓ Fully compliant - improves core analysis without violating principles
  • Problem-fit: 5/5 - Directly addresses "AI agents timeout on large codebases" core problem
  • Breaking: No - Pure internal optimizations, identical public APIs
  • Tier: Tier 1 (zero-dep ✓ + foundation-aligned ✓)

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: Exceptional performance engineering that directly solves the core timeout problem. Clean implementation with no breaking changes, proper fallbacks maintained, and transparent API preservation. This is exactly the kind of optimization needed for large codebases.

Summary: This PR demonstrates excellent performance engineering - targeting algorithmic bottlenecks with surgical precision while maintaining full API compatibility. The optimizations are mathematically sound (O(V×E) → O(E)) and implementation is clean. Strong approve.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR is a targeted performance optimization for two query-heavy code paths: cycle detection (findCycles) and stats/module-map aggregation (statsData, moduleMapData, findHotspots). The changes eliminate intermediate CodeGraph object construction and replace O(V×E) correlated subqueries with pre-aggregated LEFT JOIN subqueries.

Key changes:

  • findCycles: Builds a label-keyed adjacency list directly from DB rows and runs Tarjan SCC inline (tarjanFromEdges), skipping buildDependencyGraph, CodeGraph, and toEdgeArray(). A Set<string> with null-byte-separated keys deduplicates edges before they enter the algorithm.
  • findHotspots: Swaps correlated per-row SELECT COUNT(*) for two pre-aggregated LEFT JOIN subqueries (with kind NOT IN ('contains', 'parameter_of', 'receiver') filters, added per prior review feedback). The SQL LIMIT ? replaces the JS .slice().
  • moduleMapData: Same pre-aggregated JOIN pattern for out_edges/in_edges; selects only needed columns (n.file) instead of n.*.
  • countNodesByKind / countEdgesByKind / countRoles: Replace full-table load + JS .filter() with testFilterSQL-generated WHERE clauses; buildTestFileIds removed.

The two previously-raised concerns (edge deduplication and contains/parameter_of/receiver exclusion in hotspot subqueries) are both addressed in the current head commit.

Confidence Score: 5/5

Safe to merge — no P0/P1 issues remain; prior review concerns about edge deduplication and hotspot edge-kind filtering are both resolved in this head commit.

All remaining findings are P2 or lower. The core correctness risks raised in prior rounds (missing deduplication, wrong edge-kind inclusion in hotspot subqueries) are addressed. The algorithmic changes are semantically equivalent to the old paths, the test suite passes, and the SQL rewrites follow established patterns already used elsewhere in the codebase.

No files require special attention.

Important Files Changed

Filename Overview
src/domain/graph/cycles.ts Replaces intermediate CodeGraph construction with a direct DB-row adjacency list and inline Tarjan SCC; adds deduplication via a null-byte-separated key Set, matching the old buildDependencyGraph guard. Logic is correct.
src/domain/analysis/module-map.ts Replaces O(V×E) correlated subqueries with pre-aggregated LEFT JOINs for findHotspots and moduleMapData; replaces full-table JS filters with SQL WHERE clauses for countNodesByKind, countEdgesByKind, and countRoles; edge-kind exclusion filters correctly added to hotspot subqueries per prior review feedback.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant findCycles
    participant DB
    participant tarjanFromEdges
    participant native

    CLI->>findCycles: findCycles(db, {fileLevel, noTests})
    findCycles->>DB: getFileNodesAll() or getCallableNodes()
    DB-->>findCycles: node rows
    findCycles->>DB: getImportEdges() or getCallEdges()
    DB-->>findCycles: edge rows
    Note over findCycles: Deduplicate via Set<key><br/>Build label→label edge list
    findCycles->>native: detectCycles(edges) [if available]
    native-->>findCycles: string[][]
    findCycles->>tarjanFromEdges: tarjanFromEdges(edges) [JS fallback]
    Note over tarjanFromEdges: Build adj Map<br/>Run Tarjan SCC<br/>Filter SCCs length > 1
    tarjanFromEdges-->>findCycles: string[][]
    findCycles-->>CLI: cycles
Loading

Reviews (2): Last reviewed commit: "fix: deduplicate edges in findCycles and..." | Re-trigger Greptile

Comment on lines +30 to 52
for (const e of getImportEdges(db)) {
if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue;
if (e.source_id === e.target_id) continue;
const src = idToFile.get(e.source_id)!;
const tgt = idToFile.get(e.target_id)!;
edges.push({ source: src, target: tgt });
}
} else {
let nodes = getCallableNodes(db);
if (noTests) nodes = nodes.filter((n) => !isTestFile(n.file));
const nodeIds = new Set<number>();
const idToLabel = new Map<number, string>();
for (const n of nodes) {
nodeIds.add(n.id);
idToLabel.set(n.id, `${n.name}|${n.file}`);
}
for (const e of getCallEdges(db)) {
if (!nodeIds.has(e.source_id) || !nodeIds.has(e.target_id)) continue;
if (e.source_id === e.target_id) continue;
const src = idToLabel.get(e.source_id)!;
const tgt = idToLabel.get(e.target_id)!;
edges.push({ source: src, target: tgt });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Duplicate edges not deduplicated — regression vs. old buildDependencyGraph

The old buildDependencyGraph in src/graph/builders/dependency.ts explicitly deduplicated edges in both the file-level and function-level branches via an if (!graph.hasEdge(src, tgt)) guard before adding to the CodeGraph. This meant multiple import statements from file A to file B, or multiple call sites from function A to function B, were collapsed to a single edge.

The new code pushes raw DB rows directly into edges without any deduplication. A function that calls another function from ten distinct call sites produces ten identical { source, target } pairs in the array.

Impact on the native path: native.detectCycles(edges) now receives a larger and potentially duplicate-heavy list compared to what it previously received. There is no guarantee the native implementation handles duplicate edges the same way as the JS Tarjan path, creating an observable divergence between the two engines.

Impact on the JS path: tarjanFromEdges remains correct — the indices.has(w) guard prevents re-processing — but redundant adjacency-list entries add unnecessary work, which runs counter to this PR's own performance goals.

The fix is to track seen (source, target) pairs in a Set and skip duplicates before pushing into edges, mirroring the hasEdge guard that existed in buildDependencyGraph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a511c82. Added a Set<string> to track seen (source, target) pairs using a null-byte-separated key. Both the file-level and function-level branches now skip duplicate edges before pushing into the array, restoring the deduplication guarantee that existed in the old buildDependencyGraph path. This ensures the native engine receives the same unique edge set as the JS Tarjan path.

Comment on lines +119 to 128
LEFT JOIN (
SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id
) fi ON fi.target_id = n.id
LEFT JOIN (
SELECT source_id, COUNT(*) AS cnt FROM edges GROUP BY source_id
) fo ON fo.source_id = n.id
WHERE n.kind = 'file' ${testFilter}
ORDER BY (SELECT COUNT(*) FROM edges WHERE target_id = n.id)
+ (SELECT COUNT(*) FROM edges WHERE source_id = n.id) DESC
ORDER BY COALESCE(fi.cnt, 0) + COALESCE(fo.cnt, 0) DESC
LIMIT ?
`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Pre-aggregated fan-in/fan-out subqueries count ALL edge types, including edges between non-file nodes

The two LEFT JOIN subqueries aggregate over the full edges table:

LEFT JOIN (
  SELECT target_id, COUNT(*) AS cnt FROM edges GROUP BY target_id
) fi ON fi.target_id = n.id

This counts every edge whose target_id equals the file node's ID — including contains, calls, imports, parameter_of, etc. The old correlated-subquery approach (SELECT COUNT(*) FROM edges WHERE target_id = n.id) had identical semantics, so there is no correctness regression here.

However, unlike moduleMapData (which explicitly excludes contains, parameter_of, receiver in its equivalent subqueries), findHotspots counts all edge kinds. This means a file's fan_in will be dominated by contains edges (one per function/method in the file) rather than by meaningful import or call fan-in. If hotspot ranking by inter-file coupling is the intent, it may be worth adding a WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filter to the subqueries to match moduleMapData's approach — or at minimum confirm this is intentional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a511c82. Added WHERE kind NOT IN ('contains', 'parameter_of', 'receiver') filters to both the fan-in and fan-out subqueries in findHotspots, matching the filtering already used by moduleMapData. Hotspot ranking now reflects inter-file coupling (imports, calls) rather than being dominated by internal structure edges.

…spots (#781)

Restore edge deduplication that existed in the old buildDependencyGraph
path — without it, duplicate (source, target) pairs from multiple call
sites inflated the edge array passed to both native and JS engines.

Also align findHotspots fan-in/fan-out subqueries with moduleMapData by
excluding contains, parameter_of, and receiver edges so hotspot ranking
reflects inter-file coupling rather than internal structure edges.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 203d670 into main Apr 3, 2026
12 checks passed
@carlos-alm carlos-alm deleted the perf/fix-763-cycles-stats-timeout branch April 3, 2026 06:15
@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codegraph: cycles and stats commands timeout on large codebases

1 participant