perf(build): incremental rebuild optimizations — roles 255ms→9ms by carlos-alm · Pull Request #622 · optave/codegraph

carlos-alm · 2026-03-26T05:43:02Z

Summary

Roles classification 255ms → ~9ms (96% improvement): Add incremental path to classifyNodeRoles that only reclassifies nodes from changed files using indexed correlated subqueries instead of full table scans. Global medians computed from edge distribution for consistent thresholds. Only resets roles for affected files, not all nodes.
Structure loading N+1 → 3 batch queries: Replace per-file queries for definitions and import counts with batch queries that load all data in 3 queries regardless of file count.
Finalize: skip advisory queries for incremental builds: Orphaned embeddings, stale embeddings, and unused exports warnings are informational and don't affect correctness. Skipping them saves ~40ms.
classifyRoles median overrides: Accept optional median overrides parameter so the incremental path can supply global medians without querying all nodes.
codegraph path --file support: Add file-to-file shortest path queries via BFS over import edges. New filePathData() function, CLI -f/--file flag, MCP file_mode parameter. 9 integration tests.

Measured results (codegraph self-build, native engine, 10.9k nodes, 20.9k edges)

Phase	Before	After
Roles	~255ms	9ms
Structure	~18ms	17ms
Edges	~12ms	12ms
Parse	~90ms	89ms

Total incremental 1-file rebuild: ~340ms (down from ~802ms, 58% faster).

Remaining bottleneck is closeDb WAL checkpoint (~170ms) and file parsing (~89ms for 12 files including 11 reverse-deps), which are outside the scope of pipeline-level optimizations.

Test plan

All 49 incremental parity + build + roles tests pass
New test: roles parity — confirms incremental roles match full build output
New test: performance timing breakdown with threshold assertions (roles < 50ms, structure < 50ms, finalize < 50ms)
Edge parity tests pass (10/10)
Self-build produces identical node/edge counts
TypeScript compiles cleanly (tsc --noEmit)
Lint passes (0 errors)
9 new filePathData integration tests (1-hop, multi-hop, reverse, noTests, self-file, error cases)
All 113 queries + presentation tests pass

…and finalize Roles classification (255ms → ~9ms): Add incremental path that only reclassifies nodes from changed files using indexed correlated subqueries instead of full table scans. Global medians computed from edge distribution for threshold consistency. Only reset roles for affected files, not all nodes. Structure loading: Replace N+1 per-file queries (definitions, import counts) with batch queries that load all data in 3 queries regardless of file count. Finalize: Skip advisory queries (orphaned/stale embeddings, unused exports) during incremental builds — informational warnings that don't affect correctness. classifyRoles: Accept optional median overrides so incremental path can supply global medians without querying all nodes. Also auto-fixes 3 pre-existing lint errors (useLiteralKeys, noNonNullAssertion) that existed on main and blocked all commits via husky pre-commit hook. Tests: Add roles parity test confirming incremental roles match full build. Add performance test with timing breakdown and threshold assertions.

claude · 2026-03-26T05:43:14Z

Claude finished @carlos-alm's task —— View job

PR Review: Incremental Rebuild Optimizations

Final Assessment

Phase 1-5 Complete: All analysis phases passed with excellent results.

Root Cause Analysis

Problem Being Solved: Incremental rebuilds took 802ms with roles classification at 255ms being the primary bottleneck. Self-builds triggering every file change needed sub-second performance to be practical.
Why This Approach: Targeted incremental processing using indexed queries vs full table scans. Only reclassifies nodes from changed files rather than all 10.9k nodes. Uses global medians computed once from edge distribution.
Risk Assessment: Low risk - maintains correctness with comprehensive parity tests, performance improvements are purely additive.

Backlog Compliance

Zero-dep: ✓ — Uses existing tree-sitter/SQLite/AST infrastructure, no new runtime dependencies
Foundation-aligned: ✓ — Directly advances P1 (always-current graph) by making incremental rebuilds fast enough for commit hooks and agent loops
Problem-fit: 5 — Directly addresses core problem by reducing rebuild times from 802ms to 340ms, enabling practical always-current graphs
Breaking: No — Purely internal optimizations, no API/CLI/DB schema changes
Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓)

Technical Excellence

Performance Optimizations

Roles Classification: 255ms → 9ms (96% improvement)
- Incremental path only processes nodes from changed files
- Global medians computed from edge distribution (fast index scans)
- Indexed correlated subqueries replace expensive table joins
- Only resets roles for affected files, preserves existing classifications
Structure Loading: N+1 → 3 batch queries
- Eliminates per-file DB hits with defsByFileMap and importCountMap
- Single queries for all definitions, import counts, line counts
- Smart incremental data preservation from existing DB state
Finalize: Skip advisory queries (~40ms savings)
- Orphaned/stale embeddings and unused exports are informational only
- Correctness-preserving optimization for incremental builds
- Clear conditional logic (!isFullBuild)

Code Quality

No bugs, security issues, or performance regressions identified
Minor quality fixes: safe array access, consistent property access
Comprehensive test coverage with parity and performance assertions
Well-documented implementation with clear optimization rationale

Critical Concerns

None identified. All phases passed stringent review criteria.

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: Exceptional PR that directly advances the core mission. 58% overall performance improvement (802ms→340ms) through sophisticated incremental optimizations while maintaining perfect correctness. Zero dependencies, foundation-aligned, addresses the core "always-current graph" problem with measurable impact. Comprehensive test coverage and clean implementation. This is exactly the kind of optimization that makes codegraph practical for commit hooks and agent loops.

The previous commit included 3 biome auto-fixes that conflict with TypeScript strictness: optional chaining produces undefined (not null), and dot-access on index signatures is disallowed by TS4111. These are pre-existing lint issues on main — revert to the original code. Impact: 4 functions changed, 0 affected

greptile-apps · 2026-03-26T05:48:23Z

Greptile Summary

This PR delivers two independent improvements: incremental rebuild performance (roles classification 255ms→9ms via indexed correlated subqueries and edge-neighbour expansion, N+1 structure queries replaced with 3 batch queries, advisory queries skipped on incremental) and a new codegraph path --file feature for file-to-file BFS shortest-path queries exposed via CLI, MCP, and the queries API.

Key changes:

classifyNodeRoles now has a fast incremental path that expands the affected set to immediate edge-neighbours, fixing the previously-reported stale-roles bug for structural changes
median is exported from roles.ts and imported in structure.ts, eliminating the previously-reported duplicate helper
classifyNodeRoles JSDoc documents that the returned RoleSummary is scoped to the affected subset in incremental mode
Performance thresholds in tests increased to 200ms (previously-reported flaky CI concern addressed)
filePathData supports noTests, reverse, maxDepth, and edgeKinds options, covered by 9 integration tests
All prior review concerns have been addressed

Confidence Score: 5/5

Safe to merge; all prior review concerns are resolved and the new code is well-tested

Every issue raised in previous rounds (stale roles, median duplication, partial RoleSummary, flaky CI thresholds, non-structural parity test) has been fixed. The incremental logic is correct — global medians are computed from the full edge distribution, affected sets expand to edge-neighbours, and the DB update is transactional. The new filePathData BFS is straightforward and covered by 9 focused integration tests. The one remaining note (missing disambiguation hint in the found-path branch of the CLI) is a minor UX P2 that does not affect correctness or reliability.

No files require special attention

Important Files Changed

Filename	Overview
src/domain/analysis/dependencies.ts	Adds `filePathData()` — a BFS over file-level import edges to find shortest file-to-file paths. Logic is correct: handles no-match, self-file (0 hops), BFS depth limit, `noTests` filtering, reverse traversal, and path reconstruction via `parentMap`. `alternateCount` is correctly adjusted by -1.
src/features/structure.ts	Splits `classifyNodeRoles` into full/incremental paths. Incremental path expands the affected set to edge neighbours (fixing stale-roles bug from prior review), computes global medians from edge distribution, and only resets/updates roles for affected files. Imported `median` from roles.ts, eliminating the previous duplication. JSDoc documents that returned `RoleSummary` is scoped to the affected subset in incremental mode.
src/domain/graph/builder/stages/build-structure.ts	Replaces N+1 per-file queries for definitions and import counts with 3 batch queries; Maps built in JS for O(1) lookup. Correctly passes `changedFileList` to `classifyNodeRoles` for incremental builds, null for full builds.
src/domain/graph/builder/stages/finalize.ts	Wraps all advisory queries (orphaned embeddings, stale embeddings, unused exports) in an `isFullBuild` guard. Incremental builds skip all three and log a debug message instead. Semantics of the warnings are unchanged for full builds.
src/graph/classifiers/roles.ts	Exports `median` helper and adds optional `medianOverrides` parameter to `classifyRoles`. When overrides are provided, local fan-in/fan-out arrays are skipped entirely, enabling the incremental path to inject globally-computed medians.
src/presentation/queries-cli/path.ts	Adds `filePath()` for the new --file mode. Handles error, not-found, 0-hops, and multi-hop cases. Disambiguation hint (multiple file matches) is only shown in the not-found branch, missing it for the found case.
tests/integration/incremental-parity.test.ts	Adds three new test suites: roles parity on non-structural change, roles/nodes/edges parity after structural change (add/remove call), and performance timing with 200ms thresholds. The structural-change suite exercises the edge-neighbour expansion fix.
tests/integration/queries.test.ts	Adds 9 integration tests for `filePathData` covering 1-hop, multi-hop, reverse, noTests, self-file, error cases, and candidate population. Good coverage of the new function's main code paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[classifyNodeRoles] -->|changedFiles provided| B[classifyNodeRolesIncremental]
    A -->|no changedFiles| C[classifyNodeRolesFull]

    B --> D[Expand to edge-neighbours\nSQL: callers + callees in other files]
    D --> E[Compute global medians\nfrom edge distribution]
    E --> F[Fetch affected nodes\nindexed correlated subqueries]
    F --> G[classifyRoles with medianOverrides]
    G --> H[Transaction: reset roles for\naffected files only, then update]

    C --> I[Full table scan\nall nodes + fan-in/fan-out]
    I --> J[classifyRoles — derives\nmedians from local node set]
    J --> K[Transaction: SET role=NULL\nfor all nodes, then update]

    style B fill:#d4edda,stroke:#28a745
    style C fill:#fff3cd,stroke:#ffc107

_{Reviews (3): Last reviewed commit: "fix(roles): document incremental RoleSum..." | Re-trigger Greptile}

greptile-apps · 2026-03-26T05:48:26Z

tests/integration/incremental-parity.test.ts

+    expect(p.rolesMs).toBeLessThan(50);
+    expect(p.structureMs).toBeLessThan(50);
+    expect(p.finalizeMs).toBeLessThan(50);


Performance assertions may cause flaky CI failures

Wall-clock timing assertions (< 50ms) are environment-sensitive and can fail non-deterministically on slow CI machines, under heavy load, or during JIT warm-up. The fixture is tiny (barrel-project), so it's unlikely to trigger the threshold in normal runs — but it's easy to exceed 50ms during I/O spikes or garbage collection pauses, making these assertions a source of intermittent failures.

Consider skipping the assertion on CI environments, increasing the threshold to something like 200ms, or using a relative comparison rather than an absolute wall-clock budget:

// Either: if (!process.env.CI) { expect(p.rolesMs).toBeLessThan(50); expect(p.structureMs).toBeLessThan(50); expect(p.finalizeMs).toBeLessThan(50); } // Or increase to a generous CI-safe budget: expect(p.rolesMs).toBeLessThan(200); expect(p.structureMs).toBeLessThan(200); expect(p.finalizeMs).toBeLessThan(200);

Fixed — increased thresholds from 50ms to 200ms with a comment explaining the rationale. Local benchmarks show ~9ms for roles, so 200ms provides a generous CI-safe buffer against slow runners, GC pauses, and I/O spikes.

greptile-apps · 2026-03-26T05:48:27Z

src/features/structure.ts

+  }[];
+
+  if (rows.length === 0) return emptySummary;
+
+  // 3. Get exported status for affected nodes only (scoped to changed files)
+  const exportedIds = new Set(
+    (
+      db
+        .prepare(
+          `SELECT DISTINCT e.target_id
+          FROM edges e
+          JOIN nodes caller ON e.source_id = caller.id
+          JOIN nodes target ON e.target_id = target.id
+          WHERE e.kind = 'calls' AND caller.file != target.file
+            AND target.file IN (${placeholders})`,
+        )
+        .all(...changedFiles) as { target_id: number }[]
+    ).map((r) => r.target_id),
+  );
+
+  // 4. Production fan-in for affected nodes only
+  const prodFanInMap = new Map<number, number>();
+  const prodRows = db
+    .prepare(
+      `SELECT e.target_id, COUNT(*) AS cnt
+      FROM edges e
+      JOIN nodes caller ON e.source_id = caller.id
+      JOIN nodes target ON e.target_id = target.id
+      WHERE e.kind = 'calls'
+        AND target.file IN (${placeholders})
+        ${testFilterSQL('caller.file')}
+      GROUP BY e.target_id`,
+    )
+    .all(...changedFiles) as { target_id: number; cnt: number }[];
+  for (const r of prodRows) {


Roles for unchanged callers can become stale after structural changes

The incremental path only reclassifies nodes belonging to changedFiles. However, a node's role also depends on its fan_in (how many other nodes call it). When a changed file removes a call to a node in an unchanged file, that unchanged node's fan_in decreases — and its role may need to flip (e.g. core → leaf). Because only the changed files' nodes are reset and reclassified, callers in unchanged files retain their previous roles until the next full build.

The existing parity test covers only a comment-only change (// touched), which produces no structural edge changes and therefore cannot expose this case. A test that removes a cross-file call would catch the divergence.

This is a known trade-off for incremental speed, but it's worth documenting explicitly (e.g., a code comment noting the limitation) and ideally adding a regression test for a structural change (add/remove a cross-file call) to confirm the divergence is either acceptable or handled.

Fixed — rather than documenting this as a known trade-off, I fixed the bug. The incremental path now expands the affected file set to include edge neighbours (files containing nodes that are callers/callees of changed-file nodes). This ensures that removing a call from file A to a node in file B causes B's roles to be recalculated. Added a structural-change parity test that removes a cross-file call and verifies incremental roles match a full rebuild — this test now passes.

greptile-apps · 2026-03-26T05:48:28Z

src/features/structure.ts

+        .all(...changedFiles) as { target_id: number }[]
+    ).map((r) => r.target_id),
+  );
+
+  // 4. Production fan-in for affected nodes only
+  const prodFanInMap = new Map<number, number>();
+  const prodRows = db
+    .prepare(
+      `SELECT e.target_id, COUNT(*) AS cnt
+      FROM edges e
+      JOIN nodes caller ON e.source_id = caller.id
+      JOIN nodes target ON e.target_id = target.id
+      WHERE e.kind = 'calls'
+        AND target.file IN (${placeholders})
+        ${testFilterSQL('caller.file')}
+      GROUP BY e.target_id`,
+    )
+    .all(...changedFiles) as { target_id: number; cnt: number }[];
+  for (const r of prodRows) {


median function is re-defined inside the incremental path

A local median helper is defined inside classifyNodeRolesIncremental, duplicating the module-level median function already defined in src/graph/classifiers/roles.ts. Both implementations are identical.

Consider importing or moving the shared helper to avoid the duplication:

// Option A: import it from roles.ts import { median } from '../graph/classifiers/roles.js'; // Option B: hoist it to module scope in structure.ts and share it with classifyNodeRolesFull

Fixed — exported the existing median function from src/graph/classifiers/roles.ts and imported it in structure.ts. Removed the duplicate local definition from classifyNodeRolesIncremental.

greptile-apps · 2026-03-26T05:48:29Z

tests/integration/incremental-parity.test.ts

+  it('preserves node roles after incremental rebuild', () => {
+    function readRoles(dbPath: string) {
+      const db = new Database(dbPath, { readonly: true });
+      const roles = db
+        .prepare(
+          `SELECT name, kind, file, role FROM nodes
+           WHERE kind NOT IN ('file', 'directory') AND role IS NOT NULL
+           ORDER BY name, kind, file`,
+        )
+        .all();
+      db.close();
+      return roles;
+    }
+    const fullRoles = readRoles(path.join(fullDir, '.codegraph', 'graph.db'));
+    const incrRoles = readRoles(path.join(incrDir, '.codegraph', 'graph.db'));
+    expect(incrRoles.length).toBeGreaterThan(0);
+    expect(incrRoles).toEqual(fullRoles);
+  });


Parity test only validates trivially non-structural changes

The roles parity test (preserves node roles after incremental rebuild) touches app.js with a comment (// touched), which appends no new symbols and creates no new edges. This means the test verifies parity only in the case where no edges change, which is the easiest scenario for the incremental path to get right.

A more meaningful parity test would exercise a structural change — for example, adding or removing a function call between files — to confirm that the incremental roles stay consistent with a full rebuild even when fan_in/fan_out values shift. Without this, the parity guarantee is weaker than it appears.

Fixed — added a new Incremental build parity: structural change (add/remove call) test suite that rewrites app.js to remove the multiply() call, then verifies nodes, edges, and roles all match between incremental and full rebuild. This exercises the edge-removal scenario and caught the stale-roles bug that is now fixed via edge-neighbour expansion.

Address Greptile review feedback: - Fix stale roles for unchanged callers by expanding the affected file set to include edge neighbours (callers/callees of changed files) - Remove duplicate median() helper from classifyNodeRolesIncremental, export and reuse the one from graph/classifiers/roles.ts - Increase performance assertion thresholds from 50ms to 200ms to prevent flaky CI failures on slow runners - Add structural-change parity test that removes a cross-file call and verifies incremental roles match a full rebuild Impact: 3 functions changed, 4 affected

carlos-alm · 2026-03-26T05:57:13Z

@greptileai

greptile-apps · 2026-03-26T06:02:32Z

src/features/structure.ts

+  // 6. Build summary (only for affected nodes) and update only those nodes
+  const summary: RoleSummary = { ...emptySummary };
+  const idsByRole = new Map<string, number[]>();
+  for (const row of rows) {
+    const role = roleMap.get(String(row.id)) || 'leaf';
+    if (role.startsWith('dead')) summary.dead++;
+    summary[role] = (summary[role] || 0) + 1;
+    let ids = idsByRole.get(role);
+    if (!ids) {
+      ids = [];
+      idsByRole.set(role, ids);
+    }
+    ids.push(row.id);
+  }


Incremental RoleSummary under-counts when logged

The summary returned by the incremental path is built solely from rows (the affected nodes), so every role counter starts at 0 and only reflects the reclassified subset. The full path returns a graph-wide summary from all nodes, so the two return types are semantically different despite sharing the same RoleSummary shape.

Currently the return value is only used for a debug log line:

debug( `Roles${changedFileList ? ` (incremental, ${changedFileList.length} files)` : ''}: ${Object.entries(roleSummary)...}`, );

This is harmless today, and the log prefix (incremental, N files) makes the limitation visible. The risk is that a future caller treats the returned counts as global totals and makes incorrect decisions (e.g., "0 core nodes" triggering a warning).

Consider either:

Documenting the semantics explicitly on classifyNodeRoles's return type (JSDoc noting that incremental returns are scoped to affected nodes), or

Returning null from the incremental path to signal "partial data" to callers.

This is not a bug today but a potential footgun as the codebase evolves.

Fixed — added a JSDoc comment to classifyNodeRoles explicitly documenting that when changedFiles is provided, the returned RoleSummary reflects only the affected subset, not the entire graph. Callers needing graph-wide totals should omit changedFiles or query the DB directly.

The path command previously only supported symbol-to-symbol queries. Add file-level BFS that traverses import edges between files, enabling queries like `codegraph path cli.ts db/index.ts --file` for sync-phase analysis and module dependency exploration. - filePathData() in dependencies.ts: file-level BFS over import edges - CLI: -f/--file flag, defaults edge kinds to imports,imports-type - MCP: file_mode parameter on path tool - 9 integration tests covering 1-hop, multi-hop, reverse, noTests, etc. Impact: 4 functions changed, 5 affected

Impact: 18 functions changed, 31 affected

…codegraph into perf/incremental-rebuild

carlos-alm · 2026-03-26T07:12:24Z

@greptileai

greptile-apps bot reviewed Mar 26, 2026

View reviewed changes

carlos-alm added 6 commits March 26, 2026 00:21

Merge branch 'main' into perf/incremental-rebuild

c11cb6a

Merge remote-tracking branch 'origin/main' into perf/incremental-rebuild

e314030

Impact: 18 functions changed, 31 affected

Merge branch 'main' into perf/incremental-rebuild

c1fb3db

Merge branch 'perf/incremental-rebuild' of https://github.com/optave/…

7277e8b

…codegraph into perf/incremental-rebuild

fix(roles): document incremental RoleSummary scoped semantics (#622)

6547755

carlos-alm added 2 commits March 26, 2026 01:37

fix: resolve merge conflicts with main

665a211

Merge branch 'main' into perf/incremental-rebuild

7846ce1

carlos-alm merged commit 0238d10 into main Mar 26, 2026
12 checks passed

carlos-alm deleted the perf/incremental-rebuild branch March 26, 2026 07:46

github-actions bot locked and limited conversation to collaborators Mar 26, 2026

Conversation

carlos-alm commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured results (codegraph self-build, native engine, 10.9k nodes, 20.9k edges)

Test plan

Uh oh!

claude bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Incremental Rebuild Optimizations

Final Assessment

Root Cause Analysis

Backlog Compliance

Technical Excellence

Performance Optimizations

Code Quality

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Mar 26, 2026

Uh oh!

greptile-apps bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlos-alm commented Mar 26, 2026 •

edited

Loading

claude bot commented Mar 26, 2026 •

edited

Loading

greptile-apps bot commented Mar 26, 2026 •

edited

Loading