Skip to content

perf(globs): memoize compiled include/exclude globs per build#1005

Merged
carlos-alm merged 5 commits intomainfrom
fix/memoize-globs
Apr 23, 2026
Merged

perf(globs): memoize compiled include/exclude globs per build#1005
carlos-alm merged 5 commits intomainfrom
fix/memoize-globs

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Memoize compiled include/exclude globs in both the TS host (src/shared/globs.ts) and the Rust native engine (crates/codegraph-core/src/file_collector.rs) so long-running processes (watch mode, MCP server) don't recompile pattern lists on every buildGraph call.
  • Keyed on exact pattern content; FIFO cache capped at 32 entries with a clear() hook for tests and future config-reload scenarios.
  • TS side widens collectFiles internal signatures to readonly RegExp[] so frozen cached arrays flow through unchanged.

Why

Prior behavior: every buildGraph invocation re-parsed the same include/exclude patterns into fresh RegExp[] / GlobSet. Cheap on a single run, but watch mode and the MCP server issue many rebuilds against the same config and were paying the compile cost repeatedly.

Closes #1000 (deferred follow-up from PR #994 review).

Test plan

  • npx vitest run tests/unit/globs.test.ts — 7/7 pass (new file)
  • cargo test -p codegraph-core file_collector — 8/8 pass (includes 2 new memoization tests)
  • npx biome check on changed TS files — clean
  • Manual sanity: watch mode still hot-reloads on config edit (caller must clearGlobCache() if reloading config at runtime)

Long-running hosts (watch mode, MCP server) invoke buildGraph repeatedly
with the same config. Previously each invocation recompiled the
include/exclude patterns from scratch. Memoize by pattern content so the
compiled regex list (TS) / GlobSet (Rust) is reused across calls.

Both sides use a FIFO cache capped at 32 entries with a clear() hook for
tests and config-reload scenarios.

Closes #1000
@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: perf(globs): memoize compiled include/exclude globs per build

Todo List

  • Phase 1: Root Cause Analysis - understand the performance problem being solved
  • Phase 2: Configuration Change Detection - check for rule relaxations
  • Phase 3: Test Integrity - verify test quality and coverage
  • Phase 4: Code Quality Review - examine implementation details
  • Phase 5: Backlog Compliance Assessment - rate against BACKLOG.md criteria
  • Provide final assessment and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Long-running processes (watch mode, MCP server) were recompiling the same include/exclude glob patterns on every buildGraph call. This was wasteful for processes that rebuild repeatedly with identical configuration.

Why This Approach: Memoization is the correct solution - cache compiled RegExp arrays (TS) and GlobSet instances (Rust) keyed by pattern content. The approach is sensible with FIFO eviction and reasonable cache limits.

Risk Assessment: Low risk. This is a pure performance optimization that doesn't change semantics - same inputs produce same outputs, just cached.

Phase 2: Configuration Change Detection ✅

No problematic configuration changes detected:

  • No Biome rules disabled or relaxed
  • No test timeouts increased or coverage lowered
  • No TypeScript strict mode changes
  • No build configuration modifications
  • This is purely additive performance optimization code

Phase 3: Test Integrity ✅

Strong test coverage:

  • TypeScript side: 7 comprehensive tests covering edge cases, memoization behavior, cache isolation, frozen arrays, and cache clearing
  • Rust side: 2 new memoization tests using Arc::ptr_eq to verify identical compiled instances are returned
  • No weakened assertions - tests properly verify the performance optimization using identity comparisons (toBe() for same object reference)
  • Proper isolation - uses afterEach(clearGlobCache) to prevent test pollution
  • Edge case coverage - tests empty inputs, pattern ordering sensitivity, cache eviction

Phase 4: Code Quality Review ✅

Implementation Quality:

  • Cache key strategy: Uses \x1f (Unit Separator) to join patterns - robust choice that won't appear in file paths
  • FIFO eviction: Simple and appropriate for stable config pattern lists
  • Cache size limit: 32 entries is reasonable for config-driven pattern lists
  • Thread safety (Rust): Uses Arc<GlobSet> and Mutex<GlobCache> correctly
  • Type safety improvement: Widens signatures to readonly RegExp[] - good immutability hygiene
  • Error handling: Gracefully handles malformed patterns (existing behavior preserved)
  • Memory management: Frozen arrays prevent accidental mutation of cached results

Potential Issues Found:

  • Cache key collision risk: Patterns like ["a", "bc"] and ["ab", "c"] produce the same key (a\x1fc vs ab\x1c). However, this is acceptable since \x1f cannot appear in glob patterns (filesystem-safe chars only).

Security: No security implications - pure performance optimization.

Phase 5: Backlog Compliance Assessment

Root Cause Analysis

Problem Being Solved: Performance optimization for long-running processes that repeatedly compile identical glob patterns
Why This Approach: Direct memoization of expensive compilation step is the correct solution
Risk Assessment: Very low - pure performance optimization with identical semantics

Backlog Compliance

  • Zero-dep: ✓ — Uses only existing infrastructure (Map cache, Object.freeze, Arc/Mutex patterns)
  • Foundation-aligned: ✓ — Pure performance optimization, no architectural changes, maintains embeddable API
  • Problem-fit: 3 — Useful optimization but tangential to core "AI orientation" problem
  • Breaking: No — Purely additive, maintains exact API compatibility
  • Tier: 1

Critical Concerns

None identified. This is a clean, well-tested performance optimization.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation of a legitimate performance optimization. Well-tested with comprehensive coverage of edge cases. Code follows existing patterns and maintains backward compatibility. The only reason it's not 5 stars is that it's not a high-impact feature - just a solid incremental improvement for long-running processes.

Strengths:
✅ Addresses a real performance issue in watch mode and MCP server
✅ Comprehensive test suite with proper isolation
✅ Clean dual-engine implementation (TS + Rust)
✅ Proper cache management with FIFO eviction
✅ Type safety improvements (readonly arrays)
✅ Zero breaking changes

Minor considerations:

  • Cache key collision theoretical edge case (acceptable risk)
  • 32-entry cache limit seems reasonable but untested under stress

This PR demonstrates good engineering practices and solves a genuine performance issue without compromising code quality or introducing risk.


Job run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR memoizes compiled glob patterns in both the TypeScript layer (src/shared/globs.ts) and the Rust native engine (file_collector.rs), eliminating redundant recompilation in long-running hosts (watch mode, MCP server). Both sides use a content-keyed FIFO cache capped at 32 entries; the prior review concerns (cache key collision via \x1f, O(n) Vec::remove(0)) have been fully resolved in this iteration.

Confidence Score: 5/5

Safe to merge — all previously raised concerns are resolved and no new correctness issues were found.

Both blocking issues from the prior review round (cache key collision and O(n) eviction) are fixed. The Rust TOCTOU gap between the cache-check lock release and the cache-insert lock acquire is benign (worst case is two threads each compile the same patterns once, with the second insert being a no-op update). All remaining findings are P2 style observations. Tests cover the key invariants well.

No files require special attention.

Important Files Changed

Filename Overview
src/shared/globs.ts Adds module-level FIFO memoization cache (capped at 32) for compiled glob→RegExp arrays; returns frozen arrays; uses JSON.stringify key (prior collision issue resolved); exports clearGlobCache() for config-reload scenarios.
crates/codegraph-core/src/file_collector.rs Adds GlobCache backed by HashMap + VecDeque (O(1) pop_front — prior O(n) issue resolved); build_glob_set now returns Option<Arc>; uses OnceLock<Mutex> global; as_deref() call-sites updated correctly; clear_glob_cache is cfg(test)-only.
src/domain/graph/builder/helpers.ts Widens _includeRegexes/_excludeRegexes parameters from RegExp[] to readonly RegExp[] across all three overloads — backward-compatible and required for frozen cached arrays to flow through.
tests/unit/globs.test.ts New test file: 7 tests covering empty input, pattern compilation, memoization identity, distinct keys, order sensitivity, frozen output, and cache clear isolation. afterEach clearGlobCache ensures isolation.

Sequence Diagram

sequenceDiagram
    participant Caller as buildGraph / collect_files
    participant CG as compileGlobs (TS) / build_glob_set (Rust)
    participant Cache as compileCache / GlobCache
    participant Compiler as globToRegex / GlobSetBuilder

    Caller->>CG: patterns[]
    CG->>Cache: get(JSON.stringify(patterns) / Vec<String>)
    alt Cache hit
        Cache-->>CG: Arc<GlobSet> / readonly RegExp[]
        CG-->>Caller: cached result (no recompilation)
    else Cache miss
        Cache-->>CG: None / undefined
        CG->>Compiler: compile each pattern
        Compiler-->>CG: GlobSet / RegExp[]
        CG->>CG: Object.freeze(out) / Arc::new(set)
        CG->>Cache: insert(key, frozen result)
        Note over Cache: FIFO eviction if size >= 32
        CG-->>Caller: fresh compiled result
    end
Loading

Reviews (2): Last reviewed commit: "fix(globs): use VecDeque for O(1) FIFO e..." | Re-trigger Greptile

Comment thread src/shared/globs.ts Outdated
Comment on lines +61 to +65
const GLOB_KEY_SEP = '\x1f';
const compileCache = new Map<string, readonly RegExp[]>();

function buildCacheKey(patterns: readonly string[]): string {
return patterns.join(GLOB_KEY_SEP);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cache key collision on control-character separator

buildCacheKey joins patterns with \x1f (ASCII unit separator). If a user's config file ever contains a pattern with that character (e.g. from copy-paste of a binary-encoded string), two distinct pattern lists can produce the same cache key and the second lookup will silently return the wrong compiled regexes. JSON.stringify(patterns) is an unambiguous, zero-overhead alternative.

Suggested change
const GLOB_KEY_SEP = '\x1f';
const compileCache = new Map<string, readonly RegExp[]>();
function buildCacheKey(patterns: readonly string[]): string {
return patterns.join(GLOB_KEY_SEP);
function buildCacheKey(patterns: readonly string[]): string {
return JSON.stringify(patterns);
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e83741d — swapped the \x1f separator for JSON.stringify(patterns). Each pattern is now quoted and comma-delimited in the key, so two distinct pattern lists can never collide regardless of what characters appear in the patterns themselves. No perf impact on the hot path (cache hit is still a single Map lookup).

Comment on lines +73 to +87
self.map.get(key).cloned()
}

fn insert(&mut self, key: Vec<String>, value: Arc<GlobSet>) {
if self.map.contains_key(&key) {
self.map.insert(key, value);
return;
}
if self.map.len() >= COMPILE_CACHE_MAX && !self.order.is_empty() {
let oldest = self.order.remove(0);
self.map.remove(&oldest);
}
self.order.push(key.clone());
self.map.insert(key, value);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 FIFO eviction with Vec::remove(0) is O(n)

self.order.remove(0) shifts every remaining element left on each eviction. At cap=32 the cost is negligible, but a VecDeque would give O(1) pop_front at no extra complexity and makes the intent clearer.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8922af4 — switched order from Vec<Vec<String>> to VecDeque<Vec<String>> and replaced remove(0)/push with pop_front/push_back. O(1) eviction and the intent reads more clearly. cargo test -p codegraph-core file_collector still passes all 8 tests (including the 2 memoization tests).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Codegraph Impact Analysis

15 functions changed17 callers affected across 5 files

  • GlobCache.new in crates/codegraph-core/src/file_collector.rs:67 (0 transitive callers)
  • GlobCache.get in crates/codegraph-core/src/file_collector.rs:74 (0 transitive callers)
  • GlobCache.insert in crates/codegraph-core/src/file_collector.rs:78 (0 transitive callers)
  • GlobCache.clear in crates/codegraph-core/src/file_collector.rs:93 (0 transitive callers)
  • glob_cache in crates/codegraph-core/src/file_collector.rs:99 (12 transitive callers)
  • clear_glob_cache in crates/codegraph-core/src/file_collector.rs:108 (2 transitive callers)
  • build_glob_set in crates/codegraph-core/src/file_collector.rs:120 (10 transitive callers)
  • collect_files in crates/codegraph-core/src/file_collector.rs:192 (4 transitive callers)
  • try_fast_collect in crates/codegraph-core/src/file_collector.rs:289 (2 transitive callers)
  • build_glob_set_memoizes_identical_pattern_lists in crates/codegraph-core/src/file_collector.rs:463 (0 transitive callers)
  • build_glob_set_cache_distinguishes_different_lists in crates/codegraph-core/src/file_collector.rs:478 (0 transitive callers)
  • collectFiles in src/domain/graph/builder/helpers.ts:106 (0 transitive callers)
  • buildCacheKey in src/shared/globs.ts:63 (4 transitive callers)
  • compileGlobs in src/shared/globs.ts:77 (4 transitive callers)
  • clearGlobCache in src/shared/globs.ts:107 (0 transitive callers)

Greptile flagged that joining patterns with \x1f could alias distinct lists
whose contents contain that separator. JSON.stringify is unambiguous and
equally cheap — each pattern is quoted and comma-delimited so no two lists
share a key.

Impact: 1 functions changed, 4 affected
Greptile noted that Vec::remove(0) shifts every remaining element on each
eviction. VecDeque::pop_front is O(1) and better communicates FIFO intent.
At cap=32 the practical difference is negligible, but the cleaner semantics
are worth the trivial change.

Impact: 2 functions changed, 0 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 8f6f828 into main Apr 23, 2026
11 checks passed
@carlos-alm carlos-alm deleted the fix/memoize-globs branch April 23, 2026 18:33
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: memoize compiled include/exclude globs per build

1 participant