perf(globs): memoize compiled include/exclude globs per build#1005
perf(globs): memoize compiled include/exclude globs per build#1005carlos-alm merged 5 commits intomainfrom
Conversation
Long-running hosts (watch mode, MCP server) invoke buildGraph repeatedly with the same config. Previously each invocation recompiled the include/exclude patterns from scratch. Memoize by pattern content so the compiled regex list (TS) / GlobSet (Rust) is reused across calls. Both sides use a FIFO cache capped at 32 entries with a clear() hook for tests and config-reload scenarios. Closes #1000
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf(globs): memoize compiled include/exclude globs per build
|
Greptile SummaryThis PR memoizes compiled glob patterns in both the TypeScript layer ( Confidence Score: 5/5Safe to merge — all previously raised concerns are resolved and no new correctness issues were found. Both blocking issues from the prior review round (cache key collision and O(n) eviction) are fixed. The Rust TOCTOU gap between the cache-check lock release and the cache-insert lock acquire is benign (worst case is two threads each compile the same patterns once, with the second insert being a no-op update). All remaining findings are P2 style observations. Tests cover the key invariants well. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as buildGraph / collect_files
participant CG as compileGlobs (TS) / build_glob_set (Rust)
participant Cache as compileCache / GlobCache
participant Compiler as globToRegex / GlobSetBuilder
Caller->>CG: patterns[]
CG->>Cache: get(JSON.stringify(patterns) / Vec<String>)
alt Cache hit
Cache-->>CG: Arc<GlobSet> / readonly RegExp[]
CG-->>Caller: cached result (no recompilation)
else Cache miss
Cache-->>CG: None / undefined
CG->>Compiler: compile each pattern
Compiler-->>CG: GlobSet / RegExp[]
CG->>CG: Object.freeze(out) / Arc::new(set)
CG->>Cache: insert(key, frozen result)
Note over Cache: FIFO eviction if size >= 32
CG-->>Caller: fresh compiled result
end
Reviews (2): Last reviewed commit: "fix(globs): use VecDeque for O(1) FIFO e..." | Re-trigger Greptile |
| const GLOB_KEY_SEP = '\x1f'; | ||
| const compileCache = new Map<string, readonly RegExp[]>(); | ||
|
|
||
| function buildCacheKey(patterns: readonly string[]): string { | ||
| return patterns.join(GLOB_KEY_SEP); |
There was a problem hiding this comment.
Cache key collision on control-character separator
buildCacheKey joins patterns with \x1f (ASCII unit separator). If a user's config file ever contains a pattern with that character (e.g. from copy-paste of a binary-encoded string), two distinct pattern lists can produce the same cache key and the second lookup will silently return the wrong compiled regexes. JSON.stringify(patterns) is an unambiguous, zero-overhead alternative.
| const GLOB_KEY_SEP = '\x1f'; | |
| const compileCache = new Map<string, readonly RegExp[]>(); | |
| function buildCacheKey(patterns: readonly string[]): string { | |
| return patterns.join(GLOB_KEY_SEP); | |
| function buildCacheKey(patterns: readonly string[]): string { | |
| return JSON.stringify(patterns); | |
| } |
There was a problem hiding this comment.
Fixed in e83741d — swapped the \x1f separator for JSON.stringify(patterns). Each pattern is now quoted and comma-delimited in the key, so two distinct pattern lists can never collide regardless of what characters appear in the patterns themselves. No perf impact on the hot path (cache hit is still a single Map lookup).
| self.map.get(key).cloned() | ||
| } | ||
|
|
||
| fn insert(&mut self, key: Vec<String>, value: Arc<GlobSet>) { | ||
| if self.map.contains_key(&key) { | ||
| self.map.insert(key, value); | ||
| return; | ||
| } | ||
| if self.map.len() >= COMPILE_CACHE_MAX && !self.order.is_empty() { | ||
| let oldest = self.order.remove(0); | ||
| self.map.remove(&oldest); | ||
| } | ||
| self.order.push(key.clone()); | ||
| self.map.insert(key, value); | ||
| } |
There was a problem hiding this comment.
Fixed in 8922af4 — switched order from Vec<Vec<String>> to VecDeque<Vec<String>> and replaced remove(0)/push with pop_front/push_back. O(1) eviction and the intent reads more clearly. cargo test -p codegraph-core file_collector still passes all 8 tests (including the 2 memoization tests).
Codegraph Impact Analysis15 functions changed → 17 callers affected across 5 files
|
Greptile flagged that joining patterns with \x1f could alias distinct lists whose contents contain that separator. JSON.stringify is unambiguous and equally cheap — each pattern is quoted and comma-delimited so no two lists share a key. Impact: 1 functions changed, 4 affected
Greptile noted that Vec::remove(0) shifts every remaining element on each eviction. VecDeque::pop_front is O(1) and better communicates FIFO intent. At cap=32 the practical difference is negligible, but the cleaner semantics are worth the trivial change. Impact: 2 functions changed, 0 affected

Summary
src/shared/globs.ts) and the Rust native engine (crates/codegraph-core/src/file_collector.rs) so long-running processes (watch mode, MCP server) don't recompile pattern lists on everybuildGraphcall.clear()hook for tests and future config-reload scenarios.collectFilesinternal signatures toreadonly RegExp[]so frozen cached arrays flow through unchanged.Why
Prior behavior: every
buildGraphinvocation re-parsed the same include/exclude patterns into freshRegExp[]/GlobSet. Cheap on a single run, but watch mode and the MCP server issue many rebuilds against the same config and were paying the compile cost repeatedly.Closes #1000 (deferred follow-up from PR #994 review).
Test plan
npx vitest run tests/unit/globs.test.ts— 7/7 pass (new file)cargo test -p codegraph-core file_collector— 8/8 pass (includes 2 new memoization tests)npx biome checkon changed TS files — cleanclearGlobCache()if reloading config at runtime)