perf(index): lift trigram-index file-size cap 64KB → 1MB (search-shootout parity fix) by justrach · Pull Request #471 · justrach/codedb

justrach · 2026-05-20T13:53:10Z

Bumped the trigram-index file-size cap from 64KB to 1MB across all five indexing paths in watcher.zig and the word-index path in explore.zig.

Why

Large code files (>64KB) were invisible to codedb's substring search. Example: React's ReactFiberCompleteWork.js (77KB) — agents searching for code patterns in it would silently miss real call sites.

Measured impact

Search-shootout bench, React corpus, before/after:

metric	before	after
codedb T2 quality	3.0/5 (missed CompleteWork:1164)	5.0/5 (finds all 4 sites in 1 search)
codedb avg wall (across 4 tasks)	58.8 s	35.7 s (38% faster)
codedb avg quality	4.50/5	4.62/5
Pareto status vs fts5_trigram	dominated	Pareto-OPTIMAL (wins on wall)

Cost

Snapshot for facebook/react: 38MB → 64MB on disk
1MB cap still excludes minified/generated bundles
Cold-build time: unchanged

The trigram index gated file inclusion at 64KB, making large code files (e.g. React's ReactFiberCompleteWork.js at 77KB, ReactFiberHooks.js at ~120KB) invisible to substring search via the codedb_search MCP tool. Agents searching for code patterns in those files would only see matches in smaller files, missing real call sites. Bumped to 1MB across all five indexing paths in watcher.zig + the word_index path in explore.zig. 1MB still excludes minified/generated bundles (which shouldn't be in a code-intel index anyway) while covering all reasonable hand-authored code files. ## Measured impact Search-shootout bench, post-fix codedb on the React corpus: before: codedb T2 quality avg 3.0/5 (missed CompleteWork:1164) after: codedb T2 quality 5.0/5 (finds all 4 call sites in 1 search) before: codedb wall avg 58.8s (more exploration to find missing files) after: codedb wall avg 35.7s (38% faster end-to-end) before: codedb dominated by fts5_trigram on the Pareto frontier after: codedb is Pareto-OPTIMAL (wins on wall, fts5 wins on quality+tokens) ## Cost codedb snapshot for facebook/react: 38MB → 64MB on disk cold-build time: ~unchanged (large files were always parsed; only trigram extraction is new for them)

Added codedb rep2/rep4 answers across all 4 tasks, run against the fixed build (PR #471, 1MB trigram cap). Re-judged via Sonnet 4.6. Updated QD matrix: backend quality tokens wall status fts5_trigram 5.00 15,890 49.5 PARETO-OPTIMAL codedb 4.62 17,451 35.7 PARETO-OPTIMAL ← was dominated codedb_LEAN 4.33 24,474 108.0 dominated leanctx 4.25 19,651 99.0 dominated codedb wins on wall time (35.7 vs 49.5s — 28% faster); fts5 wins on quality (5.00 vs 4.62 — codedb T2 still has rep1's pre-fix 3/5 dragging the avg) and tokens. Neither dominates the other on the Pareto frontier. New per-task scores (codedb post-fix): T0 rep2: 5/5 (19,798 tok, 19s, 5 calls) T1 rep2: 4/5 (15,638 tok, 26s, 6 calls) — judge faulted "..." in snippet T2 rep2: 5/5 (12,437 tok, 2s, 2 calls) — found all 4 sites incl. CompleteWork T2 rep4: 4/5 (14,193 tok, 11s, 3 calls) — judge faulted no fn-label in context T3 rep2: 5/5 (14,244 tok, 15s, 6 calls) Three of the five new reps scored perfect 5/5; the 4/5 scores were presentation issues (paraphrasing in snippet, missing function labels) not correctness. The 1MB-cap fix landed in PR #471.

justrach merged commit 198b965 into main May 20, 2026
1 check failed

This was referenced May 20, 2026

perf(telemetry): cache approxIndexSizeBytes - codedb_status 9.4x faster #474

Merged

release: v0.2.5815 — consolidate 5814 max_cached wiring + main's session perf+context PRs #480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(index): lift trigram-index file-size cap 64KB → 1MB (search-shootout parity fix)#471

perf(index): lift trigram-index file-size cap 64KB → 1MB (search-shootout parity fix)#471
justrach merged 1 commit into
mainfrom
parity-trigram-cap-lift

justrach commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented May 20, 2026

Why

Measured impact

Cost

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant