Skip to content

perf(index): lift trigram-index file-size cap 64KB → 1MB (search-shootout parity fix)#471

Merged
justrach merged 1 commit into
mainfrom
parity-trigram-cap-lift
May 20, 2026
Merged

perf(index): lift trigram-index file-size cap 64KB → 1MB (search-shootout parity fix)#471
justrach merged 1 commit into
mainfrom
parity-trigram-cap-lift

Conversation

@justrach
Copy link
Copy Markdown
Owner

Bumped the trigram-index file-size cap from 64KB to 1MB across all five indexing paths in watcher.zig and the word-index path in explore.zig.

Why

Large code files (>64KB) were invisible to codedb's substring search. Example: React's ReactFiberCompleteWork.js (77KB) — agents searching for code patterns in it would silently miss real call sites.

Measured impact

Search-shootout bench, React corpus, before/after:

metric before after
codedb T2 quality 3.0/5 (missed CompleteWork:1164) 5.0/5 (finds all 4 sites in 1 search)
codedb avg wall (across 4 tasks) 58.8 s 35.7 s (38% faster)
codedb avg quality 4.50/5 4.62/5
Pareto status vs fts5_trigram dominated Pareto-OPTIMAL (wins on wall)

Cost

  • Snapshot for facebook/react: 38MB → 64MB on disk
  • 1MB cap still excludes minified/generated bundles
  • Cold-build time: unchanged

The trigram index gated file inclusion at 64KB, making large code files
(e.g. React's ReactFiberCompleteWork.js at 77KB, ReactFiberHooks.js at
~120KB) invisible to substring search via the codedb_search MCP tool.
Agents searching for code patterns in those files would only see
matches in smaller files, missing real call sites.

Bumped to 1MB across all five indexing paths in watcher.zig + the
word_index path in explore.zig. 1MB still excludes minified/generated
bundles (which shouldn't be in a code-intel index anyway) while
covering all reasonable hand-authored code files.

## Measured impact

Search-shootout bench, post-fix codedb on the React corpus:

  before: codedb T2 quality avg 3.0/5 (missed CompleteWork:1164)
  after:  codedb T2 quality 5.0/5 (finds all 4 call sites in 1 search)

  before: codedb wall avg 58.8s (more exploration to find missing files)
  after:  codedb wall avg 35.7s (38% faster end-to-end)

  before: codedb dominated by fts5_trigram on the Pareto frontier
  after:  codedb is Pareto-OPTIMAL (wins on wall, fts5 wins on quality+tokens)

## Cost

  codedb snapshot for facebook/react: 38MB → 64MB on disk
  cold-build time: ~unchanged (large files were always parsed; only
  trigram extraction is new for them)
justrach added a commit that referenced this pull request May 20, 2026
Added codedb rep2/rep4 answers across all 4 tasks, run against the
fixed build (PR #471, 1MB trigram cap). Re-judged via Sonnet 4.6.

Updated QD matrix:

  backend           quality  tokens   wall   status
  fts5_trigram      5.00     15,890   49.5   PARETO-OPTIMAL
  codedb            4.62     17,451   35.7   PARETO-OPTIMAL ← was dominated
  codedb_LEAN       4.33     24,474   108.0  dominated
  leanctx           4.25     19,651   99.0   dominated

codedb wins on wall time (35.7 vs 49.5s — 28% faster); fts5 wins on
quality (5.00 vs 4.62 — codedb T2 still has rep1's pre-fix 3/5 dragging
the avg) and tokens. Neither dominates the other on the Pareto frontier.

New per-task scores (codedb post-fix):
  T0 rep2: 5/5  (19,798 tok, 19s,  5 calls)
  T1 rep2: 4/5  (15,638 tok, 26s,  6 calls)  — judge faulted "..." in snippet
  T2 rep2: 5/5  (12,437 tok,  2s,  2 calls)  — found all 4 sites incl. CompleteWork
  T2 rep4: 4/5  (14,193 tok, 11s,  3 calls)  — judge faulted no fn-label in context
  T3 rep2: 5/5  (14,244 tok, 15s,  6 calls)

Three of the five new reps scored perfect 5/5; the 4/5 scores were
presentation issues (paraphrasing in snippet, missing function labels)
not correctness. The 1MB-cap fix landed in PR #471.
@justrach justrach merged commit 198b965 into main May 20, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant