perf(lexical/search): read facets from DocValues, not the stored document (#597) by mosuka · Pull Request #777 · mosuka/laurus

mosuka · 2026-06-03T14:01:41Z

Summary

FacetCollector::collect_doc (facet.rs) fetched reader.document(doc_id) for every collected hit, decoding the variable-width rkyv stored-fields blob and cloning a Document just to read the facet field values via document.get(field).

The right data structure for facet aggregation — DocValues — already exists end to end:

the writer stores every field into DocValues (InvertedIndexWriter::upsert_analyzed_document),
the reader exposes get_doc_value(field, doc_id) / has_doc_values(field) (both overridden by the segment and multi-segment readers),
and pub type FieldValue = crate::data::DataValue;, so get_doc_value returns exactly the DataValue the collector already matches on.

Since DocValues and the stored document derive from the same stored_fields, facet counts are identical.

Change

collect_doc now reads each facet field from DocValues and skips reader.document() entirely when every facet field has a DocValues column (the common case). A field that lacks DocValues transparently falls back to the stored document (the synthetic error fallback is preserved on that path), so behaviour and counts are unchanged; readers that don't implement DocValues (has_doc_values defaults to false) keep the exact pre-existing path.
The value → facet-path extraction is factored into a shared push_path_components helper used by both paths.
Per-field DocValues availability is doc-independent, so it is resolved once and cached on the collector (field_has_dv) rather than re-probing the lock-guarded reader per hit. Phase 2 (interning + counter bumps) is unchanged.

Out of scope (follow-up): emitting per-field facet-ord columns at index time + dense Vec<u64> ordinal counters (the issue's larger "proposed" rewrite, which stacks with the columnar DocValues work in #547 / #688).

Tests

New unit tests in facet.rs (deterministic, via a configurable mock reader):

facet_docvalues_counts_match_stored_doc — DocValues-path counts equal the stored-document-path counts for the same corpus (flat + hierarchical).
facet_docvalues_skips_document_fetch — when all facet fields have DocValues, document() is never called (the mock panics if it is).
facet_falls_back_to_document_without_docvalues — fields without DocValues are still counted via the stored document.
facet_mixed_docvalues_and_stored — one field from DocValues, one from the stored document; both counted.

Verification

cargo clippy -p laurus --all-targets -- -D warnings — clean
cargo fmt --check — clean
cargo test -p laurus --lib — 1108 passed (+4)
markdownlint-cli2 (en/ja faceting.md) — 0 errors

Benchmark (`facet_bench`)

The bench mock now exposes DocValues and each document carries non-facet payload fields (title/body), modelling the asymmetry between a whole-document decode and a single-field DocValues read. Measured with git stash isolating facet.rs (before/after):

case	1k	10k	100k
flat_single	−42.5%	−29.2%	−17.7%
multi_field	−27.1%	−35.7%	−13.2%
hierarchical	−19.7%	−31.9%	−25.2%

All statistically significant (p < 0.05). The in-memory mock only models the Document clone avoidance; in production document() also decodes the variable-width rkyv blob (every stored field, not just facet fields) plus I/O, so the real-world win is expected to be larger.

Docs: added a "Performance" section to laurus/faceting.md (en/ja).

Closes #597

…ment (#597) FacetCollector::collect_doc fetched reader.document(doc_id) per collected hit, decoding and cloning the entire stored-fields blob just to read the facet field values. DocValues already hold those values (the writer stores every field into DocValues, the reader exposes get_doc_value/has_doc_values, and FieldValue is a type alias for DataValue), so the collector now reads facet values directly from DocValues and skips the whole-document decode entirely when every facet field has a DocValues column. Fields without DocValues fall back to the stored document, so counts are unchanged. Per-field DocValues availability is resolved once (cached on the collector) rather than re-probed per hit. facet_bench (fat documents) shows -13% to -42% across all flat/multi/hierarchical sizes. Closes #597

mosuka merged commit fcb538d into main Jun 3, 2026
22 checks passed

mosuka deleted the perf/597-facet-docvalues branch June 3, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(lexical/search): read facets from DocValues, not the stored document (#597)#777

perf(lexical/search): read facets from DocValues, not the stored document (#597)#777
mosuka merged 1 commit into
mainfrom
perf/597-facet-docvalues

mosuka commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mosuka commented Jun 3, 2026

Summary

Change

Tests

Verification

Benchmark (facet_bench)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Benchmark (`facet_bench`)