perf(lexical/search): read facets from DocValues, not the stored document (#597)#777
Merged
Conversation
…ment (#597) FacetCollector::collect_doc fetched reader.document(doc_id) per collected hit, decoding and cloning the entire stored-fields blob just to read the facet field values. DocValues already hold those values (the writer stores every field into DocValues, the reader exposes get_doc_value/has_doc_values, and FieldValue is a type alias for DataValue), so the collector now reads facet values directly from DocValues and skips the whole-document decode entirely when every facet field has a DocValues column. Fields without DocValues fall back to the stored document, so counts are unchanged. Per-field DocValues availability is resolved once (cached on the collector) rather than re-probed per hit. facet_bench (fat documents) shows -13% to -42% across all flat/multi/hierarchical sizes. Closes #597
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FacetCollector::collect_doc(facet.rs) fetchedreader.document(doc_id)for every collected hit, decoding the variable-width rkyv stored-fields blob and cloning aDocumentjust to read the facet field values viadocument.get(field).The right data structure for facet aggregation — DocValues — already exists end to end:
InvertedIndexWriter::upsert_analyzed_document),get_doc_value(field, doc_id)/has_doc_values(field)(both overridden by the segment and multi-segment readers),pub type FieldValue = crate::data::DataValue;, soget_doc_valuereturns exactly theDataValuethe collector already matches on.Since DocValues and the stored document derive from the same
stored_fields, facet counts are identical.Change
collect_docnow reads each facet field from DocValues and skipsreader.document()entirely when every facet field has a DocValues column (the common case). A field that lacks DocValues transparently falls back to the stored document (the synthetic error fallback is preserved on that path), so behaviour and counts are unchanged; readers that don't implement DocValues (has_doc_valuesdefaults tofalse) keep the exact pre-existing path.push_path_componentshelper used by both paths.field_has_dv) rather than re-probing the lock-guarded reader per hit. Phase 2 (interning + counter bumps) is unchanged.Out of scope (follow-up): emitting per-field facet-ord columns at index time + dense
Vec<u64>ordinal counters (the issue's larger "proposed" rewrite, which stacks with the columnar DocValues work in #547 / #688).Tests
New unit tests in
facet.rs(deterministic, via a configurable mock reader):facet_docvalues_counts_match_stored_doc— DocValues-path counts equal the stored-document-path counts for the same corpus (flat + hierarchical).facet_docvalues_skips_document_fetch— when all facet fields have DocValues,document()is never called (the mock panics if it is).facet_falls_back_to_document_without_docvalues— fields without DocValues are still counted via the stored document.facet_mixed_docvalues_and_stored— one field from DocValues, one from the stored document; both counted.Verification
cargo clippy -p laurus --all-targets -- -D warnings— cleancargo fmt --check— cleancargo test -p laurus --lib— 1108 passed (+4)markdownlint-cli2(en/jafaceting.md) — 0 errorsBenchmark (
facet_bench)The bench mock now exposes DocValues and each document carries non-facet payload fields (title/body), modelling the asymmetry between a whole-document decode and a single-field DocValues read. Measured with
git stashisolatingfacet.rs(before/after):All statistically significant (p < 0.05). The in-memory mock only models the
Documentclone avoidance; in productiondocument()also decodes the variable-width rkyv blob (every stored field, not just facet fields) plus I/O, so the real-world win is expected to be larger.Docs: added a "Performance" section to
laurus/faceting.md(en/ja).Closes #597