Skip to content

perf(lexical/search): read facets from DocValues, not the stored document (#597)#777

Merged
mosuka merged 1 commit into
mainfrom
perf/597-facet-docvalues
Jun 3, 2026
Merged

perf(lexical/search): read facets from DocValues, not the stored document (#597)#777
mosuka merged 1 commit into
mainfrom
perf/597-facet-docvalues

Conversation

@mosuka
Copy link
Copy Markdown
Owner

@mosuka mosuka commented Jun 3, 2026

Summary

FacetCollector::collect_doc (facet.rs) fetched reader.document(doc_id) for every collected hit, decoding the variable-width rkyv stored-fields blob and cloning a Document just to read the facet field values via document.get(field).

The right data structure for facet aggregation — DocValues — already exists end to end:

  • the writer stores every field into DocValues (InvertedIndexWriter::upsert_analyzed_document),
  • the reader exposes get_doc_value(field, doc_id) / has_doc_values(field) (both overridden by the segment and multi-segment readers),
  • and pub type FieldValue = crate::data::DataValue;, so get_doc_value returns exactly the DataValue the collector already matches on.

Since DocValues and the stored document derive from the same stored_fields, facet counts are identical.

Change

  • collect_doc now reads each facet field from DocValues and skips reader.document() entirely when every facet field has a DocValues column (the common case). A field that lacks DocValues transparently falls back to the stored document (the synthetic error fallback is preserved on that path), so behaviour and counts are unchanged; readers that don't implement DocValues (has_doc_values defaults to false) keep the exact pre-existing path.
  • The value → facet-path extraction is factored into a shared push_path_components helper used by both paths.
  • Per-field DocValues availability is doc-independent, so it is resolved once and cached on the collector (field_has_dv) rather than re-probing the lock-guarded reader per hit. Phase 2 (interning + counter bumps) is unchanged.

Out of scope (follow-up): emitting per-field facet-ord columns at index time + dense Vec<u64> ordinal counters (the issue's larger "proposed" rewrite, which stacks with the columnar DocValues work in #547 / #688).

Tests

New unit tests in facet.rs (deterministic, via a configurable mock reader):

  • facet_docvalues_counts_match_stored_doc — DocValues-path counts equal the stored-document-path counts for the same corpus (flat + hierarchical).
  • facet_docvalues_skips_document_fetch — when all facet fields have DocValues, document() is never called (the mock panics if it is).
  • facet_falls_back_to_document_without_docvalues — fields without DocValues are still counted via the stored document.
  • facet_mixed_docvalues_and_stored — one field from DocValues, one from the stored document; both counted.

Verification

  • cargo clippy -p laurus --all-targets -- -D warnings — clean
  • cargo fmt --check — clean
  • cargo test -p laurus --lib1108 passed (+4)
  • markdownlint-cli2 (en/ja faceting.md) — 0 errors

Benchmark (facet_bench)

The bench mock now exposes DocValues and each document carries non-facet payload fields (title/body), modelling the asymmetry between a whole-document decode and a single-field DocValues read. Measured with git stash isolating facet.rs (before/after):

case 1k 10k 100k
flat_single −42.5% −29.2% −17.7%
multi_field −27.1% −35.7% −13.2%
hierarchical −19.7% −31.9% −25.2%

All statistically significant (p < 0.05). The in-memory mock only models the Document clone avoidance; in production document() also decodes the variable-width rkyv blob (every stored field, not just facet fields) plus I/O, so the real-world win is expected to be larger.

Docs: added a "Performance" section to laurus/faceting.md (en/ja).

Closes #597

…ment (#597)

FacetCollector::collect_doc fetched reader.document(doc_id) per collected
hit, decoding and cloning the entire stored-fields blob just to read the
facet field values. DocValues already hold those values (the writer stores
every field into DocValues, the reader exposes get_doc_value/has_doc_values,
and FieldValue is a type alias for DataValue), so the collector now reads
facet values directly from DocValues and skips the whole-document decode
entirely when every facet field has a DocValues column. Fields without
DocValues fall back to the stored document, so counts are unchanged.

Per-field DocValues availability is resolved once (cached on the collector)
rather than re-probed per hit. facet_bench (fat documents) shows -13% to
-42% across all flat/multi/hierarchical sizes.

Closes #597
@mosuka mosuka merged commit fcb538d into main Jun 3, 2026
22 checks passed
@mosuka mosuka deleted the perf/597-facet-docvalues branch June 3, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(lexical/search): faceting reads full stored document per facet collection — even when the field has docvalues

1 participant