Skip to content

perf(index): fast-path total rows and cache index_statistics output#6582

Open
justinrmiller wants to merge 2 commits intolance-format:mainfrom
justinrmiller:perf/index-stats-fast-path
Open

perf(index): fast-path total rows and cache index_statistics output#6582
justinrmiller wants to merge 2 commits intolance-format:mainfrom
justinrmiller:perf/index-stats-fast-path

Conversation

@justinrmiller
Copy link
Copy Markdown
Contributor

Summary

Two optimizations to Dataset::index_statistics that together turn a ~5 ms call into ~22 µs for the common repeat-call pattern on a 10 M-row / 10 K-fragment dataset, with exact-behavior fallback for legacy datasets.

P1 — skip count_rows(None) when the manifest can answer

gather_fragment_statistics was calling Dataset::count_rows(None) purely to derive num_unindexed_rows = total - indexed. On modern datasets that's an in-memory fan-out; on legacy fragments without physical_rows / writer_version it opens the first data file of every fragment from object storage. New manifest_total_rows helper sums Fragment::num_rows() in memory and falls through to the existing count_rows path only when any fragment is missing that metadata.

P2 — cache the JSON keyed on manifest version

index_statistics is a pure function of (dataset URI, manifest version, index name). The DSIndexCache already scopes entries by dataset URI; manifest version bumps on every operation that can change the answer (append, delete, compact, index create/optimize/drop), so keying on it gives automatic invalidation with no write-path coupling. New IndexStatisticsKey alongside the existing IndexMetadataKey / ScalarIndexDetailsKey; index_statistics wraps the existing dispatch with a get/insert.

Errors don't poison the cache. No manifest format change, no public API change.

Benchmarks

New benches/index_stats.rs. Runs four paths in a single process on a shared fixture:

  • count_rows_baseline — cost of the count_rows(None) fan-out alone.
  • legacy_cold — pre-P1/P2 behavior, exercised via #[doc(hidden)] bench_legacy_index_statistics.
  • cold — current behavior, fresh session per iter.
  • cached — current behavior, warm cache.

A startup parity check asserts the legacy and current paths produce identical counters (num_indexed_rows, num_unindexed_rows, num_indexed_fragments, num_unindexed_fragments, num_indices) before measurement begins.

BTree index on Int32, local SSD, Apple Silicon, --sample-size 10 --measurement-time 2s.

Scaling across fixture size

Fixture count_rows_baseline legacy_cold cold (P1) cached (P1 + P2)
256 K rows / 256 frags 58 µs 538 µs 466 µs 1.4 µs
1 M rows / 1 024 frags 190 µs 834 µs 643 µs 2.9 µs
10 M rows / 10 000 frags 1.82 ms 5.33 ms 3.56 ms (−33%) 21.7 µs (~245×)

Observations:

  • count_rows cost scales linearly with fragment count (~180–200 ns/fragment in-memory). On legacy-format fragments or slow object stores it grows by orders of magnitude — each fragment may require a HEAD plus a range read to recover physical_rows.
  • P1's absolute savings scale with fragment count (72 µs → 191 µs → 1 770 µs). The cold path converges toward legacy_cold − count_rows_baseline, which is the theoretical floor.
  • P2's cache-hit cost is roughly constant in fixture size (dominated by the moka lookup + string clone, not JSON size).
  • Together, P1 + P2 convert a 5 ms call into a 22 µs call at 10 K fragments.

Test plan

  • cargo test -p lance --lib -- index::tests:: — 68 passing, including 5 new:
    • test_index_statistics_row_counts_match_count_rows (multi-fragment + partial index + deletes + append; asserts indexed + unindexed == count_rows(None))
    • test_manifest_total_rows_missing_metadata_returns_none (forces the count_rows fallback)
    • test_index_statistics_cache_hit_avoids_io (second call does zero reads per io_tracker)
    • test_index_statistics_cache_invalidates_on_manifest_bump (append + optimize_indices both return fresh numbers)
    • test_index_statistics_cache_distinguishes_index_names (two indices on the same dataset don't collide)
  • cargo clippy --all --tests --benches -- -D warnings
  • cargo fmt --all -- --check
  • cargo bench -p lance --bench index_stats — parity check passes, numbers above.

Not in this change

Left for follow-up, in rough priority order:

  • Single-pass indexed_fragments (currently O(deltas × fragments) with Fragment clones). Cheap rewrite, ~5–10× on many-delta indices.
  • load_statistics for BTree / Inverted / BloomFilter / LabelList — only Bitmap has it today; others fall back to open_generic_index, which materializes the full index file. 20–50× potential on cold-cache calls, medium-effort per-plugin work.
  • Persisted num_indexed_rows on IndexMetadata — manifest format change; probably unnecessary after P1 + P2.
  • Related production signal: ENT-547 / index_statistics on LABEL_LIST index is very slow #4620 flagged slow index_stats; test(java): re-construct tests for merge operation #4483 added a Phalanx-side timeout as mitigation.

`Dataset::index_statistics` called `count_rows(None)` on every
invocation purely to compute `num_unindexed_rows = total - indexed`.
On large or legacy datasets `count_rows` fans out per-fragment and
can open fragment data files, making it the dominant cost. In
addition, repeat calls at the same manifest version recomputed the
full JSON from scratch even though the answer is deterministic.

P1: introduce `manifest_total_rows` which sums `Fragment::num_rows()`
from the manifest. If every fragment has `physical_rows`, the total
is resolved in memory. If any is missing, fall through to the
existing `count_rows(None)` path — no correctness regression for
legacy datasets.

P2: wrap `index_statistics` with a `DSIndexCache` get/insert keyed
on `(manifest_version, index_name)`. The cache already scopes entries
by dataset URI; manifest version is monotonic and bumps on every
operation that can change the result (append, delete, compact,
index create/optimize/drop), so invalidation is automatic — no write
path ever needs to touch the cache.

Both changes are additive. Failed calls don't populate the cache.
No public API or manifest format change.

Benchmarks (10M rows, 10K fragments, local SSD, Apple Silicon):

  count_rows_baseline   1.82 ms
  legacy_cold           5.33 ms   (pre-P1/P2)
  cold                  3.56 ms   (P1 only;  -33% / -1.77 ms)
  cached                21.7 µs   (P1 + P2;  ~245x vs legacy_cold)

P1 savings scale linearly with fragment count (~200 ns/fragment
in-memory; orders of magnitude more on legacy formats or slow
object stores where `count_rows` reads data files). P2 collapses
repeat calls to a moka lookup plus a string clone.

Tests added:
  - test_index_statistics_row_counts_match_count_rows
  - test_manifest_total_rows_missing_metadata_returns_none
  - test_index_statistics_cache_hit_avoids_io
  - test_index_statistics_cache_invalidates_on_manifest_bump
  - test_index_statistics_cache_distinguishes_index_names

Benchmark added:
  - benches/index_stats.rs (legacy vs cold vs cached, parity check)
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 81.60000% with 46 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index.rs 81.14% 45 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant