perf: new layout for positions and new algo for phrase query by BubbleCal · Pull Request #6203 · lance-format/lance

BubbleCal · 2026-03-16T06:55:57Z

This stores all positions of single posting list in a flat array, and apply compression to the positions:

for positions of single document, storing delta positions
divide positions of single posting list into blocks (128 by default) and apply bitpack to each block
store the remaining positions in varint

This also stores the remaining doc ids / frequencies in varint as well.

These are breaking changes.

on 1M random dataset:

build is 9.35x faster (34.269s -> 3.664s)
index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes)
phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms)

the improvement can be more significant on larger dataset

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions · 2026-03-16T06:56:14Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

github-actions · 2026-03-16T06:58:57Z

PR Review: varint for doc/freq

This PR introduces varint-delta encoding for posting list tails and a new shared position stream layout (Lucene-style packed deltas), replacing the fixed-size encoding. Good improvements for space efficiency with solid backward-compat handling via codec metadata detection.

Issues

P1: Compat test no longer covers the old FTS index format

The FtsIndexTest in test_scalar_indices.py now creates with with_position=True and uses MatchQuery/PhraseQuery instead of the old LIKE filter. This means the compat test framework no longer exercises the upgrade path for FTS indices created by older Lance versions (without positions / with the old Fixed32 tail codec). Consider keeping the existing test case as-is and adding a new compat test class (e.g., FtsIndexWithPositionsTest) for the new format. This is important since the old format will exist in the wild.

P1: Duplicated metadata construction in builder.rs

write_metadata and the partial-metadata writer (write_metadata for partition) have nearly identical blocks for constructing the codec/positions metadata HashMap. If either is updated without the other, they'll silently diverge. Consider extracting to a shared helper, e.g.:

fn build_index_metadata(&self, partitions: &[u64]) -> Result<HashMap<String, String>> { ... }

P1: Duplicate varint encoder functions in encoding.rs

write_varint_u32(builder: &mut impl Write, ...) and encode_varint_u32(dst: &mut Vec<u8>, ...) implement identical logic with different interfaces. Since Vec<u8> implements Write, one can be expressed in terms of the other (or just use one).

Minor

In decompress_posting_remainder (encoding.rs), the VarintDelta branch uses expect() / assert_eq!() for what could be corrupt-data errors. The Fixed32 path has always been panic-on-corrupt, so this is consistent, but eventually propagating Result from this function would be safer.

Overall the encoding/decoding logic looks correct, the backward-compat codec detection (parse_posting_tail_codec, parse_shared_position_codec with sensible defaults) is well-designed, and the test coverage for new codecs and legacy formats is thorough.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov · 2026-03-16T08:33:37Z

Codecov Report

❌ Patch coverage is 82.56607% with 409 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/index.rs	84.14%	156 Missing and 34 partials ⚠️
rust/lance-index/src/scalar/inverted/builder.rs	80.48%	61 Missing and 27 partials ⚠️
rust/lance-index/src/scalar/inverted/encoding.rs	79.72%	63 Missing and 25 partials ⚠️
rust/lance-index/src/scalar/inverted/wand.rs	80.54%	35 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/iter.rs	89.39%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Xuanwo · 2026-03-16T10:26:19Z

rust/lance-index/src/scalar/inverted/iter.rs

+                    let compressed = list.value(self.idx);
+                    let positions = decompress_positions(compressed.as_binary());
+                    Box::new(positions.into_iter()) as Box<dyn Iterator<Item = u32>>
+                }


I'm not sure if I understand this correctly. It looks like we have cached the entire POSITION_COL in LegacyPerDoc:

PositionsLayout::LegacyPerDoc => { let batch = self .reader .read_range(self.posting_list_range(token_id), Some(&[POSITION_COL])) .await .map_err(|e| match e { Error::Schema { .. } => Error::invalid_input("position is not found but required for phrase queries, try recreating the index with position".to_owned()), e => e, })?; CompressedPositionStorage::LegacyPerDoc(batch[POSITION_COL].as_list::<i32>().clone())

Do we still need to do this?

it caches only the rows hit, not the entire column.

these line read the cached rows and decompress them.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Xuanwo

I'm good with this PR once all CI passed

BubbleCal · 2026-03-16T17:36:29Z

hold it because of compatibility issue

LuQQiu · 2026-03-16T20:21:43Z

Looks like #[serde(rename = "memory_limit", skip_serializing, ...)]
pub(crate) memory_limit_mb: Option,

memory limit still not being respected and fallback to 2GiB

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…g/inverted-index-positions

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2026-03-20T08:49:29Z

Looks like #[serde(rename = "memory_limit", skip_serializing, ...)] pub(crate) memory_limit_mb: Option,

memory limit still not being respected and fallback to 2GiB

skipping serializing this field is expected, because we don't want to add this into index params, it's should be one-time param for only building.

I just tested this param it works

esteban · 2026-03-20T18:38:51Z

Just leaving for release notes: env var LANCE_FTS_FORMAT_VERSION default is 1 and must be set to 2 in order to enable the new format.

This stores all positions of single posting list in a flat array, and apply compression to the positions: - for positions of single document, storing delta positions - divide positions of single posting list into blocks (128 by default) and apply bitpack to each block - store the remaining positions in varint This also stores the remaining doc ids / frequencies in varint as well. These are breaking changes. on 1M random dataset: - build is 9.35x faster (34.269s -> 3.664s) - index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes) - phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms) the improvement can be more significant on larger dataset --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…ormat#6203) This stores all positions of single posting list in a flat array, and apply compression to the positions: - for positions of single document, storing delta positions - divide positions of single posting list into blocks (128 by default) and apply bitpack to each block - store the remaining positions in varint This also stores the remaining doc ids / frequencies in varint as well. These are breaking changes. on 1M random dataset: - build is 9.35x faster (34.269s -> 3.664s) - index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes) - phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms) the improvement can be more significant on larger dataset --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

varint for doc/freq

ddec035

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the python label Mar 16, 2026

BubbleCal changed the title ~~varint for doc/freq~~ perf: new layout for positions and new algo for phrase query Mar 16, 2026

github-actions bot added the performance label Mar 16, 2026

BubbleCal added 4 commits March 16, 2026 15:07

fmt

d875d6c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

094cf77

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

02c1152

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

9a86c1a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix compat test

b67cdd9

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Xuanwo reviewed Mar 16, 2026

View reviewed changes

fix compat

66fa5a1

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from Xuanwo March 16, 2026 11:24

BubbleCal added 3 commits March 16, 2026 21:19

fix compat

097eed4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix compat

6f14096

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix compat

98fe864

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Xuanwo approved these changes Mar 16, 2026

View reviewed changes

BubbleCal added the donotmerge Do not merge label Mar 16, 2026

BubbleCal added 8 commits March 18, 2026 14:46

optimize memory

ee85615

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix ci

6cc141f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

04390ae

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

9afb9f7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

32bd413

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix test

88d09a5

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

845fd6b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lance-format/lance into yan…

36e3994

…g/inverted-index-positions

BubbleCal added 6 commits March 19, 2026 21:59

fix fts format version selection at runtime

3a5741a

remove test code

7e71230

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

5d732d5

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix preserve fts index format across compat updates

ae980ea

revert uv.lock

ae4513a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

with positions

5262199

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal merged commit 42581d4 into main Mar 20, 2026
28 checks passed

BubbleCal deleted the yang/inverted-index-positions branch March 20, 2026 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: new layout for positions and new algo for phrase query#6203

perf: new layout for positions and new algo for phrase query#6203
BubbleCal merged 24 commits intomainfrom
yang/inverted-index-positions

BubbleCal commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

codecov bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Xuanwo Mar 16, 2026

Uh oh!

BubbleCal Mar 16, 2026

Uh oh!

Xuanwo left a comment

Uh oh!

BubbleCal commented Mar 16, 2026

Uh oh!

LuQQiu commented Mar 16, 2026

Uh oh!

BubbleCal commented Mar 20, 2026

Uh oh!

Uh oh!

esteban commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BubbleCal commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

PR Review: varint for doc/freq

Issues

Minor

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

BubbleCal commented Mar 16, 2026

Uh oh!

LuQQiu commented Mar 16, 2026

Uh oh!

BubbleCal commented Mar 20, 2026

Uh oh!

Uh oh!

esteban commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BubbleCal commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading