Skip to content

Conversation

@BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Apr 22, 2025

This introduces:

  • compressing the posting lists and positions (4x smaller than before)
  • cache compressed posting lists in memory instead of plain posting lists
  • search over compressed posting lists, and decompress only needed blocks
  • new indexing algo, now it builds the index by partitioning by doc, and finally merge by size
  • the new index format has been changed so this is a breaking change
  • new search algo BlockMaxWAND

This also changes the default indexing params:

  • with_position -> false
  • stem -> true
  • remove_stop_words -> true
  • ascii_folding -> true

this also fix a bug in LanceIndexStore::copy(), which causes error if the src store is not the same as dest store (one on local disk but the other one on cloud)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delivers major performance improvements for full‐text search by introducing a new compression algorithm for posting lists and a redesigned indexing strategy based on document partitioning and merging. It also updates the default indexing parameters, replaces the legacy TokenizerConfig with InvertedIndexParams, and fixes a bug in LanceIndexStore::copy() when source and destination stores differ.

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
rust/lance/src/index.rs Reordering and refinement of imports and test setup for inverted index configuration.
rust/lance/src/dataset.rs Updated tests verifying index scanning and improved assertion messages.
rust/lance-index/src/scalar/inverted/wand.rs Refactored posting iterator and WAND-related logic for better performance and clarity.
rust/lance-index/src/scalar/inverted/tokenizer.rs Replaced TokenizerConfig with InvertedIndexParams and adjusted configuration methods.
python/src/dataset.rs, python/python/tests Updated Python API usage and adjusted test expectations according to new defaults.
Other files Refinements in compression, benchmark setup, and container API changes to support the new index format.

@BubbleCal BubbleCal marked this pull request as ready for review May 13, 2025 12:07
BubbleCal added 2 commits May 15, 2025 14:57
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal mentioned this pull request May 16, 2025
@github-actions github-actions bot added the java label May 19, 2025
BubbleCal added 2 commits May 19, 2025 17:59
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal force-pushed the fts-indexing-perf branch from b7588a7 to 5235236 Compare May 20, 2025 05:06
BubbleCal added 7 commits May 20, 2025 13:41
…xing-perf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…date evaluation

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something of a YOLO. I've followed the basic changes I think. Overall it looks like a great addition. I appreciate the additional logging which will make it easier to debug. The compression looks good. Thanks for figuring this out. It looks like quite a bit of optimization, profiling, and debugging work!

len: 0,
limit,
}
pub fn with_capacity_limit(mut self, limit: u16) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user sets the capacity limit to something small after they've added items to the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i just realized this name confuses, it should be with_max_expand_cap, this effects only the new node cap

self.write_tokens(dest_store).await?;
self.write_docs(dest_store).await?;
async fn write_metadata(&self, dest_store: &dyn IndexStore, partitions: &[u64]) -> Result<()> {
let metadata = HashMap::from_iter(vec![
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but maybe in the future we could just allow the index store to read & write JSON files.


/// decompress the posting list from a LargeBinaryArray
/// returns a vector of (row_id, frequency) tuples
#[allow(dead_code)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we decompress only needed block so it's never used, but leave it for convenient testing

Comment on lines 111 to 114
// skip to a doc id that is greater than or equal to the given doc id
// pub fn skip_to(&mut self, doc_id: u32) {
// let
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

impl Iterator for TokenIterator<'_> {
type Item = (String, u32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little odd this is an iterator of String and not &'a str?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is, unfortunately the FST lib produces &[u8] and can't convert it to &str because of lifetime issue, we need str::from_raw_parts but it's still unstable

Self::schema(self)
}

fn size(&self) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use this size?

I'm not entirely sure this is the exact size of the file (I think you would need to add a few more bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like I added them for some purpose but finally i'm not using it, remove it for now

.new_index_file(name, Arc::new(reader.schema().into()))
.await?;

for offset in (0..reader.num_rows()).step_by(4096) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We coudll maybe make this 4096 a parameter of the copy_index_file method, but that can be in the future when we hit a problem I suppose.

@westonpace
Copy link
Member

Do we need to add a backwards compatibility test? To make sure that an old index is still readable by the new code? Also, to make sure we can read an old index, update the index, and then have a new index that still works?

BubbleCal added 2 commits May 23, 2025 10:22
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
… files for version 0.27.0

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal
Copy link
Contributor Author

Do we need to add a backwards compatibility test? To make sure that an old index is still readable by the new code? Also, to make sure we can read an old index, update the index, and then have a new index that still works?

added

…ndling

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal merged commit aa45c5d into lance-format:main May 26, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants