-
Notifications
You must be signed in to change notification settings - Fork 502
perf!: introduce compression and new indexing algo for FTS #3720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR delivers major performance improvements for full‐text search by introducing a new compression algorithm for posting lists and a redesigned indexing strategy based on document partitioning and merging. It also updates the default indexing parameters, replaces the legacy TokenizerConfig with InvertedIndexParams, and fixes a bug in LanceIndexStore::copy() when source and destination stores differ.
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/lance/src/index.rs | Reordering and refinement of imports and test setup for inverted index configuration. |
| rust/lance/src/dataset.rs | Updated tests verifying index scanning and improved assertion messages. |
| rust/lance-index/src/scalar/inverted/wand.rs | Refactored posting iterator and WAND-related logic for better performance and clarity. |
| rust/lance-index/src/scalar/inverted/tokenizer.rs | Replaced TokenizerConfig with InvertedIndexParams and adjusted configuration methods. |
| python/src/dataset.rs, python/python/tests | Updated Python API usage and adjusted test expectations according to new defaults. |
| Other files | Refinements in compression, benchmark setup, and container API changes to support the new index format. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
b7588a7 to
5235236
Compare
…xing-perf Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…xing-perf Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…date evaluation Signed-off-by: BubbleCal <bubble-cal@outlook.com>
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something of a YOLO. I've followed the basic changes I think. Overall it looks like a great addition. I appreciate the additional logging which will make it easier to debug. The compression looks good. Thanks for figuring this out. It looks like quite a bit of optimization, profiling, and debugging work!
| len: 0, | ||
| limit, | ||
| } | ||
| pub fn with_capacity_limit(mut self, limit: u16) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user sets the capacity limit to something small after they've added items to the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i just realized this name confuses, it should be with_max_expand_cap, this effects only the new node cap
| self.write_tokens(dest_store).await?; | ||
| self.write_docs(dest_store).await?; | ||
| async fn write_metadata(&self, dest_store: &dyn IndexStore, partitions: &[u64]) -> Result<()> { | ||
| let metadata = HashMap::from_iter(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but maybe in the future we could just allow the index store to read & write JSON files.
|
|
||
| /// decompress the posting list from a LargeBinaryArray | ||
| /// returns a vector of (row_id, frequency) tuples | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we decompress only needed block so it's never used, but leave it for convenient testing
| // skip to a doc id that is greater than or equal to the given doc id | ||
| // pub fn skip_to(&mut self, doc_id: u32) { | ||
| // let | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
|
|
||
| impl Iterator for TokenIterator<'_> { | ||
| type Item = (String, u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little odd this is an iterator of String and not &'a str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is, unfortunately the FST lib produces &[u8] and can't convert it to &str because of lifetime issue, we need str::from_raw_parts but it's still unstable
| Self::schema(self) | ||
| } | ||
|
|
||
| fn size(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use this size?
I'm not entirely sure this is the exact size of the file (I think you would need to add a few more bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like I added them for some purpose but finally i'm not using it, remove it for now
| .new_index_file(name, Arc::new(reader.schema().into())) | ||
| .await?; | ||
|
|
||
| for offset in (0..reader.num_rows()).step_by(4096) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We coudll maybe make this 4096 a parameter of the copy_index_file method, but that can be in the future when we hit a problem I suppose.
|
Do we need to add a backwards compatibility test? To make sure that an old index is still readable by the new code? Also, to make sure we can read an old index, update the index, and then have a new index that still works? |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
… files for version 0.27.0 Signed-off-by: BubbleCal <bubble-cal@outlook.com>
added |
…ndling Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This introduces:
This also changes the default indexing params:
this also fix a bug in
LanceIndexStore::copy(), which causes error if the src store is not the same as dest store (one on local disk but the other one on cloud)