Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: skip documents with WAND #2632

Merged
merged 7 commits into from
Jul 25, 2024
Merged

perf: skip documents with WAND #2632

merged 7 commits into from
Jul 25, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jul 23, 2024

This improves the full text search 3x faster without recall loss.

With WAND, we don't need to calculate the score for all matched documents, it would skip the documents that are impossible to be in the results.

ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process

This also adds a method mask() for PreFilter trait, to get the RowIdMask, because it's hard to use the filter_row_ids method in the WAND implementation

with WAND, we don't need to calculate the score for all matched documents,
it would skip the documents that are impossible to be in the results.

ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal mentioned this pull request Jul 23, 2024
7 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jul 23, 2024

Codecov Report

Attention: Patch coverage is 84.45596% with 30 lines in your changes missing coverage. Please review.

Project coverage is 79.15%. Comparing base (e571229) to head (d38c228).
Report is 78 commits behind head on main.

Files Patch % Lines
rust/lance-index/src/scalar/inverted/wand.rs 81.75% 26 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted.rs 91.17% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   78.99%   79.15%   +0.16%     
==========================================
  Files         215      219       +4     
  Lines       62904    63741     +837     
  Branches    62904    63741     +837     
==========================================
+ Hits        49689    50455     +766     
- Misses      10293    10342      +49     
- Partials     2922     2944      +22     
Flag Coverage Δ
unittests 79.15% <84.45%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few questions but nothing concerning.

Comment on lines +39 to +40
/// Get the row id mask for this prefilter
fn mask(&self) -> Arc<RowIdMask>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably document that this cannot be called before wait_for_ready

}

impl<'a> PostingIterator<'a> {
pub(crate) fn new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this maybe_new or something like that to signify it returns an option?

Comment on lines +80 to +97
fn next(&mut self, least_id: u64) -> Option<(u64, usize)> {
let block_size = ((self.list.len() - self.index) as f32).sqrt().ceil() as usize;
// skip blocks
while self.index + block_size < self.list.len()
&& self.list.row_ids[self.index + block_size] < least_id
{
self.index += block_size;
}
// linear search
while self.index < self.list.len() {
let row_id = self.list.row_ids[self.index];
if row_id >= least_id && self.mask.selected(row_id) {
return Some((row_id, self.index));
}
self.index += 1;
}
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this sort of like a binary search? Why not just use a binary search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can first use binary search to find the last element element with row_id < least_id and then do linear search remaining elements.
this is for the coming disk-based implementation, that won't load the entire posting list into memory so can't do binary search on it

token_id: u32,
list: &'a PostingList,
index: usize,
mask: Arc<RowIdMask>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: can we make this mask: &'a RowIdMask and save some Arc copies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doable, will do it in the next PR cause the method has been modified in that

} else if score > self.threshold {
self.candidates.pop();
self.candidates.push(Reverse(OrderedDoc::new(doc, score)));
self.threshold = self.candidates.peek().unwrap().0.score.0 * self.factor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is "we want the top K results but can ignore results if they are significantly worst than the top 1 result"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct: "we want the top K results but can ignore results if they are significantly worst than the top K result"
here the candidates is min-heap so the peek() returns the one with smallest score

@BubbleCal
Copy link
Contributor Author

will fix the comments in the next PR cause many code modified

@BubbleCal BubbleCal merged commit a52e703 into lancedb:main Jul 25, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants