-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: skip documents with WAND #2632
Conversation
with WAND, we don't need to calculate the score for all matched documents, it would skip the documents that are impossible to be in the results. ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2632 +/- ##
==========================================
+ Coverage 78.99% 79.15% +0.16%
==========================================
Files 215 219 +4
Lines 62904 63741 +837
Branches 62904 63741 +837
==========================================
+ Hits 49689 50455 +766
- Misses 10293 10342 +49
- Partials 2922 2944 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few questions but nothing concerning.
/// Get the row id mask for this prefilter | ||
fn mask(&self) -> Arc<RowIdMask>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably document that this cannot be called before wait_for_ready
} | ||
|
||
impl<'a> PostingIterator<'a> { | ||
pub(crate) fn new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this maybe_new
or something like that to signify it returns an option?
fn next(&mut self, least_id: u64) -> Option<(u64, usize)> { | ||
let block_size = ((self.list.len() - self.index) as f32).sqrt().ceil() as usize; | ||
// skip blocks | ||
while self.index + block_size < self.list.len() | ||
&& self.list.row_ids[self.index + block_size] < least_id | ||
{ | ||
self.index += block_size; | ||
} | ||
// linear search | ||
while self.index < self.list.len() { | ||
let row_id = self.list.row_ids[self.index]; | ||
if row_id >= least_id && self.mask.selected(row_id) { | ||
return Some((row_id, self.index)); | ||
} | ||
self.index += 1; | ||
} | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this sort of like a binary search? Why not just use a binary search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can first use binary search to find the last element element with row_id < least_id
and then do linear search remaining elements.
this is for the coming disk-based implementation, that won't load the entire posting list into memory so can't do binary search on it
token_id: u32, | ||
list: &'a PostingList, | ||
index: usize, | ||
mask: Arc<RowIdMask>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: can we make this mask: &'a RowIdMask
and save some Arc copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doable, will do it in the next PR cause the method has been modified in that
} else if score > self.threshold { | ||
self.candidates.pop(); | ||
self.candidates.push(Reverse(OrderedDoc::new(doc, score))); | ||
self.threshold = self.candidates.peek().unwrap().0.score.0 * self.factor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is "we want the top K results but can ignore results if they are significantly worst than the top 1 result"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct: "we want the top K results but can ignore results if they are significantly worst than the top K result"
here the candidates is min-heap so the peek()
returns the one with smallest score
will fix the comments in the next PR cause many code modified |
This improves the full text search 3x faster without recall loss.
With WAND, we don't need to calculate the score for all matched documents, it would skip the documents that are impossible to be in the results.
ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process
This also adds a method
mask()
forPreFilter
trait, to get theRowIdMask
, because it's hard to use thefilter_row_ids
method in the WAND implementation