perf: skip documents with WAND #2632

BubbleCal · 2024-07-23T09:19:44Z

This improves the full text search 3x faster without recall loss.

With WAND, we don't need to calculate the score for all matched documents, it would skip the documents that are impossible to be in the results.

ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process

This also adds a method mask() for PreFilter trait, to get the RowIdMask, because it's hard to use the filter_row_ids method in the WAND implementation

with WAND, we don't need to calculate the score for all matched documents, it would skip the documents that are impossible to be in the results. ref: https://www.researchgate.net/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-07-23T14:54:11Z

Codecov Report

Attention: Patch coverage is 84.45596% with 30 lines in your changes missing coverage. Please review.

Project coverage is 79.15%. Comparing base (e571229) to head (d38c228).
Report is 78 commits behind head on main.

Files	Patch %	Lines
rust/lance-index/src/scalar/inverted/wand.rs	81.75%	26 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted.rs	91.17%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   78.99%   79.15%   +0.16%     
==========================================
  Files         215      219       +4     
  Lines       62904    63741     +837     
  Branches    62904    63741     +837     
==========================================
+ Hits        49689    50455     +766     
- Misses      10293    10342      +49     
- Partials     2922     2944      +22

Flag	Coverage Δ
unittests	`79.15% <84.45%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

Looks good. A few questions but nothing concerning.

westonpace · 2024-07-25T13:57:13Z

rust/lance-index/src/prefilter.rs

+    /// Get the row id mask for this prefilter
+    fn mask(&self) -> Arc<RowIdMask>;


We should probably document that this cannot be called before wait_for_ready

westonpace · 2024-07-25T14:00:31Z

rust/lance-index/src/scalar/inverted/wand.rs

+}
+
+impl<'a> PostingIterator<'a> {
+    pub(crate) fn new(


Can we call this maybe_new or something like that to signify it returns an option?

westonpace · 2024-07-25T14:04:56Z

rust/lance-index/src/scalar/inverted/wand.rs

+    fn next(&mut self, least_id: u64) -> Option<(u64, usize)> {
+        let block_size = ((self.list.len() - self.index) as f32).sqrt().ceil() as usize;
+        // skip blocks
+        while self.index + block_size < self.list.len()
+            && self.list.row_ids[self.index + block_size] < least_id
+        {
+            self.index += block_size;
+        }
+        // linear search
+        while self.index < self.list.len() {
+            let row_id = self.list.row_ids[self.index];
+            if row_id >= least_id && self.mask.selected(row_id) {
+                return Some((row_id, self.index));
+            }
+            self.index += 1;
+        }
+        None
+    }


So is this sort of like a binary search? Why not just use a binary search?

yes, we can first use binary search to find the last element element with row_id < least_id and then do linear search remaining elements.
this is for the coming disk-based implementation, that won't load the entire posting list into memory so can't do binary search on it

westonpace · 2024-07-25T14:06:36Z

rust/lance-index/src/scalar/inverted/wand.rs

+    token_id: u32,
+    list: &'a PostingList,
+    index: usize,
+    mask: Arc<RowIdMask>,


Minor nit: can we make this mask: &'a RowIdMask and save some Arc copies?

doable, will do it in the next PR cause the method has been modified in that

westonpace · 2024-07-25T14:08:38Z

rust/lance-index/src/scalar/inverted/wand.rs

+            } else if score > self.threshold {
+                self.candidates.pop();
+                self.candidates.push(Reverse(OrderedDoc::new(doc, score)));
+                self.threshold = self.candidates.peek().unwrap().0.score.0 * self.factor;


So the idea is "we want the top K results but can ignore results if they are significantly worst than the top 1 result"?

correct: "we want the top K results but can ignore results if they are significantly worst than the top K result"
here the candidates is min-heap so the peek() returns the one with smallest score

BubbleCal · 2024-07-25T14:56:36Z

will fix the comments in the next PR cause many code modified

github-actions bot added the performance label Jul 23, 2024

BubbleCal added 3 commits July 23, 2024 17:20

add wand file

3616389

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add license

5058ea1

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

f08bc21

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested review from eddyxu, westonpace and wjones127 July 23, 2024 09:35

fmt

f9cc84c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal mentioned this pull request Jul 23, 2024

Full text search (FTS) indices #1195

Open

7 tasks

BubbleCal added 2 commits July 23, 2024 23:08

add more comments

d93dac7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove score column

d38c228

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace approved these changes Jul 25, 2024

View reviewed changes

BubbleCal merged commit a52e703 into lancedb:main Jul 25, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip documents with WAND #2632

perf: skip documents with WAND #2632

BubbleCal commented Jul 23, 2024 •

edited

Loading

codecov-commenter commented Jul 23, 2024 •

edited

Loading

westonpace left a comment

westonpace Jul 25, 2024

westonpace Jul 25, 2024

westonpace Jul 25, 2024

BubbleCal Jul 25, 2024

westonpace Jul 25, 2024

BubbleCal Jul 25, 2024

westonpace Jul 25, 2024

BubbleCal Jul 25, 2024

BubbleCal commented Jul 25, 2024

		/// Get the row id mask for this prefilter
		fn mask(&self) -> Arc<RowIdMask>;

perf: skip documents with WAND #2632

perf: skip documents with WAND #2632

Conversation

BubbleCal commented Jul 23, 2024 • edited Loading

codecov-commenter commented Jul 23, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Jul 25, 2024

BubbleCal commented Jul 23, 2024 •

edited

Loading

codecov-commenter commented Jul 23, 2024 •

edited

Loading