New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: add inverted index #2526

Open

BubbleCal wants to merge 10 commits into lancedb:main from BubbleCal:fts

Contributor

BubbleCal commented Jun 25, 2024

No description provided.


          feat: add inverted index

6aaa40f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the enhancement label

BubbleCal added 3 commits

June 25, 2024 16:20


          add file

33679fd

Signed-off-by: BubbleCal <bubble-cal@outlook.com>


          Merge branch 'main' of https://github.com/lancedb/lance into fts

e4f72d3


          finish

17e3bc8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested review from westonpace and eddyxu

June 28, 2024 12:34


          rename

77bc8ff

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter commented Jun 28, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 80.68460% with 79 lines in your changes missing coverage. Please review.

Project coverage is 79.99%. Comparing base (f8c5f4d) to head (12335e4).
Report is 15 commits behind head on main.

Files	Patch %	Lines
rust/lance-index/src/scalar/inverted.rs	83.12%	31 Missing and 36 partials ⚠️
rust/lance-index/src/scalar.rs	0.00%	4 Missing ⚠️
rust/lance-index/src/scalar/btree.rs	0.00%	0 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/flat.rs	0.00%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2526      +/-   ##
==========================================
+ Coverage   79.81%   79.99%   +0.18%     
==========================================
  Files         207      210       +3     
  Lines       59569    60256     +687     
  Branches    59569    60256     +687     
==========================================
+ Hits        47544    48204     +660     
+ Misses       9243     9178      -65     
- Partials     2782     2874      +92

Flag	Coverage Δ
unittests	`79.99% <80.68%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonpace reviewed

View reviewed changes

Contributor

westonpace left a comment

This seems more complex than it needs to be?

It looks like you are creating a BTreeMap<String, Vec<u64>>. The search then looks up each keyword in the map and combines the row ids. I don't know why you need three files to store one map?

Also, how many tokens are there? This structure will take GB of RAM when there are billions of rows.

You could store the file in partitions and then keep a top-level BTreeMap<String, u32> where u32 is the partition id. This is what the btree index does today.

rust/lance-index/src/scalar/inverted.rs

Comment on lines +35 to +37

+                  tokens: TokenSet,
+                  invert_list: InvertedList,
+                  docs: DocSet,

Contributor

westonpace Jul 2, 2024

Can you describe a little what these things are? Either here or in the structs?

Contributor Author

BubbleCal Jul 4, 2024

added

rust/lance-index/src/scalar/inverted.rs

Comment on lines +195 to +199

+              struct TokenSet {
+                  tokens: Vec<String>,
+                  ids: Vec<u32>,
+                  frequencies: Vec<u64>,
+              }

Contributor

westonpace Jul 2, 2024

Could maybe implement this with BTreeMap<String, (u32, u64)>

rust/lance-index/src/scalar/inverted.rs Outdated

+                      })
+                  }
+                  fn add(&mut self, token_id: u32, row_id: u64, frequency: u64) {

Contributor

westonpace Jul 2, 2024

I think frequency is always 1? Also, frequency is never used?

Contributor Author

BubbleCal Jul 3, 2024

right, fixed

rust/lance-index/src/scalar/inverted.rs Outdated

+                  async fn search(&self, query: &ScalarQuery) -> Result<UInt64Array> {
+                      let row_ids = match query {
+                          ScalarQuery::FullTextSearch(texts) => {
+                              let tokens = self.map(texts);

Contributor

westonpace Jul 2, 2024

Do we need to tokenize texts with tantivy?

Contributor Author

BubbleCal Jul 3, 2024

it depends, here i got a bad name, change it to tokens

rust/lance-index/src/scalar/inverted.rs Outdated

+                              let row_ids = tokens
+                                  .iter()
+                                  .filter_map(|token| self.invert_list.retrieve(*token))
+                                  .flat_map(|(row_ids, _)| row_ids.iter().cloned())

Contributor

westonpace Jul 2, 2024

Can this lead to duplicate row ids? E.g. if row 0 contains "a b" and texts is ["a", "b"] then will row 0 appear twice in the results?

Contributor Author

BubbleCal commented Jul 3, 2024 •

edited

Loading

This seems more complex than it needs to be?

It looks like you are creating a BTreeMap<String, Vec<u64>>. The search then looks up each keyword in the map and combines the row ids. I don't know why you need three files to store one map?

Also, how many tokens are there? This structure will take GB of RAM when there are billions of rows.

You could store the file in partitions and then keep a top-level BTreeMap<String, u32> where u32 is the partition id. This is what the btree index does today.

DocSet is needed only when we need to sort the results by bm25 scores.
The inverted list is at most with the number of english words, it won't be too many.

BubbleCal added 5 commits

July 3, 2024 19:53


          fix comments

1b26359

Signed-off-by: BubbleCal <bubble-cal@outlook.com>


          rename

Signed-off-by: BubbleCal <bubble-cal@outlook.com>


          fix ut

6df8f5a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>


          optimize

3fc7c8d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>


          bm25

12335e4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review

July 4, 2024 15:27

BubbleCal requested a review from westonpace

July 4, 2024 15:27

westonpace approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment