Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add inverted index #2526

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

feat: add inverted index #2526

wants to merge 10 commits into from

Conversation

BubbleCal
Copy link
Contributor

No description provided.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Jun 25, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 80.68460% with 79 lines in your changes missing coverage. Please review.

Project coverage is 79.99%. Comparing base (f8c5f4d) to head (12335e4).
Report is 15 commits behind head on main.

Files Patch % Lines
rust/lance-index/src/scalar/inverted.rs 83.12% 31 Missing and 36 partials ⚠️
rust/lance-index/src/scalar.rs 0.00% 4 Missing ⚠️
rust/lance-index/src/scalar/btree.rs 0.00% 0 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/flat.rs 0.00% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2526      +/-   ##
==========================================
+ Coverage   79.81%   79.99%   +0.18%     
==========================================
  Files         207      210       +3     
  Lines       59569    60256     +687     
  Branches    59569    60256     +687     
==========================================
+ Hits        47544    48204     +660     
+ Misses       9243     9178      -65     
- Partials     2782     2874      +92     
Flag Coverage Δ
unittests 79.99% <80.68%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more complex than it needs to be?

It looks like you are creating a BTreeMap<String, Vec<u64>>. The search then looks up each keyword in the map and combines the row ids. I don't know why you need three files to store one map?

Also, how many tokens are there? This structure will take GB of RAM when there are billions of rows.

You could store the file in partitions and then keep a top-level BTreeMap<String, u32> where u32 is the partition id. This is what the btree index does today.

Comment on lines +35 to +37
tokens: TokenSet,
invert_list: InvertedList,
docs: DocSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe a little what these things are? Either here or in the structs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines +195 to +199
struct TokenSet {
tokens: Vec<String>,
ids: Vec<u32>,
frequencies: Vec<u64>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe implement this with BTreeMap<String, (u32, u64)>

})
}

fn add(&mut self, token_id: u32, row_id: u64, frequency: u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think frequency is always 1? Also, frequency is never used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, fixed

async fn search(&self, query: &ScalarQuery) -> Result<UInt64Array> {
let row_ids = match query {
ScalarQuery::FullTextSearch(texts) => {
let tokens = self.map(texts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to tokenize texts with tantivy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends, here i got a bad name, change it to tokens

let row_ids = tokens
.iter()
.filter_map(|token| self.invert_list.retrieve(*token))
.flat_map(|(row_ids, _)| row_ids.iter().cloned())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this lead to duplicate row ids? E.g. if row 0 contains "a b" and texts is ["a", "b"] then will row 0 appear twice in the results?

@BubbleCal
Copy link
Contributor Author

BubbleCal commented Jul 3, 2024

This seems more complex than it needs to be?

It looks like you are creating a BTreeMap<String, Vec<u64>>. The search then looks up each keyword in the map and combines the row ids. I don't know why you need three files to store one map?

Also, how many tokens are there? This structure will take GB of RAM when there are billions of rows.

You could store the file in partitions and then keep a top-level BTreeMap<String, u32> where u32 is the partition id. This is what the btree index does today.

DocSet is needed only when we need to sort the results by bm25 scores.
The inverted list is at most with the number of english words, it won't be too many.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review July 4, 2024 15:27
@BubbleCal BubbleCal requested a review from westonpace July 4, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants