-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add inverted index #2526
base: main
Are you sure you want to change the base?
feat: add inverted index #2526
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2526 +/- ##
==========================================
+ Coverage 79.81% 79.99% +0.18%
==========================================
Files 207 210 +3
Lines 59569 60256 +687
Branches 59569 60256 +687
==========================================
+ Hits 47544 48204 +660
+ Misses 9243 9178 -65
- Partials 2782 2874 +92
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more complex than it needs to be?
It looks like you are creating a BTreeMap<String, Vec<u64>>
. The search then looks up each keyword in the map and combines the row ids. I don't know why you need three files to store one map?
Also, how many tokens are there? This structure will take GB of RAM when there are billions of rows.
You could store the file in partitions and then keep a top-level BTreeMap<String, u32>
where u32
is the partition id. This is what the btree index does today.
tokens: TokenSet, | ||
invert_list: InvertedList, | ||
docs: DocSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe a little what these things are? Either here or in the structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
struct TokenSet { | ||
tokens: Vec<String>, | ||
ids: Vec<u32>, | ||
frequencies: Vec<u64>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could maybe implement this with BTreeMap<String, (u32, u64)>
}) | ||
} | ||
|
||
fn add(&mut self, token_id: u32, row_id: u64, frequency: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think frequency
is always 1? Also, frequency is never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, fixed
async fn search(&self, query: &ScalarQuery) -> Result<UInt64Array> { | ||
let row_ids = match query { | ||
ScalarQuery::FullTextSearch(texts) => { | ||
let tokens = self.map(texts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to tokenize texts
with tantivy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends, here i got a bad name, change it to tokens
let row_ids = tokens | ||
.iter() | ||
.filter_map(|token| self.invert_list.retrieve(*token)) | ||
.flat_map(|(row_ids, _)| row_ids.iter().cloned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this lead to duplicate row ids? E.g. if row 0 contains "a b" and texts
is ["a", "b"]
then will row 0 appear twice in the results?
|
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
No description provided.