Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support phrase query for full text search #2751

Merged
merged 6 commits into from
Aug 22, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Aug 19, 2024

The old indices can still work, but don't support phrase query.
This introduces new data to store: positions, the positions of each term in each doc, the positions data can be very huge, so we won't read it if the query isn't a phrase query.

  • passing projection to IndexReader so we can skip position column if no need
  • build FTS index with positions
  • report error to inform the users to re-create their FTS index when they try to do phrase query with old versions
  • an algo similar to WAND to fast check the position requirement for phrase query

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Aug 19, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal mentioned this pull request Aug 19, 2024
13 tasks
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 78.65169% with 57 lines in your changes missing coverage. Please review.

Project coverage is 79.24%. Comparing base (f7cc676) to head (00fcddf).
Report is 1 commits behind head on main.

Files Patch % Lines
rust/lance-index/src/scalar/inverted/builder.rs 55.73% 25 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 78.35% 11 Missing and 10 partials ⚠️
rust/lance-index/src/scalar/lance_format.rs 65.21% 6 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/inverted/wand.rs 98.82% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2751      +/-   ##
==========================================
+ Coverage   79.18%   79.24%   +0.05%     
==========================================
  Files         227      227              
  Lines       67818    68093     +275     
  Branches    67818    68093     +275     
==========================================
+ Hits        53703    53958     +255     
- Misses      10995    11007      +12     
- Partials     3120     3128       +8     
Flag Coverage Δ
unittests 79.24% <78.65%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BubbleCal BubbleCal marked this pull request as ready for review August 20, 2024 04:26
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me. We will need to document the query syntax at some point for users but that can be in a future PR.

Comment on lines 435 to 444
// the positions column may not exist for old indices
// in that case, phrase query is not supported
// let positions = if is_phrase_query {
// Some(batch
// .column_by_name(POSITION_COL)
// .ok_or(Error::Index { message: format!("the index was built with old version which doesn't support phrase query, please re-create the index"), location: location!() })?
// .as_list::<i32>().clone())
// } else {
// None
// };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove this, just removed

pub fn new(
row_ids: ScalarBuffer<u64>,
frequencies: ScalarBuffer<f32>,
positions: Option<ListArray>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Option? Is it because older versions of the index might not have this data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't want to introduce a breaking change, so make it optional, so that old index can still work, just doesn't support phrase query, the error message would guide users to re-create index

Comment on lines +111 to +115
async fn read_range(
&self,
range: std::ops::Range<usize>,
projection: Option<&[&str]>,
) -> Result<RecordBatch>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems fine. Do you want to just briefly document whether this is expected to handle nested references (e.g. read_range(0..30, Some(&["x.y"])))?

Probably easiest if it doesn't for now and we can change it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added document to say nested column not supported

Comment on lines 363 to 369
let row_ids = invert_index
.search(&SargableQuery::FullTextSearch(
FullTextSearchQuery::new("\"database lance\"".to_owned()).limit(Some(3)),
))
.await
.unwrap();
assert_eq!(row_ids.len(), Some(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more case for database lance (no phrase query) and make sure it returns 2 results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! just added a test that query "lance database" and asserts there should be 4 documents hit (because it's OR)

@BubbleCal BubbleCal merged commit ef4632f into lancedb:main Aug 22, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants