-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support phrase query for full text search #2751
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2751 +/- ##
==========================================
+ Coverage 79.18% 79.24% +0.05%
==========================================
Files 227 227
Lines 67818 68093 +275
Branches 67818 68093 +275
==========================================
+ Hits 53703 53958 +255
- Misses 10995 11007 +12
- Partials 3120 3128 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me. We will need to document the query syntax at some point for users but that can be in a future PR.
// the positions column may not exist for old indices | ||
// in that case, phrase query is not supported | ||
// let positions = if is_phrase_query { | ||
// Some(batch | ||
// .column_by_name(POSITION_COL) | ||
// .ok_or(Error::Index { message: format!("the index was built with old version which doesn't support phrase query, please re-create the index"), location: location!() })? | ||
// .as_list::<i32>().clone()) | ||
// } else { | ||
// None | ||
// }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove this, just removed
pub fn new( | ||
row_ids: ScalarBuffer<u64>, | ||
frequencies: ScalarBuffer<f32>, | ||
positions: Option<ListArray>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this Option
? Is it because older versions of the index might not have this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't want to introduce a breaking change, so make it optional, so that old index can still work, just doesn't support phrase query, the error message would guide users to re-create index
async fn read_range( | ||
&self, | ||
range: std::ops::Range<usize>, | ||
projection: Option<&[&str]>, | ||
) -> Result<RecordBatch>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems fine. Do you want to just briefly document whether this is expected to handle nested references (e.g. read_range(0..30, Some(&["x.y"]))
)?
Probably easiest if it doesn't for now and we can change it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added document to say nested column not supported
let row_ids = invert_index | ||
.search(&SargableQuery::FullTextSearch( | ||
FullTextSearchQuery::new("\"database lance\"".to_owned()).limit(Some(3)), | ||
)) | ||
.await | ||
.unwrap(); | ||
assert_eq!(row_ids.len(), Some(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one more case for database lance
(no phrase query) and make sure it returns 2 results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! just added a test that query "lance database" and asserts there should be 4 documents hit (because it's OR)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
The old indices can still work, but don't support phrase query.
This introduces new data to store: positions, the positions of each term in each doc, the positions data can be very huge, so we won't read it if the query isn't a phrase query.
IndexReader
so we can skip position column if no need