-
Notifications
You must be signed in to change notification settings - Fork 82
Fix phrase search containing stop words #664
Conversation
94957dd
to
b34fbcf
Compare
@ManyTheFish I have completed all the steps and I think it's implemented correctly, but that test ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is not the easiest one,
I suggest starting by fixing resolve_phrase, it should help you with the other steps!
@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates( | |||
} | |||
Phrase(words) => { | |||
let mut groups_positions = Vec::with_capacity(words.len()); | |||
for word in words { | |||
for word in words.iter().filter_map(|w| w.as_ref()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit tricky, the call to the below plane_sweep
can't be set as true
if at least 1 stop word is surrounded by words, otherwise we set it to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented it: 4705368
It's a bit ugly though. I couldn't find a better way to do it.
Also, how do I test this? The existing test already has a stop word surrounded by words but it doesn't seem to fail without this. Am I missing something?
milli/src/search/query_tree.rs
Outdated
let words = words | ||
.into_iter() | ||
.filter_map(|w| w) | ||
.map(|w| MatchingWord::new(w.to_string(), 0, false)) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit afraid of this one, let's come back later on it
@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates( | |||
} | |||
Phrase(words) => { | |||
let mut groups_positions = Vec::with_capacity(words.len()); | |||
for word in words { | |||
for word in words.iter().filter_map(|w| w.as_ref()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485
Just to update, I'm still working on this. I couldn't make any progress because of other commitments, but I have picked it up again now. I'll update the PR in the next few hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Samyak2,
Nice Job! I requested a last change, we should be able to merge your PR after that!
let mut consecutive = true; | ||
let mut was_last_word_a_stop_word = false; | ||
for word in words.iter() { | ||
if let Some(word) = word { | ||
let positions = match words_positions.get(word) { | ||
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(), | ||
None => return Ok(vec![]), | ||
}; | ||
groups_positions.push(positions); | ||
|
||
if was_last_word_a_stop_word { | ||
consecutive = false; | ||
} | ||
was_last_word_a_stop_word = false; | ||
} else { | ||
if !was_last_word_a_stop_word { | ||
consecutive = false; | ||
} | ||
|
||
was_last_word_a_stop_word = true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block could be refactored like:
let mut consecutive = true; | |
let mut was_last_word_a_stop_word = false; | |
for word in words.iter() { | |
if let Some(word) = word { | |
let positions = match words_positions.get(word) { | |
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(), | |
None => return Ok(vec![]), | |
}; | |
groups_positions.push(positions); | |
if was_last_word_a_stop_word { | |
consecutive = false; | |
} | |
was_last_word_a_stop_word = false; | |
} else { | |
if !was_last_word_a_stop_word { | |
consecutive = false; | |
} | |
was_last_word_a_stop_word = true; | |
} | |
} | |
let mut consecutive = true; | |
let mut contains_stop_word = false; | |
for word in words.iter().skip_while(Option::is_none) { | |
match word { | |
Some(word) => { | |
let positions = match words_positions.get(word) { | |
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(), | |
None => return Ok(vec![]), | |
}; | |
groups_positions.push(positions); | |
// if there is at least one stop word between words, | |
// then words are not considered consecutives. | |
consecutive = !contains_stop_word; | |
}, | |
None => contains_stop_word = true, | |
} | |
} |
However, we could probably be clever, by using slice_group_by
let mut consecutive = true; | |
let mut was_last_word_a_stop_word = false; | |
for word in words.iter() { | |
if let Some(word) = word { | |
let positions = match words_positions.get(word) { | |
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(), | |
None => return Ok(vec![]), | |
}; | |
groups_positions.push(positions); | |
if was_last_word_a_stop_word { | |
consecutive = false; | |
} | |
was_last_word_a_stop_word = false; | |
} else { | |
if !was_last_word_a_stop_word { | |
consecutive = false; | |
} | |
was_last_word_a_stop_word = true; | |
} | |
} | |
// group stop_words together. | |
for words in words.linear_group_by_key(Option::is_none) { | |
// skip if it's a group of stop words. | |
if words.first().flatten().is_none() { | |
continue; | |
} | |
// make a consecutive plane-sweep on the subgroup of words. | |
let mut subgroup = Vec::with_capacity(words.len()); | |
for word in words { | |
let positions = match words_positions.get(word) { | |
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(), | |
None => return Ok(vec![]), | |
}; | |
subgroup.push(positions); | |
} | |
groups_positions.push(plane_sweep(subgroup, true)?); | |
} | |
// then make a non-consecutive plane-sweep on groups of words separated by stop words. | |
plane_sweep(groups_positions, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it's panicking at:
milli/milli/src/search/criteria/proximity.rs
Line 422 in 488d31e
let q = current[1]; |
Should I make it such that it only calls plane_sweep
when there is >= 2 subgroups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a match should do the job, something like:
match groups_positions.len() {
0 => Ok(vec![]),
1 => Ok(groups_positions.pop().unwrap()),
_ => plane_sweep(groups_positions, false),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Added this in d35afa0. Although I had to change a few things.
@Samyak2, thanks for your PR! |
Originally written by ManyTheFish here: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762 Co-authored-by: ManyTheFish <many@meilisearch.com>
Moved the actual test into a separate function used by both the existing test and the new test.
c642cd9
to
488d31e
Compare
^ rebased on main and fixed conflicts |
bors try |
tryBuild failed: |
Co-authored-by: ManyTheFish <many@meilisearch.com>
bors try |
tryBuild succeeded: |
None => return Ok(vec![]), | ||
} | ||
} | ||
groups_positions.push(plane_sweep(subgroup, true)?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A subgroup could contain only one word, isn't it an issue to call a plane-sweep with only one word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. Should I ignore such subgroups or should I push it to groups_positions
directly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Samyak2,
It looks good to me, I let bors run the tests and if everything goes well, your PR will be merged automatically!
Thank you for your contribution,
bors merge
Build succeeded: |
This message is sent automatically Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form. |
Thank you for your guidance! This was a really nice contributing experience :) |
Pull Request
This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.
Related issue
Fixes #661 and towards fixing meilisearch/meilisearch#2905
What does this PR do?
Vec<Option<String>>
instead ofVec<String>
whereNone
corresponds to a stop wordresolve_phrase
create_primitive_query
?PR checklist
Please check if your PR fulfills the following requirements: