Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Fix phrase search containing stop words #664

Merged
merged 17 commits into from
Oct 29, 2022

Conversation

Samyak2
Copy link
Contributor

@Samyak2 Samyak2 commented Oct 13, 2022

Pull Request

This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.

Related issue

Fixes #661 and towards fixing meilisearch/meilisearch#2905

What does this PR do?

  • Change Phrase Operation to use a Vec<Option<String>> instead of Vec<String> where None corresponds to a stop word
  • Update all other uses of phrase operation
  • Update resolve_phrase
  • Update create_primitive_query?
  • Add test

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

@Samyak2 Samyak2 force-pushed the fix-phrase-search-stop-words branch from 94957dd to b34fbcf Compare October 13, 2022 13:45
@Samyak2
Copy link
Contributor Author

Samyak2 commented Oct 13, 2022

@ManyTheFish I have completed all the steps and I think it's implemented correctly, but that test (cargo test test_phrase_search_with_stop_words) still fails with 0 matches instead of 1. What am I missing here?

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not the easiest one,
I suggest starting by fixing resolve_phrase, it should help you with the other steps!

milli/src/search/criteria/exactness.rs Outdated Show resolved Hide resolved
milli/src/search/criteria/mod.rs Outdated Show resolved Hide resolved
milli/src/search/criteria/proximity.rs Outdated Show resolved Hide resolved
@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates(
}
Phrase(words) => {
let mut groups_positions = Vec::with_capacity(words.len());
for word in words {
for word in words.iter().filter_map(|w| w.as_ref()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a bit tricky, the call to the below plane_sweep can't be set as true if at least 1 stop word is surrounded by words, otherwise we set it to true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented it: 4705368

It's a bit ugly though. I couldn't find a better way to do it.

Also, how do I test this? The existing test already has a stop word surrounded by words but it doesn't seem to fail without this. Am I missing something?

milli/src/search/criteria/typo.rs Outdated Show resolved Hide resolved
milli/src/search/query_tree.rs Show resolved Hide resolved
Comment on lines 594 to 596
let words = words
.into_iter()
.filter_map(|w| w)
.map(|w| MatchingWord::new(w.to_string(), 0, false))
.collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit afraid of this one, let's come back later on it

milli/src/search/query_tree.rs Outdated Show resolved Hide resolved
milli/tests/search/phrase_search.rs Outdated Show resolved Hide resolved
milli/src/search/criteria/exactness.rs Outdated Show resolved Hide resolved
@ManyTheFish ManyTheFish added the no breaking The related changes are not breaking (DB nor API) label Oct 17, 2022
@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates(
}
Phrase(words) => {
let mut groups_positions = Vec::with_capacity(words.len());
for word in words {
for word in words.iter().filter_map(|w| w.as_ref()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485

milli/tests/search/phrase_search.rs Outdated Show resolved Hide resolved
@Samyak2
Copy link
Contributor Author

Samyak2 commented Oct 20, 2022

Just to update, I'm still working on this. I couldn't make any progress because of other commitments, but I have picked it up again now. I'll update the PR in the next few hours.

@Samyak2 Samyak2 marked this pull request as ready for review October 20, 2022 13:38
@Samyak2 Samyak2 changed the title [WIP] Fix phrase search containing stop words Fix phrase search containing stop words Oct 21, 2022
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Samyak2,
Nice Job! I requested a last change, we should be able to merge your PR after that!

Comment on lines 481 to 502
let mut consecutive = true;
let mut was_last_word_a_stop_word = false;
for word in words.iter() {
if let Some(word) = word {
let positions = match words_positions.get(word) {
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
None => return Ok(vec![]),
};
groups_positions.push(positions);

if was_last_word_a_stop_word {
consecutive = false;
}
was_last_word_a_stop_word = false;
} else {
if !was_last_word_a_stop_word {
consecutive = false;
}

was_last_word_a_stop_word = true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block could be refactored like:

Suggested change
let mut consecutive = true;
let mut was_last_word_a_stop_word = false;
for word in words.iter() {
if let Some(word) = word {
let positions = match words_positions.get(word) {
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
None => return Ok(vec![]),
};
groups_positions.push(positions);
if was_last_word_a_stop_word {
consecutive = false;
}
was_last_word_a_stop_word = false;
} else {
if !was_last_word_a_stop_word {
consecutive = false;
}
was_last_word_a_stop_word = true;
}
}
let mut consecutive = true;
let mut contains_stop_word = false;
for word in words.iter().skip_while(Option::is_none) {
match word {
Some(word) => {
let positions = match words_positions.get(word) {
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
None => return Ok(vec![]),
};
groups_positions.push(positions);
// if there is at least one stop word between words,
// then words are not considered consecutives.
consecutive = !contains_stop_word;
},
None => contains_stop_word = true,
}
}

However, we could probably be clever, by using slice_group_by

Suggested change
let mut consecutive = true;
let mut was_last_word_a_stop_word = false;
for word in words.iter() {
if let Some(word) = word {
let positions = match words_positions.get(word) {
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
None => return Ok(vec![]),
};
groups_positions.push(positions);
if was_last_word_a_stop_word {
consecutive = false;
}
was_last_word_a_stop_word = false;
} else {
if !was_last_word_a_stop_word {
consecutive = false;
}
was_last_word_a_stop_word = true;
}
}
// group stop_words together.
for words in words.linear_group_by_key(Option::is_none) {
// skip if it's a group of stop words.
if words.first().flatten().is_none() {
continue;
}
// make a consecutive plane-sweep on the subgroup of words.
let mut subgroup = Vec::with_capacity(words.len());
for word in words {
let positions = match words_positions.get(word) {
Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
None => return Ok(vec![]),
};
subgroup.push(positions);
}
groups_positions.push(plane_sweep(subgroup, true)?);
}
// then make a non-consecutive plane-sweep on groups of words separated by stop words.
plane_sweep(groups_positions, false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it's panicking at:

let q = current[1];

Should I make it such that it only calls plane_sweep when there is >= 2 subgroups?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a match should do the job, something like:

match groups_positions.len() {
    0 => Ok(vec![]),
    1 => Ok(groups_positions.pop().unwrap()),
    _ => plane_sweep(groups_positions, false),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Added this in d35afa0. Although I had to change a few things.

@curquiza
Copy link
Member

@Samyak2, thanks for your PR!
Can you also fix the git conflicts please? 😊

@Samyak2 Samyak2 force-pushed the fix-phrase-search-stop-words branch from c642cd9 to 488d31e Compare October 26, 2022 13:40
@Samyak2
Copy link
Contributor Author

Samyak2 commented Oct 26, 2022

^ rebased on main and fixed conflicts

@curquiza
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Oct 26, 2022
@bors
Copy link
Contributor

bors bot commented Oct 26, 2022

try

Build failed:

@curquiza
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Oct 27, 2022
@bors
Copy link
Contributor

bors bot commented Oct 27, 2022

None => return Ok(vec![]),
}
}
groups_positions.push(plane_sweep(subgroup, true)?);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subgroup could contain only one word, isn't it an issue to call a plane-sweep with only one word?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. Should I ignore such subgroups or should I push it to groups_positions directly instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

03eb5d8

Is this what you meant?

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Samyak2,
It looks good to me, I let bors run the tests and if everything goes well, your PR will be merged automatically!
Thank you for your contribution,

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 29, 2022

@bors bors bot merged commit c965200 into meilisearch:main Oct 29, 2022
@meili-bot
Copy link
Contributor

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

@Samyak2
Copy link
Contributor Author

Samyak2 commented Oct 29, 2022

Thank you for your guidance! This was a really nice contributing experience :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no breaking The related changes are not breaking (DB nor API)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Phrase search containing stop words never retrieve any documents
4 participants