Fix phrase search containing stop words #664

Samyak2 · 2022-10-13T12:41:55Z

Pull Request

This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.

Related issue

Fixes #661 and towards fixing meilisearch/meilisearch#2905

What does this PR do?

Change Phrase Operation to use a Vec<Option<String>> instead of Vec<String> where None corresponds to a stop word
Update all other uses of phrase operation
Update resolve_phrase
Update create_primitive_query?
Add test

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Samyak2 · 2022-10-13T13:46:50Z

@ManyTheFish I have completed all the steps and I think it's implemented correctly, but that test (cargo test test_phrase_search_with_stop_words) still fails with 0 matches instead of 1. What am I missing here?

milli/src/search/criteria/exactness.rs

milli/src/search/criteria/typo.rs

ManyTheFish

This PR is not the easiest one,
I suggest starting by fixing resolve_phrase, it should help you with the other steps!

milli/src/search/criteria/exactness.rs

milli/src/search/criteria/mod.rs

milli/src/search/criteria/proximity.rs

ManyTheFish · 2022-10-13T14:53:42Z

milli/src/search/criteria/proximity.rs

@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates(
            }
            Phrase(words) => {
                let mut groups_positions = Vec::with_capacity(words.len());
-                for word in words {
+                for word in words.iter().filter_map(|w| w.as_ref()) {


This one is a bit tricky, the call to the below plane_sweep can't be set as true if at least 1 stop word is surrounded by words, otherwise we set it to true

I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485

I have implemented it: 4705368

It's a bit ugly though. I couldn't find a better way to do it.

Also, how do I test this? The existing test already has a stop word surrounded by words but it doesn't seem to fail without this. Am I missing something?

milli/src/search/criteria/typo.rs

milli/src/search/query_tree.rs

ManyTheFish · 2022-10-13T15:27:35Z

milli/src/search/query_tree.rs

+                let words = words
+                    .into_iter()
+                    .filter_map(|w| w)
+                    .map(|w| MatchingWord::new(w.to_string(), 0, false))
+                    .collect();


I am a bit afraid of this one, let's come back later on it

milli/src/search/query_tree.rs

milli/tests/search/phrase_search.rs

milli/src/search/criteria/exactness.rs

ManyTheFish · 2022-10-20T07:24:35Z

milli/src/search/criteria/proximity.rs

@@ -473,7 +477,7 @@ fn resolve_plane_sweep_candidates(
            }
            Phrase(words) => {
                let mut groups_positions = Vec::with_capacity(words.len());
-                for word in words {
+                for word in words.iter().filter_map(|w| w.as_ref()) {


I re-iter my request, I'm pretty sure that it wont return any document if we don't conditionally change the line 485

milli/tests/search/phrase_search.rs

Samyak2 · 2022-10-20T12:56:29Z

Just to update, I'm still working on this. I couldn't make any progress because of other commitments, but I have picked it up again now. I'll update the PR in the next few hours.

ManyTheFish

Hello @Samyak2,
Nice Job! I requested a last change, we should be able to merge your PR after that!

ManyTheFish · 2022-10-25T15:59:59Z

milli/src/search/criteria/proximity.rs

+                let mut consecutive = true;
+                let mut was_last_word_a_stop_word = false;
+                for word in words.iter() {
+                    if let Some(word) = word {
+                        let positions = match words_positions.get(word) {
+                            Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),
+                            None => return Ok(vec![]),
+                        };
+                        groups_positions.push(positions);
+
+                        if was_last_word_a_stop_word {
+                            consecutive = false;
+                        }
+                        was_last_word_a_stop_word = false;
+                    } else {
+                        if !was_last_word_a_stop_word {
+                            consecutive = false;
+                        }
+
+                        was_last_word_a_stop_word = true;
+                    }
                }


This code block could be refactored like:

Suggested change

let mut consecutive = true;

let mut was_last_word_a_stop_word = false;

for word in words.iter() {

if let Some(word) = word {

let positions = match words_positions.get(word) {

Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),

None => return Ok(vec![]),

};

groups_positions.push(positions);

if was_last_word_a_stop_word {

consecutive = false;

}

was_last_word_a_stop_word = false;

} else {

if !was_last_word_a_stop_word {

consecutive = false;

}

was_last_word_a_stop_word = true;

}

}

let mut consecutive = true;

let mut contains_stop_word = false;

for word in words.iter().skip_while(Option::is_none) {

match word {

Some(word) => {

let positions = match words_positions.get(word) {

Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),

None => return Ok(vec![]),

};

groups_positions.push(positions);

// if there is at least one stop word between words,

// then words are not considered consecutives.

consecutive = !contains_stop_word;

},

None => contains_stop_word = true,

}

}

However, we could probably be clever, by using slice_group_by

Suggested change

let mut consecutive = true;

let mut was_last_word_a_stop_word = false;

for word in words.iter() {

if let Some(word) = word {

let positions = match words_positions.get(word) {

Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),

None => return Ok(vec![]),

};

groups_positions.push(positions);

if was_last_word_a_stop_word {

consecutive = false;

}

was_last_word_a_stop_word = false;

} else {

if !was_last_word_a_stop_word {

consecutive = false;

}

was_last_word_a_stop_word = true;

}

}

// group stop_words together.

for words in words.linear_group_by_key(Option::is_none) {

// skip if it's a group of stop words.

if words.first().flatten().is_none() {

continue;

}

// make a consecutive plane-sweep on the subgroup of words.

let mut subgroup = Vec::with_capacity(words.len());

for word in words {

let positions = match words_positions.get(word) {

Some(positions) => positions.iter().map(|p| (p, 0, p)).collect(),

None => return Ok(vec![]),

};

subgroup.push(positions);

}

groups_positions.push(plane_sweep(subgroup, true)?);

}

// then make a non-consecutive plane-sweep on groups of words separated by stop words.

plane_sweep(groups_positions, false)

I tried this, but it's panicking at:

milli/milli/src/search/criteria/proximity.rs

Line 422 in 488d31e

let q = current[1];

Should I make it such that it only calls plane_sweep when there is >= 2 subgroups?

Yes, a match should do the job, something like:

match groups_positions.len() { 0 => Ok(vec![]), 1 => Ok(groups_positions.pop().unwrap()), _ => plane_sweep(groups_positions, false), }

Thank you! Added this in d35afa0. Although I had to change a few things.

curquiza · 2022-10-25T19:55:50Z

@Samyak2, thanks for your PR!
Can you also fix the git conflicts please? 😊

Fixes meilisearch#661 and meilisearch/meilisearch#2905

Originally written by ManyTheFish here: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762 Co-authored-by: ManyTheFish <many@meilisearch.com>

Moved the actual test into a separate function used by both the existing test and the new test.

Samyak2 · 2022-10-26T13:40:50Z

^ rebased on main and fixed conflicts

curquiza · 2022-10-26T15:34:25Z

bors try

bors · 2022-10-26T15:45:04Z

try

Build failed:

Tests on windows-latest

Co-authored-by: ManyTheFish <many@meilisearch.com>

curquiza · 2022-10-27T07:23:15Z

bors try

bors · 2022-10-27T07:55:13Z

try

Build succeeded:

ManyTheFish · 2022-10-27T09:07:07Z

milli/src/search/criteria/proximity.rs

+                            None => return Ok(vec![]),
+                        }
+                    }
+                    groups_positions.push(plane_sweep(subgroup, true)?);


A subgroup could contain only one word, isn't it an issue to call a plane-sweep with only one word?

Oh you're right. Should I ignore such subgroups or should I push it to groups_positions directly instead?

03eb5d8

Is this what you meant?

ManyTheFish

Hello @Samyak2,
It looks good to me, I let bors run the tests and if everything goes well, your PR will be merged automatically!
Thank you for your contribution,

bors merge

bors · 2022-10-29T14:15:37Z

Build succeeded:

meili-bot · 2022-10-29T14:15:42Z

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

Samyak2 · 2022-10-29T15:27:42Z

Thank you for your guidance! This was a really nice contributing experience :)

Samyak2 force-pushed the fix-phrase-search-stop-words branch from 94957dd to b34fbcf Compare October 13, 2022 13:45

Samyak2 commented Oct 13, 2022

View reviewed changes

milli/src/search/criteria/exactness.rs Outdated Show resolved Hide resolved

Samyak2 commented Oct 13, 2022

View reviewed changes

milli/src/search/criteria/typo.rs Outdated Show resolved Hide resolved

ManyTheFish suggested changes Oct 13, 2022

View reviewed changes

ManyTheFish added the no breaking The related changes are not breaking (DB nor API) label Oct 17, 2022

ManyTheFish suggested changes Oct 20, 2022

View reviewed changes

Samyak2 requested a review from ManyTheFish October 20, 2022 13:38

Samyak2 marked this pull request as ready for review October 20, 2022 13:38

Samyak2 changed the title ~~[WIP] Fix phrase search containing stop words~~ Fix phrase search containing stop words Oct 21, 2022

ManyTheFish suggested changes Oct 25, 2022

View reviewed changes

Samyak2 and others added 13 commits October 26, 2022 19:08

[WIP] Fix phrase search containing stop words

62816dd

Fixes meilisearch#661 and meilisearch/meilisearch#2905

Add test for phrase search with stop words

6a10b67

Originally written by ManyTheFish here: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762 Co-authored-by: ManyTheFish <many@meilisearch.com>

Perform filter after enumerate to keep origin indices

ef13c6a

Increment position even when it's a stop word in exactness criteria

709ab3c

Search for closest non-stop words in proximity criteria

3e19050

Use resolve_phrase in exactness and typo criteria

c8c666c

Fix snapshots to use new phrase type

d187b32

Run cargo fmt

bb9ce3c

Fix panic when phrase contains only one stop word and nothing else

2aa11af

Simplify stop word checking in create_primitive_query

77f1ff0

Add test for phrase search with stop words and all criteria at once

f1da623

Moved the actual test into a separate function used by both the existing test and the new test.

Consecutive is false when at least 1 stop word is surrounded by words

af33d22

Run cargo fmt

488d31e

Samyak2 force-pushed the fix-phrase-search-stop-words branch from c642cd9 to 488d31e Compare October 26, 2022 13:40

bors bot added a commit that referenced this pull request Oct 26, 2022

Try #664:

32742da

Samyak2 and others added 2 commits October 26, 2022 23:07

Update phrase search to use new execute method

752d031

Change consecutive phrase search grouping logic

d35afa0

Co-authored-by: ManyTheFish <many@meilisearch.com>

bors bot added a commit that referenced this pull request Oct 27, 2022

Try #664:

0a7c660

ManyTheFish reviewed Oct 27, 2022

View reviewed changes

Samyak2 added 2 commits October 28, 2022 19:32

Only call plane_sweep on subgroups when 2 or more are present

03eb5d8

Run cargo fmt

ecb8814

ManyTheFish approved these changes Oct 29, 2022

View reviewed changes

bors bot merged commit c965200 into meilisearch:main Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix phrase search containing stop words #664

Fix phrase search containing stop words #664

Samyak2 commented Oct 13, 2022 •

edited

Loading

Samyak2 commented Oct 13, 2022

ManyTheFish left a comment •

edited

Loading

ManyTheFish Oct 13, 2022

ManyTheFish Oct 20, 2022

Samyak2 Oct 20, 2022

ManyTheFish Oct 13, 2022

ManyTheFish Oct 20, 2022

Samyak2 commented Oct 20, 2022 •

edited

Loading

ManyTheFish left a comment

ManyTheFish Oct 25, 2022

Samyak2 Oct 26, 2022

ManyTheFish Oct 26, 2022

Samyak2 Oct 26, 2022

curquiza commented Oct 25, 2022

Samyak2 commented Oct 26, 2022

curquiza commented Oct 26, 2022

bors bot commented Oct 26, 2022

curquiza commented Oct 27, 2022

bors bot commented Oct 27, 2022

ManyTheFish Oct 27, 2022

Samyak2 Oct 27, 2022

ManyTheFish Oct 27, 2022

Samyak2 Oct 28, 2022

ManyTheFish left a comment

bors bot commented Oct 29, 2022

meili-bot commented Oct 29, 2022

Samyak2 commented Oct 29, 2022

Fix phrase search containing stop words #664

Fix phrase search containing stop words #664

Conversation

Samyak2 commented Oct 13, 2022 • edited Loading

Pull Request

Related issue

What does this PR do?

PR checklist

Samyak2 commented Oct 13, 2022

ManyTheFish left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samyak2 commented Oct 20, 2022 • edited Loading

ManyTheFish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

curquiza commented Oct 25, 2022

Samyak2 commented Oct 26, 2022

curquiza commented Oct 26, 2022

bors bot commented Oct 26, 2022

try

curquiza commented Oct 27, 2022

bors bot commented Oct 27, 2022

try

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ManyTheFish left a comment

Choose a reason for hiding this comment

bors bot commented Oct 29, 2022

meili-bot commented Oct 29, 2022

Samyak2 commented Oct 29, 2022

Samyak2 commented Oct 13, 2022 •

edited

Loading

ManyTheFish left a comment •

edited

Loading

Samyak2 commented Oct 20, 2022 •

edited

Loading