Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the "iterative" versions of the search algorithms #3356

Closed
loiclec opened this issue Dec 21, 2022 · 3 comments
Closed

Remove the "iterative" versions of the search algorithms #3356

loiclec opened this issue Dec 21, 2022 · 3 comments
Labels
maintenance Issue about maintenance (CI, tests, refacto...) milli Related to the milli workspace performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05
Milestone

Comments

@loiclec
Copy link
Contributor

loiclec commented Dec 21, 2022

The ranking rules proximity, sort, and attribute have two different implementation strategies. The first one (set-based) queries milli's databases and performs set operations on roaring bitmaps to find buckets of document ids. The second one (iterative) iterates on each candidate document and analyse their contents in order to sort them.

Currently, we switch between the set-based and iterative implementation strategy based on the number of candidate documents that need to be sorted. In the proximity criterion, this is done with this constant:

/// Threshold on the number of candidates that will make
/// the system choose between one algorithm or another.
const CANDIDATES_THRESHOLD: u64 = 1000;

There are however, a few problems with this approach:

  1. The CANDIDATES_THRESHOLD will always be arbitrary and suboptimal depending on the kind of data that was indexed. Maybe a value of 1000 is the best choice for small documents containing just a few dozen words, but for people with documents that weigh >500 kB, we may opt into the iterative approach too soon and take a heavy performance penalty.

  2. We have to maintain two different implementations and update them both whenever we make a change to the behaviour of a ranking rule, which is difficult. It is also difficult to ensure that both implementations are equivalent. In fact, some ranking rules already behave differently depending on the implementation strategy that was chosen. For example, in proximity a difference occurs but only in some specific cases (e.g. when we have documents/queries with consecutive identical words), which is okay. But for attribute, it appears that there is a large difference between the two implementations.

  3. It is harder to benchmark search requests correctly. We might make a change in the iterative or set-based version of the algorithm, and then misjudged the impact of the change because the alternative implementation is used instead. (This is partly fixed by Add a "Criterion implementation strategy" parameter to Search milli#742 ).

  4. It is also harder to detect bugs in the implementation of the ranking rules, for the same reason as in (3)

Ideally, when we refactor the search algorithms, we should aim to make the set-based strategy fast enough such that it is reasonable to use it even when sorting only two candidate documents. It would allow us to reduce the size of the code base and make performance/correctness problems more visible.

Additionally, if we remove the iterative versions of the proximity and attribute ranking rules, we can also remove the docid_word_positions database, which will reduce the size of the index.

@loiclec loiclec added maintenance Issue about maintenance (CI, tests, refacto...) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Dec 21, 2022
@loiclec loiclec changed the title Remove "iterative" version of some search algorithms Remove the "iterative" versions of the search algorithms Dec 21, 2022
bors bot referenced this issue in meilisearch/milli Dec 21, 2022
742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec

Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
@curquiza curquiza added the milli Related to the milli workspace label Jan 16, 2023
@curquiza curquiza transferred this issue from meilisearch/milli Jan 18, 2023
@loiclec
Copy link
Contributor Author

loiclec commented Jan 19, 2023

A data point on the performance of the iterative search algorithms when dealing with mid-size documents (containing a few thousand words):
Searching for I love in a dataset I created with random-looking data took 150ms to process (on an M1 Macbook), all of it spent in the iterative version of proximity and attribute.

@loiclec
Copy link
Contributor Author

loiclec commented Jan 25, 2023

Another data point, this time from a user:

Hello! I'm doing some experiments with Meilisearch v0.30.4 and am surprised with how long queries are taking (~500ms to search through 10K documents)... any pointers for how to go about debugging this?

What sort of data is it? What queries are you running, are they big/lots of text? What server specs is it on

Thanks for the reply! Running in Docker on a box with an 8 core Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with 64GB of RAM and NVMe disks. The docs themselves do have a fair amount of text, I'd say 64K characters?.... /stats reports the databaseSize of 11427844096 [bytes]

what sort of queries are producing the latencies?

any simple one or two word queries, also returning a couple facets

@loiclec
Copy link
Contributor Author

loiclec commented May 15, 2023

Fixed by #3542

@loiclec loiclec closed this as completed May 15, 2023
@curquiza curquiza added this to the v1.2.0 milestone May 15, 2023
@meili-bot meili-bot added the v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05 label Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Issue about maintenance (CI, tests, refacto...) milli Related to the milli workspace performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05
Projects
None yet
Development

No branches or pull requests

3 participants