Refactor of the search algorithms #3542

loiclec · 2023-02-27T09:57:38Z

This PR refactors a large part of the search logic (related to #3547)

The "query tree" is replaced by a "query graph", which describes the different ways in which the search query can be interpreted and precomputes the word derivations for each query term. Example:

The control flow between the ~~criterions~~ ranking rules is managed in a single place instead of being independently implemented by each ranking rule.
The set of document candidates is determined greedily from the beginning. It is often referred as the "universe" in the code.
The ranking rules proximity, attribute, typo, and (maybe) exactness are or will be implemented using a K-shortest path graph algorithm. This minimises the number of database and bitmap operations we need to do to compute each ranking rule bucket. It also simplifies the code a lot since a lot of ranking rules will share a large part of their implementation.
Pointers to database values are stored in a cache to avoid searching in the LMDB databases needlessly.
The result of some roaring bitmap operations are also stored in a cache, although we'll need to measure the memory pressure this puts on the system and maybe deactivate this cache later on.
Search requests can be visually logged and debugged in tests.

TODO:

The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.

dureuill · 2023-05-02T13:01:42Z

milli/src/search/new/query_term/parse_query.rs

+    let min_len_one_typo = ctx.index.min_word_len_one_typo(ctx.txn)?;
+    let min_len_two_typos = ctx.index.min_word_len_two_typos(ctx.txn)?;
+
+    // TODO: should `exact_words` also disable prefix search, ngrams, split words, or synonyms?


nitpick: @loiclec what should we do with this comment

dureuill · 2023-05-02T13:14:25Z

milli/src/search/new/query_term/parse_query.rs

+    let ngram_str_interned = ctx.word_interner.insert(ngram_str.clone());
+
+    let max_nbr_typos =
+        number_of_typos_allowed(ngram_str.as_str()).saturating_sub(terms.len() as u8 - 1);


relevancy change: @loiclec does that mean that trigrams now always have 0 allowed typos? If so @ManyTheFish is reporting this is a relevancy change as both trigrams and digrams would sometimes allow for 1 typo in the previous impl

dureuill · 2023-05-02T13:58:55Z

milli/src/search/new/bucket_sort.rs

+    /// Finish iterating over the current ranking rule, yielding
+    /// control to the parent (or finishing the search if not possible).
+    /// Update the universes accordingly and inform the logger.
+    macro_rules! back {


nitpick: @Kerollmops replace with ControlFlow?

dureuill · 2023-05-02T14:02:28Z

milli/src/search/new/bucket_sort.rs

+    let mut valid_docids = vec![];
+    let mut cur_offset = 0usize;
+
+    macro_rules! maybe_add_to_results {


nitpick: Replace macro by inline calls to the inner function?

dureuill · 2023-05-02T15:03:48Z

milli/src/search/new/exact_attribute.rs

+
+/// A ranking rule that produces 3 disjoint buckets:
+///
+/// 1. Documents from the universe whose value is exactly the query.


relevancy change: in the previous implementation, the first 2 heuristics were only run on the initial query with all the words, whereas in the new implementation, they will run on queries modified by a prior invocation of the words ranking rule, e.g. removing words at the end of the query.

If the invocation resulted in "hole" in the query (e.g. frequency removal), then the 2 heuristics will not discriminate documents. Otherwise, however, it will run again so that:

Query: Batman the dark knight rises

Word iter 1: Batman the dark knight rises

Exactness iter 1: Batman the dark knight rises

Exactness iter 1: Batman the dark knight rises

Word iter 2: Batman the dark knight

Exactness iter 2: Batman the dark knight

Exactness iter 3: Batman the dark knight crashes

dureuill · 2023-05-02T15:19:21Z

milli/src/search/new/ranking_rule_graph/typo/mod.rs

@@ -0,0 +1,78 @@
+use roaring::RoaringBitmap;


relevancy changes:

a split word is now considered as having 1 typo

a digram can have split words too: whit ehorse (FIXME: consider allowing for split words in trigrams e.g. white ehor se)

Kerollmops · 2023-05-02T15:44:48Z

milli/src/search/new/ranking_rule_graph/position/mod.rs

@@ -0,0 +1,133 @@
+use fxhash::{FxHashMap, FxHashSet};


Query words that match far in a field will be bucketed together if they slightly differ. Query words that match near the front of the field will remain precise enough (see cost_from_position).

Should the way we change the way we aggregate the positions?

…kens

…e it has been deserialized

Conflicts | resolution ----------|----------- Cargo.lock | added mimalloc Cargo.toml | took origin/main version milli/src/search/criteria/exactness.rs | deleted after checking it was only clippy changes milli/src/search/query_tree.rs | deleted after checking it was only clippy changes

dureuill

bors merge

meili-bors · 2023-05-03T14:30:01Z

Build succeeded:

loiclec added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption search relevancy Related to the relevancy of the search results labels Feb 27, 2023

loiclec mentioned this pull request Feb 28, 2023

Search Relevancy & Performance Improvements #3547

Closed

loiclec force-pushed the search-refactor branch 5 times, most recently from ac8687b to 2b6814d Compare March 13, 2023 16:22

loiclec force-pushed the search-refactor branch 4 times, most recently from 78f7fe9 to 7b160b2 Compare March 16, 2023 11:26

loiclec added 18 commits March 20, 2023 09:41

Use MiMalloc in milli tests

6c659dc

Temporarily remove codegen-units - 1

1d937f8

Remove unused term matching strategies

2d88089

Introduce a new search module, eventually meant to replace the old one

79e0a6d

The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.

Introduce structure to represent search queries as graphs

a83007c

Introduce a DatabaseCache to memorize the addresses of LMDB values

5065d8b

Introduce a common way to manage the coordination between ranking rules

ce0d1e0

Implement a function to find a QueryGraph's docids

46249ea

Introduce a structure to implement ranking rules with graph algorithms

c9bf6bb

Introduce a structure to represent a set of graph paths efficiently

864f641

Introduce cache structures used with ranking rule graphs

23bf572

Introduce a function to find the docids of a set of paths in a graph

48aae76

Introduce a function to find the K shortest paths in a graph

a70ab8b

Introduce a generic graph-based ranking rule

c645853

Introduce the proximity ranking rule as a graph-based ranking rule

89d696c

Introduce the words ranking rule working with the new search structures

345c99d

Introduce the sort ranking rule working with the new search structures

1321913

Add some documentation and use bitmaps instead of hashmaps when possible

66d0c63

dureuill reviewed May 2, 2023

View reviewed changes

Kerollmops reviewed May 2, 2023

View reviewed changes

dureuill and others added 13 commits May 2, 2023 18:53

rename located_query_terms_from_string -> located_query_terms_from_to…

7b8cc25

…kens

Remove too many arguments on resolve_maximally_reduced_query_graph

75819bc

Use MultiOps for resolve_query_graph

fdc1763

Remove self.iterating from words

b60840e

revamp the test to use execute_iterative_and_rtree_returns_the_same

c470b67

deserialize the rtree only when its needed, and keep it in memory onc…

8875d24

…e it has been deserialized

make the descendent geosort fast

c85392c

geosort: Remove rtree unwrap

342c4ff

Cargo fmt

1aaf24c

fix nb of dbs

d3e5b10

Increase map size for tests following charabia camelCase tokenization

3a408e8

Update exactness tests following charabia camelCase tokenization

f8f190c

dureuill marked this pull request as ready for review May 3, 2023 13:19

dureuill approved these changes May 3, 2023

View reviewed changes

meili-bors bot merged commit 1afde4f into main May 3, 2023
7 checks passed

meili-bors bot deleted the search-refactor branch May 3, 2023 14:30

meili-bot added the v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05 label Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of the search algorithms #3542

Refactor of the search algorithms #3542

loiclec commented Feb 27, 2023 •

edited by dureuill

Loading

dureuill May 2, 2023

dureuill May 2, 2023

dureuill May 2, 2023

dureuill May 2, 2023

dureuill May 2, 2023

dureuill May 2, 2023

Kerollmops May 2, 2023

Kerollmops May 2, 2023 •

edited

Loading

dureuill left a comment

meili-bors bot commented May 3, 2023

Refactor of the search algorithms #3542

Refactor of the search algorithms #3542

Conversation

loiclec commented Feb 27, 2023 • edited by dureuill Loading

dureuill May 2, 2023

Choose a reason for hiding this comment

dureuill May 2, 2023

Choose a reason for hiding this comment

dureuill May 2, 2023

Choose a reason for hiding this comment

dureuill May 2, 2023

Choose a reason for hiding this comment

dureuill May 2, 2023

Choose a reason for hiding this comment

dureuill May 2, 2023

Choose a reason for hiding this comment

Kerollmops May 2, 2023

Choose a reason for hiding this comment

Kerollmops May 2, 2023 • edited Loading

Choose a reason for hiding this comment

dureuill left a comment

Choose a reason for hiding this comment

meili-bors bot commented May 3, 2023

loiclec commented Feb 27, 2023 •

edited by dureuill

Loading

Kerollmops May 2, 2023 •

edited

Loading