-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of the search algorithms #3542
Conversation
ac8687b
to
2b6814d
Compare
78f7fe9
to
7b160b2
Compare
The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.
let min_len_one_typo = ctx.index.min_word_len_one_typo(ctx.txn)?; | ||
let min_len_two_typos = ctx.index.min_word_len_two_typos(ctx.txn)?; | ||
|
||
// TODO: should `exact_words` also disable prefix search, ngrams, split words, or synonyms? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: @loiclec what should we do with this comment
let ngram_str_interned = ctx.word_interner.insert(ngram_str.clone()); | ||
|
||
let max_nbr_typos = | ||
number_of_typos_allowed(ngram_str.as_str()).saturating_sub(terms.len() as u8 - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevancy change: @loiclec does that mean that trigrams now always have 0 allowed typos? If so @ManyTheFish is reporting this is a relevancy change as both trigrams and digrams would sometimes allow for 1 typo in the previous impl
/// Finish iterating over the current ranking rule, yielding | ||
/// control to the parent (or finishing the search if not possible). | ||
/// Update the universes accordingly and inform the logger. | ||
macro_rules! back { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: @Kerollmops replace with ControlFlow?
let mut valid_docids = vec![]; | ||
let mut cur_offset = 0usize; | ||
|
||
macro_rules! maybe_add_to_results { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Replace macro by inline calls to the inner function?
|
||
/// A ranking rule that produces 3 disjoint buckets: | ||
/// | ||
/// 1. Documents from the universe whose value is exactly the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevancy change: in the previous implementation, the first 2 heuristics were only run on the initial query with all the words, whereas in the new implementation, they will run on queries modified by a prior invocation of the words
ranking rule, e.g. removing words at the end of the query.
If the invocation resulted in "hole" in the query (e.g. frequency removal), then the 2 heuristics will not discriminate documents. Otherwise, however, it will run again so that:
- Query: Batman the dark knight rises
- Word iter 1: Batman the dark knight rises
- Exactness iter 1: Batman the dark knight rises
- Exactness iter 1: Batman the dark knight rises
- Word iter 2: Batman the dark knight
- Exactness iter 2: Batman the dark knight
- Exactness iter 3: Batman the dark knight crashes
@@ -0,0 +1,78 @@ | |||
use roaring::RoaringBitmap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevancy changes:
- a split word is now considered as having 1 typo
- a digram can have split words too:
whit ehorse
(FIXME: consider allowing for split words in trigrams e.g.white ehor se
)
@@ -0,0 +1,133 @@ | |||
use fxhash::{FxHashMap, FxHashSet}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query words that match far in a field will be bucketed together if they slightly differ. Query words that match near the front of the field will remain precise enough (see cost_from_position
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the way we change the way we aggregate the positions?
…e it has been deserialized
Conflicts | resolution ----------|----------- Cargo.lock | added mimalloc Cargo.toml | took origin/main version milli/src/search/criteria/exactness.rs | deleted after checking it was only clippy changes milli/src/search/query_tree.rs | deleted after checking it was only clippy changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
This PR refactors a large part of the search logic (related to #3547)
The control flow between the
criterionsranking rules is managed in a single place instead of being independently implemented by each ranking rule.The set of document candidates is determined greedily from the beginning. It is often referred as the "universe" in the code.
The ranking rules
proximity
,attribute
,typo
, and (maybe)exactness
are or will be implemented using a K-shortest path graph algorithm. This minimises the number of database and bitmap operations we need to do to compute each ranking rule bucket. It also simplifies the code a lot since a lot of ranking rules will share a large part of their implementation.Pointers to database values are stored in a cache to avoid searching in the LMDB databases needlessly.
The result of some roaring bitmap operations are also stored in a cache, although we'll need to measure the memory pressure this puts on the system and maybe deactivate this cache later on.
Search requests can be visually logged and debugged in tests.
TODO:
disableOnWords
anddisableOnAttributes
settings of typo toleranceattribute
ranking ruleword_fid_docids
andword_position_docids
(with bucketed position)exactness
ranking ruleWords
geosort
ranking ruledisableOnWords
/disableOnAttributes
distinct
attribute
,proximity
,typo
, orexactness
is placed beforewords
whiteh orse
->"white horse"
)typo
ranking rulesort
ranking ruleSearch
interface to use the new search algorithms