Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of the search algorithms #3542

Merged
merged 243 commits into from
May 3, 2023
Merged

Refactor of the search algorithms #3542

merged 243 commits into from
May 3, 2023

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Feb 27, 2023

This PR refactors a large part of the search logic (related to #3547)

  • The "query tree" is replaced by a "query graph", which describes the different ways in which the search query can be interpreted and precomputes the word derivations for each query term. Example:

Screenshot 2023-02-27 at 10 26 50

  • The control flow between the criterions ranking rules is managed in a single place instead of being independently implemented by each ranking rule.

  • The set of document candidates is determined greedily from the beginning. It is often referred as the "universe" in the code.

  • The ranking rules proximity, attribute, typo, and (maybe) exactness are or will be implemented using a K-shortest path graph algorithm. This minimises the number of database and bitmap operations we need to do to compute each ranking rule bucket. It also simplifies the code a lot since a lot of ranking rules will share a large part of their implementation.

  • Pointers to database values are stored in a cache to avoid searching in the LMDB databases needlessly.

  • The result of some roaring bitmap operations are also stored in a cache, although we'll need to measure the memory pressure this puts on the system and maybe deactivate this cache later on.

  • Search requests can be visually logged and debugged in tests.

TODO:

  • Reintroduce search benchmarks
  • Implement disableOnWords and disableOnAttributes settings of typo tolerance
  • Implement "exhaustive number of hits
  • Implement attribute ranking rule
    • Indexing changes: split into word_fid_docids and word_position_docids (with bucketed position)
    • Ranking rule implementations
  • Implement exactness ranking rule
    • Initial implementation
    • Correct implementation when followed by Words
  • Implement geosort ranking rule
  • Add tests
    • Typo tolerance disableOnWords/disableOnAttributes
    • Geosort
    • Exactness
    • Attribute/Position
    • Interactions between ranking rules:
      • Typo/Proximity/Attribute not preceded by Words
      • Exactness not preceded by Words
      • Exactness -> Words (+ check universe correctness)
      • Exactness -> Typo, etc.
      • Sort -> Words (performance tests)
      • Attribute/Position -> Typo
      • Attribute/Position -> Proximity
      • Typo -> Exactness
      • Typo -> Proximity
      • Proximity -> Typo
    • Words
    • Typo
    • Proximity
    • Sort
    • Ngrams
    • Split words
    • Ngram + Split Words
    • Term matching strategy
    • Distinct attribute
    • Phrase Search
    • Placeholder search
    • Highlighter
  • Limit the number of word derivations in a search query
  • Compute the initial universe correctly according to the terms matching strategy
  • Implement placeholder search
  • Get the list of ranking rules from the settings
  • Implement distinct
  • Determine what to do when one of attribute, proximity, typo, or exactness is placed before words
  • Make sure the correct number of allowed typos is used for each word, including the prefix one
  • Make sure stop words are treated correctly (e.g. correct position in query graph), including in phrases
  • Support phrases correctly
  • Support synonyms
  • Support split words
  • Support combination of ngram + split-words (e.g. whiteh orse -> "white horse")
  • Implement typo ranking rule
  • Implement sort ranking rule
  • Use existing Search interface to use the new search algorithms
  • Remove old code

@loiclec loiclec added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption search relevancy Related to the relevancy of the search results labels Feb 27, 2023
@loiclec loiclec force-pushed the search-refactor branch 5 times, most recently from ac8687b to 2b6814d Compare March 13, 2023 16:22
@loiclec loiclec force-pushed the search-refactor branch 4 times, most recently from 78f7fe9 to 7b160b2 Compare March 16, 2023 11:26
let min_len_one_typo = ctx.index.min_word_len_one_typo(ctx.txn)?;
let min_len_two_typos = ctx.index.min_word_len_two_typos(ctx.txn)?;

// TODO: should `exact_words` also disable prefix search, ngrams, split words, or synonyms?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: @loiclec what should we do with this comment

let ngram_str_interned = ctx.word_interner.insert(ngram_str.clone());

let max_nbr_typos =
number_of_typos_allowed(ngram_str.as_str()).saturating_sub(terms.len() as u8 - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevancy change: @loiclec does that mean that trigrams now always have 0 allowed typos? If so @ManyTheFish is reporting this is a relevancy change as both trigrams and digrams would sometimes allow for 1 typo in the previous impl

/// Finish iterating over the current ranking rule, yielding
/// control to the parent (or finishing the search if not possible).
/// Update the universes accordingly and inform the logger.
macro_rules! back {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: @Kerollmops replace with ControlFlow?

let mut valid_docids = vec![];
let mut cur_offset = 0usize;

macro_rules! maybe_add_to_results {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Replace macro by inline calls to the inner function?


/// A ranking rule that produces 3 disjoint buckets:
///
/// 1. Documents from the universe whose value is exactly the query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevancy change: in the previous implementation, the first 2 heuristics were only run on the initial query with all the words, whereas in the new implementation, they will run on queries modified by a prior invocation of the words ranking rule, e.g. removing words at the end of the query.

If the invocation resulted in "hole" in the query (e.g. frequency removal), then the 2 heuristics will not discriminate documents. Otherwise, however, it will run again so that:

  • Query: Batman the dark knight rises
  • Word iter 1: Batman the dark knight rises
  • Exactness iter 1: Batman the dark knight rises
  • Exactness iter 1: Batman the dark knight rises
  • Word iter 2: Batman the dark knight
  • Exactness iter 2: Batman the dark knight
  • Exactness iter 3: Batman the dark knight crashes

@@ -0,0 +1,78 @@
use roaring::RoaringBitmap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevancy changes:

  • a split word is now considered as having 1 typo
  • a digram can have split words too: whit ehorse (FIXME: consider allowing for split words in trigrams e.g. white ehor se)

@@ -0,0 +1,133 @@
use fxhash::{FxHashMap, FxHashSet};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query words that match far in a field will be bucketed together if they slightly differ. Query words that match near the front of the field will remain precise enough (see cost_from_position).

Copy link
Member

@Kerollmops Kerollmops May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the way we change the way we aggregate the positions?

@dureuill dureuill marked this pull request as ready for review May 3, 2023 13:19
Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

@meili-bors
Copy link
Contributor

meili-bors bot commented May 3, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption search relevancy Related to the relevancy of the search results v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants