Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Refactor the Facets databases to enable incremental indexing #619

Merged
merged 60 commits into from
Oct 26, 2022

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Sep 1, 2022

Pull Request

What does this PR do?

Party fixes #605 by making the indexing of the facet databases (i.e. facet_id_f64_docids and facet_id_string_docids) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed:

  1. The facet distribution algorithm did not respect the maxFacetValues parameter when there were only a few candidate document ids.
  2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:

  1. the new structure of the databases
  2. the algorithms for sort, facet distribution, and range search
  3. the new/removed heed codecs

Then, weigh in on the following concerns:

  1. adding fuzzcheck as a fuzz-only dependency may add too much complexity for the benefits it provides
  2. the ByteSliceRef and StrRefCodec are misnamed or should not exist
  3. the new behaviour of facet distributions can be considered incorrect
  4. incremental deletion is useless given that documents are always deleted in bulk

What's left for me to do

  1. Re-read everything once to make sure I haven't forgotten anything
  2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
  3. Test whether the info/http-ui crates survived the refactor

Old structure of the facet_id_f64_docids and facet_id_string_docids databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where the first line is the key of the database, consisting of :

  • the field id
  • the level height
  • the left and right bound of the group

and the second line is the value of the database, consisting of:

  • a bitmap of all the docids that have a facet value within the bounds

The facet_id_string_docids had a similar structure:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where, at level 0, the key is:

  • the normalised facet value (string)

and the value is:

  • the original facet value (string)
  • a bitmap of all the docids that have this normalised string facet value

At level 1, the key is:

  • the left bound of the range as an index in level 0
  • the right bound of the range as an index in level 0

and the value is:

  • the left bound of the range as a normalised string
  • the right bound of the range as a normalised string
  • a bitmap of all the docids that have a string facet value within the bounds

At level > 1, the key is:

  • the left bound of the range as an index in level 0
  • the right bound of the range as an index in level 0

and the value is:

  • a bitmap of all the docids that have a string facet value within the bounds

New structure of the facet_id_f64_docids and facet_id_string_docids databases

Now both the facet_id_f64_docids and facet_id_string_docids databases have the exact same structure:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where for all levels, the key is a FacetGroupKey<T> containing:

  • the field id (u16)
  • the level height (u8)
  • the left bound of the range (T)

and the value is a FacetGroupValue containing:

  • the number of elements from the level below that are part of the range (u8, =0 for level 0)
  • a bitmap of all the docids that have a facet value within the bounds (RoaringBitmap)

The right bound of the range is now implicit, it is equal to Excluded(next_left_bound).

In the code, the key is always encoded using FacetGroupKeyCodec<C> where C is the codec used to encode the facet value (either OrderedF64Codec or StrRefCodec) and the value is encoded with FacetGroupValueCodec.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. FacetGroupKey<&[u8]> encoded as FacetGroupKeyCodec<ByteSliceRef>). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the facet_id_string_docids used to store indices to level 0 in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id z and facet value gap., then we want to add/modify the elements highlighted below in pink:
Screenshot 2022-09-12 at 10 14 54

which results in:
Screenshot 2022-09-12 at 10 23 29

  • one element was added in level 0
  • one key/value was modified in level 1
  • one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the group_size part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (max_group_size), then we split the node into two. For example, let's imagine that max_group_size is 4 and we add the docid y with facet value gas. First, we add it in level 0:
Screenshot 2022-09-12 at 10 30 40
Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
Screenshot 2022-09-12 at 10 33 16
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
Screenshot 2022-09-12 at 10 34 42

We also have two other parameters:

  • group_size is the default group size when building the database from scratch
  • min_level_size is the minimum number of elements that a level should contain

When the highest level size is greater than group_size * min_level_size, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting "aa" with the docid w would change the database to:
Screenshot 2022-09-12 at 10 41 56

The root of the code for incremental indexing is the FacetUpdateIncremental builder.

Incrementally removing a facet value

TODO: the algorithm was implemented and works, but its current API is: fn delete(self, facet_value, single_docid). It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.

Bulk insertion

While it's faster to incrementally add a single facet value to the database, it is sometimes slower to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The FacetUpdateBulk builder provides a way to do so. It works by:

  1. clearing all levels > 0 from the DB
  2. adding all new elements in level 0
  3. rebuilding the higher levels from scratch

The algorithm for bulk insertion is the same as the previous one.

Choosing between incremental and bulk insertion

On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

Fuzz-testing

Potentially controversial:
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for group_size, max_group_size, and min_level_size. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for unreachable/panic statements and errors returned by heed.

The fuzz-test and the fuzzcheck dependency are only compiled when cargo fuzzcheck is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it.

Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:

  1. Sort, ascending
  2. Sort, descending
  3. Facet distribution
  4. Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): FacetNumberRange, FacetNumberRevRange, FacetNumberIter (with a reversed and reducing/non-reducing option), FacetStringGroupRange, FacetStringGroupRevRange, FacetStringLevel0Range, FacetStringLevel0RevRange, and FacetStringIter (reversed + reducing/non-reducing).

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:

  1. ascending_facet_sort in search/facet/facet_sort_ascending.rs
  2. descending_facet_sort in search/facet/facet_sort_descending.rs
  3. iterate_over_facet_distribution in search/facet/facet_distribution_iter.rs
  4. find_docids_of_facet_within_bounds in search/facet/facet_range_search.rs

I have tried to test them with some snapshot tests but more testing could still be done. I don't think that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of facet_id_string_docids . This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in field_id_docid_facet_strings. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,

{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }

Facet distribution for the colour field among the candidates [1]:

{ "RED": 1 }

Here, "RED" was given as the original facet value even though it does not appear in the document id 1.

Heed codecs

A number of heed codecs related to the facet databases were removed:

  • FacetLevelValueF64Codec
  • FacetLevelValueU32Codec
  • FacetStringLevelZeroCodec
  • StringValueCodec
  • FacetStringZeroBoundsValueCodec
  • FacetValueStringCodec
  • FieldDocIdFacetStringCodec
  • FieldDocIdFacetF64Codec

They were replaced by:

  • FacetGroupKeyCodec<C> (replaces all key codecs for the facet databases)
  • FacetGroupValueCodec (replaces all value codecs for the facet databases)
  • FieldDocIdFacetCodec<C> (replaces FieldDocIdFacetStringCodec and FieldDocIdFacetF64Codec)

Since the associated encoded item of FacetGroupKeyCodec<C> is FacetKey<T> and we often work with FacetKey<&[u8]> and FacetKey<&str>, then we need to have codecs that encode values of type &str and &[u8]. The existing ByteSlice and Str codecs do not work for that purpose (their EItem are [u8] and str), I have also created two new codecs:

  • ByteSliceRef is a codec with a EItem = DItem = &[u8]
  • StrRefCodec is a codec with a EItem = DItem = &str

I have also factored out the code used to encode an ordered f64 into its own OrderedF64Codec.

@loiclec loiclec added indexing Related to the documents/settings indexing algorithms. querying Related to the searching/fetch data algorithms. DB breaking The related changes break the DB performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Sep 1, 2022
@loiclec loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 527c381 to c61386f Compare September 12, 2022 07:45
@irevoire irevoire added the API breaking The related changes break the milli API label Sep 12, 2022
@irevoire
Copy link
Member

Adding API-breaking for this part;

Change of behaviour for facet distributions

milli/Cargo.toml Outdated Show resolved Hide resolved
@loiclec loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 838ea66 to ae81e63 Compare September 21, 2022 15:16
@loiclec loiclec force-pushed the facet-levels-refactor branch 3 times, most recently from 5d9c1be to bf83884 Compare October 17, 2022 11:04
@loiclec loiclec marked this pull request as ready for review October 19, 2022 14:00
@loiclec loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 150e19a to 52a7e10 Compare October 25, 2022 09:05
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why does http-ui/src/main.rs and infos/src/main.rs exists?

.gitignore Show resolved Hide resolved
milli/Cargo.toml Outdated Show resolved Hide resolved
milli/Cargo.toml Show resolved Hide resolved
@loiclec
Copy link
Contributor Author

loiclec commented Oct 26, 2022

  • Why does http-ui/src/main.rs and infos/src/main.rs exists?

aftermath of a rebase, they're gone now :)

Kerollmops
Kerollmops previously approved these changes Oct 26, 2022
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This is the last day and hour of this magnificent PR 👏
It will improve the facet insertion and clean up a lot of code from milli ❤️

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 26, 2022

Merge conflict.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

bors bot added a commit that referenced this pull request Oct 26, 2022
619: Refactor the Facets databases to enable incremental indexing r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Party fixes #605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed:
1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids.
2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

## How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:
1. the new structure of the databases
2. the algorithms for sort, facet distribution, and range search
3. the new/removed heed codecs

Then, weigh in on the following concerns:
1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides
2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist
3. the new behaviour of facet distributions can be considered incorrect
4. incremental deletion is useless given that documents are always deleted in bulk

## What's left for me to do

1. Re-read everything once to make sure I haven't forgotten anything
2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
3. Test whether the `info`/`http-ui` crates survived the refactor

## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where the first line is the key of the database, consisting of :
- the field id
- the level height
- the left and right bound of the group 

and the second line is the value of the database, consisting of:
- a bitmap of all the docids that have a facet value within the bounds

The `facet_id_string_docids` had a similar structure:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where, **at level 0**, the key is:
* the normalised facet value (string)

and the value is:
* the original facet value (string)
* a bitmap of all the docids that have this normalised string facet value

**At level 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* the left bound of the range as a normalised string
* the right bound of the range as a normalised string
* a bitmap of all the docids that have a string facet value within the bounds

**At level > 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* a bitmap of all the docids that have a string facet value within the bounds

## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure:
```                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where for all levels, the key is a `FacetGroupKey<T>` containing:
* the field id (`u16`)
* the level height (`u8`)
* the left bound of the range (`T`)

and the value is a `FacetGroupValue` containing:
* the number of elements from the level below that are part of the range (`u8`, =0 for level 0)
* a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`)

The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`.

In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

## Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink:
<img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png">

which results in:
<img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png">

* one element was added in level 0
* one key/value was modified in level 1
* one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0:
<img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png">
Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
<img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png">
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
<img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png">

We also have two other parameters:
* `group_size` is the default group size when building the database from scratch
* `min_level_size` is the minimum number of elements that a level should contain

When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to:
<img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png">

The root of the code for incremental indexing is the `FacetUpdateIncremental` builder.

## Incrementally removing a facet value
TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.  

## Bulk insertion
While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by:
1. clearing all levels > 0 from the DB
2. adding all new elements in level 0
3. rebuilding the higher levels from scratch 

The algorithm for bulk insertion is the same as the previous one.

## Choosing between incremental and bulk insertion
On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

## Fuzz-testing

**Potentially controversial:**
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`.

The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. 

## Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:
1. Sort, ascending
2. Sort, descending
3. Facet distribution
4. Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). 

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:
1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs`
2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs`
3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs`
4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs`

I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

## Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,
```json
{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }
```
Facet distribution for the `colour` field among the candidates `[1]`:
```
{ "RED": 1 }
```
Here, "RED" was given as the original facet value even though it does not appear in the document id `1`.

## Heed codecs

A number of heed codecs related to the facet databases were removed:
* `FacetLevelValueF64Codec`
* `FacetLevelValueU32Codec`
* `FacetStringLevelZeroCodec`
* `StringValueCodec`
* `FacetStringZeroBoundsValueCodec`
* `FacetValueStringCodec`
* `FieldDocIdFacetStringCodec`
* `FieldDocIdFacetF64Codec`

They were replaced by:
* `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases)
* `FacetGroupValueCodec` (replaces all value codecs for the facet databases)
* `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`)

Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs:
* `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]`
* `StrRefCodec` is a codec with a `EItem = DItem = &str`

I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
@bors
Copy link
Contributor

bors bot commented Oct 26, 2022

Build failed:

@curquiza
Copy link
Member

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 26, 2022

Build succeeded:

@bors bors bot merged commit 2e53924 into main Oct 26, 2022
@bors bors bot deleted the facet-levels-refactor branch October 26, 2022 15:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API breaking The related changes break the milli API DB breaking The related changes break the DB indexing Related to the documents/settings indexing algorithms. performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption querying Related to the searching/fetch data algorithms.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve speed of incremental indexing Separate the original facet string values from the docids
4 participants