Refactor the Facets databases to enable incremental indexing #619

loiclec · 2022-09-01T08:13:55Z

Pull Request

What does this PR do?

Party fixes #605 by making the indexing of the facet databases (i.e. facet_id_f64_docids and facet_id_string_docids) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed:

The facet distribution algorithm did not respect the maxFacetValues parameter when there were only a few candidate document ids.
The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:

the new structure of the databases
the algorithms for sort, facet distribution, and range search
the new/removed heed codecs

Then, weigh in on the following concerns:

adding fuzzcheck as a fuzz-only dependency may add too much complexity for the benefits it provides
the ByteSliceRef and StrRefCodec are misnamed or should not exist
the new behaviour of facet distributions can be considered incorrect
incremental deletion is useless given that documents are always deleted in bulk

What's left for me to do

Re-read everything once to make sure I haven't forgotten anything
Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
Test whether the info/http-ui crates survived the refactor

Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where the first line is the key of the database, consisting of :

the field id
the level height
the left and right bound of the group

and the second line is the value of the database, consisting of:

a bitmap of all the docids that have a facet value within the bounds

The facet_id_string_docids had a similar structure:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where, at level 0, the key is:

the normalised facet value (string)

and the value is:

the original facet value (string)
a bitmap of all the docids that have this normalised string facet value

At level 1, the key is:

the left bound of the range as an index in level 0
the right bound of the range as an index in level 0

and the value is:

the left bound of the range as a normalised string
the right bound of the range as a normalised string
a bitmap of all the docids that have a string facet value within the bounds

At level > 1, the key is:

the left bound of the range as an index in level 0
the right bound of the range as an index in level 0

and the value is:

a bitmap of all the docids that have a string facet value within the bounds

New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Now both the facet_id_f64_docids and facet_id_string_docids databases have the exact same structure:

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

where for all levels, the key is a FacetGroupKey<T> containing:

the field id (u16)
the level height (u8)
the left bound of the range (T)

and the value is a FacetGroupValue containing:

the number of elements from the level below that are part of the range (u8, =0 for level 0)
a bitmap of all the docids that have a facet value within the bounds (RoaringBitmap)

The right bound of the range is now implicit, it is equal to Excluded(next_left_bound).

In the code, the key is always encoded using FacetGroupKeyCodec<C> where C is the codec used to encode the facet value (either OrderedF64Codec or StrRefCodec) and the value is encoded with FacetGroupValueCodec.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. FacetGroupKey<&[u8]> encoded as FacetGroupKeyCodec<ByteSliceRef>). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the facet_id_string_docids used to store indices to level 0 in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id z and facet value gap., then we want to add/modify the elements highlighted below in pink:

which results in:

one element was added in level 0
one key/value was modified in level 1
one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the group_size part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (max_group_size), then we split the node into two. For example, let's imagine that max_group_size is 4 and we add the docid y with facet value gas. First, we add it in level 0:

Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:

and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:

We also have two other parameters:

group_size is the default group size when building the database from scratch
min_level_size is the minimum number of elements that a level should contain

When the highest level size is greater than group_size * min_level_size, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting "aa" with the docid w would change the database to:

The root of the code for incremental indexing is the FacetUpdateIncremental builder.

Incrementally removing a facet value

TODO: the algorithm was implemented and works, but its current API is: fn delete(self, facet_value, single_docid). It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.

Bulk insertion

While it's faster to incrementally add a single facet value to the database, it is sometimes slower to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The FacetUpdateBulk builder provides a way to do so. It works by:

clearing all levels > 0 from the DB
adding all new elements in level 0
rebuilding the higher levels from scratch

The algorithm for bulk insertion is the same as the previous one.

Choosing between incremental and bulk insertion

On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

Fuzz-testing

Potentially controversial:
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for group_size, max_group_size, and min_level_size. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for unreachable/panic statements and errors returned by heed.

The fuzz-test and the fuzzcheck dependency are only compiled when cargo fuzzcheck is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it.

Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:

Sort, ascending
Sort, descending
Facet distribution
Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): FacetNumberRange, FacetNumberRevRange, FacetNumberIter (with a reversed and reducing/non-reducing option), FacetStringGroupRange, FacetStringGroupRevRange, FacetStringLevel0Range, FacetStringLevel0RevRange, and FacetStringIter (reversed + reducing/non-reducing).

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:

ascending_facet_sort in search/facet/facet_sort_ascending.rs
descending_facet_sort in search/facet/facet_sort_descending.rs
iterate_over_facet_distribution in search/facet/facet_distribution_iter.rs
find_docids_of_facet_within_bounds in search/facet/facet_range_search.rs

I have tried to test them with some snapshot tests but more testing could still be done. I don't think that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of facet_id_string_docids . This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in field_id_docid_facet_strings. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,

{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }

Facet distribution for the colour field among the candidates [1]:

{ "RED": 1 }

Here, "RED" was given as the original facet value even though it does not appear in the document id 1.

Heed codecs

A number of heed codecs related to the facet databases were removed:

FacetLevelValueF64Codec
FacetLevelValueU32Codec
FacetStringLevelZeroCodec
StringValueCodec
FacetStringZeroBoundsValueCodec
FacetValueStringCodec
FieldDocIdFacetStringCodec
FieldDocIdFacetF64Codec

They were replaced by:

FacetGroupKeyCodec<C> (replaces all key codecs for the facet databases)
FacetGroupValueCodec (replaces all value codecs for the facet databases)
FieldDocIdFacetCodec<C> (replaces FieldDocIdFacetStringCodec and FieldDocIdFacetF64Codec)

Since the associated encoded item of FacetGroupKeyCodec<C> is FacetKey<T> and we often work with FacetKey<&[u8]> and FacetKey<&str>, then we need to have codecs that encode values of type &str and &[u8]. The existing ByteSlice and Str codecs do not work for that purpose (their EItem are [u8] and str), I have also created two new codecs:

ByteSliceRef is a codec with a EItem = DItem = &[u8]
StrRefCodec is a codec with a EItem = DItem = &str

I have also factored out the code used to encode an ordered f64 into its own OrderedF64Codec.

irevoire · 2022-09-12T13:03:52Z

Adding API-breaking for this part;

Change of behaviour for facet distributions

milli/Cargo.toml

Prepare refactor of facets database

By deleting multiple docids at once instead of one-by-one

+ update deletion snapshots to the new database format

Where the docid that is used to get the original facet string value definitely belongs to the candidates

Kerollmops

Why does http-ui/src/main.rs and infos/src/main.rs exists?

.gitignore

milli/Cargo.toml

... that were reintroduced after a rebase

loiclec · 2022-10-26T12:10:02Z

Why does http-ui/src/main.rs and infos/src/main.rs exists?

aftermath of a rebase, they're gone now :)

Kerollmops

Thank you very much! This is the last day and hour of this magnificent PR 👏
It will improve the facet insertion and clean up a lot of code from milli ❤️

bors merge

bors · 2022-10-26T12:39:32Z

Merge conflict.

Kerollmops

bors merge

619: Refactor the Facets databases to enable incremental indexing r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Party fixes #605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed: 1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids. 2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents ## How to review this PR First, read this comment to get an overview of the changes. Then, based on this comment, raise any concerns you might have about: 1. the new structure of the databases 2. the algorithms for sort, facet distribution, and range search 3. the new/removed heed codecs Then, weigh in on the following concerns: 1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides 2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist 3. the new behaviour of facet distributions can be considered incorrect 4. incremental deletion is useless given that documents are always deleted in bulk ## What's left for me to do 1. Re-read everything once to make sure I haven't forgotten anything 2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code. 3. Test whether the `info`/`http-ui` crates survived the refactor ## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 1.2 – 2 │ 3.4 – 100 │ 102 – 104 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 1.2 – 1.3 │ 1.6 – 2 │ 3.4 – 12 │ 12.3 – 100 │ 102 – 104 │ │Level 1│ │ │ │ │ │ │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ 1.2 │ 1.3 │ 1.6 │ 2 │ 3.4 │ 12 │ 12.3 │ 100 │ 102 │ 104 │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where the first line is the key of the database, consisting of : - the field id - the level height - the left and right bound of the group and the second line is the value of the database, consisting of: - a bitmap of all the docids that have a facet value within the bounds The `facet_id_string_docids` had a similar structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 0 – 3 │ 4 – 7 │ 8 – 9 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 0 – 1 │ 2 – 3 │ 4 – 5 │ 6 – 7 │ 8 – 9 │ │Level 1│ │ "ab" – "ac" │ "ba" – "bac" │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz" │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │ "ZZ" │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where, **at level 0**, the key is: * the normalised facet value (string) and the value is: * the original facet value (string) * a bitmap of all the docids that have this normalised string facet value **At level 1**, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * the left bound of the range as a normalised string * the right bound of the range as a normalised string * a bitmap of all the docids that have a string facet value within the bounds **At level > 1**, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * a bitmap of all the docids that have a string facet value within the bounds ## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ "ab" (2) │ "gaf" (2) │ "woz" (1) │ │Level 2│ │ │ │ │ └───────┘ │ [a, b, d, f, z] │ [c, d, e, f, g] │ [u, y] │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ "ab" (2) │ "ba" (2) │ "gaf" (2) │ "form" (2) │ "woz" (2) │ │Level 1│ │ │ │ │ │ │ └───────┘ │ [a, b, d, z] │ [a, b, f] │ [c, d, g] │ [e, f] │ [u, y] │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│ [g] │ [e] │ [e, f]│ [y] │ [u] │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where for all levels, the key is a `FacetGroupKey<T>` containing: * the field id (`u16`) * the level height (`u8`) * the left bound of the range (`T`) and the value is a `FacetGroupValue` containing: * the number of elements from the level below that are part of the range (`u8`, =0 for level 0) * a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`) The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`. In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`. Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification. The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices. Note that the original string value of a facet is no longer stored in this database. ## Incrementally adding a facet value Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink: <img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png"> which results in: <img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png"> * one element was added in level 0 * one key/value was modified in level 1 * one value was modified in level 2 Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0: <img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png"> Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes: <img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png"> and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent: <img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png"> We also have two other parameters: * `group_size` is the default group size when building the database from scratch * `min_level_size` is the minimum number of elements that a level should contain When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it. There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to: <img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png"> The root of the code for incremental indexing is the `FacetUpdateIncremental` builder. ## Incrementally removing a facet value TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all. ## Bulk insertion While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by: 1. clearing all levels > 0 from the DB 2. adding all new elements in level 0 3. rebuilding the higher levels from scratch The algorithm for bulk insertion is the same as the previous one. ## Choosing between incremental and bulk insertion On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents. This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times. ## Fuzz-testing **Potentially controversial:** I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`. The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. ## Algorithms operating on the facet databases There are four important algorithms making use of the facet databases: 1. Sort, ascending 2. Sort, descending 3. Facet distribution 4. Range search Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions: 1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs` 2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs` 3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs` 4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs` I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks. ## Change of behaviour for facet distributions Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example, ```json { "id": 0, "colour": "RED" } { "id": 1, "colour": "red" } ``` Facet distribution for the `colour` field among the candidates `[1]`: ``` { "RED": 1 } ``` Here, "RED" was given as the original facet value even though it does not appear in the document id `1`. ## Heed codecs A number of heed codecs related to the facet databases were removed: * `FacetLevelValueF64Codec` * `FacetLevelValueU32Codec` * `FacetStringLevelZeroCodec` * `StringValueCodec` * `FacetStringZeroBoundsValueCodec` * `FacetValueStringCodec` * `FieldDocIdFacetStringCodec` * `FieldDocIdFacetF64Codec` They were replaced by: * `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases) * `FacetGroupValueCodec` (replaces all value codecs for the facet databases) * `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`) Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs: * `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]` * `StrRefCodec` is a codec with a `EItem = DItem = &str` I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>

bors · 2022-10-26T13:30:35Z

Build failed:

Tests on windows-latest

curquiza · 2022-10-26T15:04:33Z

bors merge

bors · 2022-10-26T15:25:14Z

Build succeeded:

loiclec added indexing Related to the documents/settings indexing algorithms. querying Related to the searching/fetch data algorithms. DB breaking The related changes break the DB performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Sep 1, 2022

loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 527c381 to c61386f Compare September 12, 2022 07:45

irevoire added the API breaking The related changes break the milli API label Sep 12, 2022

irevoire reviewed Sep 12, 2022

View reviewed changes

milli/Cargo.toml Outdated Show resolved Hide resolved

loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 838ea66 to ae81e63 Compare September 21, 2022 15:16

loiclec mentioned this pull request Oct 10, 2022

Filter "id > xxx" returns incorrect result meilisearch/meilisearch#2820

Closed

loiclec force-pushed the facet-levels-refactor branch 3 times, most recently from 5d9c1be to bf83884 Compare October 17, 2022 11:04

loiclec marked this pull request as ready for review October 19, 2022 14:00

loiclec force-pushed the facet-levels-refactor branch 2 times, most recently from 150e19a to 52a7e10 Compare October 25, 2022 09:05

curquiza mentioned this pull request Oct 25, 2022

Bad performance or crash during indexation meilisearch/meilisearch#2132

Closed

Loïc Lecrenier added 11 commits October 26, 2022 13:46

Prepare refactor of facets database

c3f49f7

Prepare refactor of facets database

Update Facets indexing to be compatible with new database structure

7913d63

Start porting facet distribution and sort to new database structure

63ef0ab

Add range search and incremental indexing algorithm

b8a1caa

Reintroduce facet distribution functionality

5a904cf

Remove unused heed codec files

6cc9182

Reintroduce facet deletion functionality

22d80ee

Reintroduce filter range search and facet extractors

39a4a0a

Remove unused code

bd2c0e1

Reintroduce asc/desc functionality

e570c23

Reintroduce db_snap! for facet databases

fb8d23d

loiclec and others added 11 commits October 26, 2022 13:47

Add link to GitHub PR to document of update/facet module

acc8cae

Use real delete function in facet indexing fuzz tests

2295e0e

By deleting multiple docids at once instead of one-by-one

Ignore files generated by fuzzcheck

ee1abfd

Add option to avoid soft deletion of documents

d885de1

Add document deletion snapshot tests and tests for hard-deletion

ab5e56f

Make deletion tests for both soft-deletion and hard-deletion

e3ba1fc

Add facet deletion tests that use both the incremental and bulk methods

f198b20

+ update deletion snapshots to the new database format

cargo fmt

206a3e0

Add some documentation on how to run the facet db fuzzer

14ca804

Revert behaviour of facet distribution to what it was before

3b1f908

Where the docid that is used to get the original facet string value definitely belongs to the candidates

Fix formatting and warning after rebasing from main

b7f2428

loiclec force-pushed the facet-levels-refactor branch from 52a7e10 to b7f2428 Compare October 26, 2022 11:49

Kerollmops suggested changes Oct 26, 2022

View reviewed changes

.gitignore Show resolved Hide resolved

milli/Cargo.toml Outdated Show resolved Hide resolved

milli/Cargo.toml Show resolved Hide resolved

loiclec added 3 commits October 26, 2022 14:03

Merge remote-tracking branch 'origin/main' into facet-levels-refactor

2741756

Depend on released version of fuzzcheck from crates.io

631e991

Remove outdated files from http-ui/ and infos/

2fa85a2

... that were reintroduced after a rebase

curquiza requested review from Kerollmops and irevoire October 26, 2022 12:16

Kerollmops previously approved these changes Oct 26, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into facet-levels-refactor

54c0cf9

loiclec dismissed Kerollmops’s stale review via 54c0cf9 October 26, 2022 13:13

Kerollmops approved these changes Oct 26, 2022

View reviewed changes

bors bot merged commit 2e53924 into main Oct 26, 2022

bors bot deleted the facet-levels-refactor branch October 26, 2022 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the Facets databases to enable incremental indexing #619

Refactor the Facets databases to enable incremental indexing #619

loiclec commented Sep 1, 2022 •

edited

Loading

irevoire commented Sep 12, 2022

Kerollmops left a comment

loiclec commented Oct 26, 2022

Kerollmops left a comment

bors bot commented Oct 26, 2022

Kerollmops left a comment

bors bot commented Oct 26, 2022

curquiza commented Oct 26, 2022

bors bot commented Oct 26, 2022

Refactor the Facets databases to enable incremental indexing #619

Refactor the Facets databases to enable incremental indexing #619

Conversation

loiclec commented Sep 1, 2022 • edited Loading

Pull Request

What does this PR do?

How to review this PR

What's left for me to do

Old structure of the facet_id_f64_docids and facet_id_string_docids databases

New structure of the facet_id_f64_docids and facet_id_string_docids databases

Incrementally adding a facet value

Incrementally removing a facet value

Bulk insertion

Choosing between incremental and bulk insertion

Fuzz-testing

Algorithms operating on the facet databases

Change of behaviour for facet distributions

Heed codecs

irevoire commented Sep 12, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

loiclec commented Oct 26, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

bors bot commented Oct 26, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

bors bot commented Oct 26, 2022

curquiza commented Oct 26, 2022

bors bot commented Oct 26, 2022

loiclec commented Sep 1, 2022 •

edited

Loading

Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases