This repository has been archived by the owner on Apr 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 82
Refactor the Facets databases to enable incremental indexing #619
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
loiclec
added
indexing
Related to the documents/settings indexing algorithms.
querying
Related to the searching/fetch data algorithms.
DB breaking
The related changes break the DB
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
labels
Sep 1, 2022
loiclec
force-pushed
the
facet-levels-refactor
branch
2 times, most recently
from
September 12, 2022 07:45
527c381
to
c61386f
Compare
Adding API-breaking for this part;
|
irevoire
reviewed
Sep 12, 2022
loiclec
force-pushed
the
facet-levels-refactor
branch
2 times, most recently
from
September 21, 2022 15:16
838ea66
to
ae81e63
Compare
loiclec
force-pushed
the
facet-levels-refactor
branch
3 times, most recently
from
October 17, 2022 11:04
5d9c1be
to
bf83884
Compare
loiclec
force-pushed
the
facet-levels-refactor
branch
2 times, most recently
from
October 25, 2022 09:05
150e19a
to
52a7e10
Compare
Prepare refactor of facets database
By deleting multiple docids at once instead of one-by-one
+ update deletion snapshots to the new database format
Where the docid that is used to get the original facet string value definitely belongs to the candidates
loiclec
force-pushed
the
facet-levels-refactor
branch
from
October 26, 2022 11:49
52a7e10
to
b7f2428
Compare
Kerollmops
suggested changes
Oct 26, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why does
http-ui/src/main.rs
andinfos/src/main.rs
exists?
aftermath of a rebase, they're gone now :) |
Kerollmops
previously approved these changes
Oct 26, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! This is the last day and hour of this magnificent PR 👏
It will improve the facet insertion and clean up a lot of code from milli ❤️
bors merge
Merge conflict. |
Kerollmops
approved these changes
Oct 26, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
bors bot
added a commit
that referenced
this pull request
Oct 26, 2022
619: Refactor the Facets databases to enable incremental indexing r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Party fixes #605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed: 1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids. 2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents ## How to review this PR First, read this comment to get an overview of the changes. Then, based on this comment, raise any concerns you might have about: 1. the new structure of the databases 2. the algorithms for sort, facet distribution, and range search 3. the new/removed heed codecs Then, weigh in on the following concerns: 1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides 2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist 3. the new behaviour of facet distributions can be considered incorrect 4. incremental deletion is useless given that documents are always deleted in bulk ## What's left for me to do 1. Re-read everything once to make sure I haven't forgotten anything 2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code. 3. Test whether the `info`/`http-ui` crates survived the refactor ## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 1.2 – 2 │ 3.4 – 100 │ 102 – 104 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 1.2 – 1.3 │ 1.6 – 2 │ 3.4 – 12 │ 12.3 – 100 │ 102 – 104 │ │Level 1│ │ │ │ │ │ │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ 1.2 │ 1.3 │ 1.6 │ 2 │ 3.4 │ 12 │ 12.3 │ 100 │ 102 │ 104 │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where the first line is the key of the database, consisting of : - the field id - the level height - the left and right bound of the group and the second line is the value of the database, consisting of: - a bitmap of all the docids that have a facet value within the bounds The `facet_id_string_docids` had a similar structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 0 – 3 │ 4 – 7 │ 8 – 9 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 0 – 1 │ 2 – 3 │ 4 – 5 │ 6 – 7 │ 8 – 9 │ │Level 1│ │ "ab" – "ac" │ "ba" – "bac" │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz" │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │ "ZZ" │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where, **at level 0**, the key is: * the normalised facet value (string) and the value is: * the original facet value (string) * a bitmap of all the docids that have this normalised string facet value **At level 1**, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * the left bound of the range as a normalised string * the right bound of the range as a normalised string * a bitmap of all the docids that have a string facet value within the bounds **At level > 1**, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * a bitmap of all the docids that have a string facet value within the bounds ## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ "ab" (2) │ "gaf" (2) │ "woz" (1) │ │Level 2│ │ │ │ │ └───────┘ │ [a, b, d, f, z] │ [c, d, e, f, g] │ [u, y] │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ "ab" (2) │ "ba" (2) │ "gaf" (2) │ "form" (2) │ "woz" (2) │ │Level 1│ │ │ │ │ │ │ └───────┘ │ [a, b, d, z] │ [a, b, f] │ [c, d, g] │ [e, f] │ [u, y] │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│ [g] │ [e] │ [e, f]│ [y] │ [u] │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where for all levels, the key is a `FacetGroupKey<T>` containing: * the field id (`u16`) * the level height (`u8`) * the left bound of the range (`T`) and the value is a `FacetGroupValue` containing: * the number of elements from the level below that are part of the range (`u8`, =0 for level 0) * a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`) The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`. In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`. Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification. The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices. Note that the original string value of a facet is no longer stored in this database. ## Incrementally adding a facet value Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink: <img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png"> which results in: <img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png"> * one element was added in level 0 * one key/value was modified in level 1 * one value was modified in level 2 Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0: <img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png"> Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes: <img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png"> and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent: <img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png"> We also have two other parameters: * `group_size` is the default group size when building the database from scratch * `min_level_size` is the minimum number of elements that a level should contain When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it. There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to: <img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png"> The root of the code for incremental indexing is the `FacetUpdateIncremental` builder. ## Incrementally removing a facet value TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all. ## Bulk insertion While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by: 1. clearing all levels > 0 from the DB 2. adding all new elements in level 0 3. rebuilding the higher levels from scratch The algorithm for bulk insertion is the same as the previous one. ## Choosing between incremental and bulk insertion On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents. This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times. ## Fuzz-testing **Potentially controversial:** I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`. The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. ## Algorithms operating on the facet databases There are four important algorithms making use of the facet databases: 1. Sort, ascending 2. Sort, descending 3. Facet distribution 4. Range search Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions: 1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs` 2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs` 3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs` 4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs` I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks. ## Change of behaviour for facet distributions Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example, ```json { "id": 0, "colour": "RED" } { "id": 1, "colour": "red" } ``` Facet distribution for the `colour` field among the candidates `[1]`: ``` { "RED": 1 } ``` Here, "RED" was given as the original facet value even though it does not appear in the document id `1`. ## Heed codecs A number of heed codecs related to the facet databases were removed: * `FacetLevelValueF64Codec` * `FacetLevelValueU32Codec` * `FacetStringLevelZeroCodec` * `StringValueCodec` * `FacetStringZeroBoundsValueCodec` * `FacetValueStringCodec` * `FieldDocIdFacetStringCodec` * `FieldDocIdFacetF64Codec` They were replaced by: * `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases) * `FacetGroupValueCodec` (replaces all value codecs for the facet databases) * `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`) Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs: * `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]` * `StrRefCodec` is a codec with a `EItem = DItem = &str` I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
Build failed: |
bors merge |
Build succeeded:
|
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
API breaking
The related changes break the milli API
DB breaking
The related changes break the DB
indexing
Related to the documents/settings indexing algorithms.
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
querying
Related to the searching/fetch data algorithms.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
What does this PR do?
Party fixes #605 by making the indexing of the facet databases (i.e.
facet_id_f64_docids
andfacet_id_string_docids
) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed:maxFacetValues
parameter when there were only a few candidate document ids.How to review this PR
First, read this comment to get an overview of the changes.
Then, based on this comment, raise any concerns you might have about:
Then, weigh in on the following concerns:
fuzzcheck
as a fuzz-only dependency may add too much complexity for the benefits it providesByteSliceRef
andStrRefCodec
are misnamed or should not existWhat's left for me to do
info
/http-ui
crates survived the refactorOld structure of the
facet_id_f64_docids
andfacet_id_string_docids
databasesPreviously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:
where the first line is the key of the database, consisting of :
and the second line is the value of the database, consisting of:
The
facet_id_string_docids
had a similar structure:where, at level 0, the key is:
and the value is:
At level 1, the key is:
and the value is:
At level > 1, the key is:
and the value is:
New structure of the
facet_id_f64_docids
andfacet_id_string_docids
databasesNow both the
facet_id_f64_docids
andfacet_id_string_docids
databases have the exact same structure:where for all levels, the key is a
FacetGroupKey<T>
containing:u16
)u8
)T
)and the value is a
FacetGroupValue
containing:u8
, =0 for level 0)RoaringBitmap
)The right bound of the range is now implicit, it is equal to
Excluded(next_left_bound)
.In the code, the key is always encoded using
FacetGroupKeyCodec<C>
whereC
is the codec used to encode the facet value (eitherOrderedF64Codec
orStrRefCodec
) and the value is encoded withFacetGroupValueCodec
.Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e.
FacetGroupKey<&[u8]>
encoded asFacetGroupKeyCodec<ByteSliceRef>
). This is, in my opinion, a big simplification.The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the
facet_id_string_docids
used to store indices tolevel 0
in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.Note that the original string value of a facet is no longer stored in this database.
Incrementally adding a facet value
Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id
z
and facet valuegap
., then we want to add/modify the elements highlighted below in pink:which results in:
Adding this element was easy since we could simply add it to level 0 and then increase the
group_size
part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (max_group_size
), then we split the node into two. For example, let's imagine thatmax_group_size
is4
and we add the docidy
with facet valuegas
. First, we add it in level 0:Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
We also have two other parameters:
group_size
is the default group size when building the database from scratchmin_level_size
is the minimum number of elements that a level should containWhen the highest level size is greater than
group_size * min_level_size
, then we create an additional level above it.There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting
"aa"
with the docidw
would change the database to:The root of the code for incremental indexing is the
FacetUpdateIncremental
builder.Incrementally removing a facet value
TODO: the algorithm was implemented and works, but its current API is:
fn delete(self, facet_value, single_docid)
. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.Bulk insertion
While it's faster to incrementally add a single facet value to the database, it is sometimes slower to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The
FacetUpdateBulk
builder provides a way to do so. It works by:The algorithm for bulk insertion is the same as the previous one.
Choosing between incremental and bulk insertion
On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.
This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.
Fuzz-testing
Potentially controversial:
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for
group_size
,max_group_size
, andmin_level_size
. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except forunreachable/panic
statements and errors returned byheed
.The fuzz-test and the fuzzcheck dependency are only compiled when
cargo fuzzcheck
is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it.Algorithms operating on the facet databases
There are four important algorithms making use of the facet databases:
Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string):
FacetNumberRange
,FacetNumberRevRange
,FacetNumberIter
(with a reversed and reducing/non-reducing option),FacetStringGroupRange
,FacetStringGroupRevRange
,FacetStringLevel0Range
,FacetStringLevel0RevRange
, andFacetStringIter
(reversed + reducing/non-reducing).Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:
ascending_facet_sort
insearch/facet/facet_sort_ascending.rs
descending_facet_sort
insearch/facet/facet_sort_descending.rs
iterate_over_facet_distribution
insearch/facet/facet_distribution_iter.rs
find_docids_of_facet_within_bounds
insearch/facet/facet_range_search.rs
I have tried to test them with some snapshot tests but more testing could still be done. I don't think that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.
Change of behaviour for facet distributions
Previously, the original string value of a facet was stored in the level 0 of
facet_id_string_docids
. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one infield_id_docid_facet_strings
. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,Facet distribution for the
colour
field among the candidates[1]
:Here, "RED" was given as the original facet value even though it does not appear in the document id
1
.Heed codecs
A number of heed codecs related to the facet databases were removed:
FacetLevelValueF64Codec
FacetLevelValueU32Codec
FacetStringLevelZeroCodec
StringValueCodec
FacetStringZeroBoundsValueCodec
FacetValueStringCodec
FieldDocIdFacetStringCodec
FieldDocIdFacetF64Codec
They were replaced by:
FacetGroupKeyCodec<C>
(replaces all key codecs for the facet databases)FacetGroupValueCodec
(replaces all value codecs for the facet databases)FieldDocIdFacetCodec<C>
(replacesFieldDocIdFacetStringCodec
andFieldDocIdFacetF64Codec
)Since the associated encoded item of
FacetGroupKeyCodec<C>
isFacetKey<T>
and we often work withFacetKey<&[u8]>
andFacetKey<&str>
, then we need to have codecs that encode values of type&str
and&[u8]
. The existingByteSlice
andStr
codecs do not work for that purpose (theirEItem
are[u8]
andstr
), I have also created two new codecs:ByteSliceRef
is a codec with aEItem = DItem = &[u8]
StrRefCodec
is a codec with aEItem = DItem = &str
I have also factored out the code used to encode an ordered f64 into its own
OrderedF64Codec
.