Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Criteria lazy data preparation #358

Closed
Kerollmops opened this issue Nov 30, 2019 · 2 comments
Closed

Criteria lazy data preparation #358

Kerollmops opened this issue Nov 30, 2019 · 2 comments
Assignees
Milestone

Comments

@Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Nov 30, 2019

Currently, when a query is submitted, we retrieve the postings lists corresponding to each query word and store them in the same memory allocation. Once all of the postings lists have been retrieved (25.63ms):

  • we sort all of those accumulated TmpMath by DocumentId and word_index (20.03ms)
  • rewrite them for potential multi-words (i.e. "searchengine" -> "search" "engine") (11.41ms)
  • sort by DocumentId the previously separated highlights (10.32ms)
  • and we finally construct the RawDocument list from that data (15.06ms)

All of these sorting and allocating actions (a total of 82.41ms) are done for the whole accumulated TmpMatch (275196), this is a colossal and, sadly, useless work for most of the criteria especially if we take into account the fact that the main algorithm, the bucket sort, only take 10.54ms. This is roughly 13% of the total time spent for a search, 87% of the time is spent "preparing the data" for the bucket sort...

To give some context about the current criteria running in the engine and their needs in terms of data setup:

  • SumOfTypos needs the Levenshtein distance of each query words (6936)
  • NumberOfWords needs the number of query words matching (6936)
  • WordsProximity needs the query words proximity in the document (1564)
  • SumOfWordsAttribute needs the attribute where query words matched (1122)
  • SumOfWordsPosition needs the position of query words in the attribute (1122)
  • Exact needs the Levenshtein distance and some more about query words (1122)
  • DocumentId needs each document id (1, 1, 1, 1)

For information, 83470 documents were nominated candidate for the bucket sort, or 81.8% of the total number of indexed documents. The numbers written in parentheses after each of the criteria above indicates the number of documents to be given to the next criteria, like filtering. In other words, the first criteria filtered 91.6% of the candidates documents, keeping 8.3%.

I think that all of the preparation work before the bucket sort can just be made when necessary, avoiding the multi-words rewrite, because only the WordsProximity criteria needed this information (so only 6936 documents would have been rewritten not 83470).
I will also keep the raw data when possible and avoid taking too much time processing it, keeping the TmpMatch data as it is. Probably formatting it for other criterion and sorting it differently for others.

So the general idea behind this issue is to introduce a Criterion::init/prepare function to format the data that each RawDocument exposes. This function would be called on any documents group about to be processed by a criterion.

I would like to see a minimum of a 70% query time reduction. Moving from 103.24ms to something like 30.9ms for this example.

@Kerollmops

This comment has been minimized.

Copy link
Member Author

@Kerollmops Kerollmops commented Dec 6, 2019

I have made some progress on this issue, I achieved something between a 27x to a 3.5x search query time reduction 🎉 But I am still worried of some long running search requests.

It is related to the Proximity criterion that needs to prepare the documents query matches. In the first query it takes 47.02ms and 31.53ms to evaluate 26998 documents. In the second there is only 5021 documents to evaluate and therefore the preparation and evaluation only take 11.22ms and 2.71ms respectively.

I need to work on that, I don't really have any idea currently but I will probably try to reuse the processed_distances RawDocument field if possible.

Detail for request "sneakers f"

Searching for: sneakers f
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] words fst took 38.52µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] fst search took 477.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "sneakers" gives 3 words
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] fst search took 416.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "f" gives 2514 words
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] fst search took 903.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "sneaker" gives 3 words
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] fst search took 457.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "s" gives 1 words
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] fst search took 356.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "sneakersf" gives 3 words
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] stream next took 1.47ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] postings lists fetching took 8.67ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] dfa creation took 186.13µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] bare matches (355568) retrieved in 12.56ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] sort by documents ids took 13.61ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] creating 84717 candidates documents took 5.62ms
[meilisearch-core/src/bucket_sort.rs:70] mem::size_of::<BareMatch>() = 16
[meilisearch-core/src/bucket_sort.rs:71] mem::size_of::<SimpleMatch>() = 8
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "typo" preparation took 5.25ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "typo" evaluation took 2.99ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "typo" produced a group of size 26998
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words" preparation took 72.89µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words" evaluation took 201.65µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words" produced a group of size 26998
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "proximity" preparation took 47.02ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "proximity" evaluation took 31.53ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "proximity" produced a group of size 5517
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "attribute" preparation took 20.52µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "attribute" evaluation took 1.72ms
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "attribute" produced a group of size 589
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words position" preparation took 2.03µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words position" evaluation took 92.17µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "words position" produced a group of size 528
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "exact" preparation took 124.88µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "exact" evaluation took 60.31µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "exact" produced a group of size 48
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" preparation took 93.00ns
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" evaluation took 3.40µs
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:38:44Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
raw-id: DocumentId(56086092224330608)
skuId: 1A2VZ9
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description:  chaussure de tennis classique, ce sneaker en cuir de veau lisse est o
matching in: ["en_GB_description", "fr_FR_description", "zh_CN_commercialName", "zh_HK_commercialName", "zh_HK_description", "en_GB_commercialName", "es_ES_commercialName", "fr_FR_commercialName"]

raw-id: DocumentId(333254671128962245)
skuId: 1A2TK2
fr_FR_commercialName: Sneaker Fastlane
fr_FR_description: Sneaker en néoprène et maille filet. Détails réfléchissants et semelle
matching in: ["zh_HK_commercialName", "en_GB_description", "fr_FR_description", "es_ES_description", "zh_CN_description", "fr_FR_commercialName", "zh_CN_commercialName", "es_ES_commercialName", "en_GB_commercialName"]

raw-id: DocumentId(1408815464919350922)
skuId: 1A38ST
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description:  la chaussure de tennis classique, sneaker au design épuré signé d'un
matching in: ["zh_CN_commercialName", "fr_FR_description", "es_ES_commercialName", "en_GB_description", "fr_FR_commercialName", "en_GB_commercialName", "zh_HK_description", "zh_HK_commercialName"]

raw-id: DocumentId(1720793029438713960)
skuId: 1A38T2
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description:  la chaussure de tennis classique, sneaker au design épuré signé d'un
matching in: ["fr_FR_commercialName", "en_GB_commercialName", "zh_CN_commercialName", "en_GB_description", "es_ES_commercialName", "zh_HK_description", "fr_FR_description", "zh_HK_commercialName"]

whole documents fields retrieve took 73.52µs
===== Found 4 results in 125.99ms =====

Detail for request "sneaker f"

Searching for: sneaker f
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] words fst took 25.74µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] fst search took 281.00ns
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "sneaker" gives 3 words
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] fst search took 258.00ns
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "f" gives 2514 words
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] fst search took 618.00ns
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "sneakerf" gives 2 words
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] stream next took 907.24µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] postings lists fetching took 6.88ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] dfa creation took 35.86µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] bare matches (309933) retrieved in 9.29ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] sort by documents ids took 12.90ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] creating 82261 candidates documents took 1.43ms
[meilisearch-core/src/bucket_sort.rs:70] mem::size_of::<BareMatch>() = 16
[meilisearch-core/src/bucket_sort.rs:71] mem::size_of::<SimpleMatch>() = 8
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "typo" preparation took 4.31ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "typo" evaluation took 2.85ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "typo" produced a group of size 5021
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words" preparation took 14.86µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words" evaluation took 90.83µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words" produced a group of size 5021
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "proximity" preparation took 11.22ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "proximity" evaluation took 2.71ms
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "proximity" produced a group of size 5021
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "attribute" preparation took 13.83µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "attribute" evaluation took 919.57µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "attribute" produced a group of size 754
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words position" preparation took 1.30µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words position" evaluation took 88.59µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "words position" produced a group of size 671
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "exact" preparation took 139.57µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "exact" evaluation took 42.11µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "exact" produced a group of size 671
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" preparation took 99.00ns
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" evaluation took 51.25µs
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
[2019-12-06T12:39:16Z DEBUG meilisearch_core::bucket_sort] "stable document id" produced a group of size 1
raw-id: DocumentId(56086092224330608)
skuId: 1A2VZ9
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description:  chaussure de tennis classique, ce sneaker en cuir de veau lisse est o
matching in: ["en_GB_commercialName", "zh_CN_commercialName", "es_ES_commercialName", "fr_FR_description", "zh_HK_commercialName", "zh_HK_description", "fr_FR_commercialName", "en_GB_description"]

raw-id: DocumentId(87240445382249886)
skuId: 474388
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description: <BR>
matching in: ["en_GB_commercialName", "zh_HK_commercialName", "fr_FR_commercialName", "es_ES_commercialName", "zh_CN_commercialName", "en_GB_description"]

raw-id: DocumentId(116310161627608778)
skuId: 1A2VZD
fr_FR_commercialName: Sneaker Frontrow
fr_FR_description:  chaussure de tennis classique, ce sneaker en cuir de veau lisse est o
matching in: ["zh_CN_commercialName", "zh_HK_commercialName", "en_GB_commercialName", "zh_HK_description", "fr_FR_commercialName", "en_GB_description", "es_ES_commercialName", "fr_FR_description"]

raw-id: DocumentId(122675483337993298)
skuId: 1A2VW7
fr_FR_commercialName: Sneaker Fastlane
fr_FR_description: Ce sneaker moderne en néoprène et maille filet, épouse parfaitement le
matching in: ["fr_FR_description", "zh_HK_commercialName", "en_GB_commercialName", "zh_CN_commercialName", "fr_FR_commercialName", "zh_HK_description", "zh_CN_description", "en_GB_description", "es_ES_commercialName"]

whole documents fields retrieve took 75.46µs
===== Found 4 results in 50.15ms =====


For your information "sneaker f" and "sneakers f" take 399.38ms and 398.01ms on the old system respectively and 125.99ms and 50.15ms on the new one.

@bidoubiwa

This comment has been minimized.

Copy link
Member

@bidoubiwa bidoubiwa commented Jan 7, 2020

This PR changes the name of the ranking rules the following way :

- "_sum_of_typos" => builder.push(SumOfTypos),
- "_number_of_words" => builder.push(NumberOfWords),
- "_word_proximity" => builder.push(WordsProximity),
- "_sum_of_words_attribute" => builder.push(SumOfWordsAttribute),
- "_sum_of_words_position" => builder.push(SumOfWordsPosition),
+ "_typo" => builder.push(Typo),
+ "_words" => builder.push(Words),
+ "_proximity" => builder.push(Proximity),
+ "_attribute" => builder.push(Attribute),
+ "_words_position" => builder.push(WordsPosition),
@qdequele qdequele added this to the v0.9.0 milestone Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.