Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing documents with a lot of text is too slow #3714

Closed
loiclec opened this issue May 1, 2023 · 4 comments
Closed

Indexing documents with a lot of text is too slow #3714

loiclec opened this issue May 1, 2023 · 4 comments
Labels
bug Something isn't working as expected indexing performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption spike A spike is where we need to do investigation before we could estimate the effort to fix

Comments

@loiclec
Copy link
Contributor

loiclec commented May 1, 2023

Meilisearch v1.1 takes too long to index documents that contains tens of thousands of words. For example, one user reported that indexing a small number of books could take multiple hours and that indexing 2000 books would seemingly never succeed.
In their case, the logs showed:

[2023-04-20T09:50:30Z DEBUG TimerFinished] WordPrefixPositionDocids::execute(), Elapsed=6220.5601475s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_prefix_databases(), Elapsed=9582.5155799s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_raw(), Elapsed=16657.9294701s

which indicates that a significant chunk of the time is spent in WordPrefixPositionDocids.
In Meilisearch v1.1, we save the exact position of each word in a document. This partly explains why WordPrefixPositionDocids is so slow, and why the index would grow to an unacceptable size.
In Meilisearch v1.2, we will "bucket" the positions of words in a document, which will significantly speed up this operation and reduce the size of the index.

But we probably need to do more to make indexing performance acceptable for this use case. A few additional suggestions:

  • Get rid of the docid_word_positions database, which still keeps the exact position of each word in Meilisearch v1.2
  • Reduce the maximum proximity of word pairs from word_pair_proximity_docids from 7 to ... maybe 3? We know this database becomes very big when documents contain a lot of text. Reducing the maximum proximity will reduce its size significantly.
  • Change the format of the word_pair_proximity_docids database from (prox, word1, word2) -> docids to (word1, word2) -> Vec<(prox, docids)> to reduce its size further.
  • Move forward with this refactor: Refactor of the data extractors used during indexing milli#656
@loiclec loiclec added bug Something isn't working as expected performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption indexing labels May 1, 2023
@dureuill
Copy link
Contributor

Ping from triage:

  • High severity, prevents from indexing books
  • Size: very small first could be to retry with v1.2+ since we now implemented some of the optimizations described in the issue description. If more than that is required, the size is likely to be large though.

@curquiza curquiza added the spike A spike is where we need to do investigation before we could estimate the effort to fix label Sep 7, 2023
@curquiza
Copy link
Member

curquiza commented Jan 3, 2024

@ManyTheFish following changes in v1.6.0 with diff indexing, how this issue is accurate with the current state of the code base?

@dureuill
Copy link
Contributor

dureuill commented Jan 4, 2024

Not @ManyTheFish, but I guess at the minimum we'd want to re-run the reported use case (indexing 20 books) and see if it is still problematic?

@ManyTheFish
Copy link
Member

ManyTheFish commented Feb 23, 2024

This issue has been partially fixed through different improvements:

It remains possible improvements from this issue:

Closing in favor of the two other issues,
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected indexing performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption spike A spike is where we need to do investigation before we could estimate the effort to fix
Projects
None yet
Development

No branches or pull requests

4 participants