Indexing documents with a lot of text is too slow #3714

loiclec · 2023-05-01T08:03:49Z

Meilisearch v1.1 takes too long to index documents that contains tens of thousands of words. For example, one user reported that indexing a small number of books could take multiple hours and that indexing 2000 books would seemingly never succeed.
In their case, the logs showed:

[2023-04-20T09:50:30Z DEBUG TimerFinished] WordPrefixPositionDocids::execute(), Elapsed=6220.5601475s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_prefix_databases(), Elapsed=9582.5155799s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_raw(), Elapsed=16657.9294701s

which indicates that a significant chunk of the time is spent in WordPrefixPositionDocids.
In Meilisearch v1.1, we save the exact position of each word in a document. This partly explains why WordPrefixPositionDocids is so slow, and why the index would grow to an unacceptable size.
In Meilisearch v1.2, we will "bucket" the positions of words in a document, which will significantly speed up this operation and reduce the size of the index.

But we probably need to do more to make indexing performance acceptable for this use case. A few additional suggestions:

Get rid of the docid_word_positions database, which still keeps the exact position of each word in Meilisearch v1.2
Reduce the maximum proximity of word pairs from word_pair_proximity_docids from 7 to ... maybe 3? We know this database becomes very big when documents contain a lot of text. Reducing the maximum proximity will reduce its size significantly.
Change the format of the word_pair_proximity_docids database from (prox, word1, word2) -> docids to (word1, word2) -> Vec<(prox, docids)> to reduce its size further.
Move forward with this refactor: Refactor of the data extractors used during indexing milli#656

The text was updated successfully, but these errors were encountered:

dureuill · 2023-07-11T13:46:13Z

Ping from triage:

High severity, prevents from indexing books
Size: very small first could be to retry with v1.2+ since we now implemented some of the optimizations described in the issue description. If more than that is required, the size is likely to be large though.

curquiza · 2024-01-03T17:52:40Z

@ManyTheFish following changes in v1.6.0 with diff indexing, how this issue is accurate with the current state of the code base?

dureuill · 2024-01-04T08:08:37Z

Not @ManyTheFish, but I guess at the minimum we'd want to re-run the reported use case (indexing 20 books) and see if it is still problematic?

ManyTheFish · 2024-02-23T14:55:37Z

This issue has been partially fixed through different improvements:

the computed proximity between words has been reduced from 7 to 3
The word prefix pair docids database no longer exists since the diff-indexing work
the user can even deactivate the proximity database completely using proximity precision

It remains possible improvements from this issue:

Reduce the number of keys written in LMDB by moving a part of the information from the key to the value
Determine if positional databases are an issue for the last Meilisearch version

Closing in favor of the two other issues,
Thanks

loiclec added bug Something isn't working as expected performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption indexing labels May 1, 2023

curquiza added the spike A spike is where we need to do investigation before we could estimate the effort to fix label Sep 7, 2023

ManyTheFish mentioned this issue Feb 26, 2024

Determine if positional databases should be optimized #4440

Open

ManyTheFish closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing documents with a lot of text is too slow #3714

Indexing documents with a lot of text is too slow #3714

loiclec commented May 1, 2023

dureuill commented Jul 11, 2023

curquiza commented Jan 3, 2024

dureuill commented Jan 4, 2024

ManyTheFish commented Feb 23, 2024 •

edited

Indexing documents with a lot of text is too slow #3714

Indexing documents with a lot of text is too slow #3714

Comments

loiclec commented May 1, 2023

dureuill commented Jul 11, 2023

curquiza commented Jan 3, 2024

dureuill commented Jan 4, 2024

ManyTheFish commented Feb 23, 2024 • edited

ManyTheFish commented Feb 23, 2024 •

edited