Indexing documents with a lot of text is too slow #3714
Labels
bug
Something isn't working as expected
indexing
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
spike
A spike is where we need to do investigation before we could estimate the effort to fix
Meilisearch v1.1 takes too long to index documents that contains tens of thousands of words. For example, one user reported that indexing a small number of books could take multiple hours and that indexing 2000 books would seemingly never succeed.
In their case, the logs showed:
which indicates that a significant chunk of the time is spent in
WordPrefixPositionDocids
.In Meilisearch v1.1, we save the exact position of each word in a document. This partly explains why
WordPrefixPositionDocids
is so slow, and why the index would grow to an unacceptable size.In Meilisearch v1.2, we will "bucket" the positions of words in a document, which will significantly speed up this operation and reduce the size of the index.
But we probably need to do more to make indexing performance acceptable for this use case. A few additional suggestions:
docid_word_positions
database, which still keeps the exact position of each word in Meilisearch v1.2word_pair_proximity_docids
from 7 to ... maybe 3? We know this database becomes very big when documents contain a lot of text. Reducing the maximum proximity will reduce its size significantly.word_pair_proximity_docids
database from(prox, word1, word2) -> docids
to(word1, word2) -> Vec<(prox, docids)>
to reduce its size further.The text was updated successfully, but these errors were encountered: