Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve indexation time when inserting documents #2203

Closed
3 tasks done
curquiza opened this issue Mar 2, 2022 · 11 comments
Closed
3 tasks done

Improve indexation time when inserting documents #2203

curquiza opened this issue Mar 2, 2022 · 11 comments
Assignees
Labels
enhancement New feature or improvement milli Related to the milli workspace performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption tracking issue Tracks development of a global issue v0.27.0 PRs/issues solved in v0.27.0
Milestone

Comments

@curquiza
Copy link
Member

curquiza commented Mar 2, 2022

Thanks to the @Kerollmops work on Milli's side, we succeed to improve the indexation time when inserting documents in an already existing database.

See the base PR:

And the fixes to improve it

Steps:

  • @Kerollmops shares some metric so that we can communicate about them (before/after the improvement)
  • Release milli with these changes
  • Update the milli dependence in Meilisearch

Edit by @curquiza 16/03/2022

Another PR improved the indexation speed, meilisearch/milli#467 done by @MarinPostma

@curquiza curquiza added enhancement New feature or improvement performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Mar 2, 2022
@curquiza curquiza added this to the v0.27.0 milestone Mar 2, 2022
@aariacarterweir
Copy link

I have meilisearch deployed on GKE and can only see it using a max of one core also.

@curquiza
Copy link
Member Author

Hello @aariacarterweir
This issue is only related to the specific improvement regarding the insertion of documents done by @Kerollmops in the different PRs.

You would be more interested in the open discussion dedicated to indexing performance.
The goal of this discussion is to centralize issues to help us to target the main bottlenecks.
Don't hesitate to test the potential fixes written in the post, and, if it doesn't work, write a comment explaining your issue.

We will continuously be working on the indexing time while it's not satisfying for our users.

Thanks for your interest in Meilisearch!

@Kerollmops
Copy link
Member

Kerollmops commented Mar 15, 2022

Here are the benchmarks before/after my improvements on the time spend in the prefix databases. These databases and data structures are used by the engine to reduce the time spent searching for all the words (or pairs of words) that start with a given prefix. Computing those can take time and we are now using a difference between the previous and newly created prefix word FST instead.

In this experiment, we indexed 80 million of songs and then sent 10x batches of 10 documents that are not known by the engine.

Settings

{
    "searchableAttributes":
    [
        "title",
        "album",
        "artist"
    ],
    "displayedAttributes":
    [
        "id",
        "title",
        "album",
        "artist",
        "genre",
        "country",
        "released",
        "duration"
    ],
    "criteria":
    [
        "words",
        "typo",
        "proximity",
        "attribute",
        "released-timestamp:desc"
    ],
    "filterableAttributes":
    [
        "released-timestamp",
        "duration-float",
        "genre",
        "country",
        "artist"
    ]
}

Before meilisearch/milli@45f5262

Here is the time taken by the updates in seconds from the most recent to the least recent. We first sent the whole 80 million documents (2932s) and then sent the documents 10 by 10.

868.89 + 875.24 + 867.27 + 887.89 + 872.97 + 880.95 + 810.81 + 806.35 + 814.53 + 800.37 + 2932.98 = 11418.25
11418.25 / 60 / 60 = 3.17

It takes 3 hours and 10 minutes to index with the previous version of the engine.

After meilisearch/milli@25d7ed8

Here is the time taken by the updates in seconds from the most recent to the least recent. We first sent the whole 80 million documents (3090s) and then sent the documents 10 by 10.

104.67 + 103.29 + 103.59 + 102.12 + 103.18 + 105.11 + 103.02 + 103.37 + 102.90 + 103.39 + 3090.65 = 4125.29
4125.29 / 60 / 60 = 1.145

It takes 1 hour and 8 minutes to index with the newly patched version of the engine.

@curquiza
Copy link
Member Author

curquiza commented Mar 15, 2022

@meilisearch/devrel-team so that you can follow the issue, it might interest you for your communication (v0.27.0, not now)

@MarinPostma
Copy link
Contributor

adding to this issue, meilisearch/milli#467 should about halve the times announced by @Kerollmops ! 🏎️

@curquiza curquiza added milli Related to the milli workspace tracking issue Tracks development of a global issue labels Mar 16, 2022
@curquiza
Copy link
Member Author

Milli was bumped in #2244, with milli v0.24.0 containing the current improvements regarding the indexation speed.
I don't close this issue yet since other improvements might be done during the sprint.
Plus final metrics might be needed to be shared again :)

@curquiza
Copy link
Member Author

@Kerollmops you need to provide new metrics using milli v0.26.3 🚀

@Kerollmops
Copy link
Member

Hey @curquiza,

I have made the same benchmarks as in #2203 (comment) and the engine is much faster. Note that I have done my benchmarks on v0.27.0 (meilisearch/milli@2aae19d, the latest commit).

67.93 + 68.24 + 67.88 + 68.05 + 69.02 + 67.84 + 68.28 + 67.68 + 74.42 + 69.24 + 2387.98 = 3076.56
3076.56 / 60 / 60 = 0.85

It takes 51 minutes to index 80 million and 100 new documents now! Good job @meilisearch/core-team 🎉

@irevoire
Copy link
Member

🔥 🔥 🔥 I don't know why it's faster but 🔥

@curquiza
Copy link
Member Author

🤘🤘🤘

@curquiza
Copy link
Member Author

Closing this then!

@curquiza curquiza added the v0.27.0 PRs/issues solved in v0.27.0 label Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement milli Related to the milli workspace performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption tracking issue Tracks development of a global issue v0.27.0 PRs/issues solved in v0.27.0
Projects
None yet
Development

No branches or pull requests

5 participants