Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build the trees in parallel #32

Merged
merged 30 commits into from
Dec 6, 2023
Merged

Build the trees in parallel #32

merged 30 commits into from
Dec 6, 2023

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Dec 2, 2023

This PR plans to build the trees in parallel and store the tree nodes in files before storing them in the database using the write transactions.

However, the current implementation needs to be fixed. We are creating read transactions to be able to read from different threads in parallel, but those transactions can only read the committed changes, which means that it cannot see the non-committed user items/vectors.

After some research, I found a clean solution to the original problem. As it is valid to keep pointers to the entries' data safely (if not using fancy LMDB features like encryption), I created the ImmutableLeafs data structure that lists the leaf nodes and keeps pointers to them, all of that from the current RwTxn. We, therefore, no longer need to commit the user items transaction before being able to build the trees.

TODO

  • Check and fix the tests.
  • Compute the n_trees by ourselves.
  • Update the README.
  • Use the right TMPDIR variable.
  • Document the ImmutableLeafs and specifically the safety of it (because it is).
  • Rename the build_in_parallel method into the build one.

@Kerollmops Kerollmops force-pushed the parallel-building branch 5 times, most recently from 6aedb57 to b6a66ca Compare December 3, 2023 21:54
@Kerollmops Kerollmops mentioned this pull request Dec 3, 2023
@Kerollmops Kerollmops added this to the v0.2.0 milestone Dec 4, 2023
@Kerollmops Kerollmops force-pushed the parallel-building branch 5 times, most recently from dec0e40 to 840cb30 Compare December 6, 2023 09:50
@Kerollmops Kerollmops marked this pull request as ready for review December 6, 2023 14:09
@Kerollmops Kerollmops merged commit 25eb41b into main Dec 6, 2023
5 checks passed
@Kerollmops Kerollmops deleted the parallel-building branch December 6, 2023 14:11
@Kerollmops
Copy link
Member Author

Kerollmops commented Dec 6, 2023

I did some experiments to compare the speed and size of the database with Spotify/Annoy. The results are very good for arroy. We are always faster and the database is always smaller.

That's probably related to the high number of mutex and synchronisation needed to store tree nodes in the Annoy database format where Meilisearch/arroy only requires a single atomic sequential number.

On the other hand, the size of the database probably differs from the fact that we store lists of integers in RoaringBitmaps instead of in uncompressed lists as Annoy does.

When Meilisearch/arroy takes 52s to index 95832 vectors of 678 dimensions in 200 trees by using 12 threads, Spotify/Annoy takes 69s. And when we force both library to use a single thread Arroy takes 225s when Annoy takes 338s. Even if we are using LMDB and documents are stored in a B-Tree that supports atomic operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant