-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build the trees in parallel #32
Conversation
6aedb57
to
b6a66ca
Compare
dec0e40
to
840cb30
Compare
840cb30
to
44d7120
Compare
44d7120
to
fc402f0
Compare
462ab1c
to
aea7970
Compare
901f4ac
to
6b91ae7
Compare
I did some experiments to compare the speed and size of the database with Spotify/Annoy. The results are very good for arroy. We are always faster and the database is always smaller. That's probably related to the high number of mutex and synchronisation needed to store tree nodes in the Annoy database format where Meilisearch/arroy only requires a single atomic sequential number. On the other hand, the size of the database probably differs from the fact that we store lists of integers in RoaringBitmaps instead of in uncompressed lists as Annoy does. When Meilisearch/arroy takes 52s to index 95832 vectors of 678 dimensions in 200 trees by using 12 threads, Spotify/Annoy takes 69s. And when we force both library to use a single thread Arroy takes 225s when Annoy takes 338s. Even if we are using LMDB and documents are stored in a B-Tree that supports atomic operations. |
This PR plans to build the trees in parallel and store the tree nodes in files before storing them in the database using the write transactions.
However, the current implementation needs to be fixed. We are creating read transactions to be able to read from different threads in parallel, but those transactions can only read the committed changes, which means that it cannot see the non-committed user items/vectors.After some research, I found a clean solution to the original problem. As it is valid to keep pointers to the entries' data safely (if not using fancy LMDB features like encryption), I created the
ImmutableLeafs
data structure that lists the leaf nodes and keeps pointers to them, all of that from the currentRwTxn
. We, therefore, no longer need to commit the user items transaction before being able tobuild
the trees.TODO
n_trees
by ourselves.TMPDIR
variable.ImmutableLeafs
and specifically the safety of it (because it is).build_in_parallel
method into thebuild
one.