Process files in parallel #15

sanders41 · 2023-04-19T09:41:17Z

Yesterday I wasn't happy with the processing times and thought it could be better. I messed with it a bit more and got the bulk indexing time down from ~9.3s to ~3.7s by processing the files in parallel.

prrao87 · 2023-04-19T13:32:59Z

dbs/meilisearch/scripts/bulk_index.py

+            for call in calls:
+                call_coroutines.append(loop.run_in_executor(process_pool, call))
+
+            data = await asyncio.gather(*call_coroutines)


Wow, this is a really neat trick! Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?

In any case, very cool, need to try this out in other dbs as well. Thanks for this PR!

I was messing around with aiofiles in this case and it was turning out to be slower than the regular sequential file processing. I presume that's because there's some CPU overhead due to the data validation in Pydantic, and that the bottleneck in the file-processing in this case is not purely I/O-bound? Curious to hear your thoughts on this too.

I know I'm working with a bit of a toy example here, but in a real-world situation, it's very likely I'll be using Pydantic extensively to handle issues with data and ensure the right data is being indexed. I'm eager to see what Pydantic 2.x (in Rust) brings to this fold, looking forward to running some experiments on that!

Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?

I may get the exact details wrong here, but basically it is scheduling it as a task on the event loop then running on a separate core with multi-processing. Then when the process is done it can report back to the event loop that it is ready.

I initially tried aiofiles also and got the same result you did. My intuition is that it can run async while waiting on the disk to respond, but once it actually has the file it starts blocking just like it would without aiofiles. As a guess the slow down is while it is waiting for it's next turn in the event loop.

Yes, I am looking forward to seeing what Pydantic 2 can do also! Early numbers look very promising.

Great, thanks for clarifying, and again, for the PR! 😄

prrao87 · 2023-04-19T13:54:41Z

I'm running this on an M2 mac and it's running in about 2.4 sec! 😅

Process files in parallel

17b17e1

prrao87 approved these changes Apr 19, 2023

View reviewed changes

prrao87 changed the base branch from main to meili-multi-process April 19, 2023 13:53

prrao87 merged commit f7eacfc into prrao87:meili-multi-process Apr 19, 2023

prrao87 mentioned this pull request Apr 19, 2023

Meilisearch multi-process #16

Merged

prrao87 mentioned this pull request Apr 29, 2023

Refactor all modules for performance and cleaner code #26

Merged

prrao87 mentioned this pull request Jul 27, 2023

Meilisearch bulk index benchmark #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process files in parallel #15

Process files in parallel #15

sanders41 commented Apr 19, 2023

prrao87 Apr 19, 2023

prrao87 Apr 19, 2023

sanders41 Apr 19, 2023 •

edited

Loading

prrao87 Apr 19, 2023

prrao87 commented Apr 19, 2023

Process files in parallel #15

Process files in parallel #15

Conversation

sanders41 commented Apr 19, 2023

prrao87 Apr 19, 2023

Choose a reason for hiding this comment

prrao87 Apr 19, 2023

Choose a reason for hiding this comment

sanders41 Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

prrao87 Apr 19, 2023

Choose a reason for hiding this comment

prrao87 commented Apr 19, 2023

sanders41 Apr 19, 2023 •

edited

Loading