-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process files in parallel #15
Conversation
for call in calls: | ||
call_coroutines.append(loop.run_in_executor(process_pool, call)) | ||
|
||
data = await asyncio.gather(*call_coroutines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is a really neat trick! Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?
In any case, very cool, need to try this out in other dbs as well. Thanks for this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was messing around with aiofiles
in this case and it was turning out to be slower than the regular sequential file processing. I presume that's because there's some CPU overhead due to the data validation in Pydantic, and that the bottleneck in the file-processing in this case is not purely I/O-bound? Curious to hear your thoughts on this too.
I know I'm working with a bit of a toy example here, but in a real-world situation, it's very likely I'll be using Pydantic extensively to handle issues with data and ensure the right data is being indexed. I'm eager to see what Pydantic 2.x (in Rust) brings to this fold, looking forward to running some experiments on that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?
I may get the exact details wrong here, but basically it is scheduling it as a task on the event loop then running on a separate core with multi-processing. Then when the process is done it can report back to the event loop that it is ready.
I initially tried aiofiles
also and got the same result you did. My intuition is that it can run async while waiting on the disk to respond, but once it actually has the file it starts blocking just like it would without aiofiles
. As a guess the slow down is while it is waiting for it's next turn in the event loop.
Yes, I am looking forward to seeing what Pydantic 2 can do also! Early numbers look very promising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for clarifying, and again, for the PR! 😄
I'm running this on an M2 mac and it's running in about 2.4 sec! 😅 |
Yesterday I wasn't happy with the processing times and thought it could be better. I messed with it a bit more and got the bulk indexing time down from ~9.3s to ~3.7s by processing the files in parallel.