Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process files in parallel #15

Merged
merged 1 commit into from
Apr 19, 2023

Conversation

sanders41
Copy link
Contributor

Yesterday I wasn't happy with the processing times and thought it could be better. I messed with it a bit more and got the bulk indexing time down from ~9.3s to ~3.7s by processing the files in parallel.

for call in calls:
call_coroutines.append(loop.run_in_executor(process_pool, call))

data = await asyncio.gather(*call_coroutines)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a really neat trick! Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?

In any case, very cool, need to try this out in other dbs as well. Thanks for this PR!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was messing around with aiofiles in this case and it was turning out to be slower than the regular sequential file processing. I presume that's because there's some CPU overhead due to the data validation in Pydantic, and that the bottleneck in the file-processing in this case is not purely I/O-bound? Curious to hear your thoughts on this too.

I know I'm working with a bit of a toy example here, but in a real-world situation, it's very likely I'll be using Pydantic extensively to handle issues with data and ensure the right data is being indexed. I'm eager to see what Pydantic 2.x (in Rust) brings to this fold, looking forward to running some experiments on that!

Copy link
Contributor Author

@sanders41 sanders41 Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, are you creating process pool coroutines and attaching them as tasks within an async event loop? If so, is the whole thing (CPU-bound pydantic and I/O-bound file and db ops) running in a single event loop? How is the whole thing happening in a non-blocking fashion?

I may get the exact details wrong here, but basically it is scheduling it as a task on the event loop then running on a separate core with multi-processing. Then when the process is done it can report back to the event loop that it is ready.

I initially tried aiofiles also and got the same result you did. My intuition is that it can run async while waiting on the disk to respond, but once it actually has the file it starts blocking just like it would without aiofiles. As a guess the slow down is while it is waiting for it's next turn in the event loop.

Yes, I am looking forward to seeing what Pydantic 2 can do also! Early numbers look very promising.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for clarifying, and again, for the PR! 😄

@prrao87 prrao87 changed the base branch from main to meili-multi-process April 19, 2023 13:53
@prrao87
Copy link
Owner

prrao87 commented Apr 19, 2023

I'm running this on an M2 mac and it's running in about 2.4 sec! 😅

@prrao87 prrao87 merged commit f7eacfc into prrao87:meili-multi-process Apr 19, 2023
@prrao87 prrao87 mentioned this pull request Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants