Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic v2 Meilisearch #32

Merged
merged 6 commits into from
Jul 15, 2023
Merged

Pydantic v2 Meilisearch #32

merged 6 commits into from
Jul 15, 2023

Conversation

prrao87
Copy link
Owner

@prrao87 prrao87 commented Jul 15, 2023

Fixes #31.

Updates

Updates Meilisearch section to use Pydantic v2. Performance is GREAT!

  • No changes required to the meilisearch-python-async code from the prior version (using latest meilisearch-python-async version 1.4.5 as of today)
  • Minor changes to validator and config logic in Pydantic schemas
  • Impact on run time with Pydantic is small, mainly because most of the overhead in bulk_index.py script is due to firing up multiple processes via concurrent.futures.ProcessPoolExecutor

Earlier version using Pydantic v1 ran in ~3.3 seconds, the new version with Pydantic v2 runs in ~2.7 seconds (on M2 macbook pro)..

$ cd dbs/meilisearch/scripts
$ time python bulk_index.py
Finished updating database index settings
Processing chunks
Processed ids in range 1-10000
Processed ids in range 20001-30000
Processed ids in range 60001-70000
Processed ids in range 50001-60000
Processed ids in range 10001-20000
Processed ids in range 100001-110000
Processed ids in range 80001-90000
Processed ids in range 70001-80000
Processed ids in range 40001-50000
Processed ids in range 120001-129971
Processed ids in range 30001-40000
Processed ids in range 110001-120000
Processed ids in range 90001-100000
Finished execution!
python bulk_index.py  4.97s user 0.77s system 206% cpu 2.783 total

Most of the runtime is spent in firing up the multiprocessor module. Although the total run time reduction in this case was relatively small, in a larger dataset (with millions of records), the initial overhead of spawning multiple CPU processes is totally worth it, as the async client will very efficiently process each batch with faster underlying validation logic (enabled by Pydantic v2).

Rule of thumb

@sanders41, I think that for best performance for async loading, it makes sense to wrap the validation logic for the data within the multiprocess logic, as you originally suggested. That allows users to exploit the best of the underlying CPUs as well as the event loop that's handling many batches concurrently. Let me know if you see anything that's off.

@prrao87 prrao87 merged commit 8d4c93e into main Jul 15, 2023
1 check passed
@sanders41
Copy link
Contributor

sanders41 commented Jul 15, 2023

Looks good to me! I agree leaving the multi processing makes since, especially for larger datasets.

@prrao87
Copy link
Owner Author

prrao87 commented Jul 15, 2023

I'm planning on writing a blog post on searching via Meilisearch, including best practices, and of course will link your awesome async client heavily throughout 😁.. Thanks a lot for chiming in!

@prrao87
Copy link
Owner Author

prrao87 commented Jul 29, 2023

@sanders41 I've released the blog post, and your name features in the acknowledgements -- thanks again for putting in all the work you do!
https://thedataquarry.com/posts/meilisearch-async/

@sanders41
Copy link
Contributor

Very nice write up! I’ll share around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Meilisearch with Pydantic v2
3 participants