Pydantic v2 Meilisearch #32

prrao87 · 2023-07-15T22:34:51Z

Fixes #31.

Updates

Updates Meilisearch section to use Pydantic v2. Performance is GREAT!

No changes required to the meilisearch-python-async code from the prior version (using latest meilisearch-python-async version 1.4.5 as of today)
Minor changes to validator and config logic in Pydantic schemas
Impact on run time with Pydantic is small, mainly because most of the overhead in bulk_index.py script is due to firing up multiple processes via concurrent.futures.ProcessPoolExecutor

Earlier version using Pydantic v1 ran in ~3.3 seconds, the new version with Pydantic v2 runs in ~2.7 seconds (on M2 macbook pro)..

$ cd dbs/meilisearch/scripts
$ time python bulk_index.py
Finished updating database index settings
Processing chunks
Processed ids in range 1-10000
Processed ids in range 20001-30000
Processed ids in range 60001-70000
Processed ids in range 50001-60000
Processed ids in range 10001-20000
Processed ids in range 100001-110000
Processed ids in range 80001-90000
Processed ids in range 70001-80000
Processed ids in range 40001-50000
Processed ids in range 120001-129971
Processed ids in range 30001-40000
Processed ids in range 110001-120000
Processed ids in range 90001-100000
Finished execution!
python bulk_index.py  4.97s user 0.77s system 206% cpu 2.783 total

Most of the runtime is spent in firing up the multiprocessor module. Although the total run time reduction in this case was relatively small, in a larger dataset (with millions of records), the initial overhead of spawning multiple CPU processes is totally worth it, as the async client will very efficiently process each batch with faster underlying validation logic (enabled by Pydantic v2).

Rule of thumb

@sanders41, I think that for best performance for async loading, it makes sense to wrap the validation logic for the data within the multiprocess logic, as you originally suggested. That allows users to exploit the best of the underlying CPUs as well as the event loop that's handling many batches concurrently. Let me know if you see anything that's off.

sanders41 · 2023-07-15T22:45:52Z

Looks good to me! I agree leaving the multi processing makes since, especially for larger datasets.

prrao87 · 2023-07-15T23:18:27Z

I'm planning on writing a blog post on searching via Meilisearch, including best practices, and of course will link your awesome async client heavily throughout 😁.. Thanks a lot for chiming in!

prrao87 · 2023-07-29T03:27:26Z

@sanders41 I've released the blog post, and your name features in the acknowledgements -- thanks again for putting in all the work you do!
https://thedataquarry.com/posts/meilisearch-async/

sanders41 · 2023-07-29T09:42:18Z

Very nice write up! I’ll share around

prrao87 and others added 6 commits July 15, 2023 18:24

Update versions for meilisearch client, db and pydantic

319b856

Update .env.example

959a20e

Use pydantic-settings

9a1e63e

Update schemas and formatting per pydantic v2

2f0e145

Update bulk index

c3e4470

Fix code style issues with Black

a9401b9

prrao87 merged commit 8d4c93e into main Jul 15, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic v2 Meilisearch #32

Pydantic v2 Meilisearch #32

prrao87 commented Jul 15, 2023 •

edited

sanders41 commented Jul 15, 2023 •

edited

prrao87 commented Jul 15, 2023

prrao87 commented Jul 29, 2023

sanders41 commented Jul 29, 2023

Pydantic v2 Meilisearch #32

Pydantic v2 Meilisearch #32

Conversation

prrao87 commented Jul 15, 2023 • edited

Updates

Rule of thumb

sanders41 commented Jul 15, 2023 • edited

prrao87 commented Jul 15, 2023

prrao87 commented Jul 29, 2023

sanders41 commented Jul 29, 2023

prrao87 commented Jul 15, 2023 •

edited

sanders41 commented Jul 15, 2023 •

edited