perf Mistral.py#13761
Closed
PVBLIC-F wants to merge 3 commits into
Closed
Conversation
I kicked off by streamlining the HTTP layer—replacing individual requests calls with a shared aiohttp.ClientSession and streaming PDF uploads to cut down on TCP/TLS handshakes. From there, I converted every network interaction (_upload_file, _get_signed_url, _process_ocr, and _delete_file) into async def methods so nothing blocks the event loop. To speed up CPU-bound work, I introduced a helper _build_document and used asyncio.get_event_loop().run_in_executor combined with asyncio.gather to parse all pages in parallel. Recognizing that OCR calls can sometimes hiccup, I wrapped _process_ocr in a Tenacity @Retry decorator for exponential-backoff retries. Finally, to prevent overloading the OCR service or my own resources, I added an asyncio.Semaphore to throttle concurrent load() operations to five at a time—and provided a shutdown_loader helper to cleanly close the session when I’m done.
Safe file handling In _upload_file, I replaced the manual open/close calls with a with open(..., "rb") context manager so the file descriptor is always closed—even if an exception occurs during upload. Request timeouts I imported ClientTimeout from aiohttp and wrapped our ClientSession in a 30 s total timeout. Now any stalled upload/GET/POST/DELETE will raise a timeout error instead of hanging indefinitely.
Switched to httpx.AsyncClient, replacing the previous aiohttp setup with a single, unified client that supports HTTP/2 out of the box, simpler timeout configuration, and a familiar requests-style API in both sync and async contexts.
Contributor
|
using |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
devbranch.Changelog Entry
Description
I’ve completely overhauled the Mistral OCR loader for maximum performance and resilience: I replaced raw requests calls with a single aiohttp.ClientSession configured with a 30 s ClientTimeout so no request can hang indefinitely; converted every network interaction (_upload_file, _get_signed_url, _process_ocr, and _delete_file) to non-blocking async def methods; wrapped file uploads in a with open(...) context manager to guarantee the PDF handle is always closed; parallelized page-to-Document construction using asyncio.get_event_loop().run_in_executor so multi-page docs process in parallel; hardened the OCR step with a Tenacity @Retry decorator for exponential-backoff retries; throttled overall concurrency with an asyncio.Semaphore(5) around the entire load() method; and added aenter/aexit to make the loader a proper async context manager—plus a helper to shut down the session cleanly.
Added
Changed
Deprecated
Removed
Fixed
Security
Breaking Changes
Additional Information
Screenshots or Videos
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.