New Feature PR: Streaming live parsing via gRPC for agents #2110
krickert
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
Sick of waiting for PDF pages to comeback as markdown?
Well that's fixed in this PR #2109
Want to setup mTLS? You got it....
Want to use this in any of 12 popular languages ... we got you...
Here's the PR to bring in the gRPC server to markitdown:
#2109
Give it a try! there's samples. Now you can start the embeddings before the full document is done. Speed up time and money for RAG.
I've finished a gRPC interface for MarkItDown and opened a PR. The full implementation is working, tested, and you can find some examples in the PR.
The problem and details
MarkItDown today is batch-oriented. A caller submits a document, waits for the entire conversion to finish, and receives the complete result. For large PDFs and long presentations this means several seconds of dead time before downstream systems can do anything with the output. Services that index, summarize, or render progressively are blocked on the slowest page of the document.
What this adds
A gRPC service with three RPCs on top of the existing conversion engine:
Convertreturns the full Markdown in one response, equivalent to the current API but with the generated clients for up to 12 languages.ConvertStreamreturns the Markdown as an ordered stream of chunks.ConvertDocumentStreamreturns typed structural elements (headings, tables with parsed cells, lists with parsed items, code blocks, images, and so on) so consumers can process document structure without re-parsing Markdown.PDFs are processed page by page and PPTX slide by slide. On a 120-page PDF, time-to-first-chunk drops from about 2.8 seconds to about 0.08 seconds, with byte-identical output. This can greatly reduce processing time for large pipelines for large PPT and PDFs
sequenceDiagram participant Client participant Server as gRPC Server participant Conv as PDF/PPTX Converter Client->>Server: ConvertDocumentStream (incremental=true) Server-->>Client: started loop each page or slide Server->>Conv: convert one page/slide Conv-->>Server: markdown fragment Server-->>Client: typed elements for that page end Server-->>Client: completedDesign constraints followed
I tried to be minimally invasive on the code level - so to stream I just took the PPT and PDF converters and exposed the loop and kept the same API structure. Converters are now stream-capable.
The
DocumentConvertercontract and the plugin API are unchanged. The streaming logic lives in a separate experimental package (markitdown.streaming) that reuses the existing converters' extraction code behind its own controller. The one core edit is a behavior-preserving refactor in the PPTX converter that extracts per-slide conversion into its own method, which has been verified as byte-identical against the existing test vectors.Everything experimental is opt-in and clearly marked. Defaults are byte-stable with current behavior. gRPC itself is an optional extra (
pip install markitdown[grpc]), so the core package gains no new dependencies.Operationally the server ships with the standard pieces: health checking and reflection (works with Kubernetes probes and grpcurl out of the box), proper gRPC status code mapping instead of leaked tracebacks, a non-localhost bind warning matching the MCP server, and a 100 MiB default message limit so real documents work without tuning.
Test coverage is 376 passing via
hatch test, including byte-exact parity assertions between incremental and whole-document output for PPTX and table-bearing PDFs.Why start with PDF and PPTX
Both formats have natural structural boundaries (pages, slides) and existing converter logic that could be reused without rewriting anything. DOCX, HTML, and XLSX don't have an equally clean incremental boundary in the current implementation, so they fall back to whole-document conversion transparently through the same API (I can help fix this). If the approach is well received, the streaming layer can grow format by format without forcing changes to the converter architecture.
Try it!
markitdown.streamingpackage living inside the core package?Beta Was this translation helpful? Give feedback.
All reactions