You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A 1.1 KB synthetic PDF that reproduces the bug deterministically lives at /tmp/markitdown_repro/repro.pdf (built by /tmp/markitdown_repro/build_repro.py,
verified by /tmp/markitdown_repro/verify.py). It contains exactly one page with:
one Type1 Helvetica font (WinAnsiEncoding)
one text-showing op drawing BEFORE_IMAGE: this text should be extracted
one inline image (BI /W 16 /H 16 /BPC 1 /IM true /F [/A85 /Fl] ID ... EI)
whose ASCII85+Flate payload terminates with a bare ~ (no trailing >),
matching the SAP/Crystal Reports output that triggers the bug in real life
four text-showing ops drawing AFTER_IMAGE: ... lines totalling >200 chars
so the MarkItDown PyMuPDF fallback's threshold is exercised end-to-end
Verified output:
pdfminer.six: 46 chars, only BEFORE_IMAGE present (FAIL)
pdfplumber: 43 chars, only BEFORE_IMAGE present (FAIL)
PyMuPDF: 295 chars, both markers present (PASS)
Root cause confirmed in source:pdfminer/pdfinterp.pydo_keyword
(lines ~342-348) sets eos = b"~>" whenever the inline image declares an
ASCII85 (/A85) filter. The reference Cyberport invoice — and the synthetic
repro — terminate the ASCII85 payload with a single ~ (which the spec
permits as a tolerant variant), so get_inline_data never finds ~> and
keeps consuming bytes past the real EI until end-of-stream, swallowing
every subsequent text-showing operator. Maintainers can attach /tmp/markitdown_repro/repro.pdf (or rebuild it from the script) when
filing this upstream against pdfminer.six and as a regression fixture in
markitdown's PDF tests.
Summary
Both extraction paths in PdfConverter (pdfplumber via extract_words / extract_text, and pdfminer via pdfminer.high_level.extract_text) silently fail to return any text positioned after an inline image (BI ... ID ... EI operator sequence) in a page's content stream. The text exists in the PDF as ordinary Tj operators with WinAnsi encoding — there is no decoding ambiguity — it is simply not surfaced.
The user-visible symptom is a "successful" conversion that returns only header content from the page (everything drawn before the inline image) and silently omits the body. The result string can be hundreds of characters out of thousands actually present.
Affected file
packages/markitdown/src/markitdown/converters/_pdf_converter.py — both the pdfplumber.open(...) per-page path and the pdfminer.high_level.extract_text(...) fallback exhibit the same blind spot, because they share the same content-stream parser family.
Root cause
When a content stream contains a sequence like:
... <header text> ...
q
65.30 0 0 18.00 272 768 cm
BI /W 544 /H 150 /BPC 1 /IM true /F [/A85 /Fl] ID
Gb"0F_$Rn2$j/a\#KTG=4X4NgK'T9aoQ_R<NQ#J1g::s@hIY"pSb0!HPBlQf;Z3_?c?X/3iU(92_-er6$jM@#?n`E+#(sa"0Gk3&K>CqL(^pV$_-er6%)ss?!unfl)u
~
EI
Q
... <body text — never returned> ...
the inline-image data uses ASCII85 (/A85) + Flate (/Fl) filters. The compressed bytes contain sequences that look like EI to the content-stream tokenizer; the parser consumes too much input before recognising the real EI and the rest of the stream — including subsequent BT ... Tj ... ET blocks — is dropped on the floor.
The bug is not in MarkItDown's higher-level logic, but inherited from pdfminer/pdfplumber's tokenizer. However, MarkItDown surfaces it as a silent partial-extraction with no warning.
Reproduction
Any PDF with an inline image (BI...EI) using /A85 /Fl filter and text after it triggers this. Concretely:
pymupdf: ~2.0 KB of text (full content, including all line items, totals, footer)
PyMuPDF parses the inline image correctly and returns text positioned after it.
Why this is hard to detect heuristically
pdfplumber.Page.chars is also short (matches the broken extraction), so we can't compare counts to spot the gap. The PDF reports a normal MediaBox, has font resources, has reasonable annotations — nothing visible at the metadata level signals truncation. The only reliable signal is "another parser returns substantially more text".
Suggested fixes
Recommended: add a pymupdf-based fallback inside PdfConverter.convert that runs only when the primary path returns suspiciously little text (pymupdf_text_len > existing * 1.3 + 500, for example). This keeps the import lazy and avoids paying the cost on documents that already extracted well.
Acknowledged constraint: PyMuPDF is AGPL-3.0 (see Use of Pymupdf #1675), so it cannot be a hard dependency of markitdown core. Implement the fallback as an optional extra (pip install 'markitdown[pymupdf]') — same pattern the markitdown-ocr plugin already uses to depend on PyMuPDF transitively.
Alternative without PyMuPDF: forward-scan for the next EI operator at a token boundary (whitespace or end-of-stream on either side) when handling BI ... ID ... EI, instead of letting the tokenizer be confused by EI-shaped bytes inside compressed image data. This is a deeper fix in pdfminer/pdfplumber territory.
Independent of fix choice: when extraction returns < 1KB for a PDF with > 50KB of file size and ≥ 1 inline image, emit a warning. The current behaviour (silent partial extraction) is the worst failure mode for downstream RAG/document pipelines because the missing content looks like absence rather than corruption.
Minimal repro PDF
A 1.1 KB synthetic PDF that reproduces the bug deterministically lives at
/tmp/markitdown_repro/repro.pdf(built by/tmp/markitdown_repro/build_repro.py,verified by
/tmp/markitdown_repro/verify.py). It contains exactly one page with:BEFORE_IMAGE: this text should be extractedBI /W 16 /H 16 /BPC 1 /IM true /F [/A85 /Fl] ID ... EI)whose ASCII85+Flate payload terminates with a bare
~(no trailing>),matching the SAP/Crystal Reports output that triggers the bug in real life
AFTER_IMAGE: ...lines totalling >200 charsso the MarkItDown PyMuPDF fallback's threshold is exercised end-to-end
Verified output:
BEFORE_IMAGEpresent (FAIL)BEFORE_IMAGEpresent (FAIL)Root cause confirmed in source:
pdfminer/pdfinterp.pydo_keyword(lines ~342-348) sets
eos = b"~>"whenever the inline image declares anASCII85 (
/A85) filter. The reference Cyberport invoice — and the syntheticrepro — terminate the ASCII85 payload with a single
~(which the specpermits as a tolerant variant), so
get_inline_datanever finds~>andkeeps consuming bytes past the real
EIuntil end-of-stream, swallowingevery subsequent text-showing operator. Maintainers can attach
/tmp/markitdown_repro/repro.pdf(or rebuild it from the script) whenfiling this upstream against pdfminer.six and as a regression fixture in
markitdown's PDF tests.
Summary
Both extraction paths in
PdfConverter(pdfplumber viaextract_words/extract_text, and pdfminer viapdfminer.high_level.extract_text) silently fail to return any text positioned after an inline image (BI ... ID ... EIoperator sequence) in a page's content stream. The text exists in the PDF as ordinaryTjoperators with WinAnsi encoding — there is no decoding ambiguity — it is simply not surfaced.The user-visible symptom is a "successful" conversion that returns only header content from the page (everything drawn before the inline image) and silently omits the body. The result string can be hundreds of characters out of thousands actually present.
Affected file
packages/markitdown/src/markitdown/converters/_pdf_converter.py— both thepdfplumber.open(...)per-page path and thepdfminer.high_level.extract_text(...)fallback exhibit the same blind spot, because they share the same content-stream parser family.Root cause
When a content stream contains a sequence like:
the inline-image data uses ASCII85 (
/A85) + Flate (/Fl) filters. The compressed bytes contain sequences that look likeEIto the content-stream tokenizer; the parser consumes too much input before recognising the realEIand the rest of the stream — including subsequentBT ... Tj ... ETblocks — is dropped on the floor.The bug is not in MarkItDown's higher-level logic, but inherited from pdfminer/pdfplumber's tokenizer. However, MarkItDown surfaces it as a silent partial-extraction with no warning.
Reproduction
Any PDF with an inline image (
BI...EI) using/A85 /Flfilter and text after it triggers this. Concretely:For a real-world invoice in our environment:
pdfplumber: 28 words, 463 charspdfminer: 507 charspymupdf: ~2.0 KB of text (full content, including all line items, totals, footer)PyMuPDF parses the inline image correctly and returns text positioned after it.
Why this is hard to detect heuristically
pdfplumber.Page.charsis also short (matches the broken extraction), so we can't compare counts to spot the gap. The PDF reports a normalMediaBox, has font resources, has reasonable annotations — nothing visible at the metadata level signals truncation. The only reliable signal is "another parser returns substantially more text".Suggested fixes
Recommended: add a pymupdf-based fallback inside
PdfConverter.convertthat runs only when the primary path returns suspiciously little text (pymupdf_text_len > existing * 1.3 + 500, for example). This keeps the import lazy and avoids paying the cost on documents that already extracted well.Acknowledged constraint: PyMuPDF is AGPL-3.0 (see Use of Pymupdf #1675), so it cannot be a hard dependency of
markitdowncore. Implement the fallback as an optional extra (pip install 'markitdown[pymupdf]') — same pattern themarkitdown-ocrplugin already uses to depend on PyMuPDF transitively.Alternative without PyMuPDF: forward-scan for the next
EIoperator at a token boundary (whitespace or end-of-stream on either side) when handlingBI ... ID ... EI, instead of letting the tokenizer be confused byEI-shaped bytes inside compressed image data. This is a deeper fix in pdfminer/pdfplumber territory.Independent of fix choice: when extraction returns < 1KB for a PDF with > 50KB of file size and ≥ 1 inline image, emit a warning. The current behaviour (silent partial extraction) is the worst failure mode for downstream RAG/document pipelines because the missing content looks like absence rather than corruption.
Related issues (none cover this exact mode)
Environment
4b65609(2026-05-07)