Skip to content

PDF text after an inline image (BI ... EI) is silently dropped from extraction #1870

@martinsotirov

Description

@martinsotirov

Minimal repro PDF

A 1.1 KB synthetic PDF that reproduces the bug deterministically lives at
/tmp/markitdown_repro/repro.pdf (built by /tmp/markitdown_repro/build_repro.py,
verified by /tmp/markitdown_repro/verify.py). It contains exactly one page with:

  • one Type1 Helvetica font (WinAnsiEncoding)
  • one text-showing op drawing BEFORE_IMAGE: this text should be extracted
  • one inline image (BI /W 16 /H 16 /BPC 1 /IM true /F [/A85 /Fl] ID ... EI)
    whose ASCII85+Flate payload terminates with a bare ~ (no trailing >),
    matching the SAP/Crystal Reports output that triggers the bug in real life
  • four text-showing ops drawing AFTER_IMAGE: ... lines totalling >200 chars
    so the MarkItDown PyMuPDF fallback's threshold is exercised end-to-end

Verified output:

  • pdfminer.six: 46 chars, only BEFORE_IMAGE present (FAIL)
  • pdfplumber: 43 chars, only BEFORE_IMAGE present (FAIL)
  • PyMuPDF: 295 chars, both markers present (PASS)

Root cause confirmed in source: pdfminer/pdfinterp.py do_keyword
(lines ~342-348) sets eos = b"~>" whenever the inline image declares an
ASCII85 (/A85) filter. The reference Cyberport invoice — and the synthetic
repro — terminate the ASCII85 payload with a single ~ (which the spec
permits as a tolerant variant), so get_inline_data never finds ~> and
keeps consuming bytes past the real EI until end-of-stream, swallowing
every subsequent text-showing operator. Maintainers can attach
/tmp/markitdown_repro/repro.pdf (or rebuild it from the script) when
filing this upstream against pdfminer.six and as a regression fixture in
markitdown's PDF tests.

Summary

Both extraction paths in PdfConverter (pdfplumber via extract_words / extract_text, and pdfminer via pdfminer.high_level.extract_text) silently fail to return any text positioned after an inline image (BI ... ID ... EI operator sequence) in a page's content stream. The text exists in the PDF as ordinary Tj operators with WinAnsi encoding — there is no decoding ambiguity — it is simply not surfaced.

The user-visible symptom is a "successful" conversion that returns only header content from the page (everything drawn before the inline image) and silently omits the body. The result string can be hundreds of characters out of thousands actually present.

Affected file

packages/markitdown/src/markitdown/converters/_pdf_converter.py — both the pdfplumber.open(...) per-page path and the pdfminer.high_level.extract_text(...) fallback exhibit the same blind spot, because they share the same content-stream parser family.

Root cause

When a content stream contains a sequence like:

... <header text> ...
q
65.30 0 0 18.00 272 768 cm
BI /W 544 /H 150 /BPC 1 /IM true /F [/A85 /Fl] ID
Gb"0F_$Rn2$j/a\#KTG=4X4NgK'T9aoQ_R<NQ#J1g::s@hIY"pSb0!HPBlQf;Z3_?c?X/3iU(92_-er6$jM@#?n`E+#(sa"0Gk3&K>CqL(^pV$_-er6%)ss?!unfl)u
~
EI
Q
... <body text — never returned> ...

the inline-image data uses ASCII85 (/A85) + Flate (/Fl) filters. The compressed bytes contain sequences that look like EI to the content-stream tokenizer; the parser consumes too much input before recognising the real EI and the rest of the stream — including subsequent BT ... Tj ... ET blocks — is dropped on the floor.

The bug is not in MarkItDown's higher-level logic, but inherited from pdfminer/pdfplumber's tokenizer. However, MarkItDown surfaces it as a silent partial-extraction with no warning.

Reproduction

Any PDF with an inline image (BI...EI) using /A85 /Fl filter and text after it triggers this. Concretely:

import pdfplumber, pdfminer.high_level

path = "<affected.pdf>"
with pdfplumber.open(path) as pdf:
    for p in pdf.pages:
        words = p.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
        print("words:", len(words), "extract_text length:", len(p.extract_text() or ""))

with open(path, "rb") as f:
    print("pdfminer length:", len(pdfminer.high_level.extract_text(f)))

For a real-world invoice in our environment:

  • pdfplumber: 28 words, 463 chars
  • pdfminer: 507 chars
  • pymupdf: ~2.0 KB of text (full content, including all line items, totals, footer)

PyMuPDF parses the inline image correctly and returns text positioned after it.

Why this is hard to detect heuristically

pdfplumber.Page.chars is also short (matches the broken extraction), so we can't compare counts to spot the gap. The PDF reports a normal MediaBox, has font resources, has reasonable annotations — nothing visible at the metadata level signals truncation. The only reliable signal is "another parser returns substantially more text".

Suggested fixes

  1. Recommended: add a pymupdf-based fallback inside PdfConverter.convert that runs only when the primary path returns suspiciously little text (pymupdf_text_len > existing * 1.3 + 500, for example). This keeps the import lazy and avoids paying the cost on documents that already extracted well.

  2. Acknowledged constraint: PyMuPDF is AGPL-3.0 (see Use of Pymupdf #1675), so it cannot be a hard dependency of markitdown core. Implement the fallback as an optional extra (pip install 'markitdown[pymupdf]') — same pattern the markitdown-ocr plugin already uses to depend on PyMuPDF transitively.

  3. Alternative without PyMuPDF: forward-scan for the next EI operator at a token boundary (whitespace or end-of-stream on either side) when handling BI ... ID ... EI, instead of letting the tokenizer be confused by EI-shaped bytes inside compressed image data. This is a deeper fix in pdfminer/pdfplumber territory.

  4. Independent of fix choice: when extraction returns < 1KB for a PDF with > 50KB of file size and ≥ 1 inline image, emit a warning. The current behaviour (silent partial extraction) is the worst failure mode for downstream RAG/document pipelines because the missing content looks like absence rather than corruption.

Related issues (none cover this exact mode)

Environment

  • markitdown commit: 4b65609 (2026-05-07)
  • pdfplumber 0.11.9, pdfminer.six 20251230, pymupdf 1.27.2.3
  • macOS 15.6 (Darwin 24.6.0), Python 3.13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions