PDF text after an inline image (`BI ... EI`) is silently dropped from extraction

## Minimal repro PDF

A 1.1 KB synthetic PDF that reproduces the bug deterministically lives at
`/tmp/markitdown_repro/repro.pdf` (built by `/tmp/markitdown_repro/build_repro.py`,
verified by `/tmp/markitdown_repro/verify.py`). It contains exactly one page with:

- one Type1 Helvetica font (WinAnsiEncoding)
- one text-showing op drawing `BEFORE_IMAGE: this text should be extracted`
- one inline image (`BI /W 16 /H 16 /BPC 1 /IM true /F [/A85 /Fl] ID ... EI`)
  whose ASCII85+Flate payload terminates with a bare `~` (no trailing `>`),
  matching the SAP/Crystal Reports output that triggers the bug in real life
- four text-showing ops drawing `AFTER_IMAGE: ...` lines totalling >200 chars
  so the MarkItDown PyMuPDF fallback's threshold is exercised end-to-end

Verified output:

- pdfminer.six: 46 chars, only `BEFORE_IMAGE` present (FAIL)
- pdfplumber: 43 chars, only `BEFORE_IMAGE` present (FAIL)
- PyMuPDF: 295 chars, both markers present (PASS)

**Root cause confirmed in source:** `pdfminer/pdfinterp.py` `do_keyword`
(lines ~342-348) sets `eos = b"~>"` whenever the inline image declares an
ASCII85 (`/A85`) filter. The reference Cyberport invoice — and the synthetic
repro — terminate the ASCII85 payload with a single `~` (which the spec
permits as a tolerant variant), so `get_inline_data` never finds `~>` and
keeps consuming bytes past the real `EI` until end-of-stream, swallowing
every subsequent text-showing operator. Maintainers can attach
`/tmp/markitdown_repro/repro.pdf` (or rebuild it from the script) when
filing this upstream against pdfminer.six and as a regression fixture in
markitdown's PDF tests.

## Summary

Both extraction paths in `PdfConverter` (pdfplumber via `extract_words` / `extract_text`, and pdfminer via `pdfminer.high_level.extract_text`) silently fail to return any text positioned **after an inline image** (`BI ... ID ... EI` operator sequence) in a page's content stream. The text exists in the PDF as ordinary `Tj` operators with WinAnsi encoding — there is no decoding ambiguity — it is simply not surfaced.

The user-visible symptom is a "successful" conversion that returns only header content from the page (everything drawn before the inline image) and silently omits the body. The result string can be hundreds of characters out of thousands actually present.

## Affected file

`packages/markitdown/src/markitdown/converters/_pdf_converter.py` — both the `pdfplumber.open(...)` per-page path and the `pdfminer.high_level.extract_text(...)` fallback exhibit the same blind spot, because they share the same content-stream parser family.

## Root cause

When a content stream contains a sequence like:

```
... <header text> ...
q
65.30 0 0 18.00 272 768 cm
BI /W 544 /H 150 /BPC 1 /IM true /F [/A85 /Fl] ID
Gb"0F_$Rn2$j/a\#KTG=4X4NgK'T9aoQ_R<NQ#J1g::s@hIY"pSb0!HPBlQf;Z3_?c?X/3iU(92_-er6$jM@#?n`E+#(sa"0Gk3&K>CqL(^pV$_-er6%)ss?!unfl)u
~
EI
Q
... <body text — never returned> ...
```

the inline-image data uses ASCII85 (`/A85`) + Flate (`/Fl`) filters. The compressed bytes contain sequences that look like `EI` to the content-stream tokenizer; the parser consumes too much input before recognising the real `EI` and the rest of the stream — including subsequent `BT ... Tj ... ET` blocks — is dropped on the floor.

The bug is not in MarkItDown's higher-level logic, but inherited from pdfminer/pdfplumber's tokenizer. However, MarkItDown surfaces it as a silent partial-extraction with no warning.

## Reproduction

Any PDF with an inline image (`BI...EI`) using `/A85 /Fl` filter and text after it triggers this. Concretely:

```python
import pdfplumber, pdfminer.high_level

path = "<affected.pdf>"
with pdfplumber.open(path) as pdf:
    for p in pdf.pages:
        words = p.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
        print("words:", len(words), "extract_text length:", len(p.extract_text() or ""))

with open(path, "rb") as f:
    print("pdfminer length:", len(pdfminer.high_level.extract_text(f)))
```

For a real-world invoice in our environment:

- `pdfplumber`: 28 words, 463 chars
- `pdfminer`: 507 chars
- `pymupdf`: ~2.0 KB of text (full content, including all line items, totals, footer)

PyMuPDF parses the inline image correctly and returns text positioned after it.

## Why this is hard to detect heuristically

`pdfplumber.Page.chars` is also short (matches the broken extraction), so we can't compare counts to spot the gap. The PDF reports a normal `MediaBox`, has font resources, has reasonable annotations — nothing visible at the metadata level signals truncation. The only reliable signal is "another parser returns substantially more text".

## Suggested fixes

1. **Recommended:** add a pymupdf-based fallback inside `PdfConverter.convert` that runs only when the primary path returns suspiciously little text (`pymupdf_text_len > existing * 1.3 + 500`, for example). This keeps the import lazy and avoids paying the cost on documents that already extracted well.

2. Acknowledged constraint: PyMuPDF is AGPL-3.0 (see #1675), so it cannot be a hard dependency of `markitdown` core. Implement the fallback as an **optional extra** (`pip install 'markitdown[pymupdf]'`) — same pattern the `markitdown-ocr` plugin already uses to depend on PyMuPDF transitively.

3. Alternative without PyMuPDF: forward-scan for the next `EI` operator at a token boundary (whitespace or end-of-stream on either side) when handling `BI ... ID ... EI`, instead of letting the tokenizer be confused by `EI`-shaped bytes inside compressed image data. This is a deeper fix in pdfminer/pdfplumber territory.

4. **Independent of fix choice:** when extraction returns < 1KB for a PDF with > 50KB of file size and ≥ 1 inline image, emit a warning. The current behaviour (silent partial extraction) is the worst failure mode for downstream RAG/document pipelines because the missing content looks like absence rather than corruption.

## Related issues (none cover this exact mode)

- #1276 — pdfminer is slow and lower-quality vs PyMuPDF/PyMuPDF4LLM
- #293 — tables not extracted properly
- #1675 — PyMuPDF AGPL licensing concern (justifies the optional-extra approach)
- #131 — long-standing request for a better PDF converter
- #217 — different symptom (MediaBox KeyError) but same family

## Environment

- markitdown commit: `4b65609` (2026-05-07)
- pdfplumber 0.11.9, pdfminer.six 20251230, pymupdf 1.27.2.3
- macOS 15.6 (Darwin 24.6.0), Python 3.13


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF text after an inline image (`BI ... EI`) is silently dropped from extraction #1870

Minimal repro PDF

Summary

Affected file

Root cause

Reproduction

Why this is hard to detect heuristically

Suggested fixes

Related issues (none cover this exact mode)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PDF text after an inline image (BI ... EI) is silently dropped from extraction #1870

Description

Minimal repro PDF

Summary

Affected file

Root cause

Reproduction

Why this is hard to detect heuristically

Suggested fixes

Related issues (none cover this exact mode)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PDF text after an inline image (`BI ... EI`) is silently dropped from extraction #1870