Problem
When converting PDFs with embedded images to Markdown, markitdown currently either:
- Truncates base64 data (showing
), or
- Preserves full base64 inline (which makes the Markdown file very large and unsuitable for RAG/indexing)
For full-context RAG pipelines and document archival, I need:
- Embedded images extracted and saved as separate files (e.g.
images/page3-figure1.png)
- Markdown references using relative paths:

- Optional: image metadata/captions in front matter or inline comments
Use case
- Converting technical documentation, datasheets, and PDFs with diagrams for AI/LLM ingestion
- Preserving full document context including visual information
- Keeping Markdown files lightweight and portable while still referencing images
Proposed behavior
Add a CLI flag and Python API option, for example:
markitdown document.pdf --extract-images --output-dir ./output
from markitdown import MarkItDown
md = MarkItDown(extract_images=True, output_dir="./output")
result = md.convert("document.pdf")
# result.markdown includes:
# Actual image files saved to ./output/images/
Alternatives considered
- Full base64 inline: makes files too large for indexing/LLM pipelines
- Image descriptions only: loses visual context for diagrams, charts, schematics
- Manual extraction: error-prone and not reproducible
Additional context
Environment
- markitdown version: 0.1.6
- Python version: 3.11+
- OS: macOS
Problem
When converting PDFs with embedded images to Markdown, markitdown currently either:
), orFor full-context RAG pipelines and document archival, I need:
images/page3-figure1.png)Use case
Proposed behavior
Add a CLI flag and Python API option, for example:
Alternatives considered
Additional context
Environment