Skip to content

Feature Request: Save embedded images from PDFs as separate files with relative Markdown links #2049

@agodianel

Description

@agodianel

Problem

When converting PDFs with embedded images to Markdown, markitdown currently either:

  • Truncates base64 data (showing ![](data:image/png;base64...)), or
  • Preserves full base64 inline (which makes the Markdown file very large and unsuitable for RAG/indexing)

For full-context RAG pipelines and document archival, I need:

  1. Embedded images extracted and saved as separate files (e.g. images/page3-figure1.png)
  2. Markdown references using relative paths: ![description](images/page3-figure1.png)
  3. Optional: image metadata/captions in front matter or inline comments

Use case

  • Converting technical documentation, datasheets, and PDFs with diagrams for AI/LLM ingestion
  • Preserving full document context including visual information
  • Keeping Markdown files lightweight and portable while still referencing images

Proposed behavior

Add a CLI flag and Python API option, for example:

markitdown document.pdf --extract-images --output-dir ./output
from markitdown import MarkItDown
md = MarkItDown(extract_images=True, output_dir="./output")
result = md.convert("document.pdf")
# result.markdown includes: 
# Actual image files saved to ./output/images/

Alternatives considered

  • Full base64 inline: makes files too large for indexing/LLM pipelines
  • Image descriptions only: loses visual context for diagrams, charts, schematics
  • Manual extraction: error-prone and not reproducible

Additional context

Environment

  • markitdown version: 0.1.6
  • Python version: 3.11+
  • OS: macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions