Summary
Currently, MarkItDown converts supported documents into Markdown format only. It would be useful to provide an optional JSON output format that preserves document structure, such as headings, paragraphs, tables, images, and metadata.
Motivation
Many developers use MarkItDown as part of automated processing pipelines and LLM workflows. Structured JSON output would allow easier integration with downstream applications without requiring additional parsing of Markdown.
Proposed Solution
Add a command-line option such as:
markitdown document.pdf --output-format json
Example output:
{
"title": "Sample Document",
"sections": [
{
"heading": "Introduction",
"content": "..."
}
]
}
Benefits
Easier integration with AI pipelines
Structured document representation
Better support for data extraction workflows
Reduced need for custom Markdown parsing
Summary
Currently, MarkItDown converts supported documents into Markdown format only. It would be useful to provide an optional JSON output format that preserves document structure, such as headings, paragraphs, tables, images, and metadata.
Motivation
Many developers use MarkItDown as part of automated processing pipelines and LLM workflows. Structured JSON output would allow easier integration with downstream applications without requiring additional parsing of Markdown.
Proposed Solution
Add a command-line option such as:
markitdown document.pdf --output-format json
Example output:
{
"title": "Sample Document",
"sections": [
{
"heading": "Introduction",
"content": "..."
}
]
}
Benefits
Easier integration with AI pipelines
Structured document representation
Better support for data extraction workflows
Reduced need for custom Markdown parsing