Feature request
The current PdfConverter extracts only the text body of a PDF. Standard PDF documents carry structured metadata in their document info dictionary: title, author, subject, keywords, creator, and creation date.
For research, legal, and document management workflows this metadata is often exactly what you need. It is also useful as context when the LLM processes the converted markdown - knowing the author and date helps with citation and provenance.
Proposed output
When PDF metadata is present, prepend a metadata block to the converted markdown:
# Document Metadata
**Title:** Annual Report 2025
**Author:** Jane Smith
**Subject:** Financial Results
**Keywords:** annual report, financials, 2025
**Created:** 2025-03-15
**Modified:** 2025-03-20
---
[body text follows]
Implementation notes
pdfminer.six (already a dependency) exposes document info via PDFDocument and resolve1. Alternatively, pypdf exposes it via reader.metadata. Either can be used without adding a new dependency.
Fields with empty or None values should be skipped. The metadata block should only appear when at least one metadata field is non-empty.
Why this matters
- Researchers converting batches of papers get author/title/year without parsing the body
- The
DocumentConverterResult.title field can be populated from the PDF title metadata automatically
- Consistent with how
EmlConverter surfaces email headers as structured metadata
Feature request
The current
PdfConverterextracts only the text body of a PDF. Standard PDF documents carry structured metadata in their document info dictionary: title, author, subject, keywords, creator, and creation date.For research, legal, and document management workflows this metadata is often exactly what you need. It is also useful as context when the LLM processes the converted markdown - knowing the author and date helps with citation and provenance.
Proposed output
When PDF metadata is present, prepend a metadata block to the converted markdown:
Implementation notes
pdfminer.six(already a dependency) exposes document info viaPDFDocumentandresolve1. Alternatively,pypdfexposes it viareader.metadata. Either can be used without adding a new dependency.Fields with empty or
Nonevalues should be skipped. The metadata block should only appear when at least one metadata field is non-empty.Why this matters
DocumentConverterResult.titlefield can be populated from the PDF title metadata automaticallyEmlConvertersurfaces email headers as structured metadata