Skip to content

PdfConverter does not extract PDF metadata (title, author, creation date) #1664

@VANDRANKI

Description

@VANDRANKI

Feature request

The current PdfConverter extracts only the text body of a PDF. Standard PDF documents carry structured metadata in their document info dictionary: title, author, subject, keywords, creator, and creation date.

For research, legal, and document management workflows this metadata is often exactly what you need. It is also useful as context when the LLM processes the converted markdown - knowing the author and date helps with citation and provenance.

Proposed output

When PDF metadata is present, prepend a metadata block to the converted markdown:

# Document Metadata

**Title:** Annual Report 2025
**Author:** Jane Smith
**Subject:** Financial Results
**Keywords:** annual report, financials, 2025
**Created:** 2025-03-15
**Modified:** 2025-03-20

---

[body text follows]

Implementation notes

pdfminer.six (already a dependency) exposes document info via PDFDocument and resolve1. Alternatively, pypdf exposes it via reader.metadata. Either can be used without adding a new dependency.

Fields with empty or None values should be skipped. The metadata block should only appear when at least one metadata field is non-empty.

Why this matters

  • Researchers converting batches of papers get author/title/year without parsing the body
  • The DocumentConverterResult.title field can be populated from the PDF title metadata automatically
  • Consistent with how EmlConverter surfaces email headers as structured metadata

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions