Skip to content

Unescaped data going to XMP metadata makes output documents unparseable #79

@epuronta

Description

@epuronta

When generating XMP metadata in drafthorse/pdf.py:158, the data is embedded in the XML template string unescaped. Now, if any non-XML-safe data comes in, the output XMP is invalid and will make the output PDF unparseable as Factur-X.

This will produce something like this when reading the file back:
pypdf.errors.PdfReadError: XML in XmpInformation was invalid: not well-formed (invalid token)

Or when the file is handled with Mustang:

WARNING: Problems with parsing metadata. XML parsing failure
org.verapdf.xmp.XMPException: XML parsing failure
...
Caused by: org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 28; The entity name must immediately follow the '&' in the entity reference.

There are two easy cases to make this occur in practice with metadata automatically extracted from Factur-X payload in drafthorse/pdf.py:294

  • Selling company name in ApplicableHeaderTradeAgreement/SellerTradePartyName/Name. Generating invoices for Michael & Sonwill fail
  • Invoice number in ExchangedDocument/ID. Less likely, but still possible. Fails with an invoice number like A&A-1

All metadata going to XMP generation should be escaped.

I'll be creating a PR to fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions