Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx: corrupted file caused by metadata fields with different case #5413

Open
tolot27 opened this issue Mar 31, 2019 · 16 comments
Open

docx: corrupted file caused by metadata fields with different case #5413

tolot27 opened this issue Mar 31, 2019 · 16 comments

Comments

@tolot27
Copy link
Contributor

tolot27 commented Mar 31, 2019

With #5252 (pandoc 2.6 ) it is possible to store custom metadata properties in docx. But in rare cases this can lead to corrupt docx files if two metadata fields just differing in case.

The cause of this problem was found by @lierdakil as part of lierdakil/pandoc-crossref#223. A MWE without involving pandoc-crossref is:

echo -e "---\ntest: 1\nTest: 1\n..." | pandoc -o corrupt.docx
@jkr
Copy link
Collaborator

jkr commented Apr 1, 2019

Interesting! There's probably no perfect solution, but @agusmba, @jgm, how does this sound?

  1. the first usage of a custom prop gets kept, regardless of case.

  2. future uses with different case get thrown out, and trigger a warning.

  3. future uses with same case overwrite.

  4. a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

@jkr
Copy link
Collaborator

jkr commented Apr 1, 2019

Oh, and question: should this be specific to docx or a general rule? I'd tend to think the latter -- less confusion.

@lierdakil
Copy link
Contributor

lierdakil commented Apr 1, 2019

or a general rule

Please don't go redefining YAML semantics just because you feel it's "less confusion". By the end of the road, you'll most likely get more confusion. YAML is case-sensitive. In general, that should be the only thing that matters. At the very least, pandoc's metadata has been case-sensitive since the beginning. Changing that now should at least be accompanied by a major version bump, since it's a huge change in behaviour, as far as I'm concerned.

future uses with same case overwrite

This is kinda irrelevant. YAML parser used in Pandoc will refuse to parse YAML objects with duplicate fields. As it should to be a compliant YAML parser (YAML spec 1.2 says that YAML object MUST have unique keys to be valid)

Edit: nope, apparently I'm wrong here. Pandoc's YAML parser will silently use the latest definition of the field (my previous test had other syntactical errors):

$ echo -e "---\ntest: 1\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 2
---

Lorem

a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

Not sure what you mean here. Could you explain in more detail?

As I see it, the issue is that OOXML's custom props use case-insensitive names, while pandoc's metadata is case-sensitive (and it was since the beginning, so changing that now doesn't sound good; maybe in pandoc v3.0 that would be justified). Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

@jkr
Copy link
Collaborator

jkr commented Apr 1, 2019

Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

That's what I meant to say. I wasn't talking about YAML at all -- I was referring to what the docx writer does with the metadata it's handed, and how we get produce docx custom properties. (That's why I was referring to it as properties, and not fields.)

As far as a general rule goes, I just meant for writers that produce these xml formats with custom metadata properties. But yes, that's a silly way to deal with individual clunky formats.

@lierdakil
Copy link
Contributor

That's what I meant to say.

Okay, cool.

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

or writers that produce these xml formats with custom metadata properties

About that... pptx is obviously the same as docx, that's a given. But I'm not so sure about ODT. LibreOffice doesn't mind metadata properties differing in case only (both in docx and odt), but whether that's a quirk of LibreOffice, or something related to the ODF specification, I can't say.

@agusmba
Copy link
Contributor

agusmba commented Apr 1, 2019

@lierdakil

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

Not really, please note that:

$ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 1
---

which is consistent with the documentation:

A document may contain multiple metadata blocks. The metadata fields will be combined through a left-biased union: if two metadata blocks attempt to set the same field, the value from the first block will be taken.

What wasn't stated apparently was that if a field is defined multiple times within the same yaml block, the last definition is kept (I don't think this is a bug since many libraries act this way, but it might be a good idea to document this also).

That said, I would apply the same rationale to custom properties in docx|odt (ignoring their case):

  • within the same yaml block, the latest definition is kept
  • duplicate definitions in subsequent yaml blocks are ignored

@agusmba
Copy link
Contributor

agusmba commented Apr 1, 2019

actually since this is a docx|odt quirk, we might go around solving it in a simpler way, that is:

At the writer level, where pandoc has already filtered real duplicates for us, use the first definition of a property and if a second definition with different case is found, ignore it throwing a warning.

@lierdakil
Copy link
Contributor

lierdakil commented Apr 1, 2019 via email

@lierdakil
Copy link
Contributor

lierdakil commented Apr 1, 2019 via email

@agusmba
Copy link
Contributor

agusmba commented Apr 1, 2019

Except writer has no information about original definition order, and
changing that would require nontrivial changes to AST. So "first" is
ambiguous here.

Fair enough, I'd rather have a simple solution with a warning, so if the order is unknown let's not even worry about it (if it is known, like alphabetical, lower case first..., we could include it in the warning). Whenever the docx|odt writer detects a case-duplicate ignore it and warn the user.

Warning example:

Ambiguous definition of property XXXX, docx|odt does not support custom properties with the same name (ignoring case), only one definition will be included in the output.

Actually if we document the behavior in the MANUAL, there'd be no need for the warning (as what happens currently with real duplicates)

@lierdakil
Copy link
Contributor

there'd be no need for the warning

I would argue having a warning is better than having no warning, regardless of documentation. Explicit over implicit.

@jgm
Copy link
Owner

jgm commented Apr 1, 2019

I like the idea of having a warning in the docx writer when it ignores these case-duplicates.

@lierdakil
Copy link
Contributor

Small update. We need to do this for odt too. All the spec has to say is this:

The name shall be unique within the domain of meta:user-defined elements.

without defining precisely what "unique" means.

While LibreOffice and OpenOffice seemingly use case-sensitive comparison (or at least they don't complain about "corrupted file" and don't break rendering), Word does not, and hence also complains about odt with properties with names that differ only in case.

@tolot27
Copy link
Contributor Author

tolot27 commented Apr 1, 2019

Ignoring case-insensitive doublicates and having a warning is IMHO a simple fix and restores the compatibility as before of #5252.

@agusmba
Copy link
Contributor

agusmba commented Apr 1, 2019

pptx writer will need to be fixed as well probably.

@tolot27
Copy link
Contributor Author

tolot27 commented May 8, 2019

Is this easy to fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants