docx: corrupted file caused by metadata fields with different case #5413

tolot27 · 2019-03-31T21:49:26Z

With #5252 (pandoc 2.6 ) it is possible to store custom metadata properties in docx. But in rare cases this can lead to corrupt docx files if two metadata fields just differing in case.

The cause of this problem was found by @lierdakil as part of lierdakil/pandoc-crossref#223. A MWE without involving pandoc-crossref is:

echo -e "---\ntest: 1\nTest: 1\n..." | pandoc -o corrupt.docx

The text was updated successfully, but these errors were encountered:

jkr · 2019-04-01T02:30:49Z

Interesting! There's probably no perfect solution, but @agusmba, @jgm, how does this sound?

the first usage of a custom prop gets kept, regardless of case.
future uses with different case get thrown out, and trigger a warning.
future uses with same case overwrite.
a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

jkr · 2019-04-01T02:31:24Z

Oh, and question: should this be specific to docx or a general rule? I'd tend to think the latter -- less confusion.

lierdakil · 2019-04-01T03:44:09Z

or a general rule

Please don't go redefining YAML semantics just because you feel it's "less confusion". By the end of the road, you'll most likely get more confusion. YAML is case-sensitive. In general, that should be the only thing that matters. At the very least, pandoc's metadata has been case-sensitive since the beginning. Changing that now should at least be accompanied by a major version bump, since it's a huge change in behaviour, as far as I'm concerned.

future uses with same case overwrite

This is kinda irrelevant. YAML parser used in Pandoc will refuse to parse YAML objects with duplicate fields. As it should to be a compliant YAML parser (YAML spec 1.2 says that YAML object MUST have unique keys to be valid)

Edit: nope, apparently I'm wrong here. Pandoc's YAML parser will silently use the latest definition of the field (my previous test had other syntactical errors):

$ echo -e "---\ntest: 1\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 2
---

Lorem

a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

Not sure what you mean here. Could you explain in more detail?

As I see it, the issue is that OOXML's custom props use case-insensitive names, while pandoc's metadata is case-sensitive (and it was since the beginning, so changing that now doesn't sound good; maybe in pandoc v3.0 that would be justified). Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

jkr · 2019-04-01T04:00:38Z

Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

That's what I meant to say. I wasn't talking about YAML at all -- I was referring to what the docx writer does with the metadata it's handed, and how we get produce docx custom properties. (That's why I was referring to it as properties, and not fields.)

As far as a general rule goes, I just meant for writers that produce these xml formats with custom metadata properties. But yes, that's a silly way to deal with individual clunky formats.

lierdakil · 2019-04-01T04:37:56Z

That's what I meant to say.

Okay, cool.

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

or writers that produce these xml formats with custom metadata properties

About that... pptx is obviously the same as docx, that's a given. But I'm not so sure about ODT. LibreOffice doesn't mind metadata properties differing in case only (both in docx and odt), but whether that's a quirk of LibreOffice, or something related to the ODF specification, I can't say.

agusmba · 2019-04-01T07:13:49Z

@lierdakil

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

Not really, please note that:

$ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 1
---

which is consistent with the documentation:

A document may contain multiple metadata blocks. The metadata fields will be combined through a left-biased union: if two metadata blocks attempt to set the same field, the value from the first block will be taken.

What wasn't stated apparently was that if a field is defined multiple times within the same yaml block, the last definition is kept (I don't think this is a bug since many libraries act this way, but it might be a good idea to document this also).

That said, I would apply the same rationale to custom properties in docx|odt (ignoring their case):

within the same yaml block, the latest definition is kept
duplicate definitions in subsequent yaml blocks are ignored

agusmba · 2019-04-01T07:19:58Z

actually since this is a docx|odt quirk, we might go around solving it in a simpler way, that is:

At the writer level, where pandoc has already filtered real duplicates for us, use the first definition of a property and if a second definition with different case is found, ignore it throwing a warning.

lierdakil · 2019-04-01T07:28:51Z

пн, 1 апр. 2019 г., 10:14 Agustín Martín Barbero <notifications@github.com>:

@lierdakil <https://github.com/lierdakil> I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep. Not really, please note that: $ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown --- test: 1 --- which is consistent with the documentation <https://pandoc.org/MANUAL.html#extension-yaml_metadata_block>: A document may contain multiple metadata blocks. The metadata fields will be combined through a left-biased union: if two metadata blocks attempt to set the same field, the value from the first block will be taken.

In my mind, this is a completely different case. I.e. here we're talking about parsing each metadata block individually, then combining the results, as opposed to parsing a single metadata block. According to YAML spec, duplicate field names are an error, which I initially assumed pandoc's YAML parser observes (aeson? can't recall from the top of my head). I guess it makes sense tat it behaves the way it does considering basically the same code is used to parse JSON (which allows duplicate object fields). It's not a huge deal of course, just something to be aware of.

What wasn't stated apparently was that if a field is defined multiple times within the *same* yaml block, the last definition is kept (I don't think this is a bug since many libraries act this way, but it might be a good idea to document this also). That said, I would apply the same rationale to custom properties in docx|odt (ignoring their case): - within the same yaml block, the latest definition is kept - duplicate definitions in subsequent yaml blocks are ignored

This would work for reading custom properties from those formats (which isn't implemented yet AFAIK), but not as much for writing. IIRC pandoc uses a Map to store metadata objects, so any information about original order is lost by the time AST gets to writer code.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5413 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG8EZi8QR58BTQ9nv0QGb0xgS9dGFXYeks5vcbG6gaJpZM4cUkZW> .

lierdakil · 2019-04-01T07:32:35Z

пн, 1 апр. 2019 г., 10:20 Agustín Martín Barbero <notifications@github.com>:

actually since this is a docx|odt quirk, we might go around solving it in a simpler way, that is: At the writer level, where pandoc has already filtered real duplicates for us, use the first definition of a property and if a second definition with different case is found, ignore it throwing a warning.

Except writer has no information about original definition order, and changing that would require nontrivial changes to AST. So "first" is ambiguous here.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5413 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG8EZnp4W-JOUgxAZpyEQKGFC9-EQKO8ks5vcbMqgaJpZM4cUkZW> .

agusmba · 2019-04-01T07:46:23Z

Except writer has no information about original definition order, and
changing that would require nontrivial changes to AST. So "first" is
ambiguous here.

Fair enough, I'd rather have a simple solution with a warning, so if the order is unknown let's not even worry about it (if it is known, like alphabetical, lower case first..., we could include it in the warning). Whenever the docx|odt writer detects a case-duplicate ignore it and warn the user.

Warning example:

Ambiguous definition of property XXXX, docx|odt does not support custom properties with the same name (ignoring case), only one definition will be included in the output.

Actually if we document the behavior in the MANUAL, there'd be no need for the warning (as what happens currently with real duplicates)

lierdakil · 2019-04-01T07:51:34Z

there'd be no need for the warning

I would argue having a warning is better than having no warning, regardless of documentation. Explicit over implicit.

jgm · 2019-04-01T17:35:12Z

I like the idea of having a warning in the docx writer when it ignores these case-duplicates.

lierdakil · 2019-04-01T19:41:53Z

Small update. We need to do this for odt too. All the spec has to say is this:

The name shall be unique within the domain of meta:user-defined elements.

without defining precisely what "unique" means.

While LibreOffice and OpenOffice seemingly use case-sensitive comparison (or at least they don't complain about "corrupted file" and don't break rendering), Word does not, and hence also complains about odt with properties with names that differ only in case.

tolot27 · 2019-04-01T20:57:58Z

Ignoring case-insensitive doublicates and having a warning is IMHO a simple fix and restores the compatibility as before of #5252.

agusmba · 2019-04-01T23:33:08Z

pptx writer will need to be fixed as well probably.

tolot27 · 2019-05-08T04:51:30Z

Is this easy to fix?

mb21 added format:Docx writer labels Apr 1, 2019

mb21 mentioned this issue May 24, 2019

'-f docx+style' misinterprets metadata styles (Title, Author, Date...) if docx file is modified in Word #5523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx: corrupted file caused by metadata fields with different case #5413

docx: corrupted file caused by metadata fields with different case #5413

tolot27 commented Mar 31, 2019

jkr commented Apr 1, 2019

jkr commented Apr 1, 2019

lierdakil commented Apr 1, 2019 •

edited

Loading

jkr commented Apr 1, 2019

lierdakil commented Apr 1, 2019

agusmba commented Apr 1, 2019

agusmba commented Apr 1, 2019

lierdakil commented Apr 1, 2019 via email •

edited

Loading

lierdakil commented Apr 1, 2019 via email

agusmba commented Apr 1, 2019

lierdakil commented Apr 1, 2019

jgm commented Apr 1, 2019

lierdakil commented Apr 1, 2019

tolot27 commented Apr 1, 2019

agusmba commented Apr 1, 2019

tolot27 commented May 8, 2019

docx: corrupted file caused by metadata fields with different case #5413

docx: corrupted file caused by metadata fields with different case #5413

Comments

tolot27 commented Mar 31, 2019

jkr commented Apr 1, 2019

jkr commented Apr 1, 2019

lierdakil commented Apr 1, 2019 • edited Loading

jkr commented Apr 1, 2019

lierdakil commented Apr 1, 2019

agusmba commented Apr 1, 2019

agusmba commented Apr 1, 2019

lierdakil commented Apr 1, 2019 via email • edited Loading

lierdakil commented Apr 1, 2019 via email

agusmba commented Apr 1, 2019

lierdakil commented Apr 1, 2019

jgm commented Apr 1, 2019

lierdakil commented Apr 1, 2019

tolot27 commented Apr 1, 2019

agusmba commented Apr 1, 2019

tolot27 commented May 8, 2019

lierdakil commented Apr 1, 2019 •

edited

Loading

lierdakil commented Apr 1, 2019 via email •

edited

Loading