-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docx: corrupted file caused by metadata fields with different case #5413
Comments
Interesting! There's probably no perfect solution, but @agusmba, @jgm, how does this sound?
|
Oh, and question: should this be specific to docx or a general rule? I'd tend to think the latter -- less confusion. |
Please don't go redefining YAML semantics just because you feel it's "less confusion". By the end of the road, you'll most likely get more confusion. YAML is case-sensitive. In general, that should be the only thing that matters. At the very least, pandoc's metadata has been case-sensitive since the beginning. Changing that now should at least be accompanied by a major version bump, since it's a huge change in behaviour, as far as I'm concerned.
Edit: nope, apparently I'm wrong here. Pandoc's YAML parser will silently use the latest definition of the field (my previous test had other syntactical errors):
Not sure what you mean here. Could you explain in more detail? As I see it, the issue is that OOXML's custom props use case-insensitive names, while pandoc's metadata is case-sensitive (and it was since the beginning, so changing that now doesn't sound good; maybe in pandoc v3.0 that would be justified). Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx. |
That's what I meant to say. I wasn't talking about YAML at all -- I was referring to what the docx writer does with the metadata it's handed, and how we get produce docx custom properties. (That's why I was referring to it as properties, and not fields.) As far as a general rule goes, I just meant for writers that produce these xml formats with custom metadata properties. But yes, that's a silly way to deal with individual clunky formats. |
Okay, cool. I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.
About that... pptx is obviously the same as docx, that's a given. But I'm not so sure about ODT. LibreOffice doesn't mind metadata properties differing in case only (both in docx and odt), but whether that's a quirk of LibreOffice, or something related to the ODF specification, I can't say. |
Not really, please note that:
which is consistent with the documentation:
What wasn't stated apparently was that if a field is defined multiple times within the same yaml block, the last definition is kept (I don't think this is a bug since many libraries act this way, but it might be a good idea to document this also). That said, I would apply the same rationale to custom properties in docx|odt (ignoring their case):
|
actually since this is a docx|odt quirk, we might go around solving it in a simpler way, that is: At the writer level, where pandoc has already filtered real duplicates for us, use the first definition of a property and if a second definition with different case is found, ignore it throwing a warning. |
пн, 1 апр. 2019 г., 10:14 Agustín Martín Barbero <notifications@github.com>:
@lierdakil <https://github.com/lierdakil>
I was wrong about pandoc's YAML parser btw, it will silently accept
duplicate fields and use the last value it's seen, apparently. That'll
teach me not to run tests while still half-asleep.
Not really, please note that:
$ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 1
---
which is consistent with the documentation
<https://pandoc.org/MANUAL.html#extension-yaml_metadata_block>:
A document may contain multiple metadata blocks. The metadata fields will
be combined through a left-biased union: if two metadata blocks attempt to
set the same field, the value from the first block will be taken.
In my mind, this is a completely different case. I.e. here we're talking
about parsing each metadata block individually, then combining the results,
as opposed to parsing a single metadata block. According to YAML spec,
duplicate field names are an error, which I initially assumed pandoc's YAML
parser observes (aeson? can't recall from the top of my head). I guess it
makes sense tat it behaves the way it does considering basically the same
code is used to parse JSON (which allows duplicate object fields). It's not
a huge deal of course, just something to be aware of.
What wasn't stated apparently was that if a field is defined multiple
times within the *same* yaml block, the last definition is kept (I don't
think this is a bug since many libraries act this way, but it might be a
good idea to document this also).
That said, I would apply the same rationale to custom properties in
docx|odt (ignoring their case):
- within the same yaml block, the latest definition is kept
- duplicate definitions in subsequent yaml blocks are ignored
This would work for reading custom properties from those formats (which
isn't implemented yet AFAIK), but not as much for writing. IIRC pandoc uses
a Map to store metadata objects, so any information about original order is
lost by the time AST gets to writer code.
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5413 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AG8EZi8QR58BTQ9nv0QGb0xgS9dGFXYeks5vcbG6gaJpZM4cUkZW>
.
|
пн, 1 апр. 2019 г., 10:20 Agustín Martín Barbero <notifications@github.com>:
actually since this is a docx|odt quirk, we might go around solving it in
a simpler way, that is:
At the writer level, where pandoc has already filtered real duplicates for
us, use the first definition of a property and if a second definition with
different case is found, ignore it throwing a warning.
Except writer has no information about original definition order, and
changing that would require nontrivial changes to AST. So "first" is
ambiguous here.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5413 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AG8EZnp4W-JOUgxAZpyEQKGFC9-EQKO8ks5vcbMqgaJpZM4cUkZW>
.
|
Fair enough, I'd rather have a simple solution with a warning, so if the order is unknown let's not even worry about it (if it is known, like alphabetical, lower case first..., we could include it in the warning). Whenever the docx|odt writer detects a case-duplicate ignore it and warn the user. Warning example:
Actually if we document the behavior in the MANUAL, there'd be no need for the warning (as what happens currently with real duplicates) |
I would argue having a warning is better than having no warning, regardless of documentation. Explicit over implicit. |
I like the idea of having a warning in the docx writer when it ignores these case-duplicates. |
Small update. We need to do this for odt too. All the spec has to say is this:
without defining precisely what "unique" means. While LibreOffice and OpenOffice seemingly use case-sensitive comparison (or at least they don't complain about "corrupted file" and don't break rendering), Word does not, and hence also complains about odt with properties with names that differ only in case. |
Ignoring case-insensitive doublicates and having a warning is IMHO a simple fix and restores the compatibility as before of #5252. |
|
Is this easy to fix? |
With #5252 (pandoc 2.6 ) it is possible to store custom metadata properties in docx. But in rare cases this can lead to corrupt docx files if two metadata fields just differing in case.
The cause of this problem was found by @lierdakil as part of lierdakil/pandoc-crossref#223. A MWE without involving pandoc-crossref is:
The text was updated successfully, but these errors were encountered: