-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'-f docx+style' misinterprets metadata styles (Title, Author, Date...) if docx file is modified in Word #5523
Comments
Some initial investigation: The original docx generated by pandoc uses english style names: <w:p>
<w:pPr><w:pStyle w:val="Title" /></w:pPr>
<w:r><w:t xml:space="preserve">AMB</w:t></w:r>
<w:r><w:t xml:space="preserve"> </w:t></w:r>
<w:r><w:t xml:space="preserve">Título</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Author" /></w:pPr>
<w:r><w:t xml:space="preserve">AMB</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Date" /></w:pPr>
<w:r><w:t xml:space="preserve">24/05/19</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading1" /></w:pPr>
<w:bookmarkStart w:id="20" w:name="my-first-level-title" />
<w:r><w:t xml:space="preserve">My first level title</w:t></w:r>
<w:bookmarkEnd w:id="20" />
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="FirstParagraph" /></w:pPr>
<w:r><w:t xml:space="preserve">Blah blah blah</w:t></w:r>
</w:p> Note that opening this file with Word shows localized style names like "Título" and "Fecha" but "Author" is shown in english. In the edited docx we have: <w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
<w:pPr><w:pStyle w:val="Ttulo"/></w:pPr>
<w:r><w:t>AMB Título</w:t></w:r>
</w:p>
<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
<w:pPr><w:pStyle w:val="Author"/></w:pPr>
<w:r><w:t>AMB</w:t></w:r>
</w:p>
<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
<w:pPr><w:pStyle w:val="Fecha"/></w:pPr>
<w:r><w:t>24/05/19</w:t></w:r>
</w:p>
<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
<w:pPr><w:pStyle w:val="Ttulo1"/></w:pPr>
<w:bookmarkStart w:id="0" w:name="my-first-level-title"/>
<w:r><w:t>My first level title</w:t></w:r>
<w:bookmarkEnd w:id="0"/>
<w:r w:rsidR="00F13E83"><w:t xml:space="preserve"> modified</w:t></w:r>
<w:bookmarkStart w:id="1" w:name="_GoBack"/>
<w:bookmarkEnd w:id="1"/>
</w:p>
<w:p w:rsidR="00AF7378" w:rsidRDefault="00F13E83">
<w:pPr><w:pStyle w:val="FirstParagraph"/></w:pPr>
<w:r><w:t>Modified</w:t></w:r>
</w:p> Note that first level title is "Ttulo1" instead of "Heading1" but that is interpreted correctly by pandoc(see previous post). However that is not the case with "Tutlo" and "Title" or with "Fecha" and "Date". Another strange one is "Author" since in the modified file the name of the style hasn't changed, but it's not recognized as metadata by pandoc. |
I added some more text to the initial markdown so as to get normal text style, and found that
in the modified docx is converted back to
So it looks as if the only working style is Heading1, Heading2, etc. Is there any intelligent lookup there that could be applied to the rest of the styles? |
Note that in the modified docx we have the following styles.xml [...]
<w:style w:type="paragraph" w:styleId="Textoindependiente">
<w:name w:val="Body Text"/>
<w:basedOn w:val="Normal"/>
[...]
</w:style>
<w:style w:type="paragraph" w:styleId="Ttulo">
<w:name w:val="Title"/><w:basedOn w:val="Normal"/>
<w:next w:val="Textoindependiente"/>[...]
</w:style>
<w:style w:type="paragraph" w:styleId="Ttulo1">
<w:name w:val="heading 1"/><w:basedOn w:val="Normal"/>
<w:next w:val="Textoindependiente"/>[...]
</w:style>
[...] I don't see any fundamental difference between Heading 1 and Body Text (or any of the rest) styles, but one works and the other doesn't when going from docx to markdown. |
Note that even the first docx to markdown conversion (unmodified) is not really correct, since it is giving us:
However in order for it to work going from markdown to docx, the standard english names need to be used (notice the space):
This is probably going to be quite a problem with non-english word users, since I don't think the internal english name with spaces is available anywhere. The more I look into it, the more complex it seems. |
To recap, is this about:
|
It's definitely not #5413 It's about custom-styles from(/to) docx |
Ah, then it's #5074 perhaps? |
Hi, no, it's not the same thing either. I read that one before opening this one, and IMHO they are separate issues. This one here stems from a discussion-list message, where the OP complained about pandoc behaving strange when reading a docx if the docx had been modified in Word. I initially dismissed it but when I tested it I saw that there was indeed something strange going on. In the process of documenting it, I may have gone a bit overboard and muddled things a bit. Let me try to summarize the problem (in the first post):
Then in the process of investigating what was going on in order to pin-point where to fix it, I discovered that for other docx styles, the docx+style option was not working completely right, since because the names it chose in one direction (docx > md) were not compatible with the opposite direction (md > docx). So I'm not sure if what I describe here is one or two issues, but they are related to how pandoc manages (and interprets) the docx style names back and forth. |
note that #1716 or thereabouts might the reason why Headings work correctly in both directions. |
@jkr may have some ideas about this; I know he has worked on similar issues in the past. |
@agusmba, @mb21, |
Would it be interesting to generate a lookup-table for different languages using the word-macro I linked above? It could ease Pandoc selecting the appropriate "built-in" style when converting from non-English Word documents. And even for English documents, we could use it to select the right style (I'm talking about the space issue presented here and in #5074). |
Currently the docx reader just hardcodes these associations:
We could certainly add non-English style names to this. Better would be to make it sensitive to language, but perhaps that's not necessary as there likely won't be ambiguities. I'm not sure how or whether the docx reader represents the document's language. @jkr if you have a minute to chime in on this, it would be helpful to get your feedback. |
IIRC, IIRC, Readers.Docx currently looks at I could post P.S. NB: |
Yes, please do. Great tip to use |
Here it is for Word 2019 Russian: reference_w2019_ru.docx Here's the important bit, in <w:style w:type="paragraph" w:default="1" w:styleId="a">
<w:name w:val="Normal" />
<w:style w:type="paragraph" w:styleId="1">
<w:name w:val="heading 1" /> <!-- notice how "heading" is not capitalized; see below -->
<w:style w:type="paragraph" w:styleId="2">
<w:name w:val="heading 2" />
<w:style w:type="table" w:default="1" w:styleId="a2">
<w:name w:val="Normal Table" />
<w:style w:type="paragraph" w:styleId="a4">
<w:name w:val="Title" />
<!-- etc for built-in styles, but for custom styles with ASCII names, it's a bit different: -->
<w:style w:type="paragraph" w:customStyle="1" w:styleId="FirstParagraph">
<w:name w:val="First Paragraph" />
<w:style w:type="paragraph" w:customStyle="1" w:styleId="Compact">
<w:name w:val="Compact" />
<!-- etc --> Notice how As a side note, custom styles will also have their identifiers mangled if Also, notice pandoc/data/docx/word/styles.xml Lines 125 to 126 in 0e31483
However, it is not so after re-saving with Word! |
This looks promising. If we can make pandoc understand and use docx's styles name values, it could solve both internationalized problems and round-trip conversions. |
Note that while the style name is the one that stays the same across different international Word versions, it seems that it's the styleId (which changes) the one used to reference the styles within the text, so pandoc needs to understand and use both, when reading and also when writing (getting the information from the reference-doc). |
I had a brief look. It's a bit complicated how all of this works in the docx reader, so I think I'll have to leave it up to @jkr to implement using w:name. |
FWIW, pandoc already does something very similar in the docx writer. I vaguely recall implementing a good chunk of that a few years back. In fact, I think I can find the PRs... yeah, the relevant ones seem to be #1968 and #2023. I think I was going to do the same in reader, but never got to it. So basically there's already code that builds style name -> style id map. For reader, we probably need the reverse though, but changes are more or less straightforward (or one could try to use a bidirectional map of some description). Then, instead of comparing |
You are right of course, otherwise using a reference-doc wouldn't work correctly for international users who modified standard styles such as Author etc. I was a bit careless/partial in my previous comment. Thanks for the clarification and the link to relevant code! |
Hmm. I've taken a closer look, and apparently Docx reader already uses pandoc/src/Text/Pandoc/Readers/Docx/Parse.hs Lines 1128 to 1141 in ad9770f
Apparently not everywhere though, for instance, not for code, definitions, and indeed not for "meta styles" (like author, etc) |
I'll take a look at this, and be able to chime in, when I get finished with the |
So, uh... I've searched around (I knew this all sounded suspiciously familiar), and basically this comes back to #5052. Well, at least for the most part. As I see it right now, there are three options to proceed with this:
One curious thing to note, by the way. |
I like option 1 if it can be managed. There have been many issues associated with this, and if we can get a solid framework for handling style names in a way that survives localization, it will save headaches and more work later on. |
…meaning Motivating issues: jgm#5523, jgm#5052, jgm#5074 Style name comparisons are case-insensitive, since those are case-insensitive in Word. w:styleId will be used as style name if w:name is missing (this should only happen for malformed docx and is kept as a fallback to avoid failing altogether on malformed documents) Block quote detection code moved from Docx.Parser to Readers.Docx Code styles, i.e. "Source Code" and "Verbatim Char" now honor style inheritance Docx Reader now honours "Compact" style (used in Pandoc-generated docx). The side-effect is that "Compact" style no longer shows up in docx+styles output. Styles inherited from "Compact" will still show up. Removed obsolete list-item style from divsToKeep. That didn't really do anything for a while now. Add newtypes to differentiate between style names, ids, and different style types (that is, paragraph and character styles) Since docx style names can have spaces in them, and pandoc-markdown classes can't, anywhere when style name is used as a class name, spaces are replaced with ASCII dashes `-`. Get rid of extraneous intermediate types, carrying styleId information. Instead, styleId is saved with other style data. Use RunStyle for inline style definitions only (lacking styleId and styleName); for Character Styles use CharStyle type (which is basicaly RunStyle with styleId and StyleName bolted onto it).
Motivating issues: jgm#5523, jgm#5052, jgm#5074 Style name comparisons are case-insensitive, since those are case-insensitive in Word. w:styleId will be used as style name if w:name is missing (this should only happen for malformed docx and is kept as a fallback to avoid failing altogether on malformed documents) Block quote detection code moved from Docx.Parser to Readers.Docx Code styles, i.e. "Source Code" and "Verbatim Char" now honor style inheritance Docx Reader now honours "Compact" style (used in Pandoc-generated docx). The side-effect is that "Compact" style no longer shows up in docx+styles output. Styles inherited from "Compact" will still show up. Removed obsolete list-item style from divsToKeep. That didn't really do anything for a while now. Add newtypes to differentiate between style names, ids, and different style types (that is, paragraph and character styles) Since docx style names can have spaces in them, and pandoc-markdown classes can't, anywhere when style name is used as a class name, spaces are replaced with ASCII dashes `-`. Get rid of extraneous intermediate types, carrying styleId information. Instead, styleId is saved with other style data. Use RunStyle for inline style definitions only (lacking styleId and styleName); for Character Styles use CharStyle type (which is basicaly RunStyle with styleId and StyleName bolted onto it).
Motivating issues: #5523, #5052, #5074 Style name comparisons are case-insensitive, since those are case-insensitive in Word. w:styleId will be used as style name if w:name is missing (this should only happen for malformed docx and is kept as a fallback to avoid failing altogether on malformed documents) Block quote detection code moved from Docx.Parser to Readers.Docx Code styles, i.e. "Source Code" and "Verbatim Char" now honor style inheritance Docx Reader now honours "Compact" style (used in Pandoc-generated docx). The side-effect is that "Compact" style no longer shows up in docx+styles output. Styles inherited from "Compact" will still show up. Removed obsolete list-item style from divsToKeep. That didn't really do anything for a while now. Add newtypes to differentiate between style names, ids, and different style types (that is, paragraph and character styles) Since docx style names can have spaces in them, and pandoc-markdown classes can't, anywhere when style name is used as a class name, spaces are replaced with ASCII dashes `-`. Get rid of extraneous intermediate types, carrying styleId information. Instead, styleId is saved with other style data. Use RunStyle for inline style definitions only (lacking styleId and styleName); for Character Styles use CharStyle type (which is basicaly RunStyle with styleId and StyleName bolted onto it).
This issue is apparently fixed by #5732 and hence can be closed:
|
Thanks @lierdakil - it's great to have all these issues connected with style localization fixed. |
Thanks @lierdakil !!! I think this is huge, specially for international Word users. EDIT: I confirm that this is working great on my Win7 (spanish locale), using a nightly build from yesterday. |
Using pandoc v2.7.2
If we create a simple docx file:
We can convert back to markdown quite well:
However if I open the docx, modify some text and save, pandoc doesn't understand the metadata styles anymore:
Here are the docx files:
test.docx
testM.docx
The text was updated successfully, but these errors were encountered: