Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'-f docx+style' misinterprets metadata styles (Title, Author, Date...) if docx file is modified in Word #5523

Closed
agusmba opened this issue May 24, 2019 · 28 comments

Comments

@agusmba
Copy link
Contributor

agusmba commented May 24, 2019

Using pandoc v2.7.2

If we create a simple docx file:

$ pandoc -f markdown -t docx -o test.docx  << EOF
---
title: AMB Título
author: AMB
date: 24/05/19
...

# My first level title

Blah blah blah
EOF

We can convert back to markdown quite well:

$ pandoc -f docx+styles -t markdown -s test.docx
---
author: AMB
date: '24/05/19'
title: AMB Título
---

My first level title
====================

::: {custom-style="FirstParagraph"}
Blah blah blah
:::

However if I open the docx, modify some text and save, pandoc doesn't understand the metadata styles anymore:

$ pandoc -f docx+styles -t markdown -s testM.docx
::: {custom-style="Ttulo"}
AMB Título
:::

::: {custom-style="Author"}
AMB
:::

::: {custom-style="Fecha"}
24/05/19
:::

My first level title modified
=============================

::: {custom-style="FirstParagraph"}
Modified
:::

Here are the docx files:

test.docx

testM.docx

@agusmba
Copy link
Contributor Author

agusmba commented May 24, 2019

Some initial investigation:

The original docx generated by pandoc uses english style names:

		<w:p>
			<w:pPr><w:pStyle w:val="Title" /></w:pPr>
			<w:r><w:t xml:space="preserve">AMB</w:t></w:r>
			<w:r><w:t xml:space="preserve"> </w:t></w:r>
			<w:r><w:t xml:space="preserve">Título</w:t></w:r>
		</w:p>
		<w:p>
			<w:pPr><w:pStyle w:val="Author" /></w:pPr>
			<w:r><w:t xml:space="preserve">AMB</w:t></w:r>
		</w:p>
		<w:p>
			<w:pPr><w:pStyle w:val="Date" /></w:pPr>
			<w:r><w:t xml:space="preserve">24/05/19</w:t></w:r>
		</w:p>
		<w:p>
			<w:pPr><w:pStyle w:val="Heading1" /></w:pPr>
			<w:bookmarkStart w:id="20" w:name="my-first-level-title" />
			<w:r><w:t xml:space="preserve">My first level title</w:t></w:r>
			<w:bookmarkEnd w:id="20" />
		</w:p>
		<w:p>
			<w:pPr><w:pStyle w:val="FirstParagraph" /></w:pPr>
			<w:r><w:t xml:space="preserve">Blah blah blah</w:t></w:r>
		</w:p>

Note that opening this file with Word shows localized style names like "Título" and "Fecha" but "Author" is shown in english.

In the edited docx we have:

		<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
			<w:pPr><w:pStyle w:val="Ttulo"/></w:pPr>
			<w:r><w:t>AMB Título</w:t></w:r>
		</w:p>
		<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
			<w:pPr><w:pStyle w:val="Author"/></w:pPr>
			<w:r><w:t>AMB</w:t></w:r>
		</w:p>
		<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
			<w:pPr><w:pStyle w:val="Fecha"/></w:pPr>
			<w:r><w:t>24/05/19</w:t></w:r>
		</w:p>
		<w:p w:rsidR="00AF7378" w:rsidRDefault="00CC51CA">
			<w:pPr><w:pStyle w:val="Ttulo1"/></w:pPr>
			<w:bookmarkStart w:id="0" w:name="my-first-level-title"/>
			<w:r><w:t>My first level title</w:t></w:r>
			<w:bookmarkEnd w:id="0"/>
			<w:r w:rsidR="00F13E83"><w:t xml:space="preserve"> modified</w:t></w:r>
			<w:bookmarkStart w:id="1" w:name="_GoBack"/>
			<w:bookmarkEnd w:id="1"/>
		</w:p>
		<w:p w:rsidR="00AF7378" w:rsidRDefault="00F13E83">
			<w:pPr><w:pStyle w:val="FirstParagraph"/></w:pPr>
			<w:r><w:t>Modified</w:t></w:r>
		</w:p>

Note that first level title is "Ttulo1" instead of "Heading1" but that is interpreted correctly by pandoc(see previous post). However that is not the case with "Tutlo" and "Title" or with "Fecha" and "Date".

Another strange one is "Author" since in the modified file the name of the style hasn't changed, but it's not recognized as metadata by pandoc.

@agusmba
Copy link
Contributor Author

agusmba commented May 24, 2019

I added some more text to the initial markdown so as to get normal text style, and found that

::: {custom-style="BodyText"}
Blah blah blah
:::

in the modified docx is converted back to

::: {custom-style="Textoindependiente"}
Blah blah blahasdf
:::

So it looks as if the only working style is Heading1, Heading2, etc.

Is there any intelligent lookup there that could be applied to the rest of the styles?

@agusmba
Copy link
Contributor Author

agusmba commented May 24, 2019

Note that in the modified docx we have the following styles.xml

[...]
<w:style w:type="paragraph" w:styleId="Textoindependiente">
	<w:name w:val="Body Text"/>
	<w:basedOn w:val="Normal"/>
	[...]
</w:style>
<w:style w:type="paragraph" w:styleId="Ttulo">
        <w:name w:val="Title"/><w:basedOn w:val="Normal"/>
        <w:next w:val="Textoindependiente"/>[...]
</w:style>
<w:style w:type="paragraph" w:styleId="Ttulo1">
        <w:name w:val="heading 1"/><w:basedOn w:val="Normal"/>
        <w:next w:val="Textoindependiente"/>[...]
</w:style>
[...]

I don't see any fundamental difference between Heading 1 and Body Text (or any of the rest) styles, but one works and the other doesn't when going from docx to markdown.

@agusmba
Copy link
Contributor Author

agusmba commented May 24, 2019

Note that even the first docx to markdown conversion (unmodified) is not really correct, since it is giving us:

::: {custom-style="BodyText"}
Blah blah blah
:::

However in order for it to work going from markdown to docx, the standard english names need to be used (notice the space):

::: {custom-style="Body Text"}
Blah blah blah
:::

This is probably going to be quite a problem with non-english word users, since I don't think the internal english name with spaces is available anywhere.

The more I look into it, the more complex it seems.

@mb21
Copy link
Collaborator

mb21 commented May 24, 2019

To recap, is this about:

@agusmba
Copy link
Contributor Author

agusmba commented May 24, 2019

It's definitely not #5413

It's about custom-styles from(/to) docx

@mb21
Copy link
Collaborator

mb21 commented May 24, 2019

Ah, then it's #5074 perhaps?

@agusmba
Copy link
Contributor Author

agusmba commented May 27, 2019

Hi, no, it's not the same thing either. I read that one before opening this one, and IMHO they are separate issues.

This one here stems from a discussion-list message, where the OP complained about pandoc behaving strange when reading a docx if the docx had been modified in Word. I initially dismissed it but when I tested it I saw that there was indeed something strange going on.

In the process of documenting it, I may have gone a bit overboard and muddled things a bit. Let me try to summarize the problem (in the first post):

Pandoc can understand metadata "Title, Author and Date" when going from a docx it created to markdown. But if you modify and save the docx in Word (any simple edit), pandoc no longer recognizes them.

Then in the process of investigating what was going on in order to pin-point where to fix it, I discovered that for other docx styles, the docx+style option was not working completely right, since because the names it chose in one direction (docx > md) were not compatible with the opposite direction (md > docx).

So I'm not sure if what I describe here is one or two issues, but they are related to how pandoc manages (and interprets) the docx style names back and forth.

@agusmba
Copy link
Contributor Author

agusmba commented May 27, 2019

note that #1716 or thereabouts might the reason why Headings work correctly in both directions.

@jgm
Copy link
Owner

jgm commented May 27, 2019

@jkr may have some ideas about this; I know he has worked on similar issues in the past.

@K4zuki
Copy link
Sponsor

K4zuki commented Jun 25, 2019

Note that even the first docx to markdown conversion (unmodified) is not really correct, since it is giving us:

::: {custom-style="BodyText"}
Blah blah blah
:::

However in order for it to work going from markdown to docx, the standard english names need to be used (notice the space):

::: {custom-style="Body Text"}
Blah blah blah
:::

This is probably going to be quite a problem with non-english word users, since I don't think the internal english name with spaces is available anywhere.

The more I look into it, the more complex it seems.

@agusmba, @mb21,
I agree with this particular part. This is what I raised #5074 for.
When setting space-included-style-name in custom-style, md->docx(or docx writer) works as expected. Trying docx->md where docx is the output of this conversion, even without opening output file, docx reader takes wrong style name.
In the conversation in #5074 @lierdakil kindly suggested this problem would be even more complicated when using Word in non-English version.

@agusmba
Copy link
Contributor Author

agusmba commented Jul 1, 2019

Would it be interesting to generate a lookup-table for different languages using the word-macro I linked above? It could ease Pandoc selecting the appropriate "built-in" style when converting from non-English Word documents. And even for English documents, we could use it to select the right style (I'm talking about the space issue presented here and in #5074).

@jgm
Copy link
Owner

jgm commented Sep 4, 2019

Currently the docx reader just hardcodes these associations:

 metaStyles :: M.Map String String
metaStyles = M.fromList [ ("Title", "title")
                        , ("Subtitle", "subtitle")
                        , ("Author", "author")
                        , ("Date", "date")
                        , ("Abstract", "abstract")]

We could certainly add non-English style names to this. Better would be to make it sensitive to language, but perhaps that's not necessary as there likely won't be ambiguities. I'm not sure how or whether the docx reader represents the document's language. @jkr if you have a minute to chime in on this, it would be helpful to get your feedback.

@jgm jgm added the reader label Sep 4, 2019
@jgm jgm added this to the 2.8 milestone Sep 4, 2019
@lierdakil
Copy link
Contributor

lierdakil commented Sep 4, 2019

IIRC, w:name for Word's built-in styles is always in English (not so for user styles, but that's somewhat out of scope anyway). w:styleId is however arbitrary, with little relation to actual style name. IIRC, the reason is that Word tries to write displayed style name into w:styleId, but w:styleId can't contain non-ASCII characters; Word then invents an arbitrary ASCII identifier on the spot and puts that into w:styleId.

IIRC, Readers.Docx currently looks at w:styleId (well, technically, w:pStyle, which references w:styleId) when looking for "meta styles". Changing that to look up style names (as given by w:name) using Readers.Docx.StyleMap, similar to Writers.Docx, should be by far more robust.

I could post reference.docx re-saved with Russian version of Word 2013 or 2019 if that would help -- as far as I can tell, the behaviour will be pretty similar for Japanese, Chinese, Hindi, etc versions of Word -- in short, versions for languages that primarily use non-ASCII characters. And it should also be somewhat similar for German and other languages using extended ASCII, where non-ASCII characters crop up in style names only occasionally (which would mean w:styleId is mangled only occasionally)

P.S. NB: w:name is not an attribute, but actually a child node in style definition.

@jgm
Copy link
Owner

jgm commented Sep 4, 2019

I could post reference.docx re-saved with Russian version of Word 2013 or 2019 if that would help

Yes, please do. Great tip to use w:name.

@lierdakil
Copy link
Contributor

lierdakil commented Sep 4, 2019

Here it is for Word 2019 Russian: reference_w2019_ru.docx

Here's the important bit, in styles.xml we find:

<w:style w:type="paragraph" w:default="1" w:styleId="a">
    <w:name w:val="Normal" />
<w:style w:type="paragraph" w:styleId="1">
    <w:name w:val="heading 1" /> <!-- notice how "heading" is not capitalized; see below -->
<w:style w:type="paragraph" w:styleId="2">
    <w:name w:val="heading 2" />
<w:style w:type="table" w:default="1" w:styleId="a2">
    <w:name w:val="Normal Table" />
<w:style w:type="paragraph" w:styleId="a4">
    <w:name w:val="Title" />
<!-- etc for built-in styles, but for custom styles with ASCII names, it's a bit different: -->
<w:style w:type="paragraph" w:customStyle="1"  w:styleId="FirstParagraph">
    <w:name w:val="First Paragraph" />
<w:style w:type="paragraph" w:customStyle="1"  w:styleId="Compact">
    <w:name w:val="Compact" />
<!-- etc -->

Notice how w:name is in English, while w:styleId is semi-arbitrary, the particular identifier chosen by Word seems to depend on style type and the order in which those appear in the styles.xml, not the actual name or purpose.

As a side note, custom styles will also have their identifiers mangled if w:name contains non-ASCII characters -- either those non-ASCII characters (and spaces) will be stripped, or an arbitrary identifier will be generated.

Also, notice w:name is case-insensitive, meaning Word can change capitalization arbitrarily! Pandoc's default reference.docx has heading names capitalized, e.g.

<w:style w:type="paragraph" w:styleId="Heading1">
<w:name w:val="Heading 1" />

However, it is not so after re-saving with Word!

@agusmba
Copy link
Contributor Author

agusmba commented Sep 5, 2019

This looks promising. If we can make pandoc understand and use docx's styles name values, it could solve both internationalized problems and round-trip conversions.

@agusmba
Copy link
Contributor Author

agusmba commented Sep 5, 2019

Note that while the style name is the one that stays the same across different international Word versions, it seems that it's the styleId (which changes) the one used to reference the styles within the text, so pandoc needs to understand and use both, when reading and also when writing (getting the information from the reference-doc).

@jgm jgm removed this from the 2.8 milestone Sep 5, 2019
@jgm
Copy link
Owner

jgm commented Sep 5, 2019

I had a brief look. It's a bit complicated how all of this works in the docx reader, so I think I'll have to leave it up to @jkr to implement using w:name.

@lierdakil
Copy link
Contributor

lierdakil commented Sep 6, 2019

pandoc needs to understand and use both

FWIW, pandoc already does something very similar in the docx writer. I vaguely recall implementing a good chunk of that a few years back. In fact, I think I can find the PRs... yeah, the relevant ones seem to be #1968 and #2023. I think I was going to do the same in reader, but never got to it. So basically there's already code that builds style name -> style id map. For reader, we probably need the reverse though, but changes are more or less straightforward (or one could try to use a bidirectional map of some description). Then, instead of comparing styleId against the predefined list, one would compare fromMaybe (fallbackBasedOn styleId) $ Map.lookup styleId styleMap, or something along those lines.

@agusmba
Copy link
Contributor Author

agusmba commented Sep 6, 2019

FWIW, pandoc already does something very similar in the docx writer.

You are right of course, otherwise using a reference-doc wouldn't work correctly for international users who modified standard styles such as Author etc. I was a bit careless/partial in my previous comment.

Thanks for the clarification and the link to relevant code!

@lierdakil
Copy link
Contributor

Hmm. I've taken a closer look, and apparently Docx reader already uses w:name in some places, e.g. for headers:

getHeaderLevel :: NameSpaces -> Element -> Maybe (String,Int)
getHeaderLevel ns element
| Just styleId <- findAttrByName ns "w" "styleId" element
, Just index <- stripPrefix "Heading" styleId
, Just n <- stringToInteger index
, n > 0 = Just (styleId, fromInteger n)
| Just styleId <- findAttrByName ns "w" "styleId" element
, Just index <- findChildByName ns "w" "name" element >>=
findAttrByName ns "w" "val" >>=
stripPrefix "heading "
, Just n <- stringToInteger index
, n > 0 = Just (styleId, fromInteger n)
getHeaderLevel _ _ = Nothing

Apparently not everywhere though, for instance, not for code, definitions, and indeed not for "meta styles" (like author, etc)

@jkr
Copy link
Collaborator

jkr commented Sep 6, 2019

I'll take a look at this, and be able to chime in, when I get finished with the sectPr parsing (on the way to correct bidi parsing) -- probably later today. But, FWIW, I think @lierdakil did a lot of the heavy lifting on this way back when.

@lierdakil
Copy link
Contributor

So, uh... I've searched around (I knew this all sounded suspiciously familiar), and basically this comes back to #5052. Well, at least for the most part.

As I see it right now, there are three options to proceed with this:

  1. Rewrite Readers.Docx.Parse to use w:name as the primary style identifier, and only using w:styleId internally to match paragrahs/runs with style. This is probably the most "correct" option, but seems like a lot of work. It will solve a lot of issues at once though. One caveat is that several distinct styles can have identical w:name — that's a very specific corner case though.
  2. Go the same route I went way back when with the writer code, and build a name <=> styleId map, then reference this map when matching styles in Readers.Docx. This is a bit of a hack, but way fewer changes need to make it work (or at least it seems so from where I'm sitting). Probably will get increasingly messy as time goes on though, so doesn't seem like a good long-term solution.
  3. Modify Readers.Docx.Parse to collect more metadata while parsing styles, similar to how it already handles headers and block quotes. This implies essentially moving all predefined style lists to Readers.Docx.Parse and handling those during initial parsing.

One curious thing to note, by the way. custom-style feature expects style name (as given by w:name) in the Writers.Docx, but Readers.Docx outputs style id (as given by w:styleId). This inconsistency is described in #5074. Going with (1) will "automatically" fix this. Going with (2) or (3), on the other hand, will require additional considerations.

@jgm
Copy link
Owner

jgm commented Sep 6, 2019

I like option 1 if it can be managed. There have been many issues associated with this, and if we can get a solid framework for handling style names in a way that survives localization, it will save headaches and more work later on.

@lierdakil lierdakil mentioned this issue Sep 7, 2019
8 tasks
lierdakil added a commit to lierdakil/pandoc that referenced this issue Sep 17, 2019
…meaning

Motivating issues: jgm#5523, jgm#5052, jgm#5074

Style name comparisons are case-insensitive, since those are
case-insensitive in Word.

w:styleId will be used as style name if w:name is missing (this should
only happen for malformed docx and is kept as a fallback to avoid
failing altogether on malformed documents)

Block quote detection code moved from Docx.Parser to Readers.Docx

Code styles, i.e. "Source Code" and "Verbatim Char" now honor style
inheritance

Docx Reader now honours "Compact" style (used in Pandoc-generated docx).
The side-effect is that "Compact" style no longer shows up in
docx+styles output. Styles inherited from "Compact" will still
show up.

Removed obsolete list-item style from divsToKeep. That didn't
really do anything for a while now.

Add newtypes to differentiate between style names, ids, and
different style types (that is, paragraph and character styles)

Since docx style names can have spaces in them, and pandoc-markdown
classes can't, anywhere when style name is used as a class name,
spaces are replaced with ASCII dashes `-`.

Get rid of extraneous intermediate types, carrying styleId information.
Instead, styleId is saved with other style data.

Use RunStyle for inline style definitions only (lacking styleId and styleName);
for Character Styles use CharStyle type (which is basicaly RunStyle with styleId
and StyleName bolted onto it).
lierdakil added a commit to lierdakil/pandoc that referenced this issue Sep 17, 2019
Motivating issues: jgm#5523, jgm#5052, jgm#5074

Style name comparisons are case-insensitive, since those are
case-insensitive in Word.

w:styleId will be used as style name if w:name is missing (this should
only happen for malformed docx and is kept as a fallback to avoid
failing altogether on malformed documents)

Block quote detection code moved from Docx.Parser to Readers.Docx

Code styles, i.e. "Source Code" and "Verbatim Char" now honor style
inheritance

Docx Reader now honours "Compact" style (used in Pandoc-generated docx).
The side-effect is that "Compact" style no longer shows up in
docx+styles output. Styles inherited from "Compact" will still
show up.

Removed obsolete list-item style from divsToKeep. That didn't
really do anything for a while now.

Add newtypes to differentiate between style names, ids, and
different style types (that is, paragraph and character styles)

Since docx style names can have spaces in them, and pandoc-markdown
classes can't, anywhere when style name is used as a class name,
spaces are replaced with ASCII dashes `-`.

Get rid of extraneous intermediate types, carrying styleId information.
Instead, styleId is saved with other style data.

Use RunStyle for inline style definitions only (lacking styleId and styleName);
for Character Styles use CharStyle type (which is basicaly RunStyle with styleId
and StyleName bolted onto it).
jgm pushed a commit that referenced this issue Sep 21, 2019
Motivating issues: #5523, #5052, #5074

Style name comparisons are case-insensitive, since those are
case-insensitive in Word.

w:styleId will be used as style name if w:name is missing (this should
only happen for malformed docx and is kept as a fallback to avoid
failing altogether on malformed documents)

Block quote detection code moved from Docx.Parser to Readers.Docx

Code styles, i.e. "Source Code" and "Verbatim Char" now honor style
inheritance

Docx Reader now honours "Compact" style (used in Pandoc-generated docx).
The side-effect is that "Compact" style no longer shows up in
docx+styles output. Styles inherited from "Compact" will still
show up.

Removed obsolete list-item style from divsToKeep. That didn't
really do anything for a while now.

Add newtypes to differentiate between style names, ids, and
different style types (that is, paragraph and character styles)

Since docx style names can have spaces in them, and pandoc-markdown
classes can't, anywhere when style name is used as a class name,
spaces are replaced with ASCII dashes `-`.

Get rid of extraneous intermediate types, carrying styleId information.
Instead, styleId is saved with other style data.

Use RunStyle for inline style definitions only (lacking styleId and styleName);
for Character Styles use CharStyle type (which is basicaly RunStyle with styleId
and StyleName bolted onto it).
@lierdakil
Copy link
Contributor

This issue is apparently fixed by #5732 and hence can be closed:

$ pandoc -f docx -t markdown test.docx -s
---
author: AMB
date: 24/05/19
title: AMB Título
---

My first level title
====================

Blah blah blah

$ pandoc -f docx -t markdown testM.docx -s
---
author: AMB
date: 24/05/19
title: AMB Título
---

My first level title modified
=============================

Modified

$ pandoc -f docx+styles -t markdown test.docx -s
---
author: AMB
date: 24/05/19
title: AMB Título
---

My first level title
====================

::: {custom-style="First Paragraph"}
Blah blah blah
:::

$ pandoc -f docx+styles -t markdown testM.docx -s
---
author: AMB
date: 24/05/19
title: AMB Título
---

My first level title modified
=============================

::: {custom-style="First Paragraph"}
Modified
:::

@jgm
Copy link
Owner

jgm commented Sep 22, 2019

Thanks @lierdakil - it's great to have all these issues connected with style localization fixed.

@jgm jgm closed this as completed Sep 22, 2019
@agusmba
Copy link
Contributor Author

agusmba commented Sep 23, 2019

Thanks @lierdakil !!! I think this is huge, specially for international Word users.

EDIT: I confirm that this is working great on my Win7 (spanish locale), using a nightly build from yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants