docx reader produces level-0 headers #3830

jgm · 2017-08-03T16:09:39Z

Here is a snippet of the parse tree produced from a docx file by pandoc 1.19.2.1:

,BlockQuote
    [Para [Str "Appendix"]]
   ,Para [Str "A",Space,Str "Instructions",Space,Str "for",Space,Str
   "Adding",Space,Str "an",Space,Str "Appendix",Space,Str "29"]
   ,Header 0 ("figures",[],[]) [Str "FIGURES"]
   ,Para [Strong [Str "Figure",Space,Str "Page"]]
   ,Para [Strong [Str "No",Space,Str "table",Space,Str "of",Space,Str
   "figures",Space,Str "entries",Space,Str "found."]]
   ,Header 0 ("tables",[],[]) [Str "TABLES"]
   ,Para [Strong [Str "Table",Space,Str "Page"]]
   ,Para [Strong [Str "No",Space,Str "table",Space,Str "of",Space,Str
   "figures",Space,Str "entries",Space,Str "found."]]

(Still waiting for a docx file to use for testing.)

This causes a problem when rendering to RST, because the RST writer has:

          let headerChar = if level > 5 then ' ' else "=-~^'" !! (level - 1)

and when level == 0, we get a runtime error for using !! with a negative index.

The Markdown writer doesn't crash, but its output is not ideal either; the Header 0 is not rendered as a header at all.

The readers should never produce Header n with n < 1. Note: in some of the writers, we use Header 0 internally to represent chapters, when --top-level-division=chapter is used. (And we use -1 to represent a part!) This is a bit of a hack, and maybe we should code differently. In any case, the readers should never produce Header 0.

@jkr, can you see why the docx reader might produce Header 0, and can you see how to fix?

Linked pandoc-discuss thread.

The text was updated successfully, but these errors were encountered:

sound-fx · 2017-08-03T17:39:20Z

The attached input docx file demonstrates the behavior when writing a reStructuredText file. Note that the docx file includes as-yet empty lists to eventually show page numbers for figures and tables.

gug.docx

jkr · 2017-08-03T17:54:45Z

Sure -- I'll try to take a look this afternoon.

jkr · 2017-08-03T20:26:31Z

Okay -- so the issue is that this file has a class named "Heading0", in addition to the more normal "Heading1". And "Heading0" is at a lower organizational level than "Heading1". So there are two ways we could deal with this:

Use the lowest level of headings as level-1 header (so in this case "Heading0" becomes a level 1 header). And bump the rest up. This would not know what to do if you named your classes "HeadingA" and "HeadingB", but it would do the right thing in this case.
Just bump {n<1}-level headers up to level 1 and leave the rest the same. This would mess up the structure, but it would avoid adding an extra layer of code for what seems like a very rare case (it's been about 3 years before we saw this.)

At first I had been inclined toward option 1, but I think I talked myself into option 2. @jgm?

jgm · 2017-08-03T20:56:16Z

I'd be okay with option 2. +++ Jesse Rosenthal [Aug 03 17 20:26 ]:

…

Okay -- so the issue is that this file has a class named "Heading0", in addition to the more normal "Heading1". And "Heading0" is at a lower organizational level than "Heading1". So there are two ways we could deal with this: 1. Use the lowest level of headings as level-1 header (so in this case "Heading0" becomes a level 1 header). This would not know what to do if you named your classes "HeadingA" and "HeadingB", but it would do the right thing in this case. 2. Just bump {n<1}-level headers up to level 1 and leave the rest the same. This would mess up the structure, but it would avoid adding an extra layer of code for what seems like a very rare case (it's been about 3 years before we saw this.) At first I had been inclined toward option 1, but I think I talked myself into option 2. ***@***.***? — You are receiving this because you were mentioned. Reply to this email directly, [2]view it on GitHub, or [3]mute the thread. References 1. https://github.com/jgm 2. #3830 (comment) 3. https://github.com/notifications/unsubscribe-auth/AAAL5AtN_Lv3oEuYGwXGt-By-vt86ZkEks5sUiz4gaJpZM4OsqbN

jgm added format:Docx reader labels Aug 3, 2017

jgm added this to the pandoc 2.0 milestone Aug 3, 2017

jgm added the bug label Aug 3, 2017

jkr closed this as completed in a36a56b Aug 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx reader produces level-0 headers #3830

docx reader produces level-0 headers #3830

jgm commented Aug 3, 2017

sound-fx commented Aug 3, 2017

jkr commented Aug 3, 2017

jkr commented Aug 3, 2017 •

edited

Loading

jgm commented Aug 3, 2017 via email

docx reader produces level-0 headers #3830

docx reader produces level-0 headers #3830

Comments

jgm commented Aug 3, 2017

sound-fx commented Aug 3, 2017

jkr commented Aug 3, 2017

jkr commented Aug 3, 2017 • edited Loading

jgm commented Aug 3, 2017 via email

jkr commented Aug 3, 2017 •

edited

Loading