Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx reader produces level-0 headers #3830

Closed
jgm opened this issue Aug 3, 2017 · 4 comments
Closed

docx reader produces level-0 headers #3830

jgm opened this issue Aug 3, 2017 · 4 comments

Comments

@jgm
Copy link
Owner

jgm commented Aug 3, 2017

Here is a snippet of the parse tree produced from a docx file by pandoc 1.19.2.1:

,BlockQuote
    [Para [Str "Appendix"]]
   ,Para [Str "A",Space,Str "Instructions",Space,Str "for",Space,Str
   "Adding",Space,Str "an",Space,Str "Appendix",Space,Str "29"]
   ,Header 0 ("figures",[],[]) [Str "FIGURES"]
   ,Para [Strong [Str "Figure",Space,Str "Page"]]
   ,Para [Strong [Str "No",Space,Str "table",Space,Str "of",Space,Str
   "figures",Space,Str "entries",Space,Str "found."]]
   ,Header 0 ("tables",[],[]) [Str "TABLES"]
   ,Para [Strong [Str "Table",Space,Str "Page"]]
   ,Para [Strong [Str "No",Space,Str "table",Space,Str "of",Space,Str
   "figures",Space,Str "entries",Space,Str "found."]]

(Still waiting for a docx file to use for testing.)

This causes a problem when rendering to RST, because the RST writer has:

          let headerChar = if level > 5 then ' ' else "=-~^'" !! (level - 1)

and when level == 0, we get a runtime error for using !! with a negative index.

The Markdown writer doesn't crash, but its output is not ideal either; the Header 0 is not rendered as a header at all.

The readers should never produce Header n with n < 1. Note: in some of the writers, we use Header 0 internally to represent chapters, when --top-level-division=chapter is used. (And we use -1 to represent a part!) This is a bit of a hack, and maybe we should code differently. In any case, the readers should never produce Header 0.

@jkr, can you see why the docx reader might produce Header 0, and can you see how to fix?

Linked pandoc-discuss thread.

@jgm jgm added this to the pandoc 2.0 milestone Aug 3, 2017
@jgm jgm added the bug label Aug 3, 2017
@sound-fx
Copy link

sound-fx commented Aug 3, 2017

The attached input docx file demonstrates the behavior when writing a reStructuredText file. Note that the docx file includes as-yet empty lists to eventually show page numbers for figures and tables.

gug.docx

@jkr
Copy link
Collaborator

jkr commented Aug 3, 2017

Sure -- I'll try to take a look this afternoon.

@jkr
Copy link
Collaborator

jkr commented Aug 3, 2017

Okay -- so the issue is that this file has a class named "Heading0", in addition to the more normal "Heading1". And "Heading0" is at a lower organizational level than "Heading1". So there are two ways we could deal with this:

  1. Use the lowest level of headings as level-1 header (so in this case "Heading0" becomes a level 1 header). And bump the rest up. This would not know what to do if you named your classes "HeadingA" and "HeadingB", but it would do the right thing in this case.

  2. Just bump {n<1}-level headers up to level 1 and leave the rest the same. This would mess up the structure, but it would avoid adding an extra layer of code for what seems like a very rare case (it's been about 3 years before we saw this.)

At first I had been inclined toward option 1, but I think I talked myself into option 2. @jgm?

@jgm
Copy link
Owner Author

jgm commented Aug 3, 2017 via email

@jkr jkr closed this as completed in a36a56b Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants