Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove extraneous, significant whitespace in JATS writer output #4335

Merged
merged 8 commits into from
Mar 5, 2018
Merged

Remove extraneous, significant whitespace in JATS writer output #4335

merged 8 commits into from
Mar 5, 2018

Conversation

nokome
Copy link

@nokome nokome commented Feb 1, 2018

The JATS writer produces more readable XML by using indentation. However, it was inappropriately indenting the content of elements which may contain #PCDATA, thereby adding extraneous, significant whitespace. In addition, it had wrapping turned on which added more whitespace.

I checked all the uses of inTagsIndented in src/Text/Pandoc/Writers/JATS.hs against the JATS spec an fixed those tags which can have #PCDATA content:

tag status
<caption> OK
<def-item> OK
<def> OK
<disp-quote> OK
<fn> OK
<label> fixed
<list-item> OK
<p> fixed
<ref-list> OK
<tbody> OK
<td> fixed
<term> fixed
<th> fixed
<thead> OK
<tr> OK

Nokome Bentley added 5 commits February 2, 2018 10:05
…> tags

Text wrapping, and indentation in paragraphs, causes creates extra
whitespace. The JATS spec has a content model for `<p>` tags of `(#PCDATA | ...`.
Any tag where `#PCDATA` children are possible should not have any
indentation. This is consistent with the Pandoc HTML writer.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
let render' :: Doc -> Text
render' = render colwidth
render' = render Nothing
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! One comment on this line. This disables wrapping entirely. I don't think we need to do anything that drastic. Text.Pandoc.Pretty provides a nowrap combinator which can locally disable wrapping when that is absolutely needed. But normal wrapping, which occurs on Space elements, should be innocuous, so you should only need the nowrap in a few special cases (e.g. verbatim).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @jgm. I've reinstated the wrapping now.

@nokome
Copy link
Author

nokome commented Feb 26, 2018

@jgm : I reinstated wrapping with 10eb42d which broke the tests but which illustrates the issue of introducing extra whitespace when both indentation and wrapping enabled as currently implemented. Here's an example from the breaking test in test/writer.jats:

Without wrapping:

<sec id="paragraphs">
  <title>Paragraphs</title>
  <p>Here’s a regular paragraph.</p>
  <p>In Markdown 1.0.0 and earlier. Version 8. This line turns into a list item. Because a hard-wrapped line in the middle of a paragraph looked like a list item.</p>
  <p>Here’s one with a bullet. * criminey.</p>
  <p>There should be a hard line break<break />here.</p>
</sec>

With wrapping:

<sec id="paragraphs">
  <title>Paragraphs</title>
  <p>Here’s a regular paragraph.</p>
  <p>In Markdown 1.0.0 and earlier. Version 8. This line turns into a list
  item. Because a hard-wrapped line in the middle of a paragraph looked like a
  list item.</p>
  <p>Here’s one with a bullet. * criminey.</p>
  <p>There should be a hard line break<break />here.</p>
</sec>

Note that because the <p> is a child of <sec> that an extra 2 significant spaces are inserted at the start of each wrapped line.

I'm afraid I don't have a good enough knowledge of Haskell or Pandoc to be able to fix this myself.

@jgm
Copy link
Owner

jgm commented Feb 26, 2018 via email

@nokome
Copy link
Author

nokome commented Feb 26, 2018

You're right, I wrongly assumed there was a semantic difference but that might not be the case. I did some searching but found it difficult to find clear documentation on this matter in the JATS standard https://jats.nlm.nih.gov/archiving/tag-library/1.1d1/.

The best I found was the "Whitespace Handling" section in https://www.ncbi.nlm.nih.gov/books/NBK425547/ which says:

XML processors may remove all insignificant white space but must retain one space where whitespace is significant

Which implies that processors usually combine more than one space into one space. Pinging @michael and @oliver---- who may be able to shed some light on the matter.

Either way, as you say a user can turn off wrapping locally, so I'm happy to defer to your judgement and adjust the test case so that it with wrapping on.

@jgm
Copy link
Owner

jgm commented Mar 3, 2018

I'd prefer it if the test case had wrapping, because that would be consistent with the other writer tests.
This also allows us to make sure that whitespace is treated appropriately even when there's wrapping.

@nokome
Copy link
Author

nokome commented Mar 4, 2018

@jgm: I have reintroduced wrapping into the test case and all tests are now passing.

@jgm
Copy link
Owner

jgm commented Mar 5, 2018

Great, thanks for the patch!

@jgm jgm merged commit 7d193b2 into jgm:master Mar 5, 2018
nokome added a commit to stencila/encoda that referenced this pull request Apr 11, 2018
Version 2.1.3 of Pandoc now deals with extra whitespace in JATS
jgm/pandoc#4335
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants