New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions isBlockElement and parseMixed cancel each other out; JATS body blocks are treated as inlines #8889
Comments
@tarleb, @hamishmack, @jgm do you have any insights on why this happens? TLDR; the functions
Therefore, as they are written now, neither of these two functions has any functional purpose. Instead of calling Your thoughts really appreciated. * This is because it can only output TRUE if the input is |
Something very weird about the definition of isBlockElement! I see what you mean: all the block-level tags are reproduced in |
@jgm After A LOT of analysis I came to the following conclusions, as to what the rationale was when the above code was written; and best next-steps: Points 1. and 2. below refer to the JATS 1.1. specification, the latest specification when the code was pushed in 2017(JATS 1.2. appeared in 2019, and JATS 1.3. in 2021). Line numbers refer to the JATS reader, JATS.hs in 16f28ef.
Unless @tarleb , @hamishmack , yourself, or anyone else has an objection, I will prepare a fix for the above. Otherwise do share your thoughts. |
Thanks for this detailed analysis! I have a couple questions about the rationales on the linked spreadsheet.
I may be misunderstanding the broader context -- I don't have a good picture of how this parser works. But in general, I would think that we should be guided by the following criterion: If the JATS spec says that an element can appear inside |
Thanks for your quick input! Regarding specific ways to treat specific elements, I do not feel strongly one way or the other, I suppose what is important is that everything is explained and makes sense. On that, I am happy to elaborate on your questions:
Now, it is interesting you put in writing the criterion, since that actually explains why we have this issue. In JATS, I do not think we can assume that anything that can appear inside a I suggest, In JATS, we do not make this assumption, and rather reduce that list of elements that will be treated as inlines. After your input, I believe this list would now contain |
I was thinking that the most reasonable approach would be to take it as inline if it occurs inside the context of a As for disp-formula: pandoc doesn't have a block-level construct for Math, just a Math constructor for Inline that has two variants, InlineMath and DisplayMath. In this case you'll use DisplayMath. Don't worry: it will still be displayed as a separate block in e.g. LaTeX, HTML, or docx output. So I'm confident in this case that treating it as inline is the right approach |
This is what I am saying, yes. If pandoc assumes everything that can appear inside a pandoc/src/Text/Pandoc/Readers/JATS.hs Line 166 in 16f28ef
pandoc/src/Text/Pandoc/Readers/JATS.hs Lines 202 to 211 in 16f28ef
|
OK, I see, yes, that must be the intent of Question: currently the JATS writer uses As for
How does that sound? |
I think so. Taylor & Francis' guide to JATS does explicitly say not to use
The JATS spec recommends
|
I'll change the writer so it doesn't use Code for inline code. And on your side you can just treat |
See #8889. The Taylor and Francis guide to JATS says that `<code>` is block level and not intended to be used inline within standard text.
Sounds good. Regarding To recap, all I will do is simplify the long list of (I removed Would this be a sensible approach? |
In pandoc, the normal way of writing block math is
This is parsed as [ Para
[ Str "Einstein"
, Space
, Str "showed"
, Space
, Str "that"
, Space
, Math DisplayMath "e=mc^2"
, Str "."
, Space
, Str "This"
, Space
, Str "formula"
, Space
, Str "is"
, Space
, Str "important"
, Space
, Str "because\8230"
]
] So there is no separate block for the display formula. If we treated disp-formula as a block (say, a special div), then we'd end up with a new paragraph after the formula, which isn't what is wanted. (For example, in processing with LaTeX, if you had a new paragraph after the formula you'd get unwanted indentation.) When you pass the above through the JATS writer you get: <p>Einstein showed that <disp-formula><alternatives>
<tex-math><![CDATA[e=mc^2]]></tex-math>
<mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math></alternatives></disp-formula>.
This formula is important because…</p> and it is important that passing this JATS back through the pandoc reader should parse the same way as the original. That is why disp-formula needs special conditional treatment. Or, just always treat it as inline and ignore contents that can't be represented that way -- that would be better than taking it as a block. It seems to me that |
Why does the native representation of the formula above use
|
Because in pandoc markdown |
Semantically the formula is part of the paragraph. Even if it gets rendered as a separated block, we don't start a new paragraph after the formula but continue the preceding paragraph. (Hence, no indent if paragraph indentation is called for in the style.) |
Got it. To recap, in the JATS reader:
Have we covered it? |
I think I'm still confused -- I'm sorry if I'm missing the context. Certainly you don't mean to say that the only tags that should be parsed as inline elements (e.g. in a Para or Header text) are mml:math, tex-math, related-object, and x? What about, e.g., bold, italic, strike, inline-graphic, etc.? What am I missing? |
No worries. That is not what I mean. In the context of pandoc/src/Text/Pandoc/Readers/JATS.hs Lines 107 to 108 in 16f28ef
Whatever we do to the list |
OK, thanks for the clarification. The code might be a bit unnecessarily confusing with the name |
Absolutely. To recap, in the
If no further comments, or objections, I'll prepare a fix. |
Looks good! Of course it would be good to have some tests so we can see clearly the effects of this change. |
Alright, here is the fix: #8971 It is failing 1 check, though. I am not sure if that is critical or normal. @jgm, do let me know if I should be doing something else. The test output was as expected, and minor discrepancies exist with the previous test output file because the test file (test/jats-reader.xml) was not JATS compliant and I cleaned it up. |
Background
In JATS, there exists a number of paragraph-level or body blocks, and other structurally similar elements, that sit at the same level of a
<p>
.By definition from the JATS specification, they are "elements, such as tables and figures, that are content units separated from other content visually and logically, typically with whitespace before and after them. These elements are typically used in the same places a paragraph may be used, for example, inside a section after the section title."
Now, in JATS, the element
<p>
can contain some of such block-like elements, e.g.<code>
:Because
<code>
is supposed to sit at paragraph-level, the above should look like three independent paragraphs separated by a whitespace, although in JATS it is all just one paragraph.Pandoc seems to have taken this into account, with the
parseMixed
function of the JATS reader, where every element is parsed either as a block, or as an inline, as appropriate:pandoc/src/Text/Pandoc/Readers/JATS.hs
Lines 202 to 211 in 16f28ef
According tho the above, when a block-like element is found, the reader parses it as a block, creating a separate paragraph for it. Otherwise, the reader parses the element as an inline.
The problem
The output of the
isBlockElement
function determines if the element gets to be parsed as a block or as an inline (see L203 in code extract above). However, theisBlockElement
function actually only returns TRUE if the input element is a<p>
(See discussion here). SinceisBlockElement
is only ever called over children of<p>
*,isBlockElement
is never TRUE**, and as a result, all children of<p>
are always and systematically parsed as inlines (L208-L2011 above are never reached).So the whole purpose of the
parsedMixed
function is defeated. This is causing issue #8804, for instance.*This is because
isBlockElement
is only ever called fromparseMixed
, andparseMixed
is only ever called from the case of parsing<p>
.** This is because in the JATS specification
<p>
cannot contain a<p>
, in other words, no input element ofisBlockElement
is ever a<p>
that could yield a value TRUE.The root cause
It all boils down to a likely mixup when the
isBlockElement
function was originally adapted from the DockBook reader (do scroll down to see it all):pandoc/src/Text/Pandoc/Readers/JATS.hs
Lines 106 to 132 in 16f28ef
The function lists the elements that should be considered as paragraph-level elements, then filters out any inline element. The problem is, it defines inline elements as simply any elements contained inside a paragraph (the list called
inlinetags
is an exact copy of all elements that can be contained inside the JATS 1.1 spec of<p>
). The problem of this approach to define inline elements is that it does not acknowledge body blocks and other paragraph-level elements that can exist inside<p>
elements, as defined at the beginning of this post.The only survivor of this filter is the element
<p>
itself (which is useless in this particular context).The solution
A trivial solution would be to not filter out the inline elements from the allowed block elements in
isBlockElement
(This is achieved by removing L117-L131, and\\ S.fromList inlinetags
from L108).But this might not be a trivial problem, and I might as well be missing something from the history of the
isBlockElement
function.Thoughts, anyone?
The text was updated successfully, but these errors were encountered: