structured abstract generated for HTML and LaTeX from JATS XML #8015

castedo · 2022-04-10T23:27:25Z

To be honest, this level of JATS processing is probably beyond the scope of pandoc. I imagine at some point, there is a level of JATS specific semantics for which pandoc is no longer the right tool for the job. So I'm labeling this as an enhancement.

Nonetheless, I report the limitation here with pandoc 2.18.

REPO STEPS
With source.md

pandoc source.md -t jats -s > jats.xml
pandoc jats.xml -f jats -t html -s > got.html
pandoc jats.xml -f jats -t latex -s > got.tex

GOT
jats.xml.txt
got.tex.txt
got.html.txt

EXPECTED

The abstract to NOT be the same conversion of JATS XML <sec>, <title>, <p> elements that is done in the body. Rather it should be something semantic for the abstract section.

For instance, the HTML generated for the abstract is

<div class="abstract">
<div class="abstract-title">Abstract</div>
<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
...

which isn't really right because Objective is not an h1 level heading. Although CSS is powerful enought to hack around this, it would be more appropriate to output something like:

<div class="abstract">
  <div class="abstract-title">Abstract</div>
  <div class="abstract-section" id="objective">
    <div class="abstract-section-title">Objective</div>
    <p>To examine the effectiveness of day hospital attendance</p>
  </div>
...

In the case of LaTex, the current output is:

\begin{abstract}
\hypertarget{objective}{%
\section{Objective}\label{objective}}

To examine the effectiveness of day hospital attendance

\hypertarget{design}{%
\section{Design}\label{design}}

Systematic review of 12 controlled clinical trials

\hypertarget{subjects}{%
\section{Subjects}\label{subjects}}

2867 elderly people.
\end{abstract}

my guess is there is a way to hack around this in LaTeX but I'm not as knowledgeable with LaTeX as HTML/CSS.

Currently the default output look pretty bad for JATS structured abstracts in both default HTML and LaTeX.

The text was updated successfully, but these errors were encountered:

jgm · 2022-04-10T23:45:05Z

I'm tempted to say: if you don't want level-1 headings in the abstract, don't use Markdown # in that context...

If you don't have control over that, another option is to use a Lua filter that converts level 1 headings in an abstract to something else.

castedo · 2022-04-11T01:15:29Z

In hindsight, my repo steps are confusing in that I start with start.md. I should have focused the repo steps starting with the jats.xml. As more of a side note, the jats.xml is generated quite well by the start.md I include. So pandoc works very well writing JATS XML structured abstracts. Just not so well reading them.

So having <sec> elements in the <abstract> element in JATS XML is out of my control so to speak.

I haven't learned how to make a Lua filters, but that sounds like a reasonable approach if one wants to generate HTML and LaTeX from JATS XML.

In my particular situation I have a quick work around so I'm good for now. For the long-term I suspect I will want to upgrade my JATS -> HTML/LaTex conversion from the swiss-army knife that is pandoc to a more specialized knife that only cuts JATS. It's amazing that pandoc can convert so much to so much! But I bet it's inevitable I'll want to upgrade to a specialized JATS -> HTML/LaTex solution soon.

castedo · 2023-05-22T12:35:17Z

To help clarify this issue, here is a summary. The attached JATS XML example is (roughly):

<?xml ... ?>
<!DOCTYPE ... >
<article ...>
    <front>
        <article-meta>
            <abstract>
                <sec id="objective">
                    <title>Objective</title>
                    <p>To examine the effectiveness of day hospital attendance</p>
                </sec>
                <sec id="design">
                    <title>Design</title>
                    <p>Systematic review of 12 controlled clinical trials</p>
                </sec>
                <sec id="subjects">
                    <title>Subjects</title>
                    <p>2867 elderly people.</p>
                </sec>
            </abstract>
            ...
        </article-meta>
        ...
    </front>
    ...
</article>

which pandoc converts to (roughly):

<html ...>
    ...
    <body>
        <header id="title-block-header">
            <h1 class="title">JATS an abstract</h1>
            <div class="abstract">
                <div class="abstract-title">Abstract</div>
                <h1 id="objective">Objective</h1>
                <p>To examine the effectiveness of day hospital attendance</p>
                <h1 id="design">Design</h1>
                <p>Systematic review of 12 controlled clinical trials</p>
                <h1 id="subjects">Subjects</h1>
                <p>2867 elderly people.</p>
            </div>
        </header>
    </body>
</html>

where the pandoc template variable $abstract$ get the value:

<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
<h1 id="design">Design</h1>
<p>Systematic review of 12 controlled clinical trials</p>
<h1 id="subjects">Subjects</h1>
<p>2867 elderly people.</p>

So the issue here is that pandoc is converting jATS
<article ...><front><article-meta><abstract><sec><title>
to HTML <h1>. This is essentially never gong to be the HTML that somebody wants for an abstract that is embedded inside a full-text document. There should only be one <h1> for the document and it is should not be inside the abstract.

kamoe · 2023-05-22T15:14:58Z

The root cause here is that front//abstract/sec elements are using the same function than body//abstract/sec elements, and I see why the outcome should be different.

A solution to this could be to write a customized treatment for front//abstract elements, changing the below, default recursive line inside the getAbstract function:

pandoc/src/Text/Pandoc/Readers/JATS.hs

Line 389 in 16f28ef

blks <- getBlocks s

To a behaviour that processes the inner <sec>s without adding a header at level current+1 (which is the behaviour for secs inside <body> that is currently applied in parseBlock and by transitivity in getBlocks). Could be achieved by an analogous "front" function, e.g. getFrontBlocks function, that does not append current level+1 headers to it.

castedo · 2023-05-22T15:45:30Z

Sounds like a promising idea. Thanks for thinking it out! However to be honest, I barely understand the code. I'm not very fluent in Haskell.

A net result that seems like a big improvement is something like:

<article ...>
    <front>
        <article-meta>
            <abstract>
                <sec id="objective">
                    <title>Objective</title>
                    <p>To examine the effectiveness of day hospital attendance</p>
                </sec>

getting converted to

<html ...>
    ...
    <body>
        <header id="title-block-header">
            ...
            <div class="abstract">
                <div class="abstract-title">Abstract</div>
                <div class="abstract-subtitle-1" id="objective">Objective</div>
                <p>To examine the effectiveness of day hospital attendance</p>

But if it's easier to output <h2> rather than <div class="abstract-subtitle-1"> that is certainly better than outputing <h1>.

jgm · 2023-05-22T17:59:38Z

An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate.
[EDIT: of course, this can be done already using a filter.]

castedo · 2023-05-22T18:40:06Z

FWIW, my really easy fix is to just not use headers in abstracts. 😅

So as an author I do this instead of authoring section headers:

\textbf{AUDIENCE}: Developers and early adopters of tools and services for research communication.

\textbf{STAGE}: Edition 2 planned. Feedback welcome.

which after LaTeX -> JATS -> HTML ends up not looking too bad:
https://perm.pub/H5NOlCVM9P5Vv4LbeuwJsaME8kM
but it is not very semantic.

kamoe · 2023-05-22T20:00:50Z

An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate. [EDIT: of course, this can be done already using a filter.]

Actually, an even easier solution would be to wrap the getBlocks line in manipulations of the header level to an agreed value, in much the same way it is (used to be done) in the treatment for sec here:

pandoc/src/Text/Pandoc/Readers/JATS.hs

Lines 336 to 340 in 16f28ef

    
           oldN <- gets jatsSectionLevel 
        
           modify $ \st -> st{ jatsSectionLevel = n } 
        
           b <- getBlocks e 
        
           let ident = attrValue "id" e 
        
           modify $ \st -> st{ jatsSectionLevel = oldN }

So the getAbstract function would look like:

getAbstract :: PandocMonad m => Element -> JATS m ()
getAbstract e =
  case filterElement (named "abstract") e of
    Just s -> do
      oldN <- gets jatsSectionLevel
      modify $ \st -> st{ jatsSectionLevel = 6 } -- or whatever level we agree on for front headers
      blks <- getBlocks s
      modify $ \st -> st{ jatsSectionLevel = oldN }
      addMeta "abstract" blks
   Nothing -> pure ()

castedo added the enhancement label Apr 10, 2022

castedo mentioned this issue May 22, 2023

Add raw_xml extension for JATS reader for reading additional elements not recognized by the built-in parser #8424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

structured abstract generated for HTML and LaTeX from JATS XML #8015

structured abstract generated for HTML and LaTeX from JATS XML #8015

castedo commented Apr 10, 2022

jgm commented Apr 10, 2022

castedo commented Apr 11, 2022

castedo commented May 22, 2023

kamoe commented May 22, 2023 •

edited

Loading

castedo commented May 22, 2023

jgm commented May 22, 2023 •

edited

Loading

castedo commented May 22, 2023

kamoe commented May 22, 2023 •

edited

Loading

structured abstract generated for HTML and LaTeX from JATS XML #8015

structured abstract generated for HTML and LaTeX from JATS XML #8015

Comments

castedo commented Apr 10, 2022

jgm commented Apr 10, 2022

castedo commented Apr 11, 2022

castedo commented May 22, 2023

kamoe commented May 22, 2023 • edited Loading

castedo commented May 22, 2023

jgm commented May 22, 2023 • edited Loading

castedo commented May 22, 2023

kamoe commented May 22, 2023 • edited Loading

kamoe commented May 22, 2023 •

edited

Loading

jgm commented May 22, 2023 •

edited

Loading

kamoe commented May 22, 2023 •

edited

Loading