Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structured abstract generated for HTML and LaTeX from JATS XML #8015

Open
castedo opened this issue Apr 10, 2022 · 8 comments
Open

structured abstract generated for HTML and LaTeX from JATS XML #8015

castedo opened this issue Apr 10, 2022 · 8 comments

Comments

@castedo
Copy link
Contributor

castedo commented Apr 10, 2022

To be honest, this level of JATS processing is probably beyond the scope of pandoc. I imagine at some point, there is a level of JATS specific semantics for which pandoc is no longer the right tool for the job. So I'm labeling this as an enhancement.

Nonetheless, I report the limitation here with pandoc 2.18.

REPO STEPS
With source.md

pandoc source.md -t jats -s > jats.xml
pandoc jats.xml -f jats -t html -s > got.html
pandoc jats.xml -f jats -t latex -s > got.tex

GOT
jats.xml.txt
got.tex.txt
got.html.txt

EXPECTED

The abstract to NOT be the same conversion of JATS XML <sec>, <title>, <p> elements that is done in the body. Rather it should be something semantic for the abstract section.

For instance, the HTML generated for the abstract is

<div class="abstract">
<div class="abstract-title">Abstract</div>
<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
...

which isn't really right because Objective is not an h1 level heading. Although CSS is powerful enought to hack around this, it would be more appropriate to output something like:

<div class="abstract">
  <div class="abstract-title">Abstract</div>
  <div class="abstract-section" id="objective">
    <div class="abstract-section-title">Objective</div>
    <p>To examine the effectiveness of day hospital attendance</p>
  </div>
...

In the case of LaTex, the current output is:

\begin{abstract}
\hypertarget{objective}{%
\section{Objective}\label{objective}}

To examine the effectiveness of day hospital attendance

\hypertarget{design}{%
\section{Design}\label{design}}

Systematic review of 12 controlled clinical trials

\hypertarget{subjects}{%
\section{Subjects}\label{subjects}}

2867 elderly people.
\end{abstract}

my guess is there is a way to hack around this in LaTeX but I'm not as knowledgeable with LaTeX as HTML/CSS.

Currently the default output look pretty bad for JATS structured abstracts in both default HTML and LaTeX.

@jgm
Copy link
Owner

jgm commented Apr 10, 2022

I'm tempted to say: if you don't want level-1 headings in the abstract, don't use Markdown # in that context...

If you don't have control over that, another option is to use a Lua filter that converts level 1 headings in an abstract to something else.

@castedo
Copy link
Contributor Author

castedo commented Apr 11, 2022

In hindsight, my repo steps are confusing in that I start with start.md. I should have focused the repo steps starting with the jats.xml. As more of a side note, the jats.xml is generated quite well by the start.md I include. So pandoc works very well writing JATS XML structured abstracts. Just not so well reading them.

So having <sec> elements in the <abstract> element in JATS XML is out of my control so to speak.

I haven't learned how to make a Lua filters, but that sounds like a reasonable approach if one wants to generate HTML and LaTeX from JATS XML.

In my particular situation I have a quick work around so I'm good for now. For the long-term I suspect I will want to upgrade my JATS -> HTML/LaTex conversion from the swiss-army knife that is pandoc to a more specialized knife that only cuts JATS. It's amazing that pandoc can convert so much to so much! But I bet it's inevitable I'll want to upgrade to a specialized JATS -> HTML/LaTex solution soon.

@castedo
Copy link
Contributor Author

castedo commented May 22, 2023

To help clarify this issue, here is a summary. The attached JATS XML example is (roughly):

<?xml ... ?>
<!DOCTYPE ... >
<article ...>
    <front>
        <article-meta>
            <abstract>
                <sec id="objective">
                    <title>Objective</title>
                    <p>To examine the effectiveness of day hospital attendance</p>
                </sec>
                <sec id="design">
                    <title>Design</title>
                    <p>Systematic review of 12 controlled clinical trials</p>
                </sec>
                <sec id="subjects">
                    <title>Subjects</title>
                    <p>2867 elderly people.</p>
                </sec>
            </abstract>
            ...
        </article-meta>
        ...
    </front>
    ...
</article>

which pandoc converts to (roughly):

<html ...>
    ...
    <body>
        <header id="title-block-header">
            <h1 class="title">JATS an abstract</h1>
            <div class="abstract">
                <div class="abstract-title">Abstract</div>
                <h1 id="objective">Objective</h1>
                <p>To examine the effectiveness of day hospital attendance</p>
                <h1 id="design">Design</h1>
                <p>Systematic review of 12 controlled clinical trials</p>
                <h1 id="subjects">Subjects</h1>
                <p>2867 elderly people.</p>
            </div>
        </header>
    </body>
</html>

where the pandoc template variable $abstract$ get the value:

<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
<h1 id="design">Design</h1>
<p>Systematic review of 12 controlled clinical trials</p>
<h1 id="subjects">Subjects</h1>
<p>2867 elderly people.</p>

So the issue here is that pandoc is converting jATS
<article ...><front><article-meta><abstract><sec><title>
to HTML <h1>. This is essentially never gong to be the HTML that somebody wants for an abstract that is embedded inside a full-text document. There should only be one <h1> for the document and it is should not be inside the abstract.

@kamoe
Copy link
Contributor

kamoe commented May 22, 2023

The root cause here is that front//abstract/sec elements are using the same function than body//abstract/sec elements, and I see why the outcome should be different.

A solution to this could be to write a customized treatment for front//abstract elements, changing the below, default recursive line inside the getAbstract function:

blks <- getBlocks s

To a behaviour that processes the inner <sec>s without adding a header at level current+1 (which is the behaviour for secs inside <body> that is currently applied in parseBlock and by transitivity in getBlocks). Could be achieved by an analogous "front" function, e.g. getFrontBlocks function, that does not append current level+1 headers to it.

@castedo
Copy link
Contributor Author

castedo commented May 22, 2023

Sounds like a promising idea. Thanks for thinking it out! However to be honest, I barely understand the code. I'm not very fluent in Haskell.

A net result that seems like a big improvement is something like:

<article ...>
    <front>
        <article-meta>
            <abstract>
                <sec id="objective">
                    <title>Objective</title>
                    <p>To examine the effectiveness of day hospital attendance</p>
                </sec>

getting converted to

<html ...>
    ...
    <body>
        <header id="title-block-header">
            ...
            <div class="abstract">
                <div class="abstract-title">Abstract</div>
                <div class="abstract-subtitle-1" id="objective">Objective</div>
                <p>To examine the effectiveness of day hospital attendance</p>

But if it's easier to output <h2> rather than <div class="abstract-subtitle-1"> that is certainly better than outputing <h1>.

@jgm
Copy link
Owner

jgm commented May 22, 2023

An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate.
[EDIT: of course, this can be done already using a filter.]

@castedo
Copy link
Contributor Author

castedo commented May 22, 2023

FWIW, my really easy fix is to just not use headers in abstracts. 😅

So as an author I do this instead of authoring section headers:

\textbf{AUDIENCE}: Developers and early adopters of tools and services for research communication.

\textbf{STAGE}: Edition 2 planned. Feedback welcome.

which after LaTeX -> JATS -> HTML ends up not looking too bad:
https://perm.pub/H5NOlCVM9P5Vv4LbeuwJsaME8kM
but it is not very semantic.

@kamoe
Copy link
Contributor

kamoe commented May 22, 2023

An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate. [EDIT: of course, this can be done already using a filter.]

Actually, an even easier solution would be to wrap the getBlocks line in manipulations of the header level to an agreed value, in much the same way it is (used to be done) in the treatment for sec here:

oldN <- gets jatsSectionLevel
modify $ \st -> st{ jatsSectionLevel = n }
b <- getBlocks e
let ident = attrValue "id" e
modify $ \st -> st{ jatsSectionLevel = oldN }

So the getAbstract function would look like:

getAbstract :: PandocMonad m => Element -> JATS m ()
getAbstract e =
  case filterElement (named "abstract") e of
    Just s -> do
      oldN <- gets jatsSectionLevel
      modify $ \st -> st{ jatsSectionLevel = 6 } -- or whatever level we agree on for front headers
      blks <- getBlocks s
      modify $ \st -> st{ jatsSectionLevel = oldN }
      addMeta "abstract" blks
   Nothing -> pure ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants