Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alt-text and caption content missing from graphics in JATS #9130

Closed
kamoe opened this issue Oct 11, 2023 · 4 comments
Closed

alt-text and caption content missing from graphics in JATS #9130

kamoe opened this issue Oct 11, 2023 · 4 comments
Labels

Comments

@kamoe
Copy link
Contributor

kamoe commented Oct 11, 2023

Problem
In the function getGraphic, the JATS reader fetches information for the identifier, title, and alt-text* parameters contents of a figure, as follows:

getGraphic mbfigdata e = do
let atVal a = attrValue a e
(ident, title, capt) =
case mbfigdata of
Just (capt', i) -> (i, "fig:" <> atVal "title", capt')
Nothing -> (atVal "id", atVal "title",
text (atVal "alt-text"))
attr = (ident, T.words $ atVal "role", [])
imageUrl = atVal "href"
return $ imageWith attr imageUrl title capt

However, the attribute @alt-text does not exist in the <graphic> element. Only a child element <alt-text> exists, as per the JATS specs: https://jats.nlm.nih.gov/archiving/tag-library/1.3/element/graphic.html

As a result, no alt-text is ever captured for <graphic> elements, even if one is given. In addition, since no provision is made for any <caption> child, this content gets skipped.

The result from the input JATS fragment below

<graphic id="graphic001"
        xlink:href="https://lh3.googleusercontent.com/dB7iirJ3ncQaVMBGE2YX-WCeoAVIChb6NAzoFcKCFChMsrixJvD7ZRbvcaC-ceXEzXYaoH4K5vaoRDsUyBHFkpIDPnsn3bnzovbvi0a2Gg=s660"
        xlink:title="This is the title of the graphic"
        xlink:role="This is the role of the graphic">
          <alt-text>Alternative text of the graphic</alt-text>
          <caption>
            <title>This is the title of the caption</title>
            <p>Google doodle from 14 March 2003</p>
          </caption>
</graphic>

is

Para
          [ Image
              ( "graphic001"
              , [ "This"
                , "is"
                , "the"
                , "role"
                , "of"
                , "the"
                , "graphic"
                ]
              , []
              )
              []
              ( "https://lh3.googleusercontent.com/dB7iirJ3ncQaVMBGE2YX-WCeoAVIChb6NAzoFcKCFChMsrixJvD7ZRbvcaC-ceXEzXYaoH4K5vaoRDsUyBHFkpIDPnsn3bnzovbvi0a2Gg=s660"
              , "This is the title of the graphic"
              )
          ]

Note how neither <alt-text> nor <caption> content has been captured.

Pandoc version?
All versions.

*For some reason the variables that represent alt-text are called capt and capt', suggesting caption content, which might have confused things in the original code.

@kamoe kamoe added the bug label Oct 11, 2023
@jgm
Copy link
Owner

jgm commented Oct 12, 2023

Thanks. There aren't places here to put both an alt-text and a caption. (But the JATS documentation says to put the caption element in a figure, not under graphic, except in cases where a figure contains multiple subfigures; a caption under a figure is handled elsewhere in the code.) I'd suggest we look first in caption to populate the "image description," and then in alt-text if there is nothing in caption.

@kamoe
Copy link
Contributor Author

kamoe commented Oct 12, 2023

I don't think that's what the JATS spec is saying. It says to put the <caption> at the highest possible level. In many cases that is a <figure>, but in others the highest possible level is indeed a <graphic>. It does not say that a lone graphic with a caption should absolutely go inside a figure. It gives a clear rule of thumb: If a graphic element is not to go in the list of figures, then it should not go inside <figure>. Where should one put the caption for that graphic, then?

I really think that if the original JATS document has a valid structure, like a <caption> inside a <graphic>, which is perfectly legal JATS, then I think pandoc should process exactly that structure. If the output does not make sense because best practice was not respected at input e.g. a graphic with a caption inside a figure with a caption, then that should be the input's problem, not pandoc's. Pandoc should not be cleaning up badly structured input.

I also don't think <alt-text> and <caption> should be used interchangeably. They are completely different things. If there is no alt-text, then there is no alt-text, period. Caption should not be filling in. I say this because we would be tempted to then not develop the correct support for caption.

Bottomline issue is, pandoc assumes <graphic> elements are inlines, and does not allow the imageWith constructor to have a Block content parameter (unlike figureWith, which does). This is a problem because in JATS, <graphic>s (as well as <fig>s) can be very rich block structures that can have, for instance, captions with inner paragraphs. In BITS, they can also contain <legend>s which is a crucial content element for figures and tables.

Therefore, I suggest we split this problem. We can easily take care of the treatment for alt-text now by simply fetching it from a child element instead of an attribute. But no replacement with caption or viceversa. Then, we can separately discuss how to better address the remaining block elements of <graphic> (<caption>, <legend>, etc).

What do yo think?

@jgm
Copy link
Owner

jgm commented Oct 13, 2023

I really think that if the original JATS document has a valid structure, like a <caption> inside a <graphic>, which is perfectly legal JATS, then I think pandoc should process exactly that structure.

Pandoc has to fit everything into its own structure, the Pandoc AST. This doesn't correspond exactly to the structure JATS represents, so there might not be an exact fit. JATS has an inline level element graphic that can contain a caption; pandoc just doesn't have an element like that. We have a block-level Figure element with space for a caption, and an inline-level Image with space for an image description, traditionally rendered as alt text.

I'm okay with your suggestion of grabbing the alt-text from a child element, and waiting til later to decide what to do with caption.

@kamoe
Copy link
Contributor Author

kamoe commented Oct 13, 2023

@jgm Absolutely.

Here's a PR that I think addresses the above: #9134

@jgm jgm closed this as completed in fa3513b Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants