Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandoc metadata as representation of JATS metadata #8359

Closed
castedo opened this issue Oct 7, 2022 · 6 comments
Closed

pandoc metadata as representation of JATS metadata #8359

castedo opened this issue Oct 7, 2022 · 6 comments

Comments

@castedo
Copy link
Contributor

castedo commented Oct 7, 2022

In using pandoc I've encountered issues that I'm not sure whether to consider inside or outside the scope of what pandoc should handle.

This issue/feature of pandoc metadata representing JATS metadata can probably be closed, but I wanted to share my usage scenario and double check what is outside of scope. To frame the scope, I suspect the following question is useful:

What is the pandoc metadata for JATS supposed to be? Is it:

  1. a highly interoperable common data schema to be shared by many different formats, or
  2. a YAML representation that is convenient for authors to set data inside JATS XML output, or
  3. A JSON-compatible passive data structure [1] representation of JATS XML article meta data.

Currently it seems the answer is primarily 1) and optionally 2), and not 3). I'd say pandoc currently does a poor job doing 3) which I hope is because that's out of scope.

Here's a concrete usage case that I'm affected by which illustrates some of the issues. In my YAML header I have the following metadata for pandoc:

author:
- surname: Ellerman
  given-names: E. Castedo
  email: castedo@castedo.com
  orcid: 0000-0002-5014-4809
date:
  iso-8601: 2022-08-24
  type: eprint
  year: 2022
  month: 08
  day: 24

which outputs the following JATS XML:

  <contrib-group>
    <contrib contrib-type="author">
      <contrib-id contrib-id-type="orcid">0000-0002-5014-4809</contrib-id>
      <name>
        <surname>Ellerman</surname>
        <given-names>E. Castedo</given-names>
      </name>
      <email>castedo@castedo.com</email>
    </contrib>
  </contrib-group>
  <pub-date date-type="eprint" publication-format="electronic" iso-8601-date="2022-08-24">
    <day>24</day>
    <month>8</month>
    <year>2022</year>
  </pub-date>

That JATS XML if converted back into YAML+markdown via pandoc becomes:

author:
- E. Castedo Ellerman
date: 2022-08-24

If pandoc metadata is supposed to be primarily 1) and secondarily 2) then this seems fine, and this issues can be closed. If not, then I can file some more issues. I am currently starting to use separate Python libraries to extract metadata from JATS XML.

Thank y'all for such a wonderful tool!

[1] https://en.wikipedia.org/wiki/Passive_data_structure

@jgm
Copy link
Owner

jgm commented Oct 7, 2022

There isn't currently a standardized structured metadata format that will work optimally with all formats pandoc supports. The JATS writer supports JATS-specific structured metadata, as you've illustrated. But should the JATS reader produce this too? That would be very useful if you're going to re-render as JATS. (Then again, converting JATS to JATS is not so useful.) But if you're going to be rendering some other format, then you'd prefer to have something every pandoc format can handle, which is what the JATS reader currently gives you.

@jgm
Copy link
Owner

jgm commented Oct 7, 2022

I think @tarleb has done some thinking about standardizing structured metadata, e.g. in his scholarly markdown project, so he may want to comment.

@castedo
Copy link
Contributor Author

castedo commented Oct 7, 2022

(Then again, converting JATS to JATS is not so useful.) But if you're going to be rendering some other format, then you'd prefer to have something every pandoc format can handle

Great point that I very much agree with.

@castedo
Copy link
Contributor Author

castedo commented May 23, 2023

For reference, I will use this closed issue as a high-level level nexus for other more specific issues that relate to pandoc metadata representing JATS metadata.

"JATS" is ambiguous since there are so many dialects of JATS. I can suggest some names for dialects. I list them in rough order from least specific to most specific:

@castedo
Copy link
Contributor Author

castedo commented May 23, 2023

@kamoe, here's a summary of issues with pandoc attempting to represent JATS metadata.

There are issues where the pandoc reader incorrectly represent metadata in JATS:
#8865
#8866
This is not just PMC JATS but also JATS that pandoc generates and is documented on https://pandoc.org/jats.html

Then there's PMC & pandoc JATS metadata that isn't read at all and absent from pandoc metadata from the reader:
#8867

Last but not least, in addition to the above, there are more JATS elements documented on https://pandoc.org/jats.html and show up in PMC XML but do not appear pandoc metadata from the JATS reader:

  • everything under pandoc YAML article (<article-meta> JATS)
  • everything except title under pandoc YAML journal (<journal-meta> JATS)
  • tags in pandoc YAML (kwd-group in JATS)

My solution to all these problems is the not use pandoc and instead use an XML parser. The fixes and enhancements that I would actually use are improvements/fixes to processing of not metadata, but rather marked-up text (e.g. #8847).

@kamoe
Copy link
Contributor

kamoe commented May 24, 2023

Thanks for this @castedo. I note all your comments and concerns, and will take a good look at this. I'm very interested from the perspective of the implications for a future BITS reader, so this is all very relevant. The more bugs JATS gets addressed, the less issues BITS inherits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants