Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JATS permission/copyright metadata absent from pandoc metadata #8867

Closed
castedo opened this issue May 23, 2023 · 17 comments
Closed

JATS permission/copyright metadata absent from pandoc metadata #8867

castedo opened this issue May 23, 2023 · 17 comments
Labels

Comments

@castedo
Copy link
Contributor

castedo commented May 23, 2023

I submit this issue because @kamoe was interested in seeing cases like this. This issue is one case of a more general issue #8359 (resolved as closed and out of scope in late 2022).

This applies to both PMC JATS, JATS4R and "pandoc JATS", that is, JATS generated by the default JATS template and documented as JATS that pandoc supports on https://pandoc.org/jats.html.

It is worth noting that this uses a JATS4R feature that depends on XML namespaces. Specifically the part with

<ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">

This introduces complications of XML namespaces. For instance, I believe technically the xmlns:ali= attribute can be changed and then the ali: part can be dropped or renamed, and this should be parsed the same. This is the kind of thing a full XML parser will handle.

Pandoc returns nothing in the pandoc metadata for this simple example of "pandoc JATS" input:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.2 20190208//EN" "JATS-archivearticle1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.2" article-type="other">
  <front>
    <journal-meta>
      <journal-id/>
      <journal-title-group>
</journal-title-group>
      <issn/>
      <publisher>
        <publisher-name/>
      </publisher>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title></article-title>
      </title-group>
      <permissions>
        <copyright-statement>© 2023, Ellerman et al</copyright-statement>
        <copyright-year>2023</copyright-year>
        <copyright-holder>Ellerman et al</copyright-holder>
        <license license-type="open-access">
          <ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
          <license-p>This document is distributed under a Creative Commons Attribution 4.0 International license.</license-p>
        </license>
      </permissions>
    </article-meta>
  </front>
  <body>
  </body>
  <back>
  </back>
</article>

It is also worth noting there is an older dialect of JATS out there that is different from the license schema of the JATS4R dialect here. An example of such an older dialect is the JATS4R that eLife is still using for their public JATS XML they provide, but not their JATS XML which gets stored in PMC. So for the very same eLife article, there is a JATS XML file eLife provides publicly (from github and their website I believe) and there is a different flavor of JATS XML file stored in PMC ... for the very same article. One uses the JATS4R license schema above and the other dialect does something else.

@kamoe
Copy link
Contributor

kamoe commented Aug 28, 2023

Pandoc returns nothing in the pandoc metadata for this simple example of "pandoc JATS" input

@castedo This is because there is no provision in the JATS reader to parse the contents of the <permissions> element. This is straightforward to fix. I played around and got the below output. Would that work?

Pandoc
Meta
{ unMeta =
fromList
[ ( "copyright-holder"
, MetaInlines
[ Str "Ellerman" , Space , Str "et" , Space , Str "al" ]
)
, ( "copyright-statement"
, MetaInlines
[ Str "\169"
, Space
, Str "2023,"
, Space
, Str "Ellerman"
, Space
, Str "et"
, Space
, Str "al"
]
)
, ( "copyright-year" , MetaInlines [ Str "2023" ] )
, ( "license"
, MetaBlocks
[ Plain
[ Str "https://creativecommons.org/licenses/by/4.0/"
]
, Plain
[ Str "This"
, Space
, Str "document"
, Space
, Str "is"
, Space
, Str "distributed"
, Space
, Str "under"
, Space
, Str "a"
, Space
, Str "Creative"
, Space
, Str "Commons"
, Space
, Str "Attribution"
, Space
, Str "4.0"
, Space
, Str "International"
, Space
, Str "license."
]
]
)
]
}
[]

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

Nice, definitely an improvement!

However, it falls slightly short of a feature level we could call "JATS4R level 42" for lack of a better name (or another name, as long as it is funny 😃). It is based on the JATS4R that I've been archiving in baseprints like this one:
https://archive.softwareheritage.org/swh:1:dir:aa9a884908ef6d4fad57f368ad8bdc0865f976f4

Some more examples of JATS4R license XML are here:
https://jats4r.org/permissions/

I'll propose more details on JATS4R level 42 functionality here soon (like within an hour).

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

Here is some YAML that is of interest:

license:
  type: open-access
  link: "https://creativecommons.org/licenses/by/4.0/"
  text: >-
    This document is distributed under a Creative Commons Attribution 4.0
    International license.

This is the YAML that I currently use with pandoc to produce JATS4R XML. I think this generates a native objective model that is more structured than the license objective model you quoted in #8867 (comment). Pandoc will generate proper JATS4R XML using the more structured license, but does not with a single-value license objective model.

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

Another YAML detail:

copyright:
  year: "2023"
  statement: "© 2023, Ellerman et al"
  holder: Ellerman et al

That's the YAML I use to produce good JATS4R XML. It looks like the copyright needs to be a single object/key with three members/subkeys rather than three separate top-level objects/keys.

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

I can propose that the first key behavior of JATS4R level 42 functionality is that this
panda.md is repeated/round-tripped by the following command:

pandoc panda.md --to jats -s | pandoc --from jats --to markdown -s

I propose another second key behavior which is that the following XML conversion of jats-without-license_ref.xml.txt

pandoc jats-without-license_ref.xml.txt --from jats --to markdown -s

generates

---
copyright:
  year: 2023
  statement: "© 2023, Panda et al"
  holder: "Panda et al"
license:
  type: open-access
  text: This is plain text.
---

@jgm
Copy link
Owner

jgm commented Aug 28, 2023

See #9034 - does this take care of it?

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

See #9034 - does this take care of it?

Eyeballing the Haskell code it looks to me like not. copyright-statement copyright-year an copyright-holder are top-level and license is a single value, not a 3-part object. (I'm not sure whether I'm using sensible terms for Haskell)

@jgm
Copy link
Owner

jgm commented Aug 28, 2023

@kamoe It seems to me that making copyright and license objects with multiple fields makes more sense than using hyphenated top-level copyright-holder, etc. What do you think?

@kamoe
Copy link
Contributor

kamoe commented Aug 28, 2023

@castedo #9094 is the code that produced my first proposal above. Happy to keep refining it as we continue to agree on final output.

@jgm Agree, happy to refine those as you suggest.

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

Regarding the JATS4R recommendation of <ali:license_ref> I'm not sure what to think. Proper parsing of that XML requires XML namespaces, which is difficult. But perhaps pandoc can handle XML namespaces. Assuming pandoc can handle XML namespaces, I propose a third behavior which is parsing the following XML with a license_ref
jats-with-license_ref.xml.txt

pandoc jats-with-license_ref.xml --from jats --to markdown -s

outputs panda.md.

Furthermore, proper valid XML parsing means the following XML should produce the same result: jats-crazy.xml.txt
This valid XML variation names the XML child element <crazy:license_ref> and has the xmlns attribute in the <front ...> tag. I'm pretty sure a valid XML parser is supposed to parse this the same way as the previous XML.

@castedo
Copy link
Contributor Author

castedo commented Aug 28, 2023

One quick note: the panda.md used in the above examples is the minimal case using all the copyright and license metadata documented on https://pandoc.org/jats.html.

So if XML namespaces are handled, then supporting all three proposed behaviors above is a very nice "JATS4R level 42" to achieve for pandoc.

@kamoe
Copy link
Contributor

kamoe commented Aug 29, 2023

How about this output:

Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "copyright"
            , MetaMap
                (fromList
                   [ ( "holder" , MetaString "Ellerman et al" )
                   , ( "statement"
                     , MetaString "\169 2023, Ellerman et al"
                     )
                   , ( "year" , MetaString "2023" )
                   ])
            )
          , ( "license"
            , MetaMap
                (fromList
                   [ ( "link"
                     , MetaString
                         "https://creativecommons.org/licenses/by/4.0/"
                     )
                   , ( "text"
                     , MetaString
                         "This document is distributed under a Creative Commons Attribution 4.0 International license."
                     )
                   , ( "type" , MetaString "open-access" )
                   ])
            )
          ]
    }
  []

@castedo
Copy link
Contributor Author

castedo commented Aug 29, 2023

@kamoe LGTM (since pandoc converts it to yaml that matches pandoc.org/jats.html)

Thanks! This is all data I'd like to use and I currently hard code it into HTML and PDF outputs rather than reading it from JATS XML. Now I'll be able to actually read it from JATS XML via pandoc! 👍

@castedo
Copy link
Contributor Author

castedo commented Aug 29, 2023

and by "I" I really should say the open-source library epijats (gitlab.com/perm.pub/epijats) 😬

@kamoe
Copy link
Contributor

kamoe commented Aug 30, 2023

@jgm PR incorporating the above is here: #9037

@jgm
Copy link
Owner

jgm commented Aug 30, 2023

Closed by #9037.

@castedo
Copy link
Contributor Author

castedo commented Sep 1, 2023

Nice enhancement! I just tested changing <ali:license_ref> to <crazy:license_ref> and moving the xmlns attribute around with pandoc 3.1.7 and this enhancement is working great even in this crazy XML corner case. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants