Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMC JATS dates are incorrectly represented by pandoc metadata #8865

Closed
castedo opened this issue May 23, 2023 · 6 comments
Closed

PMC JATS dates are incorrectly represented by pandoc metadata #8865

castedo opened this issue May 23, 2023 · 6 comments
Labels

Comments

@castedo
Copy link
Contributor

castedo commented May 23, 2023

I submit this issue because @kamoe was interested in seeing cases like this. My opinion is that fixing an issue like this is out of scope for pandoc.

My advice for extracing PMC JATS specific metadata is to not use pandoc for that and instead use an XML parser. #8359 has more discussion and a list of JATS dialects.

Pandoc is a great tool for converting between many different formats. I think it is a wrong choice for extracting PMC JATS specific metadata out of the millions of JATS XML files for published journal articles archived by PMC for the long-term.

Here is a summary of the attached jats.xml.txt:

<article ...>
  <front>
    ...
    <article-meta>
      ...
      <pub-date pub-type="collection">
        <month>3</month>
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub" iso-8601-date="2021-12-13">
        <day>13</day>
        <month>12</month>
        <year>2021</year>
      </pub-date>
      <pub-date pub-type="pmc-release">
        <day>13</day>
        <month>12</month>
        <year>2021</year>
      </pub-date>
      <volume>220</volume>
      <issue>3</issue>
      <history>
        <date date-type="received">
          <day>22</day>
          <month>9</month>
          <year>2021</year>
        </date>
        <date date-type="accepted">
          <day>3</day>
          <month>12</month>
          <year>2021</year>
        </date>
        <date date-type="corrected-typeset">
          <day>08</day>
          <month>2</month>
          <year>2022</year>
        </date>
      </history>
      ...
    </article-meta>
  </front>
  ...
</article>

which is a simplication of the PMC JATS XML file of article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9176297/
This is not a contrived example. This is the very first JATS XML file I picked out of the millions of JATS XML files archived in the PMC Open Access Subset.

At least four of these dates show up in either the HTML page or the PDF file for this article. Arguably the most important one is the "Published online" date which is 2021 Dec 13.

Here is what pandoc returns as metadata:

$ pandoc jats.xml.txt --from jats -s --to json | jq .meta
{
  "date": {
    "t": "MetaInlines",
    "c": [
      {
        "t": "Str",
        "c": "2022-3"
      }
    ]
  }
}

In addition to pandoc returning text that isn't even a date (and I suspect not even a valid ISO month), it isn't even the month of the date one would choose as the single date for the document. That date would be the one that PMC shows prominently on the HTML page and PDF file: 2021 Dec 13 which is not in March 2022.

In addition to getting an actual date, and one that makes sense, one would want it get the date-type attribute value to know what kind of date one is looking at. Lastly, I think one would want an PMC JATS parser to return a list of dates, not just one.

@kamoe
Copy link
Contributor

kamoe commented Sep 4, 2023

This is a similar problem than #8866
If I play around with a more structured approach, I get the following native represention:

Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "date"
            , MetaMap
                (fromList
                   [ ( "month" , MetaString "3" )
                   , ( "type" , MetaString "collection" )
                   , ( "year" , MetaString "2022" )
                   ])
            )
          ]
    }

Which, when converted to markdown, works alright and give us:

---
date:
  month: 3
  type: collection
  year: 2022
---

But when converted to DocBook, yields:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE article>
<article
  xmlns="http://docbook.org/ns/docbook" version="5.0"
  xmlns:xlink="http://www.w3.org/1999/xlink" >
  <info>
    <title></title>
    <date>true</date>
  </info>
  
</article>

In contrast, the original one-liner representation of date, gives the following in DocBook:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE article>
<article
  xmlns="http://docbook.org/ns/docbook" version="5.0"
  xmlns:xlink="http://www.w3.org/1999/xlink" >
  <info>
    <title></title>
    <date>2022-3</date>
  </info>
  
</article>

So the problem is not that JATS dates are incorrectly represented, is that, across formats, there is no agreed multi-level structure that ensures all information that is not stored as a one-liner won't be lost. I could go ahead and fix this and allow a multi-level, more complex representation in the native format, but if I do that, crucial information will be lost at some point, for some formats. As @jgm said, this is a much bigger work than it seems.

My answer to this is the same as for #8866: Given this not only involves the JATS reader, but also a number of writers, I personally prefer to not approach it until I have understood how to propose a more coordinated approach.

Of course, if someone else has an alternative solution, I would be curious to see it.

@castedo
Copy link
Contributor Author

castedo commented Sep 4, 2023

Seems to me there are fundamental long-term problems with dates in PMC JATS. I suspect they will always be incompatible with dating in other document formats.

FWIW, my current thinking for "Baseprints JATS" is that only one single date is internal and stored inside the baseprint. I'm currently using the intentionally ambiguous name "Author Date" for this single internal date. Other dates are external and not stored inside the baseprint, like when the baseprint is publicly archived. These other external dates are evidence from sources separate from the baseprint itself.

For a single date internal to a document I think a string in ISO 8601 format is a great standard and there is little need for multi-level dates in the pandoc object model or XML. A string in ISO 8601 sounds like a standard that works well with the current pandoc implementation in all other formats.

@jgm
Copy link
Owner

jgm commented Sep 4, 2023

For a single date internal to a document I think a string in ISO 8601 format is a great standard

Agreed. And it looks like that's what we were trying to achieve in the reader (but not completely successfully).

@jgm
Copy link
Owner

jgm commented Sep 4, 2023

I've fixed the problem with the month (it will now be 03 rather than 3).
As for the problem of selecting between different pub-date elements with different pub-types, I'm not quite sure how to solve that.

@castedo
Copy link
Contributor Author

castedo commented Sep 4, 2023

I have a great solution: take the average of all the dates; it's a robust estimate using all the data! 😆

For PMC JATS I don't think there is a good solution. At least for my use case of a subset of JATS (with only an "author date") the current pandoc behaviour works fine.

@jgm
Copy link
Owner

jgm commented Sep 4, 2023

Should we close this?

@jgm jgm closed this as completed Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants