Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSL Data: empty date-parts in issued causing pandoc-citeproc to crash #66

Closed
dhimmel opened this issue Oct 9, 2018 · 3 comments
Closed

Comments

@dhimmel
Copy link
Member

dhimmel commented Oct 9, 2018

Some Crossref DOI records have their issued date-parts set to null, such as 10.22541/au.149693987.70506124. In these situations, DOI Content Negotiation for CSL JSON returns

    "issued": {
      "date-parts": [
        [
          null
        ]
      ]
    },

In the past, before we pruned invalid CSL using the CSL Data JSON Schema, we addressed this case with the following:

https://github.com/greenelab/manubot/blob/693fbb7758b5922add30ecaa6e30acb98426a977/manubot/cite/citeproc.py#L88-L95

Hence, we'd remove issued if the first element of the date-parts list in Python was None. In #49, we switched to using the JSON Schema to remove invalid fields and removed custom CSL fixing logic. Our hope was that CSL that passed the JSON Schema would be compliant with pandoc-citeproc.

The JSON Schema currently specifies elements in date-parts arrays must be strings or number (excludes null), but does not specify minItems of 1. Hence, our CSL Data pruning removes the null item but keeps the empty list:

    "issued": {
      "date-parts": [
        []
      ]
    },

This causes pandoc-citeproc to crash (as currently happening with manubot cite --render doi:10.22541/au.149693987.70506124 and greenelab/meta-review#101 (comment)):

Error parsing references: Could not parse RefDate
Error running filter pandoc-citeproc:
Filter returned error status 1

This issue was discovered in greenelab/meta-review#101 (comment). The corresponding WIP PR to fix it is #65 (pending a solution).

@dhimmel
Copy link
Member Author

dhimmel commented Oct 9, 2018

Solutions

This issue touches on many projects, and there are many possible solutions that should be explored concurrently:

  1. We (Manubot) should consider checking for empty date-parts arrays and removing their parent object. CC @agitter
  2. pandoc-citeproc should consider accepting empty date-parts and treating it the same as if the field were not supplied at all. CC @jgm
  3. the JSON Schema for CSL Data should consider setting minItems: 1 for the date-parts array. CC @rmzelle.
  4. @CrossRef should consider not returning an issued object when it's missing rather than returning invalid CSL. Refs Invalid types in CSL JSON items returned by DOI content negotiation CrossRef/rest-api-doc#187. CC @kjw @gbilder
  5. Authorea should set their @CrossRef issued metadata, see tweet. This is only a partial solution as many publishers probably have this issue.

@agitter
Copy link
Member

agitter commented Oct 9, 2018

We (Manubot) should consider checking for empty date-parts arrays and removing their parent object.

This makes sense to me as a temporary workaround. Our short term goal is to be able to upgrade manuscripts to use Manubot v0.2.0.

In the longer term, one or more of the other solutions (especially 2-4) would be helpful so that we don't have to treat this as a special case.

jgm added a commit to jgm/pandoc-citeproc that referenced this issue Oct 9, 2018
These seem to occur in the wild, and it seems better to parse them
as an empty date parts than to crash with a parse error.

See manubot/manubot#66.
dhimmel added a commit to dhimmel/csl-schema that referenced this issue Oct 15, 2018
Currently, the JSON Schema allows empty arrays in date-parts, which
can cause downstream citeproc utilities to crash. See
manubot/manubot#66

Set minItems to 1 for the date-parts array as well as the nested
arrays with year, month, day info.
dhimmel added a commit to dhimmel/csl-schema that referenced this issue Oct 23, 2018
Currently, the JSON Schema allows empty arrays in date-parts, which
can cause downstream citeproc utilities to crash. See
manubot/manubot#66

Set minItems to 1 for the date-parts array as well as the nested
arrays with year, month, day info.
dhimmel added a commit that referenced this issue Oct 23, 2018
Merges #65

Fixes empty issued date-parts bug reported in the following issues:
- Closes #66
- Closes #75
- Fixes greenelab/meta-review#101 (comment)

Recursively remove errors in remove_jsonschema_errors. Combined with a CSL
JSON schema that specifies minItems for date-parts, this change fixes the above
issues. See citation-style-language/schema#158 for the CSL
JSON schema changes that are intended to be present in the CSL schema loaded
by this package.

Tests removal of empty date-parts for issued object.
dhimmel added a commit that referenced this issue Oct 23, 2018
Merges #65

Fixes empty issued date-parts bug reported in the following issues:
- Closes #66
- Closes #75
- Fixes greenelab/meta-review#101 (comment)

Recursively remove errors in remove_jsonschema_errors. Combined with a CSL
JSON schema that specifies minItems for date-parts, this change fixes the above
issues. See citation-style-language/schema#158 for the CSL
JSON schema changes that are intended to be present in the CSL schema loaded
by this package.

Tests removal of empty date-parts for issued object.
@dhimmel
Copy link
Member Author

dhimmel commented Oct 23, 2018

We've addressed this issue for Manubot in #65 / 4e6a0f6. @jgm has updated pandoc-citeproc to not crash when encountering empty date-parts (thanks!) in jgm/pandoc-citeproc@bde816c. I opened a PR to update the CSL JSON schema in citation-style-language/schema#158. So overall things are looking good... hopefully Crossref & authorea will implement solutions 4 & 5, but that is not required for Manubot to work.

dhimmel added a commit to dhimmel/csl-schema that referenced this issue Oct 30, 2018
Currently, the JSON Schema allows empty arrays in date-parts, which
can cause downstream citeproc utilities to crash. See
manubot/manubot#66

Set minItems to 1 for the date-parts array as well as the nested
arrays with year, month, day info.
dhimmel added a commit to dhimmel/csl-schema that referenced this issue Dec 10, 2018
Currently, the JSON Schema allows empty arrays in date-parts, which
can cause downstream citeproc utilities to crash. See
manubot/manubot#66

Set minItems to 1 for the date-parts array as well as the nested
arrays with year, month, day info.
bdarcus pushed a commit to citation-style-language/schema that referenced this issue May 21, 2020
Currently, the JSON Schema allows empty arrays in date-parts, which
can cause downstream citeproc utilities to crash. See
manubot/manubot#66

Set minItems to 1 for the date-parts array as well as the nested
arrays with year, month, day info.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants