Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a pandoc-manubot-cite filter for pandoc #99

Merged
merged 56 commits into from
Jan 14, 2020

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Mar 8, 2019

Create a pandoc filter providing manubot's cite-by-ID functionality.

Todo:

  • logging exception to nonzero exit status (off by default)
  • caching (off by default)
  • export references.json to file in addition to pandoc metadata?

@dhimmel
Copy link
Member Author

dhimmel commented Mar 11, 2019

As of 87d1e3c, I've implemented a proof of concept pandoc filter that extracts citationIds from pandoc's AST (abstract syntax tree) and replaces them with manubot citation_ids. The primary remaining step would be to generate CSL and store it in the appropriate pandoc metadata fields.

Since it seems feasible to create this filter, I thought I'd open the idea to discussion of its merits. The main benefits are:

  1. the filter could massively increase the userbase of manubot's cite-by-id functionality.
  2. the filter would use pandoc's parsing of documents to extract citations, helping fix bugs like how currently citations in code elements are modified (see Do not extract or replace citation strings inside code or code_blocks #13).

The downsides are:

  1. Increased support burden. More research is needed how the plugin will interact with common pandoc commands, especially those using bibtex citations.
  2. Currently, we use a less stringent method for matching citations than pandoc:

"""
Regex to extract citations.
Same rules as pandoc, except more permissive in the following ways:
1. the final character can be a slash because many URLs end in a slash.
2. underscores are allowed in internal characters because URLs, DOIs, and
citation tags often contain underscores.
If a citation string does not match this regex, it can be substituted for a
tag that does, as defined in citation-tags.tsv.
https://github.com/greenelab/manubot-rootstock/issues/2#issuecomment-312153192
Prototyped at https://regex101.com/r/s3Asz3/2
"""
citation_pattern = re.compile(
r'(?<!\w)@[a-zA-Z0-9][\w:.#$%&\-+?<>~/]*[a-zA-Z0-9/]')

If we want to switch from using manubot process to using a pandoc filter for citation-by-id for manubot manuscripts, then we will break some existing citation strings. See also discussion at manubot/rootstock#2 (comment), where @tarleb initially suggested using a pandoc filter for this purpose. One aspect will be how often are persistent identifiers invalid pandoc citations due to forbidden characters.

"""
parser = argparse.ArgumentParser(description='Pandoc filter for citation by persistent identifier')
parser.add_argument('target_format')
parser.add_argument('--pandocversion', help='The pandoc version.')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomduck, I copied this --pandocversion argument from pandoc-fignos. Do you know what it's for and whether it's needed? It doesn't seem to me that pandoc 2.5 supplies this argument to filters.

--filter=pandoc-manubot-cite \
--filter pandoc-citeproc \
manubot/pandoc_filter/tests/input-with-cites.md

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between --filter= and --filter (with a space)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

@dhimmel dhimmel force-pushed the pandoc-filter branch 2 times, most recently from c818c12 to 6c02ac7 Compare January 9, 2020 18:23
@dhimmel
Copy link
Member Author

dhimmel commented Jan 9, 2020

@agitter would be great for you to take a look at this. Would like to move forward with merging.

@agitter
Copy link
Member

agitter commented Jan 9, 2020

Okay, I'll try to make time Friday or Saturday.

Copy link
Member

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me at a high level. I don't have much knowledge about panflute and its data structures so some of the details of the AST processing are still opaque to me. Many of my questions/comments are seeking to confirm I understand the processing correctly.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated
specifying --filter=pandoc-manubot-cite.

positional arguments:
target_format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the valid values? It would be helpful even if this has a help message that says "any valid Pandoc format" or something brief.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
---
# yaml_metadata_block with pandoc metadata
citekey-aliases:
tag:meta-review: url:https://greenelab.github.io/meta-review/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tag:meta-review: syntax temporary to be backwards compatible? Or is this how you envision using tags/aliases in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatible syntax that is required until we implement #192. After we implement #192 then you will no longer need to prefix with tag:. But for now it's needed.


Citing pre-defined citekey aliases [@tag:meta-review; @tag:deep-review].

Citing @raw:dongbo-conversation whose metadata is already in pandoc's reference metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see this being used? Is the idea that a valid Pandoc filter can't ignore valid reference metadata if someone were to use it outside the typical Manbot workflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we should support the standard pandoc methods for defining reference metadata in addition to any manubot magic. That means we need to support metadata.bibliography (paths to bibliography files) and metadata.references (CSL JSON style reference metadata), since these are the standard pandoc ways to specify reference metadata.

@@ -27,14 +30,18 @@ def test_example_manuscript(manuscript):
"--output-directory",
str(manuscript_dir.joinpath("output")),
]
if skip_citations:
args.append("--skip-citations")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to test that skip citations actually skips the citation processing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because I view manubot process as a command we're phasing out and test takes time to write. Relying on manubot/rootstock#300 to test --skip-citations actually works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--skip-citations is working in https://github.com/manubot/rootstock/runs/387673500 / manubot/rootstock@49dcfcd which relies on citation-tags.tsv.

@dhimmel
Copy link
Member Author

dhimmel commented Jan 13, 2020

@dhimmel dhimmel requested a review from agitter January 13, 2020 22:37
@dhimmel
Copy link
Member Author

dhimmel commented Jan 14, 2020

@agitter I'm going to merge and address additional feedback in subsequent PRs.

@dhimmel dhimmel merged commit 5ac49d7 into manubot:master Jan 14, 2020
@dhimmel dhimmel deleted the pandoc-filter branch January 14, 2020 14:47
@agitter
Copy link
Member

agitter commented Jan 14, 2020

I'm going to merge and address additional feedback in subsequent PRs.

Sounds good to me. I'll look over your responses and new commits here later in the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants