Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a pandoc-manubot-cite filter for pandoc #99

Draft
wants to merge 23 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@dhimmel
Copy link
Member

commented Mar 8, 2019

This is an experimental PR to see if it would be easy to create a pandoc filter providing manubot's cite-by-ID functionality.

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch from 20e9297 to 87d1e3c Mar 11, 2019

@dhimmel

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2019

As of 87d1e3c, I've implemented a proof of concept pandoc filter that extracts citationIds from pandoc's AST (abstract syntax tree) and replaces them with manubot citation_ids. The primary remaining step would be to generate CSL and store it in the appropriate pandoc metadata fields.

Since it seems feasible to create this filter, I thought I'd open the idea to discussion of its merits. The main benefits are:

  1. the filter could massively increase the userbase of manubot's cite-by-id functionality.
  2. the filter would use pandoc's parsing of documents to extract citations, helping fix bugs like how currently citations in code elements are modified (see #13).

The downsides are:

  1. Increased support burden. More research is needed how the plugin will interact with common pandoc commands, especially those using bibtex citations.
  2. Currently, we use a less stringent method for matching citations than pandoc:

"""
Regex to extract citations.
Same rules as pandoc, except more permissive in the following ways:
1. the final character can be a slash because many URLs end in a slash.
2. underscores are allowed in internal characters because URLs, DOIs, and
citation tags often contain underscores.
If a citation string does not match this regex, it can be substituted for a
tag that does, as defined in citation-tags.tsv.
https://github.com/greenelab/manubot-rootstock/issues/2#issuecomment-312153192
Prototyped at https://regex101.com/r/s3Asz3/2
"""
citation_pattern = re.compile(
r'(?<!\w)@[a-zA-Z0-9][\w:.#$%&\-+?<>~/]*[a-zA-Z0-9/]')

If we want to switch from using manubot process to using a pandoc filter for citation-by-id for manubot manuscripts, then we will break some existing citation strings. See also discussion at manubot/rootstock#2 (comment), where @tarleb initially suggested using a pandoc filter for this purpose. One aspect will be how often are persistent identifiers invalid pandoc citations due to forbidden characters.

"""
parser = argparse.ArgumentParser(description='Pandoc filter for citation by persistent identifier')
parser.add_argument('target_format')
parser.add_argument('--pandocversion', help='The pandoc version.')

This comment has been minimized.

Copy link
@dhimmel

dhimmel Mar 11, 2019

Author Member

@tomduck, I copied this --pandocversion argument from pandoc-fignos. Do you know what it's for and whether it's needed? It doesn't seem to me that pandoc 2.5 supplies this argument to filters.

dhimmel added some commits Mar 11, 2019

--filter=pandoc-manubot-cite \
--filter pandoc-citeproc \
manubot/pandoc_filter/tests/input-with-cites.md

This comment has been minimized.

Copy link
@slochower

slochower Mar 12, 2019

Collaborator

Is there a difference between --filter= and --filter (with a space)?

This comment has been minimized.

Copy link
@dhimmel

dhimmel Mar 13, 2019

Author Member

No

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch from cb8bce6 to fa4198f Mar 13, 2019

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch from 5431650 to b09d6c4 Mar 15, 2019

dhimmel added some commits Mar 15, 2019

dhimmel added some commits Mar 26, 2019

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch from c196751 to b2c505a Mar 26, 2019

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch 2 times, most recently from d68ee49 to c811714 Mar 28, 2019

@dhimmel dhimmel force-pushed the dhimmel:pandoc-filter branch from c811714 to 6480168 Mar 28, 2019

@dhimmel dhimmel referenced this pull request Mar 28, 2019

Merged

Create pandoc submodule #103

@agitter agitter referenced this pull request Apr 22, 2019

Open

Manubot and Bookdown #212

@dhimmel dhimmel referenced this pull request Apr 25, 2019

Merged

Allow multiple manual-reference files of many formats #104

3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.