Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: common intermediate JSON for all pandoc outputs #196

Closed
wants to merge 2 commits into from

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Mar 18, 2019

We use pandoc to convert from markdown to several output formats, such as HTML, DOCX, PDF, and JATS in the future. Therefore, does it make sense to streamline the pandoc pipeline to do as much of the shared conversion together by saving an intermediate Pandoc JSON abstract syntax tree?

The benefits would be more assurance that different outputs include the same processing filters. The output-specific steps would be more clearly delineated from the common processing steps. Build times may improve slightly. We will also retain the manuscript.json file, which could be helpful for debugging and converting to additional formats down the road.

Draft.

Refs jgm/pandoc#3211 (comment):

I don't think the output is completely deterministic, because of maps in YAML metadata. (JSON serialization of hash maps is not deterministic.)

Todo:

  • look into broken table / equation links
  • upgrade pandoc
  • reduce JSON indentation

@dhimmel
Copy link
Member Author

dhimmel commented Mar 18, 2019

Exporting to an intermediate JSON breaks table & equation anchors

Our project uses pandoc to export to multiple formats (HTML, PDF, DOCX, and in the future JATS). Here we investigate using a unified pandoc command to convert to Pandoc's Abstract Syntax Tree (i.e --to=json). To number figure, equations, and tables, we use the pandoc-fignos, pandoc-eqnos, and pandoc-tablenos` filters by @tomduck.

For some reason, 896d2ca breaks the anchors for equations and tables, but not figures. As seen in the build log, WeasyPrint complains about the broken anchors:

ERROR: No anchor #tbl:bowling-scores for internal URI reference
ERROR: No anchor #eq:regular-equation for internal URI reference
ERROR: No anchor #eq:long-equation for internal URI reference

In the HTML output prior to this pull request:

<span>Table 1:</span> A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tbl:bowling-scores">

In the HTML output when processing through an intermediate JSON AST:

Table 1: A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tables">

Note that href="#tbl:bowling-scores" has changed to href="#tables". #tbl:bowling-scores is the anchor we need to allow the internal cross-references to resolve.

@jgm / @vincerubinetti any idea what could be breaking? Does the AST strip out equation and table identifiers?

@dhimmel
Copy link
Member Author

dhimmel commented Mar 21, 2019

This broken table & equation (but not figure) anchors has been a difficult issue for me to diagnose. My current understanding is that pandoc does not provide a way to set the id for tables or equations (see tomduck/pandoc-tablenos#11 & lierdakil/pandoc-crossref#30 (comment)). Therefore, the filters -- such as pandoc-xnos and pandoc-citeproc -- have their own approach to creating anchors. Now it seems that when I change the output format to json, pandoc-eqnos and pandoc-tablenos no longer set the href to what was specified.

I tried using pandoc-crossref with the intermediate JSON AST workflow, and pandoc-crossref did set divs properly. We originally chose pandoc-fig/eq/tablenos over pandoc-crossref in #1 due to their easier inclusion in our conda-managed environment. However, now pandoc-crossref is available on conda-forge. The in-markdown syntax is mostly the same. Therefore it is possible we could switch if we decide an intermediate JSON stage is a good idea.

A few questions for @lierdakil:

  • We convert from markdown to several output formats. Does it make sense to try to consolidate as much of the processing as possible into a single conversion to Pandoc JSON?
  • Does the behavior of pandoc-crossref vary based on the output format?

@lierdakil
Copy link

lierdakil commented Mar 22, 2019

We convert from markdown to several output formats. Does it make sense to try to consolidate as much of the processing as possible into a single conversion to Pandoc JSON?

Honestly, not sure. You might save a few CPU cycles this way, but can't say from the top of my head how much exactly. One thing I will note is that approach you're taking in this pull request does not make sense. If you're using Pandoc JSON anyway, it seems logical to use filters as a pipe instead of passing those as --filter arguments. First argument to the filter binary may be a string identifying the output format (also, that's probably why pandoc-tablenos doesn't do what you expect it to do). So, for instance:

pandoc --verbose \
  --from=markdown \
  --to=json \
  --metadata bibliography=$BIBLIOGRAPHY_PATH \
  --metadata csl=$CSL_PATH \
  --metadata link-citations=true \
...
  | python pandoc-fignos.py html5 // or no format; or another format
  | python pandoc-eqnos.py html5
  | python pandoc-tablenos.py html5
  | ...

Does the behavior of pandoc-crossref vary based on the output format?

It does. Specifically, output is slightly adjusted in case output format is docx (aka OOXML), to apply caption style to some ad-hoc captions, and is adjusted a lot when outputting to LaTeX (or PDF via LaTeX).

If you use JSON intermediary explicitly, pandoc-crossref will not do that, instead showing "generic" behaviour, unless you pass the output format also as the first argument to pandoc-crossref binary (i.e. pandoc -f markdown -t json | pandoc-crossref docx | pandoc -f json -t docx). Which might be either a good thing or a bad thing depending on how you look at it. One particular thing to note is that LaTeX output would need a bit of additional tweaking to work right in the "generic mode" (for one, without tweaks, figure and table captions will have duplicate Figure and Table headers and numbers, one added by pandoc-crossref, and another by LaTeX itself)

Fair warning: LaTeX output in pandoc-crossref is a bit of a different beast from the rest of those. Outputting to, say, docx and LaTeX with default options will produce results that differ in surprising ways. If you want consistency, using pandoc-crossref in "pipe mode" (i.e. pandoc -f markdown -t json | pandoc-crossref | pandoc -f json -t latex) will likely be an easier option.

@dhimmel
Copy link
Member Author

dhimmel commented Mar 23, 2019

Thanks @lierdakil for your explanation. It seems that it really only makes sense to consolidate the conversion pipeline for filters that are output-format agnostic. Since none of the filters we currently use are actually output-format agnostic, I will close this pull request. If we do incorporate output-format agnostic filters, such as the cite-by-id filter under development in manubot/manubot#99, we can revisit this proposal.

Going forward we will keep pandoc-crossref in mind, especially if our current suite of pandoc-xnos filters starts experiencing limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants