build: common intermediate JSON for all pandoc outputs #196

dhimmel · 2019-03-18T20:18:19Z

We use pandoc to convert from markdown to several output formats, such as HTML, DOCX, PDF, and JATS in the future. Therefore, does it make sense to streamline the pandoc pipeline to do as much of the shared conversion together by saving an intermediate Pandoc JSON abstract syntax tree?

The benefits would be more assurance that different outputs include the same processing filters. The output-specific steps would be more clearly delineated from the common processing steps. Build times may improve slightly. We will also retain the manuscript.json file, which could be helpful for debugging and converting to additional formats down the road.

Draft.

Refs jgm/pandoc#3211 (comment):

I don't think the output is completely deterministic, because of maps in YAML metadata. (JSON serialization of hash maps is not deterministic.)

Todo:

look into broken table / equation links
upgrade pandoc
reduce JSON indentation

dhimmel · 2019-03-18T20:59:33Z

Exporting to an intermediate JSON breaks table & equation anchors

Our project uses pandoc to export to multiple formats (HTML, PDF, DOCX, and in the future JATS). Here we investigate using a unified pandoc command to convert to Pandoc's Abstract Syntax Tree (i.e --to=json). To number figure, equations, and tables, we use the pandoc-fignos, pandoc-eqnos, and pandoc-tablenos` filters by @tomduck.

For some reason, 896d2ca breaks the anchors for equations and tables, but not figures. As seen in the build log, WeasyPrint complains about the broken anchors:

ERROR: No anchor #tbl:bowling-scores for internal URI reference
ERROR: No anchor #eq:regular-equation for internal URI reference
ERROR: No anchor #eq:long-equation for internal URI reference

In the HTML output prior to this pull request:

<span>Table 1:</span> A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tbl:bowling-scores">

In the HTML output when processing through an intermediate JSON AST:

Table 1: A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tables">

Note that href="#tbl:bowling-scores" has changed to href="#tables". #tbl:bowling-scores is the anchor we need to allow the internal cross-references to resolve.

@jgm / @vincerubinetti any idea what could be breaking? Does the AST strip out equation and table identifiers?

dhimmel · 2019-03-21T19:15:42Z

This broken table & equation (but not figure) anchors has been a difficult issue for me to diagnose. My current understanding is that pandoc does not provide a way to set the id for tables or equations (see tomduck/pandoc-tablenos#11 & lierdakil/pandoc-crossref#30 (comment)). Therefore, the filters -- such as pandoc-xnos and pandoc-citeproc -- have their own approach to creating anchors. Now it seems that when I change the output format to json, pandoc-eqnos and pandoc-tablenos no longer set the href to what was specified.

I tried using pandoc-crossref with the intermediate JSON AST workflow, and pandoc-crossref did set divs properly. We originally chose pandoc-fig/eq/tablenos over pandoc-crossref in #1 due to their easier inclusion in our conda-managed environment. However, now pandoc-crossref is available on conda-forge. The in-markdown syntax is mostly the same. Therefore it is possible we could switch if we decide an intermediate JSON stage is a good idea.

A few questions for @lierdakil:

We convert from markdown to several output formats. Does it make sense to try to consolidate as much of the processing as possible into a single conversion to Pandoc JSON?
Does the behavior of pandoc-crossref vary based on the output format?

lierdakil · 2019-03-22T02:37:52Z

We convert from markdown to several output formats. Does it make sense to try to consolidate as much of the processing as possible into a single conversion to Pandoc JSON?

Honestly, not sure. You might save a few CPU cycles this way, but can't say from the top of my head how much exactly. One thing I will note is that approach you're taking in this pull request does not make sense. If you're using Pandoc JSON anyway, it seems logical to use filters as a pipe instead of passing those as --filter arguments. First argument to the filter binary may be a string identifying the output format (also, that's probably why pandoc-tablenos doesn't do what you expect it to do). So, for instance:

pandoc --verbose \
  --from=markdown \
  --to=json \
  --metadata bibliography=$BIBLIOGRAPHY_PATH \
  --metadata csl=$CSL_PATH \
  --metadata link-citations=true \
...
  | python pandoc-fignos.py html5 // or no format; or another format
  | python pandoc-eqnos.py html5
  | python pandoc-tablenos.py html5
  | ...

Does the behavior of pandoc-crossref vary based on the output format?

It does. Specifically, output is slightly adjusted in case output format is docx (aka OOXML), to apply caption style to some ad-hoc captions, and is adjusted a lot when outputting to LaTeX (or PDF via LaTeX).

If you use JSON intermediary explicitly, pandoc-crossref will not do that, instead showing "generic" behaviour, unless you pass the output format also as the first argument to pandoc-crossref binary (i.e. pandoc -f markdown -t json | pandoc-crossref docx | pandoc -f json -t docx). Which might be either a good thing or a bad thing depending on how you look at it. One particular thing to note is that LaTeX output would need a bit of additional tweaking to work right in the "generic mode" (for one, without tweaks, figure and table captions will have duplicate Figure and Table headers and numbers, one added by pandoc-crossref, and another by LaTeX itself)

Fair warning: LaTeX output in pandoc-crossref is a bit of a different beast from the rest of those. Outputting to, say, docx and LaTeX with default options will produce results that differ in surprising ways. If you want consistency, using pandoc-crossref in "pipe mode" (i.e. pandoc -f markdown -t json | pandoc-crossref | pandoc -f json -t latex) will likely be an easier option.

dhimmel · 2019-03-23T03:32:04Z

Thanks @lierdakil for your explanation. It seems that it really only makes sense to consolidate the conversion pipeline for filters that are output-format agnostic. Since none of the filters we currently use are actually output-format agnostic, I will close this pull request. If we do incorporate output-format agnostic filters, such as the cite-by-id filter under development in manubot/manubot#99, we can revisit this proposal.

Going forward we will keep pandoc-crossref in mind, especially if our current suite of pandoc-xnos filters starts experiencing limitations.

build: common intermediate JSON for all pandoc outputs

896d2ca

Move metadata commands before filters

7e5a114

dhimmel closed this Mar 23, 2019

dhimmel mentioned this pull request Mar 23, 2019

Update acknowledgements greenelab/meta-review#126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: common intermediate JSON for all pandoc outputs #196

build: common intermediate JSON for all pandoc outputs #196

dhimmel commented Mar 18, 2019 •

edited

Loading

dhimmel commented Mar 18, 2019

dhimmel commented Mar 21, 2019 •

edited

Loading

lierdakil commented Mar 22, 2019 •

edited

Loading

dhimmel commented Mar 23, 2019

build: common intermediate JSON for all pandoc outputs #196

build: common intermediate JSON for all pandoc outputs #196

Conversation

dhimmel commented Mar 18, 2019 • edited Loading

dhimmel commented Mar 18, 2019

Exporting to an intermediate JSON breaks table & equation anchors

dhimmel commented Mar 21, 2019 • edited Loading

lierdakil commented Mar 22, 2019 • edited Loading

dhimmel commented Mar 23, 2019

dhimmel commented Mar 18, 2019 •

edited

Loading

dhimmel commented Mar 21, 2019 •

edited

Loading

lierdakil commented Mar 22, 2019 •

edited

Loading