YAML bibliography processing is very slow #6084

brevzin · 2020-01-25T21:12:19Z

Here's a set of instructions to reproduce:

$ touch x.md
$ wget https://wg21.link/index.yaml -O index.yaml
$ time pandoc x.md -o x.html --bibliography index.yaml

That is, empty markdown file to html, no defaults, and using the wg21 bibligraphy (which is a little over 100k lines long).

With pandoc 2.7.3, this takes 0.8s.
With pandoc 2.9.1.1, this now takes 16.8s.

The text was updated successfully, but these errors were encountered:

brevzin · 2020-01-25T21:16:32Z

pandoc 2.8.x also takes ~17s.

jgm · 2020-01-26T04:44:57Z

We switched from a wrapper over the C libyaml library (yaml package) to a pure Haskell YAML parser (HsYAML), and I suspect that is the cause of the slowdown. To confirm this we should try using HsYAML by itself on this file (no pandoc).

jgm · 2020-01-26T05:04:34Z

Sure enough, decodeNode takes 25 seconds to parse this input on my machine in a ghci session.
I will create an issue on HsYAML to see if performance there can be improved.

jgm · 2020-02-13T16:13:49Z

Nothing from HsYAML yet. I'm thinking the best path forward would be to switch back to yaml. Although there are drawbacks to a dependency on a C library, processing large YAML files is something you should be able to do with pandoc.

https://hackage.haskell.org/package/yaml-0.11.2.0/docs/Data-Yaml-Parser.html provides an interface similar to HsYAML's and would allow us to duplicate the code.

This would require changes in:

src/Text/Pandoc/Filter.hs (FromYaml instances)
src/Text/Pandoc/App/Opt.hs (FromYaml instances)
src/Text/Pandoc/Options.hs (FromYaml instances)
src/Text/Pandoc/Logging.hs (FromYaml instances)
src/Text/Pandoc/Translations.hs (FromYaml instances)
src/Text/Pandoc/Readers/Metadata.hs (YAML decoding of metadata)
src/Text/Pandoc/App/CommandLineOptions.hs (decoding YAML defaults file)

It seems attractive to add a cabal flag allowing compilation with both HsYAML and yaml, but unfortunately this would violate the expectation that the public API not be affected by compilation flags (since this would affect whether we had FromYAML or FromYaml instances).

ickc · 2020-12-18T10:47:17Z

Hi, I'm benchmarking this behavior and it is somewhat surprising:

I created a square matrix and converted it in YAML format string. Then I test various different YAML read speed.

Note that I checked and pandoc has been using HsYAML since pandoc 2.2. So if it is just a matter of calling yaml vs HsYAML, then we expect the perf. diff. shows up between pandoc 2.11 and 2.11+, not between 2.7 and 2.8. (Unless I made a mistake here?) I don't understand why the test above showed that the diff. is between 2.7 and 2.8.

Below size is the number of elements in that matrix (so n x n), and time in second.
All yaml* functions are using pyyaml, in short those with c are calling the C library, and without C is pure-Python.
All pandoc* are calling pandoc to process the file as YAML metadata in markdown to native. The numbers are obviously the pandoc versions.

Summary:

all C functions basically has the same time behavior.
the pure Python is an order of magnitude slower than that.
the pandoc's is about a factor of 3 slower than pure Python, independent of pandoc versions (i.e. calling yaml or HsYAML.)

Then I try loading the https://wg21.link/index.yaml above, (Edit 3: take these number with a grain of salt)

pandoc_2_1_3: 35.268665639000005
pandoc_2_7_3: 38.718894643
pandoc_2_8_1: 34.14647693799998
yaml_csafeloader: 1.2296081969999761
yaml_safeloader: 13.213422126000012

So, if indeed the perf. regression happens at pandoc 2.8, probably it is not due to yaml vs HsYAML, but something else?

Edit: the notebook used to produce the plot is in: https://gist.github.com/ickc/f9d6c57338d4be07f6cf70991d462fc3, with all the pandoc bins located in ~/temp/temp/pandoc-*/bin and index.yaml downloaded in ~/temp/temp.

Edit 2: I added the timing of index.yaml. Also, I noticed that I'm treating the YAML differently than the initial report hence the discrepancy: they load it as a bibliography and I put it as the YAML metadata of the markdown. I don't understand why the perf. would be so different here as they are essentially parsing the same info. (Or may be because of the construction of native AST?)

Edit 3: see comments below:

I did a similar thing expect I only prepend ---, because some markdown versions expect a closing ... and some other expects --- and I didn't bother to check which is which in my benchmark script.

But I observed a discrepancy in the benchmark speed of https://wg21.link/index.yaml, so it may be caused by that. So take my benchmark of https://wg21.link/index.yaml with a grain of salt (which shows no diff. comparing 2.1.3 and 2.7.3.)

jgm · 2020-12-18T16:49:59Z

Going from 35 to 38 isn't really the sort of performance regression I'd worry about. I imagine that most of the time is taken parsing the contents of the YAML entries as Markdown and constructing an AST. (That's something your Python and C tests are not doing at all.) Markdown parsing is slower than YAML parsing.

ickc · 2020-12-18T22:35:37Z

I ran pandoc x.md -o x.html --bibliography index.yaml using different pandoc versions and the change has to be between pandoc 2.7.3 and 2.8. pandoc 2.7.3 ship with pandoc-citeproc 0.16.2 and pandoc 2.8 ship with pandoc-citeproc 0.16.4.

Diff between pandoc-citeproc 0.16.2 and 0.16.4 are:

pandoc-citeproc (0.16.4)

Simplify compat since we now require pandoc >= 2.8.

Update man page. Clarify that reference-section-title doesn't have any
effect when an explicit Div is used (PDF TOC page numbering off by sizeof(TOC) #424).

Qualify fail with Prelude to fix compiler warnings.

Disable raw_attribute extension when writing Markdown.
This way we avoid e.g. <i>{=html} in the CSL export.

Use pandoc-types 1.20. Note that this change removes
the ability to compile pandoc-citeproc with
older versions of pandoc (< 2.8).

Incorporate the switch to Text in the dependencies (Christian
Despres). The changes to the structure of the code are fairly minimal.
None of the types have changed, requiring a reasonable amount of
packing, unpacking, and view patterns.

pandoc-citeproc (0.16.3.1)

Fix how LANG is set for bibtex conversion.
We were using a - where _ is standard.

Make locale retrieval more robust (Indented code ignoreed #420).
Previously an error was raised if the locale was 'C'.
Now 'C' is treated as default (en-US locale is used).

pandoc-citeproc (0.16.3)

Set hanging-indent class on refs Div if specified by the style (underscore _ not followed by a character in LaTeX environments #410).
This depends on whether the hanging-indent attribute of
<bibliography> is true.

Update chicago-author-date.csl and locales.

Add new biblatex @dataset entry type (s5 slides are not working in 1.9.0.5 #413).

Cleanup cabal file and remove old ghc 6.10 conditional.

Replace some of the yaml use with HsYAML-aeson (vijayphoenix).

Allow xml-conduit 1.9.

Allow network 3.1.

Allow pandoc 2.8.

Use pandoc-types 1.17.6.1

So I now understand why HsYAML might be to be blamed: "Replace some of the yaml use with HsYAML-aeson"

But then my question was that pandoc has been using HsYAML to process YAML metadata since 2.2, but processing YAML metadata between pandoc 2.1.3 vs 2.7.3 has similar performance.

In other words, I'm questioning if the perf. diff. is really from yaml vs HsYAML? Would it be from somewhere else, such as "Use pandoc-types 1.20"?

Also, from the testing above it seems that while it is slow, but it is still in ~linear time. Practically the example in this issue is kind of a worst case scenario. Although it wasn't quick, but it is still under a minute. So may be it is good enough?

jgm · 2020-12-18T23:02:22Z

Here are my observations. First, I added --- and ... around index.yaml so it becomes a proper pandoc markdown YAML metadata section. Then with pandoc 2.1, pandoc index.yaml -o /dev/null takes 4.9 seconds, while with pandoc 2.7.3 and pandoc 2.11.3, it takes 24-28 seconds. Since Markdown parsing is being done in both cases, I speculate that the difference (20 seconds or so) is due to a performance slowdown in YAML parsing due to HsYAML. That is also suggested by the experiment reported above, using the HsYAML library by itself on index.yaml.

The difference between pandoc 2.7 and pandoc 2.8 is, as you infer, not due to YAML parsing in pandoc itself. It is due to changes in the versions of pandoc-citeproc distributed with these pandocs, and in particular the change from yaml to HsYAML.

So I stand by my original conclusion that HsYAML is quite slow and makes pandoc difficult to use with large YAML bibliographies.

ickc · 2020-12-19T00:24:08Z

I did a similar thing expect I only prepend ---, because some markdown versions expect a closing ... and some other expects --- and I didn't bother to check which is which in my benchmark script.

But I observed a discrepancy in the benchmark speed of https://wg21.link/index.yaml, so it may be caused by that. So take my benchmark of https://wg21.link/index.yaml with a grain of salt (which shows no diff. comparing 2.1.3 and 2.7.3.)

ickc · 2020-12-19T01:13:22Z

Update: (this changes the scaling I see earlier, and is now sharing a similar story with @jgm's) Enclosing the whole thing by ---, ... as suggested by @jgm shows that,

For the matrix scaling:
- bottom-most is the C function
- next are the ruamel.yaml, which fork the libyaml to implement YAML 1.2 spec, roughly 2 times slower than above
- next is pandoc 2.1.3, similar to ruamel.yaml at large N, roughly 2 times slowly than C
- next 2 are pandoc 2.7.3 and 2.8.1, around 2 dozen slower than pandoc 2.1.3
- next is pandoc 2.11.2, around 50% slower than pandoc 2.8.1

For https://wg21.link/index.yaml,

pandoc_2_11_2: 79.06120117299997
pandoc_2_1_3: 1.9598751930000162
pandoc_2_7_3: 49.15886576700001
pandoc_2_8_1: 45.44163866100001
yaml_cfullloader: 1.173710701999994
yaml_cloader: 1.164000395999949
yaml_csafeloader: 1.1844055330000174
yaml_csafeloader_ruamel_1_1: 2.458159934999969
yaml_csafeloader_ruamel_1_2: 2.4833653080000317
yaml_fullloader: 12.84384976299998
yaml_loader: 12.782994889999998
yaml_safeloader: 12.736911373999988
yamlloader_cloader: 1.1987743669999986
yamlloader_csafeloader: 1.1853255159999776
yamlloader_loader: 12.743876136999972
yamlloader_safeloader: 12.713291493999975

with a similar story (still slower than I see from terminal directly. So may be passing the string to pandoc from Python has extra overhead. But the story is the same.)

Notebook used here.

mpark · 2020-12-19T04:43:25Z

Just FYI, I've worked around this problem recently in mpark/wg21 (where the discussion with @brevzin originated) by transforming the yaml bibliography into its json equivalent: mpark/wg21@1e3bd19

With Pandoc 2.9.2.1:

time pandoc x.md -o x.html --bibliography data/index.yaml
pandoc x.md -o x.html --bibliography data/index.yaml  20.18s user 0.56s system 97% cpu 21.176 total

time pandoc x.md -o x.html --bibliography data/csl.json
pandoc x.md -o x.html --bibliography data/csl.json  0.54s user 0.16s system 90% cpu 0.779 total

jgm · 2021-10-25T17:38:26Z

Upstream HsYAML seems unresponsive and I'm getting worried that it's a dead project. More reason to switch back to yaml, perhaps.

There is a problem, however. In pandoc 2.2.2 we noted the following change:

Note: HsYAML implements YAML 1.2, in which the valid true values are true, True, TRUE. This means a change in the semantics of YAML metadata that could affect users: y, yes, and on no longer count as true values.

As far as I can see, the yaml package doesn't have a configuration for YAML 1.2, so we'd go back to parsing y, yes, no, etc. as Boolean values. This could affect some existing documents.

Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML files, such as bibliographies (#6084). An issue was submitted to HsYAML, but it hasn't gotten any attention. - HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.

jgm · 2021-10-27T04:35:16Z

I've got a branch now that uses yaml instead of HsYAML. (yaml branch)

I tried to render a document with one citation, using --citeproc and the 3.5 MB index.yaml linked above. With an optimized build, it took about 4 seconds, as compared to 12 seconds for the released pandoc 2.15. (M1 mac laptop.) Not sure this is a big enough difference to justify the change, but as noted above the lack of current maintenance of HsYAML is another reason to switch.

jgm · 2021-10-27T18:30:37Z

OK, I was testing the wrong thing. I should have been using --bibliography instead of --metadata-file. With --bibliography=index.yaml and a document with one citation, I see a 20-fold difference in speed.

Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.

Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly. - Pandoc now behaves better when the YAML metadata contains escaping errors: instead of just falling back on treating the section as a table, it raises a YAML parsing error.

jgm mentioned this issue Jan 26, 2020

Performance could be better haskell-hvr/HsYAML#40

Open

jgm changed the title ~~Bibliography processing is very slow~~ YAML bibliography processing is very slow Jan 26, 2020

mb21 added performance status:waiting-for-upstream labels Jan 26, 2020

jgm mentioned this issue Jul 20, 2020

Removing non-Haskell dependencies (C dependencies) #4535

Open

4 tasks

the-solipsist mentioned this issue Nov 19, 2020

[ENHANCEMENT]: Support pandoc citeproc YAML as a citation database. Zettlr/Zettlr#1407

Closed

ickc mentioned this issue Dec 19, 2020

Upgrading YAML spec from 1.1 to 1.2? sergiocorreia/panflute#178

Closed

jgm closed this as completed in d226a35 Oct 27, 2021

cderv mentioned this issue Apr 7, 2022

Did version 2.1.3 change how '---' is parsed? (Error: pandoc document conversion failed with error 64) rstudio/rmarkdown#2343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAML bibliography processing is very slow #6084

YAML bibliography processing is very slow #6084

brevzin commented Jan 25, 2020 •

edited

Loading

brevzin commented Jan 25, 2020

jgm commented Jan 26, 2020

jgm commented Jan 26, 2020

jgm commented Feb 13, 2020

ickc commented Dec 18, 2020 •

edited

Loading

jgm commented Dec 18, 2020 •

edited

Loading

ickc commented Dec 18, 2020

jgm commented Dec 18, 2020

ickc commented Dec 19, 2020

ickc commented Dec 19, 2020 •

edited

Loading

mpark commented Dec 19, 2020 •

edited

Loading

jgm commented Oct 25, 2021

jgm commented Oct 27, 2021

jgm commented Oct 27, 2021

YAML bibliography processing is very slow #6084

YAML bibliography processing is very slow #6084

Comments

brevzin commented Jan 25, 2020 • edited Loading

brevzin commented Jan 25, 2020

jgm commented Jan 26, 2020

jgm commented Jan 26, 2020

jgm commented Feb 13, 2020

ickc commented Dec 18, 2020 • edited Loading

jgm commented Dec 18, 2020 • edited Loading

ickc commented Dec 18, 2020

jgm commented Dec 18, 2020

ickc commented Dec 19, 2020

ickc commented Dec 19, 2020 • edited Loading

mpark commented Dec 19, 2020 • edited Loading

jgm commented Oct 25, 2021

jgm commented Oct 27, 2021

jgm commented Oct 27, 2021

brevzin commented Jan 25, 2020 •

edited

Loading

ickc commented Dec 18, 2020 •

edited

Loading

jgm commented Dec 18, 2020 •

edited

Loading

ickc commented Dec 19, 2020 •

edited

Loading

mpark commented Dec 19, 2020 •

edited

Loading