Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML bibliography processing is very slow #6084

Closed
brevzin opened this issue Jan 25, 2020 · 14 comments
Closed

YAML bibliography processing is very slow #6084

brevzin opened this issue Jan 25, 2020 · 14 comments

Comments

@brevzin
Copy link

brevzin commented Jan 25, 2020

Here's a set of instructions to reproduce:

$ touch x.md
$ wget https://wg21.link/index.yaml -O index.yaml
$ time pandoc x.md -o x.html --bibliography index.yaml

That is, empty markdown file to html, no defaults, and using the wg21 bibligraphy (which is a little over 100k lines long).

With pandoc 2.7.3, this takes 0.8s.
With pandoc 2.9.1.1, this now takes 16.8s.

@brevzin
Copy link
Author

brevzin commented Jan 25, 2020

pandoc 2.8.x also takes ~17s.

@jgm
Copy link
Owner

jgm commented Jan 26, 2020

We switched from a wrapper over the C libyaml library (yaml package) to a pure Haskell YAML parser (HsYAML), and I suspect that is the cause of the slowdown. To confirm this we should try using HsYAML by itself on this file (no pandoc).

@jgm
Copy link
Owner

jgm commented Jan 26, 2020

Sure enough, decodeNode takes 25 seconds to parse this input on my machine in a ghci session.
I will create an issue on HsYAML to see if performance there can be improved.

@jgm jgm changed the title Bibliography processing is very slow YAML bibliography processing is very slow Jan 26, 2020
@jgm
Copy link
Owner

jgm commented Feb 13, 2020

Nothing from HsYAML yet. I'm thinking the best path forward would be to switch back to yaml. Although there are drawbacks to a dependency on a C library, processing large YAML files is something you should be able to do with pandoc.

https://hackage.haskell.org/package/yaml-0.11.2.0/docs/Data-Yaml-Parser.html provides an interface similar to HsYAML's and would allow us to duplicate the code.

This would require changes in:

  • src/Text/Pandoc/Filter.hs (FromYaml instances)
  • src/Text/Pandoc/App/Opt.hs (FromYaml instances)
  • src/Text/Pandoc/Options.hs (FromYaml instances)
  • src/Text/Pandoc/Logging.hs (FromYaml instances)
  • src/Text/Pandoc/Translations.hs (FromYaml instances)
  • src/Text/Pandoc/Readers/Metadata.hs (YAML decoding of metadata)
  • src/Text/Pandoc/App/CommandLineOptions.hs (decoding YAML defaults file)

It seems attractive to add a cabal flag allowing compilation with both HsYAML and yaml, but unfortunately this would violate the expectation that the public API not be affected by compilation flags (since this would affect whether we had FromYAML or FromYaml instances).

@ickc
Copy link
Contributor

ickc commented Dec 18, 2020

Hi, I'm benchmarking this behavior and it is somewhat surprising:

I created a square matrix and converted it in YAML format string. Then I test various different YAML read speed.

Note that I checked and pandoc has been using HsYAML since pandoc 2.2. So if it is just a matter of calling yaml vs HsYAML, then we expect the perf. diff. shows up between pandoc 2.11 and 2.11+, not between 2.7 and 2.8. (Unless I made a mistake here?) I don't understand why the test above showed that the diff. is between 2.7 and 2.8.

  • Below size is the number of elements in that matrix (so n x n), and time in second.

  • All yaml* functions are using pyyaml, in short those with c are calling the C library, and without C is pure-Python.

  • All pandoc* are calling pandoc to process the file as YAML metadata in markdown to native. The numbers are obviously the pandoc versions.

newplot

Summary:

  • all C functions basically has the same time behavior.
  • the pure Python is an order of magnitude slower than that.
  • the pandoc's is about a factor of 3 slower than pure Python, independent of pandoc versions (i.e. calling yaml or HsYAML.)

Then I try loading the https://wg21.link/index.yaml above, (Edit 3: take these number with a grain of salt)

pandoc_2_1_3: 35.268665639000005
pandoc_2_7_3: 38.718894643
pandoc_2_8_1: 34.14647693799998
yaml_csafeloader: 1.2296081969999761
yaml_safeloader: 13.213422126000012

So, if indeed the perf. regression happens at pandoc 2.8, probably it is not due to yaml vs HsYAML, but something else?

Edit: the notebook used to produce the plot is in: https://gist.github.com/ickc/f9d6c57338d4be07f6cf70991d462fc3, with all the pandoc bins located in ~/temp/temp/pandoc-*/bin and index.yaml downloaded in ~/temp/temp.

Edit 2: I added the timing of index.yaml. Also, I noticed that I'm treating the YAML differently than the initial report hence the discrepancy: they load it as a bibliography and I put it as the YAML metadata of the markdown. I don't understand why the perf. would be so different here as they are essentially parsing the same info. (Or may be because of the construction of native AST?)

Edit 3: see comments below:

I did a similar thing expect I only prepend ---, because some markdown versions expect a closing ... and some other expects --- and I didn't bother to check which is which in my benchmark script.

But I observed a discrepancy in the benchmark speed of https://wg21.link/index.yaml, so it may be caused by that. So take my benchmark of https://wg21.link/index.yaml with a grain of salt (which shows no diff. comparing 2.1.3 and 2.7.3.)

@jgm
Copy link
Owner

jgm commented Dec 18, 2020

Going from 35 to 38 isn't really the sort of performance regression I'd worry about. I imagine that most of the time is taken parsing the contents of the YAML entries as Markdown and constructing an AST. (That's something your Python and C tests are not doing at all.) Markdown parsing is slower than YAML parsing.

@ickc
Copy link
Contributor

ickc commented Dec 18, 2020

I ran pandoc x.md -o x.html --bibliography index.yaml using different pandoc versions and the change has to be between pandoc 2.7.3 and 2.8. pandoc 2.7.3 ship with pandoc-citeproc 0.16.2 and pandoc 2.8 ship with pandoc-citeproc 0.16.4.

Diff between pandoc-citeproc 0.16.2 and 0.16.4 are:

pandoc-citeproc (0.16.4)

  • Simplify compat since we now require pandoc >= 2.8.
  • Update man page. Clarify that reference-section-title doesn't have any
    effect when an explicit Div is used (PDF TOC page numbering off by sizeof(TOC) #424).
  • Qualify fail with Prelude to fix compiler warnings.
  • Disable raw_attribute extension when writing Markdown.
    This way we avoid e.g. <i>{=html} in the CSL export.
  • Use pandoc-types 1.20. Note that this change removes
    the ability to compile pandoc-citeproc with
    older versions of pandoc (< 2.8).
  • Incorporate the switch to Text in the dependencies (Christian
    Despres). The changes to the structure of the code are fairly minimal.
    None of the types have changed, requiring a reasonable amount of
    packing, unpacking, and view patterns.

pandoc-citeproc (0.16.3.1)

  • Fix how LANG is set for bibtex conversion.
    We were using a - where _ is standard.
  • Make locale retrieval more robust (Indented code ignoreed #420).
    Previously an error was raised if the locale was 'C'.
    Now 'C' is treated as default (en-US locale is used).

pandoc-citeproc (0.16.3)

So I now understand why HsYAML might be to be blamed: "Replace some of the yaml use with HsYAML-aeson"

But then my question was that pandoc has been using HsYAML to process YAML metadata since 2.2, but processing YAML metadata between pandoc 2.1.3 vs 2.7.3 has similar performance.


In other words, I'm questioning if the perf. diff. is really from yaml vs HsYAML? Would it be from somewhere else, such as "Use pandoc-types 1.20"?

Also, from the testing above it seems that while it is slow, but it is still in ~linear time. Practically the example in this issue is kind of a worst case scenario. Although it wasn't quick, but it is still under a minute. So may be it is good enough?

@jgm
Copy link
Owner

jgm commented Dec 18, 2020

Here are my observations. First, I added --- and ... around index.yaml so it becomes a proper pandoc markdown YAML metadata section. Then with pandoc 2.1, pandoc index.yaml -o /dev/null takes 4.9 seconds, while with pandoc 2.7.3 and pandoc 2.11.3, it takes 24-28 seconds. Since Markdown parsing is being done in both cases, I speculate that the difference (20 seconds or so) is due to a performance slowdown in YAML parsing due to HsYAML. That is also suggested by the experiment reported above, using the HsYAML library by itself on index.yaml.

The difference between pandoc 2.7 and pandoc 2.8 is, as you infer, not due to YAML parsing in pandoc itself. It is due to changes in the versions of pandoc-citeproc distributed with these pandocs, and in particular the change from yaml to HsYAML.

So I stand by my original conclusion that HsYAML is quite slow and makes pandoc difficult to use with large YAML bibliographies.

@ickc
Copy link
Contributor

ickc commented Dec 19, 2020

I did a similar thing expect I only prepend ---, because some markdown versions expect a closing ... and some other expects --- and I didn't bother to check which is which in my benchmark script.

But I observed a discrepancy in the benchmark speed of https://wg21.link/index.yaml, so it may be caused by that. So take my benchmark of https://wg21.link/index.yaml with a grain of salt (which shows no diff. comparing 2.1.3 and 2.7.3.)

@ickc
Copy link
Contributor

ickc commented Dec 19, 2020

Update: (this changes the scaling I see earlier, and is now sharing a similar story with @jgm's) Enclosing the whole thing by ---, ... as suggested by @jgm shows that,

  • For the matrix scaling:

    • bottom-most is the C function

    • next are the ruamel.yaml, which fork the libyaml to implement YAML 1.2 spec, roughly 2 times slower than above

    • next is pandoc 2.1.3, similar to ruamel.yaml at large N, roughly 2 times slowly than C

    • next 2 are pandoc 2.7.3 and 2.8.1, around 2 dozen slower than pandoc 2.1.3

    • next is pandoc 2.11.2, around 50% slower than pandoc 2.8.1

  • For https://wg21.link/index.yaml,

    pandoc_2_11_2: 79.06120117299997
    pandoc_2_1_3: 1.9598751930000162
    pandoc_2_7_3: 49.15886576700001
    pandoc_2_8_1: 45.44163866100001
    yaml_cfullloader: 1.173710701999994
    yaml_cloader: 1.164000395999949
    yaml_csafeloader: 1.1844055330000174
    yaml_csafeloader_ruamel_1_1: 2.458159934999969
    yaml_csafeloader_ruamel_1_2: 2.4833653080000317
    yaml_fullloader: 12.84384976299998
    yaml_loader: 12.782994889999998
    yaml_safeloader: 12.736911373999988
    yamlloader_cloader: 1.1987743669999986
    yamlloader_csafeloader: 1.1853255159999776
    yamlloader_loader: 12.743876136999972
    yamlloader_safeloader: 12.713291493999975

    with a similar story (still slower than I see from terminal directly. So may be passing the string to pandoc from Python has extra overhead. But the story is the same.)

Notebook used here.

@mpark
Copy link

mpark commented Dec 19, 2020

Just FYI, I've worked around this problem recently in mpark/wg21 (where the discussion with @brevzin originated) by transforming the yaml bibliography into its json equivalent: mpark/wg21@1e3bd19

With Pandoc 2.9.2.1:

time pandoc x.md -o x.html --bibliography data/index.yaml
pandoc x.md -o x.html --bibliography data/index.yaml  20.18s user 0.56s system 97% cpu 21.176 total

time pandoc x.md -o x.html --bibliography data/csl.json
pandoc x.md -o x.html --bibliography data/csl.json  0.54s user 0.16s system 90% cpu 0.779 total

@jgm
Copy link
Owner

jgm commented Oct 25, 2021

Upstream HsYAML seems unresponsive and I'm getting worried that it's a dead project. More reason to switch back to yaml, perhaps.

There is a problem, however. In pandoc 2.2.2 we noted the following change:

Note: HsYAML implements YAML 1.2, in which the valid true values are true, True, TRUE. This means a change in the semantics of YAML metadata that could affect users: y, yes, and on no longer count as true values.

As far as I can see, the yaml package doesn't have a configuration for YAML 1.2, so we'd go back to parsing y, yes, no, etc. as Boolean values. This could affect some existing documents.

jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML files, such as bibliographies (#6084).
  An issue was submitted to HsYAML, but it hasn't gotten
  any attention.
- HsYAML seems borderline unmaintained; it hasn't had a
  commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
@jgm
Copy link
Owner

jgm commented Oct 27, 2021

I've got a branch now that uses yaml instead of HsYAML. (yaml branch)

I tried to render a document with one citation, using --citeproc and the 3.5 MB index.yaml linked above. With an optimized build, it took about 4 seconds, as compared to 12 seconds for the released pandoc 2.15. (M1 mac laptop.) Not sure this is a big enough difference to justify the change, but as noted above the lack of current maintenance of HsYAML is another reason to switch.

@jgm
Copy link
Owner

jgm commented Oct 27, 2021

OK, I was testing the wrong thing. I should have been using --bibliography instead of --metadata-file. With --bibliography=index.yaml and a document with one citation, I see a 20-fold difference in speed.

jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML bibliographies (#6084).
- An issue was submitted to HsYAML, but it hasn't gotten
  any attention.  HsYAML seems borderline unmaintained; it hasn't
  had a commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
jgm added a commit that referenced this issue Oct 27, 2021
Reasons:

- Performance: HsYAML is around 20 times slower in parsing
  large YAML bibliographies (#6084).
- An issue was submitted to HsYAML, but it hasn't gotten
  any attention.  HsYAML seems borderline unmaintained; it hasn't
  had a commit in over a year.
- Unfortunately this goes back on our attempts to free ourselves
  from C dependencies (#4535).  But I don't see a better alternative
  until a better pure Haskell parser is available.

Closes #6084.

Notes:

- We've removed the FromYAML instances for all types that had
  them, since this is a HsYAML-specific typeclass [API change].
  (The yaml package just uses From/ToJSON.)
- Unlike HsYAML (in the configuration we were using), yaml
  parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values.
  Users may need to quote these when they are meant to be
  interpreted as strings.  Similarly, 'null' is parsed as
  a YAML null value (and will be treated as an empty string
  by pandoc rather than the string 'null').  Quoting it will
  force it to be interpreted as a string.
- Some tests had to be adjusted accordingly.
- Pandoc now behaves better when the YAML metadata contains
  escaping errors: instead of just falling back on treating
  the section as a table, it raises a YAML parsing error.
@jgm jgm closed this as completed in d226a35 Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants