-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YAML bibliography processing is very slow #6084
Comments
pandoc 2.8.x also takes ~17s. |
We switched from a wrapper over the C libyaml library (yaml package) to a pure Haskell YAML parser (HsYAML), and I suspect that is the cause of the slowdown. To confirm this we should try using HsYAML by itself on this file (no pandoc). |
Sure enough, |
Nothing from HsYAML yet. I'm thinking the best path forward would be to switch back to yaml. Although there are drawbacks to a dependency on a C library, processing large YAML files is something you should be able to do with pandoc. https://hackage.haskell.org/package/yaml-0.11.2.0/docs/Data-Yaml-Parser.html provides an interface similar to HsYAML's and would allow us to duplicate the code. This would require changes in:
It seems attractive to add a cabal flag allowing compilation with both HsYAML and yaml, but unfortunately this would violate the expectation that the public API not be affected by compilation flags (since this would affect whether we had FromYAML or FromYaml instances). |
Hi, I'm benchmarking this behavior and it is somewhat surprising: I created a square matrix and converted it in YAML format string. Then I test various different YAML read speed. Note that I checked and pandoc has been using HsYAML since pandoc 2.2. So if it is just a matter of calling yaml vs HsYAML, then we expect the perf. diff. shows up between pandoc 2.11 and 2.11+, not between 2.7 and 2.8. (Unless I made a mistake here?) I don't understand why the test above showed that the diff. is between 2.7 and 2.8.
Summary:
Then I try loading the https://wg21.link/index.yaml above, (Edit 3: take these number with a grain of salt) pandoc_2_1_3: 35.268665639000005
pandoc_2_7_3: 38.718894643
pandoc_2_8_1: 34.14647693799998
yaml_csafeloader: 1.2296081969999761
yaml_safeloader: 13.213422126000012 So, if indeed the perf. regression happens at pandoc 2.8, probably it is not due to yaml vs HsYAML, but something else? Edit: the notebook used to produce the plot is in: https://gist.github.com/ickc/f9d6c57338d4be07f6cf70991d462fc3, with all the pandoc bins located in Edit 2: I added the timing of index.yaml. Also, I noticed that I'm treating the YAML differently than the initial report hence the discrepancy: they load it as a bibliography and I put it as the YAML metadata of the markdown. I don't understand why the perf. would be so different here as they are essentially parsing the same info. (Or may be because of the construction of native AST?) Edit 3: see comments below:
|
Going from 35 to 38 isn't really the sort of performance regression I'd worry about. I imagine that most of the time is taken parsing the contents of the YAML entries as Markdown and constructing an AST. (That's something your Python and C tests are not doing at all.) Markdown parsing is slower than YAML parsing. |
I ran Diff between pandoc-citeproc 0.16.2 and 0.16.4 are:
So I now understand why HsYAML might be to be blamed: "Replace some of the yaml use with HsYAML-aeson" But then my question was that pandoc has been using HsYAML to process YAML metadata since 2.2, but processing YAML metadata between pandoc 2.1.3 vs 2.7.3 has similar performance. In other words, I'm questioning if the perf. diff. is really from yaml vs HsYAML? Would it be from somewhere else, such as "Use pandoc-types 1.20"? Also, from the testing above it seems that while it is slow, but it is still in ~linear time. Practically the example in this issue is kind of a worst case scenario. Although it wasn't quick, but it is still under a minute. So may be it is good enough? |
Here are my observations. First, I added The difference between pandoc 2.7 and pandoc 2.8 is, as you infer, not due to YAML parsing in pandoc itself. It is due to changes in the versions of pandoc-citeproc distributed with these pandocs, and in particular the change from yaml to HsYAML. So I stand by my original conclusion that HsYAML is quite slow and makes pandoc difficult to use with large YAML bibliographies. |
I did a similar thing expect I only prepend But I observed a discrepancy in the benchmark speed of |
Update: (this changes the scaling I see earlier, and is now sharing a similar story with @jgm's) Enclosing the whole thing by
|
Just FYI, I've worked around this problem recently in mpark/wg21 (where the discussion with @brevzin originated) by transforming the With Pandoc 2.9.2.1:
|
Upstream HsYAML seems unresponsive and I'm getting worried that it's a dead project. More reason to switch back to yaml, perhaps. There is a problem, however. In pandoc 2.2.2 we noted the following change:
As far as I can see, the yaml package doesn't have a configuration for YAML 1.2, so we'd go back to parsing |
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML files, such as bibliographies (#6084). An issue was submitted to HsYAML, but it hasn't gotten any attention. - HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.
I've got a branch now that uses yaml instead of HsYAML. ( I tried to render a document with one citation, using |
OK, I was testing the wrong thing. I should have been using |
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly.
Reasons: - Performance: HsYAML is around 20 times slower in parsing large YAML bibliographies (#6084). - An issue was submitted to HsYAML, but it hasn't gotten any attention. HsYAML seems borderline unmaintained; it hasn't had a commit in over a year. - Unfortunately this goes back on our attempts to free ourselves from C dependencies (#4535). But I don't see a better alternative until a better pure Haskell parser is available. Closes #6084. Notes: - We've removed the FromYAML instances for all types that had them, since this is a HsYAML-specific typeclass [API change]. (The yaml package just uses From/ToJSON.) - Unlike HsYAML (in the configuration we were using), yaml parses 'Y', 'N', 'Yes', 'No', 'On', 'Off' as boolean values. Users may need to quote these when they are meant to be interpreted as strings. Similarly, 'null' is parsed as a YAML null value (and will be treated as an empty string by pandoc rather than the string 'null'). Quoting it will force it to be interpreted as a string. - Some tests had to be adjusted accordingly. - Pandoc now behaves better when the YAML metadata contains escaping errors: instead of just falling back on treating the section as a table, it raises a YAML parsing error.
Here's a set of instructions to reproduce:
That is, empty markdown file to html, no defaults, and using the wg21 bibligraphy (which is a little over 100k lines long).
With pandoc 2.7.3, this takes 0.8s.
With pandoc 2.9.1.1, this now takes 16.8s.
The text was updated successfully, but these errors were encountered: