-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degrades quickly with line length and line count, possibly line content #6
Comments
Thanks for the impressive analysis, they are super useful. Yes, performance was never the focus up to now, I know there will be issues when the file gets big, because I have similar experiences in my |
@whacked Sorry for the delay of this issue. I was thinking recently, the cause of this issue is probably due to the inline parsing is happening cross multiple lines, so if you have a huge |
Thanks for the update! I think that's generally reasonable. 2 of the other org parsers that I have played with, https://github.com/mooz/org-js/ and https://github.com/daitangio/org-mode-parser also assume single-line parsers at the bottom of the parser hierarchy. I think users who need multiple lines will compromise by using long lines or repeat the markup across several lines. On the topic of regexes, one issue which also affects inline parsers (as used in the other 2 libraries mentioned here) is that they cannot handle nested markup, such as FWIW I have been running a fork of orga using a modified inline parser like the one in #7 for a while. I think it's ~3x faster, so it still chokes on inputs over e.g. 10k lines. I don't know how easy it is to preserve nested parsing capability and speed. |
@whacked I have just removed multiline support for inline parsing (commit). Please verify if it improves the performance (with the release of v1.1.0). Also, there's an overhaul for |
I did not do anything extensive, but rerunning the script in this issue yields a 20x~1000x speedup. I will report anything interesting as I update to the new version. Thanks a lot for your work! |
Thanks for this @whacked! 2021-07-18T13:20 re-run with #104 which supports multiline inline parsing. Only run up to 3000 because there's a recursion bug I need to fix, but otherwise seems okay. TODO need to compare to current master.
Edit: note that the markup search space is bounded in #104 - per the Org spec which doesn't allow text markup to span more than 3 lines (the parser actually only allows 2, it seems), so the search space is never more than 2 lines. Edit: 2021-07-19T10:55. Fixed stack depth bug. Results below:
At 5k lines on the last example, it's showing an approx 40x speed up. 2021-07-20
(consistently about 2x faster - not too bad because it means we aren't growing at a different rate, but does mean there is potential room for improvement) And for v2.4.9 (2021-07-20) we have:
Which is about the same performance as #104. |
I noticed that for some files (e.g. > 10k lines) the parser practically cannot complete. This seems to be a function of number of lines and line lengths. For example, a large file with many blank lines can still complete. I don't know if this is specific to orga, or to unifyjs.
Here's a quick test on how the total parsing time for a trivial file changes with the line length, content, and line count:
Plotting the output shows this:
I was wondering whether the same regex issue was affecting likes w ith
(
and)
but it seems like that is not the case, but the lines with parens are clearly slower than lines without parens.This is the table from the script (first column is added afterwards):
Is there an easy way to fix this?
The text was updated successfully, but these errors were encountered: