Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrades quickly with line length and line count, possibly line content #6

Closed
whacked opened this issue Jul 29, 2018 · 7 comments

Comments

@whacked
Copy link
Contributor

whacked commented Jul 29, 2018

I noticed that for some files (e.g. > 10k lines) the parser practically cannot complete. This seems to be a function of number of lines and line lengths. For example, a large file with many blank lines can still complete. I don't know if this is specific to orga, or to unifyjs.

Here's a quick test on how the total parsing time for a trivial file changes with the line length, content, and line count:

parse = require("orga").parse;

function pad(n, width) {
    n += '';
    return n.length >= width ? n : new Array(width - n.length + 1).join('0') + n;
}

[
    (n) => `_${n}___${n}_`,
    (n) => `(${n})_(${n})`,
    (_) => 'xxxxxx_xxxxxx',
    (_) => 'xxxxxx_xxxxxx######_######',
].forEach(function(templater) {
    for(nlines = 500; nlines <= 5000; nlines += 500) {
        var lines = [
        '* blah\n',
        ];
        for(var n=1;++n<=nlines;) {
            lines.push(templater(pad(n, 4)));
        }
        var text = lines.join('\n');
        var t0 = new Date().getTime();
        var ast = parse(text)
        var tN = new Date().getTime();
        var dt = tN - t0;
        var lastLine = lines[lines.length-1];
        console.log(`|\t${lastLine}\t|\t${nlines}\t|\t${dt}\t|`);
    }
});

Plotting the output shows this:

speedtest

I was wondering whether the same regex issue was affecting likes w ith ( and ) but it seems like that is not the case, but the lines with parens are clearly slower than lines without parens.

This is the table from the script (first column is added afterwards):

|under| _0500___0500_              |  500 |   162 |
|under| _1000___1000_              | 1000 |   626 |
|under| _1500___1500_              | 1500 |  1376 |
|under| _2000___2000_              | 2000 |  2474 |
|under| _2500___2500_              | 2500 |  3865 |
|under| _3000___3000_              | 3000 |  5475 |
|under| _3500___3500_              | 3500 |  7521 |
|under| _4000___4000_              | 4000 |  9850 |
|under| _4500___4500_              | 4500 | 12427 |
|under| _5000___5000_              | 5000 | 15190 |
|paren| (0500)_(0500)              |  500 |   217 |
|paren| (1000)_(1000)              | 1000 |   858 |
|paren| (1500)_(1500)              | 1500 |  1955 |
|paren| (2000)_(2000)              | 2000 |  3483 |
|paren| (2500)_(2500)              | 2500 |  5439 |
|paren| (3000)_(3000)              | 3000 |  7850 |
|paren| (3500)_(3500)              | 3500 | 10711 |
|paren| (4000)_(4000)              | 4000 | 13873 |
|paren| (4500)_(4500)              | 4500 | 17560 |
|paren| (5000)_(5000)              | 5000 | 21619 |
|xxxxx| xxxxxx_xxxxxx              |  500 |   215 |
|xxxxx| xxxxxx_xxxxxx              | 1000 |   854 |
|xxxxx| xxxxxx_xxxxxx              | 1500 |  1923 |
|xxxxx| xxxxxx_xxxxxx              | 2000 |  3429 |
|xxxxx| xxxxxx_xxxxxx              | 2500 |  5471 |
|xxxxx| xxxxxx_xxxxxx              | 3000 |  7843 |
|xxxxx| xxxxxx_xxxxxx              | 3500 | 10564 |
|xxxxx| xxxxxx_xxxxxx              | 4000 | 13859 |
|xxxxx| xxxxxx_xxxxxx              | 4500 | 17445 |
|xxxxx| xxxxxx_xxxxxx              | 5000 | 21576 |
|xand#| xxxxxx_xxxxxx######_###### |  500 |   815 |
|xand#| xxxxxx_xxxxxx######_###### | 1000 |  3211 |
|xand#| xxxxxx_xxxxxx######_###### | 1500 |  7253 |
|xand#| xxxxxx_xxxxxx######_###### | 2000 | 12967 |
|xand#| xxxxxx_xxxxxx######_###### | 2500 | 20093 |
|xand#| xxxxxx_xxxxxx######_###### | 3000 | 28973 |
|xand#| xxxxxx_xxxxxx######_###### | 3500 | 39765 |
|xand#| xxxxxx_xxxxxx######_###### | 4000 | 51717 |
|xand#| xxxxxx_xxxxxx######_###### | 4500 | 65081 |
|xand#| xxxxxx_xxxxxx######_###### | 5000 | 80055 |

Is there an easy way to fix this?

@whacked
Copy link
Contributor Author

whacked commented Jul 29, 2018

Parsing an example string such as

var content = `* blah

xxxxxx
xxxxxx
xxxxxx
xxxxxx
...

running parse(content + "\n" + content + "\n" + content); gives this profiler output:

screenshot 4

@xiaoxinghu
Copy link
Collaborator

Thanks for the impressive analysis, they are super useful. Yes, performance was never the focus up to now, I know there will be issues when the file gets big, because I have similar experiences in my swift version. I have a couple of ideas for fixing it, but it will take some time. Regex is convenient, very good for a quick implementation, but it's not performant in some cases. The heavy regex matching in your screenshots are inline format parsing (e.g. bold, italic, links ...). I won't put quick bandaids on it now, I think I need to do it properly. I will add proper performance tests and try to replace the heavy regex with faster solutions.

@xiaoxinghu
Copy link
Collaborator

@whacked Sorry for the delay of this issue. I was thinking recently, the cause of this issue is probably due to the inline parsing is happening cross multiple lines, so if you have a huge paragraph, the regex perf is going to be slaughtered. New lines (empty lines) are considered breaker of the paragraph, that's why huge files full of empty lines are ok. Maybe it makes sense to only do inline parsing within each line. I was looking around, non of the markdown parsers, or even markdown apps, do cross-line parsing. Even org-mode itself only support inline syntax within 3 lines. If I change the parse to not do multiline parsing, What do you think? That means you have to keep them all in one line, if you use fill-paragraph in emacs to tidy things up, it will sometimes break long syntax into multiple lines, happen often for links. Is it acceptable?

@whacked
Copy link
Contributor Author

whacked commented Nov 13, 2018

Thanks for the update! I think that's generally reasonable. 2 of the other org parsers that I have played with, https://github.com/mooz/org-js/ and https://github.com/daitangio/org-mode-parser also assume single-line parsers at the bottom of the parser hierarchy. I think users who need multiple lines will compromise by using long lines or repeat the markup across several lines.

On the topic of regexes, one issue which also affects inline parsers (as used in the other 2 libraries mentioned here) is that they cannot handle nested markup, such as /_italicized bold_/ or *=bolded code=* (while =*asterisk code*= does not render bold). A stack based parser can handle that, but I don't know how well it fits into orga's architecture.

FWIW I have been running a fork of orga using a modified inline parser like the one in #7 for a while. I think it's ~3x faster, so it still chokes on inputs over e.g. 10k lines. I don't know how easy it is to preserve nested parsing capability and speed.

@xiaoxinghu
Copy link
Collaborator

@whacked I have just removed multiline support for inline parsing (commit). Please verify if it improves the performance (with the release of v1.1.0). Also, there's an overhaul for gatsby-transformer-orga which will probably break your gatsby project. But it adds lots of useful features. Check it out if you use gatsby.

@whacked
Copy link
Contributor Author

whacked commented Jan 3, 2019

I did not do anything extensive, but rerunning the script in this issue yields a 20x~1000x speedup. I will report anything interesting as I update to the new version. Thanks a lot for your work!

@GuiltyDolphin
Copy link
Contributor

GuiltyDolphin commented Jul 18, 2021

Thanks for this @whacked!

2021-07-18T13:20 re-run with #104 which supports multiline inline parsing.

Only run up to 3000 because there's a recursion bug I need to fix, but otherwise seems okay. TODO need to compare to current master.

|  _0500___0500_               |  500   |  131   |
|  _1000___1000_               |  1000  |  263   |
|  _1500___1500_               |  1500  |  433   |
|  _2000___2000_               |  2000  |  619   |
|  _2500___2500_               |  2500  |  852   |
|  _3000___3000_               |  3000  |  1104  |
|  (0500)_(0500)               |  500   |  93    |
|  (1000)_(1000)               |  1000  |  214   |
|  (1500)_(1500)               |  1500  |  376   |
|  (2000)_(2000)               |  2000  |  556   |
|  (2500)_(2500)               |  2500  |  772   |
|  (3000)_(3000)               |  3000  |  958   |
|  xxxxxx_xxxxxx               |  500   |  83    |
|  xxxxxx_xxxxxx               |  1000  |  197   |
|  xxxxxx_xxxxxx               |  1500  |  337   |
|  xxxxxx_xxxxxx               |  2000  |  521   |
|  xxxxxx_xxxxxx               |  2500  |  788   |
|  xxxxxx_xxxxxx               |  3000  |  1043  |
|  xxxxxx_xxxxxx######_######  |  500   |  167   |
|  xxxxxx_xxxxxx######_######  |  1000  |  475   |
|  xxxxxx_xxxxxx######_######  |  1500  |  653   |
|  xxxxxx_xxxxxx######_######  |  2000  |  956   |
|  xxxxxx_xxxxxx######_######  |  2500  |  1343  |
|  xxxxxx_xxxxxx######_######  |  3000  |  1788  |

Edit: note that the markup search space is bounded in #104 - per the Org spec which doesn't allow text markup to span more than 3 lines (the parser actually only allows 2, it seems), so the search space is never more than 2 lines.

Edit: 2021-07-19T10:55. Fixed stack depth bug. Results below:

|  _0500___0500_               |  500   |  133   |
|  _1000___1000_               |  1000  |  245   |
|  _1500___1500_               |  1500  |  343   |
|  _2000___2000_               |  2000  |  456   |
|  _2500___2500_               |  2500  |  585   |
|  _3000___3000_               |  3000  |  740   |
|  _3500___3500_               |  3500  |  887   |
|  _4000___4000_               |  4000  |  1072  |
|  _4500___4500_               |  4500  |  1277  |
|  _5000___5000_               |  5000  |  1524  |
|  (0500)_(0500)               |  500   |  76    |
|  (1000)_(1000)               |  1000  |  160   |
|  (1500)_(1500)               |  1500  |  265   |
|  (2000)_(2000)               |  2000  |  361   |
|  (2500)_(2500)               |  2500  |  461   |
|  (3000)_(3000)               |  3000  |  580   |
|  (3500)_(3500)               |  3500  |  706   |
|  (4000)_(4000)               |  4000  |  845   |
|  (4500)_(4500)               |  4500  |  992   |
|  (5000)_(5000)               |  5000  |  1146  |
|  xxxxxx_xxxxxx               |  500   |  74    |
|  xxxxxx_xxxxxx               |  1000  |  158   |
|  xxxxxx_xxxxxx               |  1500  |  249   |
|  xxxxxx_xxxxxx               |  2000  |  358   |
|  xxxxxx_xxxxxx               |  2500  |  485   |
|  xxxxxx_xxxxxx               |  3000  |  582   |
|  xxxxxx_xxxxxx               |  3500  |  709   |
|  xxxxxx_xxxxxx               |  4000  |  846   |
|  xxxxxx_xxxxxx               |  4500  |  990   |
|  xxxxxx_xxxxxx               |  5000  |  1145  |
|  xxxxxx_xxxxxx######_######  |  500   |  131   |
|  xxxxxx_xxxxxx######_######  |  1000  |  279   |
|  xxxxxx_xxxxxx######_######  |  1500  |  443   |
|  xxxxxx_xxxxxx######_######  |  2000  |  622   |
|  xxxxxx_xxxxxx######_######  |  2500  |  821   |
|  xxxxxx_xxxxxx######_######  |  3000  |  1030  |
|  xxxxxx_xxxxxx######_######  |  3500  |  1262  |
|  xxxxxx_xxxxxx######_######  |  4000  |  1510  |
|  xxxxxx_xxxxxx######_######  |  4500  |  1771  |
|  xxxxxx_xxxxxx######_######  |  5000  |  2052  |

At 5k lines on the last example, it's showing an approx 40x speed up.

2021-07-20 master (v2.4.8) branch (with stack depth fix):

|  _0500___0500_               |  500   |  64   |
|  _1000___1000_               |  1000  |  131  |
|  _1500___1500_               |  1500  |  93   |
|  _2000___2000_               |  2000  |  114  |
|  _2500___2500_               |  2500  |  145  |
|  _3000___3000_               |  3000  |  166  |
|  _3500___3500_               |  3500  |  196  |
|  _4000___4000_               |  4000  |  225  |
|  _4500___4500_               |  4500  |  259  |
|  _5000___5000_               |  5000  |  296  |
|  (0500)_(0500)               |  500   |  59   |
|  (1000)_(1000)               |  1000  |  107  |
|  (1500)_(1500)               |  1500  |  164  |
|  (2000)_(2000)               |  2000  |  222  |
|  (2500)_(2500)               |  2500  |  286  |
|  (3000)_(3000)               |  3000  |  354  |
|  (3500)_(3500)               |  3500  |  422  |
|  (4000)_(4000)               |  4000  |  500  |
|  (4500)_(4500)               |  4500  |  569  |
|  (5000)_(5000)               |  5000  |  634  |
|  xxxxxx_xxxxxx               |  500   |  44   |
|  xxxxxx_xxxxxx               |  1000  |  97   |
|  xxxxxx_xxxxxx               |  1500  |  151  |
|  xxxxxx_xxxxxx               |  2000  |  208  |
|  xxxxxx_xxxxxx               |  2500  |  276  |
|  xxxxxx_xxxxxx               |  3000  |  333  |
|  xxxxxx_xxxxxx               |  3500  |  399  |
|  xxxxxx_xxxxxx               |  4000  |  468  |
|  xxxxxx_xxxxxx               |  4500  |  537  |
|  xxxxxx_xxxxxx               |  5000  |  614  |
|  xxxxxx_xxxxxx######_######  |  500   |  74   |
|  xxxxxx_xxxxxx######_######  |  1000  |  155  |
|  xxxxxx_xxxxxx######_######  |  1500  |  251  |
|  xxxxxx_xxxxxx######_######  |  2000  |  341  |
|  xxxxxx_xxxxxx######_######  |  2500  |  439  |
|  xxxxxx_xxxxxx######_######  |  3000  |  542  |
|  xxxxxx_xxxxxx######_######  |  3500  |  650  |
|  xxxxxx_xxxxxx######_######  |  4000  |  761  |
|  xxxxxx_xxxxxx######_######  |  4500  |  862  |
|  xxxxxx_xxxxxx######_######  |  5000  |  974  |

(consistently about 2x faster - not too bad because it means we aren't growing at a different rate, but does mean there is potential room for improvement)

And for v2.4.9 (2021-07-20) we have:

|  _0500___0500_               |  500   |  50    |
|  _1000___1000_               |  1000  |  112   |
|  _1500___1500_               |  1500  |  141   |
|  _2000___2000_               |  2000  |  183   |
|  _2500___2500_               |  2500  |  218   |
|  _3000___3000_               |  3000  |  267   |
|  _3500___3500_               |  3500  |  332   |
|  _4000___4000_               |  4000  |  375   |
|  _4500___4500_               |  4500  |  419   |
|  _5000___5000_               |  5000  |  483   |
|  (0500)_(0500)               |  500   |  90    |
|  (1000)_(1000)               |  1000  |  190   |
|  (1500)_(1500)               |  1500  |  295   |
|  (2000)_(2000)               |  2000  |  401   |
|  (2500)_(2500)               |  2500  |  514   |
|  (3000)_(3000)               |  3000  |  635   |
|  (3500)_(3500)               |  3500  |  755   |
|  (4000)_(4000)               |  4000  |  870   |
|  (4500)_(4500)               |  4500  |  1013  |
|  (5000)_(5000)               |  5000  |  1140  |
|  xxxxxx_xxxxxx               |  500   |  90    |
|  xxxxxx_xxxxxx               |  1000  |  190   |
|  xxxxxx_xxxxxx               |  1500  |  301   |
|  xxxxxx_xxxxxx               |  2000  |  412   |
|  xxxxxx_xxxxxx               |  2500  |  527   |
|  xxxxxx_xxxxxx               |  3000  |  652   |
|  xxxxxx_xxxxxx               |  3500  |  769   |
|  xxxxxx_xxxxxx               |  4000  |  894   |
|  xxxxxx_xxxxxx               |  4500  |  1019  |
|  xxxxxx_xxxxxx               |  5000  |  1162  |
|  xxxxxx_xxxxxx######_######  |  500   |  151   |
|  xxxxxx_xxxxxx######_######  |  1000  |  323   |
|  xxxxxx_xxxxxx######_######  |  1500  |  507   |
|  xxxxxx_xxxxxx######_######  |  2000  |  691   |
|  xxxxxx_xxxxxx######_######  |  2500  |  886   |
|  xxxxxx_xxxxxx######_######  |  3000  |  1089  |
|  xxxxxx_xxxxxx######_######  |  3500  |  1285  |
|  xxxxxx_xxxxxx######_######  |  4000  |  1494  |
|  xxxxxx_xxxxxx######_######  |  4500  |  1719  |
|  xxxxxx_xxxxxx######_######  |  5000  |  1934  |

Which is about the same performance as #104.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants