Performance degrades quickly with line length and line count, possibly line content #6

whacked · 2018-07-29T21:57:07Z

I noticed that for some files (e.g. > 10k lines) the parser practically cannot complete. This seems to be a function of number of lines and line lengths. For example, a large file with many blank lines can still complete. I don't know if this is specific to orga, or to unifyjs.

Here's a quick test on how the total parsing time for a trivial file changes with the line length, content, and line count:

parse = require("orga").parse;

function pad(n, width) {
    n += '';
    return n.length >= width ? n : new Array(width - n.length + 1).join('0') + n;
}

[
    (n) => `_${n}___${n}_`,
    (n) => `(${n})_(${n})`,
    (_) => 'xxxxxx_xxxxxx',
    (_) => 'xxxxxx_xxxxxx######_######',
].forEach(function(templater) {
    for(nlines = 500; nlines <= 5000; nlines += 500) {
        var lines = [
        '* blah\n',
        ];
        for(var n=1;++n<=nlines;) {
            lines.push(templater(pad(n, 4)));
        }
        var text = lines.join('\n');
        var t0 = new Date().getTime();
        var ast = parse(text)
        var tN = new Date().getTime();
        var dt = tN - t0;
        var lastLine = lines[lines.length-1];
        console.log(`|\t${lastLine}\t|\t${nlines}\t|\t${dt}\t|`);
    }
});

Plotting the output shows this:

I was wondering whether the same regex issue was affecting likes w ith ( and ) but it seems like that is not the case, but the lines with parens are clearly slower than lines without parens.

This is the table from the script (first column is added afterwards):

|under| _0500___0500_              |  500 |   162 |
|under| _1000___1000_              | 1000 |   626 |
|under| _1500___1500_              | 1500 |  1376 |
|under| _2000___2000_              | 2000 |  2474 |
|under| _2500___2500_              | 2500 |  3865 |
|under| _3000___3000_              | 3000 |  5475 |
|under| _3500___3500_              | 3500 |  7521 |
|under| _4000___4000_              | 4000 |  9850 |
|under| _4500___4500_              | 4500 | 12427 |
|under| _5000___5000_              | 5000 | 15190 |
|paren| (0500)_(0500)              |  500 |   217 |
|paren| (1000)_(1000)              | 1000 |   858 |
|paren| (1500)_(1500)              | 1500 |  1955 |
|paren| (2000)_(2000)              | 2000 |  3483 |
|paren| (2500)_(2500)              | 2500 |  5439 |
|paren| (3000)_(3000)              | 3000 |  7850 |
|paren| (3500)_(3500)              | 3500 | 10711 |
|paren| (4000)_(4000)              | 4000 | 13873 |
|paren| (4500)_(4500)              | 4500 | 17560 |
|paren| (5000)_(5000)              | 5000 | 21619 |
|xxxxx| xxxxxx_xxxxxx              |  500 |   215 |
|xxxxx| xxxxxx_xxxxxx              | 1000 |   854 |
|xxxxx| xxxxxx_xxxxxx              | 1500 |  1923 |
|xxxxx| xxxxxx_xxxxxx              | 2000 |  3429 |
|xxxxx| xxxxxx_xxxxxx              | 2500 |  5471 |
|xxxxx| xxxxxx_xxxxxx              | 3000 |  7843 |
|xxxxx| xxxxxx_xxxxxx              | 3500 | 10564 |
|xxxxx| xxxxxx_xxxxxx              | 4000 | 13859 |
|xxxxx| xxxxxx_xxxxxx              | 4500 | 17445 |
|xxxxx| xxxxxx_xxxxxx              | 5000 | 21576 |
|xand#| xxxxxx_xxxxxx######_###### |  500 |   815 |
|xand#| xxxxxx_xxxxxx######_###### | 1000 |  3211 |
|xand#| xxxxxx_xxxxxx######_###### | 1500 |  7253 |
|xand#| xxxxxx_xxxxxx######_###### | 2000 | 12967 |
|xand#| xxxxxx_xxxxxx######_###### | 2500 | 20093 |
|xand#| xxxxxx_xxxxxx######_###### | 3000 | 28973 |
|xand#| xxxxxx_xxxxxx######_###### | 3500 | 39765 |
|xand#| xxxxxx_xxxxxx######_###### | 4000 | 51717 |
|xand#| xxxxxx_xxxxxx######_###### | 4500 | 65081 |
|xand#| xxxxxx_xxxxxx######_###### | 5000 | 80055 |

Is there an easy way to fix this?

The text was updated successfully, but these errors were encountered:

whacked · 2018-07-29T22:19:22Z

Parsing an example string such as

var content = `* blah

xxxxxx
xxxxxx
xxxxxx
xxxxxx
...

running parse(content + "\n" + content + "\n" + content); gives this profiler output:

xiaoxinghu · 2018-07-29T23:12:34Z

Thanks for the impressive analysis, they are super useful. Yes, performance was never the focus up to now, I know there will be issues when the file gets big, because I have similar experiences in my swift version. I have a couple of ideas for fixing it, but it will take some time. Regex is convenient, very good for a quick implementation, but it's not performant in some cases. The heavy regex matching in your screenshots are inline format parsing (e.g. bold, italic, links ...). I won't put quick bandaids on it now, I think I need to do it properly. I will add proper performance tests and try to replace the heavy regex with faster solutions.

xiaoxinghu · 2018-11-09T03:54:46Z

@whacked Sorry for the delay of this issue. I was thinking recently, the cause of this issue is probably due to the inline parsing is happening cross multiple lines, so if you have a huge paragraph, the regex perf is going to be slaughtered. New lines (empty lines) are considered breaker of the paragraph, that's why huge files full of empty lines are ok. Maybe it makes sense to only do inline parsing within each line. I was looking around, non of the markdown parsers, or even markdown apps, do cross-line parsing. Even org-mode itself only support inline syntax within 3 lines. If I change the parse to not do multiline parsing, What do you think? That means you have to keep them all in one line, if you use fill-paragraph in emacs to tidy things up, it will sometimes break long syntax into multiple lines, happen often for links. Is it acceptable?

whacked · 2018-11-13T05:36:39Z

Thanks for the update! I think that's generally reasonable. 2 of the other org parsers that I have played with, https://github.com/mooz/org-js/ and https://github.com/daitangio/org-mode-parser also assume single-line parsers at the bottom of the parser hierarchy. I think users who need multiple lines will compromise by using long lines or repeat the markup across several lines.

On the topic of regexes, one issue which also affects inline parsers (as used in the other 2 libraries mentioned here) is that they cannot handle nested markup, such as /_italicized bold_/ or *=bolded code=* (while =*asterisk code*= does not render bold). A stack based parser can handle that, but I don't know how well it fits into orga's architecture.

FWIW I have been running a fork of orga using a modified inline parser like the one in #7 for a while. I think it's ~3x faster, so it still chokes on inputs over e.g. 10k lines. I don't know how easy it is to preserve nested parsing capability and speed.

xiaoxinghu · 2018-12-24T23:09:07Z

@whacked I have just removed multiline support for inline parsing (commit). Please verify if it improves the performance (with the release of v1.1.0). Also, there's an overhaul for gatsby-transformer-orga which will probably break your gatsby project. But it adds lots of useful features. Check it out if you use gatsby.

whacked · 2019-01-03T07:01:48Z

I did not do anything extensive, but rerunning the script in this issue yields a 20x~1000x speedup. I will report anything interesting as I update to the new version. Thanks a lot for your work!

GuiltyDolphin · 2021-07-18T12:24:32Z

Thanks for this @whacked!

2021-07-18T13:20 re-run with #104 which supports multiline inline parsing.

Only run up to 3000 because there's a recursion bug I need to fix, but otherwise seems okay. TODO need to compare to current master.

|  _0500___0500_               |  500   |  131   |
|  _1000___1000_               |  1000  |  263   |
|  _1500___1500_               |  1500  |  433   |
|  _2000___2000_               |  2000  |  619   |
|  _2500___2500_               |  2500  |  852   |
|  _3000___3000_               |  3000  |  1104  |
|  (0500)_(0500)               |  500   |  93    |
|  (1000)_(1000)               |  1000  |  214   |
|  (1500)_(1500)               |  1500  |  376   |
|  (2000)_(2000)               |  2000  |  556   |
|  (2500)_(2500)               |  2500  |  772   |
|  (3000)_(3000)               |  3000  |  958   |
|  xxxxxx_xxxxxx               |  500   |  83    |
|  xxxxxx_xxxxxx               |  1000  |  197   |
|  xxxxxx_xxxxxx               |  1500  |  337   |
|  xxxxxx_xxxxxx               |  2000  |  521   |
|  xxxxxx_xxxxxx               |  2500  |  788   |
|  xxxxxx_xxxxxx               |  3000  |  1043  |
|  xxxxxx_xxxxxx######_######  |  500   |  167   |
|  xxxxxx_xxxxxx######_######  |  1000  |  475   |
|  xxxxxx_xxxxxx######_######  |  1500  |  653   |
|  xxxxxx_xxxxxx######_######  |  2000  |  956   |
|  xxxxxx_xxxxxx######_######  |  2500  |  1343  |
|  xxxxxx_xxxxxx######_######  |  3000  |  1788  |

Edit: note that the markup search space is bounded in #104 - per the Org spec which doesn't allow text markup to span more than 3 lines (the parser actually only allows 2, it seems), so the search space is never more than 2 lines.

Edit: 2021-07-19T10:55. Fixed stack depth bug. Results below:

|  _0500___0500_               |  500   |  133   |
|  _1000___1000_               |  1000  |  245   |
|  _1500___1500_               |  1500  |  343   |
|  _2000___2000_               |  2000  |  456   |
|  _2500___2500_               |  2500  |  585   |
|  _3000___3000_               |  3000  |  740   |
|  _3500___3500_               |  3500  |  887   |
|  _4000___4000_               |  4000  |  1072  |
|  _4500___4500_               |  4500  |  1277  |
|  _5000___5000_               |  5000  |  1524  |
|  (0500)_(0500)               |  500   |  76    |
|  (1000)_(1000)               |  1000  |  160   |
|  (1500)_(1500)               |  1500  |  265   |
|  (2000)_(2000)               |  2000  |  361   |
|  (2500)_(2500)               |  2500  |  461   |
|  (3000)_(3000)               |  3000  |  580   |
|  (3500)_(3500)               |  3500  |  706   |
|  (4000)_(4000)               |  4000  |  845   |
|  (4500)_(4500)               |  4500  |  992   |
|  (5000)_(5000)               |  5000  |  1146  |
|  xxxxxx_xxxxxx               |  500   |  74    |
|  xxxxxx_xxxxxx               |  1000  |  158   |
|  xxxxxx_xxxxxx               |  1500  |  249   |
|  xxxxxx_xxxxxx               |  2000  |  358   |
|  xxxxxx_xxxxxx               |  2500  |  485   |
|  xxxxxx_xxxxxx               |  3000  |  582   |
|  xxxxxx_xxxxxx               |  3500  |  709   |
|  xxxxxx_xxxxxx               |  4000  |  846   |
|  xxxxxx_xxxxxx               |  4500  |  990   |
|  xxxxxx_xxxxxx               |  5000  |  1145  |
|  xxxxxx_xxxxxx######_######  |  500   |  131   |
|  xxxxxx_xxxxxx######_######  |  1000  |  279   |
|  xxxxxx_xxxxxx######_######  |  1500  |  443   |
|  xxxxxx_xxxxxx######_######  |  2000  |  622   |
|  xxxxxx_xxxxxx######_######  |  2500  |  821   |
|  xxxxxx_xxxxxx######_######  |  3000  |  1030  |
|  xxxxxx_xxxxxx######_######  |  3500  |  1262  |
|  xxxxxx_xxxxxx######_######  |  4000  |  1510  |
|  xxxxxx_xxxxxx######_######  |  4500  |  1771  |
|  xxxxxx_xxxxxx######_######  |  5000  |  2052  |

At 5k lines on the last example, it's showing an approx 40x speed up.

2021-07-20 master (v2.4.8) branch (with stack depth fix):

|  _0500___0500_               |  500   |  64   |
|  _1000___1000_               |  1000  |  131  |
|  _1500___1500_               |  1500  |  93   |
|  _2000___2000_               |  2000  |  114  |
|  _2500___2500_               |  2500  |  145  |
|  _3000___3000_               |  3000  |  166  |
|  _3500___3500_               |  3500  |  196  |
|  _4000___4000_               |  4000  |  225  |
|  _4500___4500_               |  4500  |  259  |
|  _5000___5000_               |  5000  |  296  |
|  (0500)_(0500)               |  500   |  59   |
|  (1000)_(1000)               |  1000  |  107  |
|  (1500)_(1500)               |  1500  |  164  |
|  (2000)_(2000)               |  2000  |  222  |
|  (2500)_(2500)               |  2500  |  286  |
|  (3000)_(3000)               |  3000  |  354  |
|  (3500)_(3500)               |  3500  |  422  |
|  (4000)_(4000)               |  4000  |  500  |
|  (4500)_(4500)               |  4500  |  569  |
|  (5000)_(5000)               |  5000  |  634  |
|  xxxxxx_xxxxxx               |  500   |  44   |
|  xxxxxx_xxxxxx               |  1000  |  97   |
|  xxxxxx_xxxxxx               |  1500  |  151  |
|  xxxxxx_xxxxxx               |  2000  |  208  |
|  xxxxxx_xxxxxx               |  2500  |  276  |
|  xxxxxx_xxxxxx               |  3000  |  333  |
|  xxxxxx_xxxxxx               |  3500  |  399  |
|  xxxxxx_xxxxxx               |  4000  |  468  |
|  xxxxxx_xxxxxx               |  4500  |  537  |
|  xxxxxx_xxxxxx               |  5000  |  614  |
|  xxxxxx_xxxxxx######_######  |  500   |  74   |
|  xxxxxx_xxxxxx######_######  |  1000  |  155  |
|  xxxxxx_xxxxxx######_######  |  1500  |  251  |
|  xxxxxx_xxxxxx######_######  |  2000  |  341  |
|  xxxxxx_xxxxxx######_######  |  2500  |  439  |
|  xxxxxx_xxxxxx######_######  |  3000  |  542  |
|  xxxxxx_xxxxxx######_######  |  3500  |  650  |
|  xxxxxx_xxxxxx######_######  |  4000  |  761  |
|  xxxxxx_xxxxxx######_######  |  4500  |  862  |
|  xxxxxx_xxxxxx######_######  |  5000  |  974  |

(consistently about 2x faster - not too bad because it means we aren't growing at a different rate, but does mean there is potential room for improvement)

And for v2.4.9 (2021-07-20) we have:

|  _0500___0500_               |  500   |  50    |
|  _1000___1000_               |  1000  |  112   |
|  _1500___1500_               |  1500  |  141   |
|  _2000___2000_               |  2000  |  183   |
|  _2500___2500_               |  2500  |  218   |
|  _3000___3000_               |  3000  |  267   |
|  _3500___3500_               |  3500  |  332   |
|  _4000___4000_               |  4000  |  375   |
|  _4500___4500_               |  4500  |  419   |
|  _5000___5000_               |  5000  |  483   |
|  (0500)_(0500)               |  500   |  90    |
|  (1000)_(1000)               |  1000  |  190   |
|  (1500)_(1500)               |  1500  |  295   |
|  (2000)_(2000)               |  2000  |  401   |
|  (2500)_(2500)               |  2500  |  514   |
|  (3000)_(3000)               |  3000  |  635   |
|  (3500)_(3500)               |  3500  |  755   |
|  (4000)_(4000)               |  4000  |  870   |
|  (4500)_(4500)               |  4500  |  1013  |
|  (5000)_(5000)               |  5000  |  1140  |
|  xxxxxx_xxxxxx               |  500   |  90    |
|  xxxxxx_xxxxxx               |  1000  |  190   |
|  xxxxxx_xxxxxx               |  1500  |  301   |
|  xxxxxx_xxxxxx               |  2000  |  412   |
|  xxxxxx_xxxxxx               |  2500  |  527   |
|  xxxxxx_xxxxxx               |  3000  |  652   |
|  xxxxxx_xxxxxx               |  3500  |  769   |
|  xxxxxx_xxxxxx               |  4000  |  894   |
|  xxxxxx_xxxxxx               |  4500  |  1019  |
|  xxxxxx_xxxxxx               |  5000  |  1162  |
|  xxxxxx_xxxxxx######_######  |  500   |  151   |
|  xxxxxx_xxxxxx######_######  |  1000  |  323   |
|  xxxxxx_xxxxxx######_######  |  1500  |  507   |
|  xxxxxx_xxxxxx######_######  |  2000  |  691   |
|  xxxxxx_xxxxxx######_######  |  2500  |  886   |
|  xxxxxx_xxxxxx######_######  |  3000  |  1089  |
|  xxxxxx_xxxxxx######_######  |  3500  |  1285  |
|  xxxxxx_xxxxxx######_######  |  4000  |  1494  |
|  xxxxxx_xxxxxx######_######  |  4500  |  1719  |
|  xxxxxx_xxxxxx######_######  |  5000  |  1934  |

Which is about the same performance as #104.

whacked mentioned this issue Aug 27, 2018

Minimal example does not work #7

Closed

whacked closed this as completed Jan 3, 2019

xiaoxinghu mentioned this issue Aug 28, 2019

newline breaks italic markup #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degrades quickly with line length and line count, possibly line content #6

Performance degrades quickly with line length and line count, possibly line content #6

whacked commented Jul 29, 2018

whacked commented Jul 29, 2018

xiaoxinghu commented Jul 29, 2018

xiaoxinghu commented Nov 9, 2018

whacked commented Nov 13, 2018

xiaoxinghu commented Dec 24, 2018

whacked commented Jan 3, 2019

GuiltyDolphin commented Jul 18, 2021 •

edited

Loading

Performance degrades quickly with line length and line count, possibly line content #6

Performance degrades quickly with line length and line count, possibly line content #6

Comments

whacked commented Jul 29, 2018

whacked commented Jul 29, 2018

xiaoxinghu commented Jul 29, 2018

xiaoxinghu commented Nov 9, 2018

whacked commented Nov 13, 2018

xiaoxinghu commented Dec 24, 2018

whacked commented Jan 3, 2019

GuiltyDolphin commented Jul 18, 2021 • edited Loading

GuiltyDolphin commented Jul 18, 2021 •

edited

Loading