Add a memoize to makeLineColumnIndex for large input scenarios #297

matthemsteger · 2020-07-04T17:45:10Z

I had an example of parsing a 231K file. This was taking around 56 seconds. I did profiling and found that it was calling makeLineColumnIndex a lot, it accounted for ~40 seconds of the time spent. While running through debugging, I found that this function was being called multiple times with the same arguments, in sequence.

I added a simple memoize with a cache of 1 (the last argument set) and the parse time went down to 13 seconds.

So I am not familiar with the internals of parsimmon so I can't say why the same function is being with the same arguments (maybe it is my parser implementation). The use case I have is parsing through this file, and I want to keep a lot of marks/index values because later I want to use this as an AST to edit the file quickly and precisely.

coveralls · 2020-07-04T17:46:40Z

Coverage remained the same at 100.0% when pulling fd08950 on matthemsteger:features/memoize-makelinecolumnindex into ce035ee on jneen:master.

wavebeem · 2020-07-04T20:46:57Z

Ooh, nice!

I'm gonna guess you're either calling .mark or .node a lot (this is expected and good). I had a feeling this area probably had really bad performance, but I never bothered profiling it to see how to optimize it. I really appreciate this optimization.

I'd love to get this and #298 released together soon :)

anko · 2020-07-04T21:53:44Z

It might also be worth optimising the logic.

Right now makeLineColumnIndex splits the entire input by newlines into an array of strings, but it only actually uses the last one (to calculate columnWeAreUpTo), and even then the actual string content is unimportant.

I think it might be faster to find matches for /\n/g and calculate line and column indexes based on the match indexes, never splitting the string.

wavebeem · 2020-07-04T23:39:42Z

@anko that's a great point. i've definitely thought about changing it to just be a tight loop that looks at charAt(i) === "\n" to manage the numbers, but i just haven't done any real perf testing myself on parsimmon

wavebeem · 2020-07-06T04:14:55Z

@anko I tried switching to a for (var j = 0; j < i; j++) loop with charAt and it was actually way slower than using split 🤷 Either way, line/column index is kind of a hack right now, since it involves re-scanning the input up to the current index. It shouldn't be O(n^2) to get the line/column info, but it's gonna stay that way until Parsimmon is rewritten to track that info along with the index.

anko · 2020-07-07T17:07:32Z

Following up on my own guess:

I think it might be faster to find matches for /\n/g and calculate line and column indexes based on the match indexes, never splitting the string.

I've greatly underestimated .split()'s performance! Both regex-based alternatives I could think of perform worse than it. On Firefox I get

split: 486,486 ops/s ±0.41%, fastest
match lines, then last line: 276,167 ops/s ±1.87%, 43.23% slower
re.exec loop: 381,400 ops/s ±2.31%, 21.6% slower

On Chromium both are almost 100% worse.

wavebeem · 2020-07-07T17:42:43Z

@anko wow, thanks for checking that out! good to know 😄

Add a memoize to makeLineColumnIndex for large input scenarios

fd08950

wavebeem merged commit 3827e04 into jneen:master Jul 4, 2020

brendo-m mentioned this pull request Apr 16, 2021

fix: Improve the line/column number caching logic #321

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a memoize to makeLineColumnIndex for large input scenarios #297

Add a memoize to makeLineColumnIndex for large input scenarios #297

matthemsteger commented Jul 4, 2020

coveralls commented Jul 4, 2020

wavebeem commented Jul 4, 2020

anko commented Jul 4, 2020

wavebeem commented Jul 4, 2020

wavebeem commented Jul 6, 2020

anko commented Jul 7, 2020

wavebeem commented Jul 7, 2020

Add a memoize to makeLineColumnIndex for large input scenarios #297

Add a memoize to makeLineColumnIndex for large input scenarios #297

Conversation

matthemsteger commented Jul 4, 2020

coveralls commented Jul 4, 2020

wavebeem commented Jul 4, 2020

anko commented Jul 4, 2020

wavebeem commented Jul 4, 2020

wavebeem commented Jul 6, 2020

anko commented Jul 7, 2020

wavebeem commented Jul 7, 2020