Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a memoize to makeLineColumnIndex for large input scenarios #297

Merged

Conversation

matthemsteger
Copy link
Contributor

I had an example of parsing a 231K file. This was taking around 56 seconds. I did profiling and found that it was calling makeLineColumnIndex a lot, it accounted for ~40 seconds of the time spent. While running through debugging, I found that this function was being called multiple times with the same arguments, in sequence.

I added a simple memoize with a cache of 1 (the last argument set) and the parse time went down to 13 seconds.

So I am not familiar with the internals of parsimmon so I can't say why the same function is being with the same arguments (maybe it is my parser implementation). The use case I have is parsing through this file, and I want to keep a lot of marks/index values because later I want to use this as an AST to edit the file quickly and precisely.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling fd08950 on matthemsteger:features/memoize-makelinecolumnindex into ce035ee on jneen:master.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 4, 2020

Ooh, nice!

I'm gonna guess you're either calling .mark or .node a lot (this is expected and good). I had a feeling this area probably had really bad performance, but I never bothered profiling it to see how to optimize it. I really appreciate this optimization.

I'd love to get this and #298 released together soon :)

@wavebeem wavebeem merged commit 3827e04 into jneen:master Jul 4, 2020
@anko
Copy link
Contributor

anko commented Jul 4, 2020

It might also be worth optimising the logic.

Right now makeLineColumnIndex splits the entire input by newlines into an array of strings, but it only actually uses the last one (to calculate columnWeAreUpTo), and even then the actual string content is unimportant.

I think it might be faster to find matches for /\n/g and calculate line and column indexes based on the match indexes, never splitting the string.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 4, 2020

@anko that's a great point. i've definitely thought about changing it to just be a tight loop that looks at charAt(i) === "\n" to manage the numbers, but i just haven't done any real perf testing myself on parsimmon

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 6, 2020

@anko I tried switching to a for (var j = 0; j < i; j++) loop with charAt and it was actually way slower than using split 🤷 Either way, line/column index is kind of a hack right now, since it involves re-scanning the input up to the current index. It shouldn't be O(n^2) to get the line/column info, but it's gonna stay that way until Parsimmon is rewritten to track that info along with the index.

@anko
Copy link
Contributor

anko commented Jul 7, 2020

Following up on my own guess:

I think it might be faster to find matches for /\n/g and calculate line and column indexes based on the match indexes, never splitting the string.

I've greatly underestimated .split()'s performance! Both regex-based alternatives I could think of perform worse than it. On Firefox I get

  • split: 486,486 ops/s ±0.41%, fastest
  • match lines, then last line: 276,167 ops/s ±1.87%, 43.23% slower
  • re.exec loop: 381,400 ops/s ±2.31%, 21.6% slower

On Chromium both are almost 100% worse.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 7, 2020

@anko wow, thanks for checking that out! good to know 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants