Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's booooooost! #814

Merged
merged 5 commits into from
Sep 30, 2023
Merged

Let's booooooost! #814

merged 5 commits into from
Sep 30, 2023

Conversation

QuarticCat
Copy link
Contributor

@QuarticCat QuarticCat commented Jul 29, 2023

I'm a developer of paguroidea. After we benchmarked several parsers, we noticed that lalrpop was abnormally slow. So I decided to dig up the reason. And it turned out to be the problem of the lexer. This PR aims to improve the lexing performance, making it a bit more decent.

First commit

  • Set the default skip rule from \s* to \s+. If the match is empty, why do we bother to skip it? This change boosted throughput from 4.5 MBps to 5.5 MBps in our JSON benchmark.
  • Set an ASCII skip rule when "unicode" feature is disabled. Purely an ergonomic improvement.

Second commit

  • Rewrite the lexer, directly use the underlying regex-automata, which gives us more control to achieve better performance. This change boosted throughput from 5.5 MBps to 138 MBps in our JSON benchmark.

The benchmark code can be found here https://github.com/SchrodingerZhu/paguroidea/tree/main/benches/json. I tested it under the qc-temp branch. Here are some of the final results:

twitter-json/lalrpop    time:   [5.2708 ms 5.2853 ms 5.3033 ms]
                        thrpt:  [137.91 MiB/s 138.38 MiB/s 138.76 MiB/s]
twitter-json/lalrpop+logos
                        time:   [2.2479 ms 2.2544 ms 2.2623 ms]
                        thrpt:  [323.29 MiB/s 324.43 MiB/s 325.37 MiB/s]

EDIT: In our benchmark, we didn't turn off the unicode feature, although it's not required. This is because, due to lalrpop limitations, if we disable unicode feature, we have to rewrite the skip rule. And if we rewrite the skip rule, we have to duplicate all other lexer rules. The first commit can solve this problem.

Copy link
Contributor

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check why CI is broken?
If rust version only matters, please feel free to update rust version of lalrpop for stablized features.

Thank you!

Cargo.toml Outdated Show resolved Hide resolved
lalrpop-util/Cargo.toml Show resolved Hide resolved
lalrpop-util/src/lexer.rs Outdated Show resolved Hide resolved
@QuarticCat
Copy link
Contributor Author

Could you check why CI is broken?

It's a regex-automata issue that I just reported. Luckily it was fixed in the blink of an eye. rust-lang/regex#1056

@QuarticCat
Copy link
Contributor Author

Any update?

@youknowone
Copy link
Contributor

I triggered CI again

@QuarticCat
Copy link
Contributor Author

I force pushed again. This time it should be fixed. @youknowone

@QuarticCat
Copy link
Contributor Author

My fault. Force-pushed again.

@youknowone
Copy link
Contributor

@nikomatsakis @yannham Could you review this patch?

@yannham
Copy link
Contributor

yannham commented Aug 7, 2023

I'm back from vacation, I'll try to take a look soon!

@dburgener
Copy link
Contributor

Did a small amount of local testing on my box. Compiling lrgrammar.lalrpop with the HEAD of this, vs just the skip rewrite commit and using /usr/bin/time -v. I got a very minor speed improvement from 4.03 seconds user time to 3.91 seconds user time with the lexer rewrite here. The peak RAM usage increased from 12.5MB to 13.8MB. (I was concerned about RAM because of the comment about "exhorbitant RAM usage" here.)

On this lalrpop file that I care about, I regressed speed from 1.98 seconds to 2.5 seconds, with peak RAM going up from 11.4MB to 12.8MB.

I also grabbed json.lalrpop which I think was used by the submitter for the benchmark. I got an improvement from 0.06 seconds user time to 0.03 seconds user time, with a peak RAM increase from 9.9MB to 11.4MB.

Obviously this is all one box, using time, without any statistical sampling, so I wouldn't read too much into the above numbers. I don't view the RAM increases as "exhorbitant", but they are a trade-off to be aware of. I am a little concerned about the variation across different workloads. The tiny json file got good results. The percentage win I observed in the much larger lrgrammar.lalrpop file was much smaller though. And the third file I tested actually regressed.

I tried running paguroidea's benchmarking, but I got unrelated compile errors, and didn't spend the time to debug them. The benchmark tests I run in my project show an extreme regression with the lexer rewrite, vs just the whitespace fix.

Full system compile time: [268.80 ms 278.32 ms 288.93 ms]
change: [+1101.8% +1153.0% +1214.7%] (p = 0.00 < 0.05)
Performance has regressed.

This is a sort of shockingly huge number though, particularly since I expect my codes performance to be the major bottleneck here, rather than lalrpop performance. So I'm wondering if there's some sort of measurement or environment error on my end, but I do seem to be getting similar results consistently.

@QuarticCat
Copy link
Contributor Author

@dburgener Thanks for testing. I don't have time to dive into these results in recent days. But I do have some speculations.

IIRC, the default strategy of regex is hybrid, while I used dfa here. One difference is that hybrid builds DFA lazily. If the lexing phase doesn't have enough workload and/or cover enough DFA states, dfa might not be worth it. I missed that scenario. Furthermore, I used dense DFA here and also enabled minimize. Both of them take more time and memory to compile.

If you look through the original implementation, you can see a lot of redundant calculations due to the limitations of regex's high-level API. So I do believe that there's a configuration that can be strictly superior to the original lexer.

Anyway, some simple profilings should reveal what's going wrong under the hood. Could you upload the complete benchmark to somewhere?

dburgener added a commit to dburgener/cascade that referenced this pull request Aug 30, 2023
lalrpop/lalrpop#814 is the lalrpop change

Commenting in the 366e87a commit lines exercises the PR with just its
first commit, and switching to d1e99cc exercises the full PR.
@dburgener
Copy link
Contributor

Could you upload the complete benchmark to somewhere?

Sure. I have a lot of skepticism about my initial time based benchmarks, but here's a branch with my project that uses lalrpop:

https://github.com/dburgener/cascade/tree/dburgener/bench-lalrpop-lexer

I haven't dug into the results here much on my end yet (I do intend to in a day or two), so it's possible there's something in my code that's relevant here. But if you modify my Cargo.toml lines to point to your first commit, run cargo bench, and then modify to the second commit and run cargo bench again, you should see the results. In my testing it was rather dramatic, which as I said is surprising, since I expect my code to dominate the performance, rather than lalrpop.

That said, I haven't really dug into ruling out environmental differences while the benchmark was running too heavily, or done any profiling, so maybe there's something majorly wrong with the way I'm testing it.

@youknowone
Copy link
Contributor

@QuarticCat This PR seems to require more discussion. If you don't mind, could you open another PR about regex pattern to fix #820 first?

@QuarticCat
Copy link
Contributor Author

Cascade (old):

Full system compile     time:   [14.138 ms 14.149 ms 14.161 ms]
Stress functions        time:   [69.476 ms 69.712 ms 69.993 ms]

Cascade (dfa::dense+minimize):

Full system compile     time:   [214.66 ms 215.11 ms 215.55 ms]
Stress functions        time:   [103.75 ms 103.92 ms 104.13 ms]

Cascade (dfa::dense):

Full system compile     time:   [34.717 ms 34.815 ms 34.922 ms]
Stress functions        time:   [74.633 ms 75.063 ms 75.505 ms]

Obviously, it's the regex compilation that took too much time in your benchmark. Considering that dfa::sparse can only be converted from dfa::dense, it must take more time to compile. So let's skip it and check hybrid.

Cascade (hybrid):

Full system compile     time:   [5.2068 ms 5.2125 ms 5.2188 ms]
Stress functions        time:   [67.099 ms 67.172 ms 67.246 ms]

Twitter JSON (hybrid):

twitter-json/lalrpop    time:   [4.3239 ms 4.3260 ms 4.3282 ms]
                        thrpt:  [168.98 MiB/s 169.07 MiB/s 169.15 MiB/s]

Surprisingly, this one is faster. This time all results are slower than initially so it's even faster than the number shows. I should have used this script to fix the environment. Anyway, that doesn't affect our conclusion.

I'm a bit bored with this PR. If you want to tweak the regex-automata configuration further, I can grant you access to my forked repo.

@QuarticCat
Copy link
Contributor Author

Oh I forgot to post memory changes.

The results below are tested by time cargo bench with a few warmups. In which time is the ZSH built-in, not GNU time. My configuration can be found here.

Cascade (old):

avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                55 MB
page faults from disk:     1
other page faults:         103484

This result is floating between 55MB and 59MB.

Cascade (hybrid):

avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                60 MB
page faults from disk:     0
other page faults:         33483

It's a little bit larger. Presumably because the hybrid version reuses cache.

@yannham
Copy link
Contributor

yannham commented Sep 6, 2023

Thanks for the thorough investigation, @QuarticCat. Sorry for not being able to review as of yet, but backlog accumulates elsewhere...It's not entirely clear to me how the table you show relates to LALRPOP after This PR, LALRPOP before this PR, or to yet another potential combination that isn't used in this PR nor previously in LALRPOP?

If the compilation can be a huge pessimization in some workflows, I would be tempted to not proceed with this PR. Especially because if the lexer's performance is of importance in one's use-case, you can still use Logos or even a roll out a fine-tuned custom lexer if needed which can be plugged in LALRPOP easily (the complexity is nowhere near the one of writing a custom parser, which would indeed defeats the purpose of LALRPOP entirely).

What do you think?

@QuarticCat
Copy link
Contributor Author

QuarticCat commented Sep 6, 2023

The original implementation also needs to compile regex on the fly. The regex crate doesn't support building regexes in compile-time at all. Given that this PR has already switched to regex's default, mostly used engine, I don't think there will be pessimization in most cases. I say "most" because in some edge cases regex will choose a different strategy. For example, when more than one literal prefix is found, regex will use Aho-Corasick DFA. More details can be found here. These cases are extremely rare to see in lexing. The default skip rule and basically any common skip rule will prevent regex from choosing this engine. Even if you find yourself in such cases, this PR might still be faster, since it avoids many redundant matchings. My benchmark and @dburgener 's benchmark can represent two extremes: lexing large files (where compile cost is negligible) and lexing small files (where compile cost is significant). Now it performs better than the original implementation in both cases.

More tests & benchmarks are welcome. I'm confident in my work.

@QuarticCat
Copy link
Contributor Author

It's not entirely clear to me how the table you show relates to LALRPOP after This PR, LALRPOP before this PR, or to yet another potential combination that isn't used in this PR nor previously in LALRPOP?

"old" -> LALRPOP before this PR
"dfa::dense+minimize" -> commit d1e99cc
"dfa::dense" -> commit d1e99cc without DFA minimization, not committed
"hybrid" -> commit c41e122, the latest one

@dburgener
Copy link
Contributor

IMHO what we really need is a good lalrpop benchmarking automation. This likely won't be the last PR we want to understand performance on. I've been super busy all week, and expect to be recovering and catching up all next week, but the week after, I can start looking our options for benchmarking automation, with a particular emphasis on applying it in this case.

The claimed performance wins here are pretty considerable, so hopefully we can gain some confidence that they are fairly universal rather than workload dependent and move forward.

I think the other problem here is that while @QuarticCat clearly understands the regex automata internals and can speak to them, it doesn't seem like any of the maintainers or reviewers have a ton of background there. I started looking at reviewing last week, and dipped my toe in the regex-automata docs, which is what motivated me to try more benchmarking, out of concerns for memory consumption, but then I got on the benchmarking path and didn't finish going through automata docs and the code here.

So all that to say, I'm enthusiastic about this change in general, and would love to do what I can to drive it forward in terms of testing and reviewing, but it'll likely be a couple of weeks until I get sufficient reprieve from my current busyness to devote some time to it again.

@Pat-Lafon
Copy link
Contributor

IMHO what we really need is a good lalrpop benchmarking automation. This likely won't be the last PR we want to understand performance on. I've been super busy all week, and expect to be recovering and catching up all next week, but the week after, I can start looking our options for benchmarking automation, with a particular emphasis on applying it in this case.

The claimed performance wins here are pretty considerable, so hopefully we can gain some confidence that they are fairly universal rather than workload dependent and move forward.

Somewhat related is https://github.com/orgs/lalrpop/discussions/809

@dburgener
Copy link
Contributor

Somewhat related is https://github.com/orgs/lalrpop/discussions/809

Yes, good point. I had vaguely recalled that post, but forgot to go dig it up when posting. I very much agree with the sentiment there, and we should probably work on that as either part of all of "improve our benchmarking". I'll take a deeper look when I set aside lalrpop time in a week or two.

The thing that a generic benchmark solution like that is lacking is testing workloads of lalrpop's actual users. I don't think it would be a ton of effort to set up a repo that automates fetching a handful of public repos that use lalrpop, and running their benchmarks to compare two pinned lalrpop commits. That doesn't isolate parser performance of course, unless any of the projects expose a parser-specific benchmark, but it would be nice to detect ahead of time if we're likely to regress actual users noticeably with a proposed lalrpop change.

@dburgener
Copy link
Contributor

FYI, I've gotten lalrpop benchmarking merged in https://github.com/rosetta-rs/parse-rosetta-rs. That's certainly not the complete answer for our performance testing story, but it's a good step. I'll aim to get some time tomorrow to run head to head benchmarks with/without the perf part of this PR and see how that effects our results.

After that, I'd like to do some benchmarking on various other workloads, and give this a thorough review. Thanks for your patience.

@dburgener
Copy link
Contributor

Based on one benchmark run, it looks to me like this change cuts our runtime in half on the json parsing test in rosetta. Very good news. I'll aim to do more benchmarking and a code review later this week.

Name Overhead (release) Build (debug) Parse (release) Downloads Version
null 0 KiB 484ms 26ms - -
chumsky 780 KiB 9379ms 1636ms Download count v0.3.0
combine 227 KiB 5513ms 1179ms Download count v3.8.1
nom 158 KiB 3026ms 1270ms Download count v7.1.3
peg 34 KiB 2755ms 20ms Download count v0.8.1
pest 154 KiB 5583ms 1172ms Download count v2.7.3
pom 216 KiB 2586ms 2076ms Download count v3.3.0
lalrpop 1,615 KiB 15149ms 2135ms Download count (https://github.com/lalrpop/lalrpop.git?rev=366e87ac9f54c0e019d4f9d24e0f2af2d15ac4be#366e87ac)
winnow 137 KiB 2385ms 1119ms Download count v0.5.15
yap 89 KiB 879ms 1115ms Download count v0.10.0
Name Overhead (release) Build (debug) Parse (release) Downloads Version
null 0 KiB 472ms 26ms - -
chumsky 780 KiB 9427ms 1652ms Download count v0.3.0
combine 227 KiB 5426ms 1182ms Download count v3.8.1
nom 158 KiB 2979ms 1251ms Download count v7.1.3
peg 34 KiB 2740ms 19ms Download count v0.8.1
pest 154 KiB 5632ms 1178ms Download count v2.7.3
pom 216 KiB 2571ms 2080ms Download count v3.3.0
lalrpop 1,468 KiB 17000ms 1091ms Download count (https://github.com/lalrpop/lalrpop.git?rev=c41e122c52309cff069648c16e844cc62614d937#c41e122c)
winnow 137 KiB 2377ms 1133ms Download count v0.5.15
yap 89 KiB 869ms 1112ms Download count v0.10.0

@dburgener
Copy link
Contributor

Okay, I've read through the code here, and I think I somewhat have my head around the new implementation, as well as the original regex crate based implementation. And the new implementation looks fine. I'm still a little fuzzy on why this is faster. Is the main win because in the original implementation regex::find() has to essentially start from the beginning at each search when looking for a longer match, while with the dfa::next_state() approach we can basically browse the input string once?

I'd also like to make sure I understand all the compile time angles here. I think I understand that the both the original regex crate implementation and the final hybrid one, using LazyDFA (at least in the common case) don't need much overhead for initial compilation. The dfa::dense approach initially tried adds pre-compilation into MatchBuilder::new(), which slows down that function. The reason why this was so bad for Cascade is that Cascade (probably wrongly) assumes that FooParser::new() is cheap, and calls it in a loop, which caused a bunch of recompilation. Arguably, that's a Cascade performance bug, rather than anything about this implementation. But in general, the dfa::dense approach should make next() faster at the expense of new() by precompiling. Implementations with a low next() to new() ratio (eg small input) would see degradation in this case.

I'm still confused by this statement though:

The original implementation also needs to compile regex on the fly. The regex crate doesn't support building regexes in compile-time at all.

As far as I can tell, both the regex implementation and the hybrid automata approach will typically use LazyDFA under the hood, doing minimal setup in new(), at a tradeoff of a longer next(). There's a number of "compile times" in parser generator land, so I'm not sure if you mean "regex compile time", or "parser generation time" (ie build.rs), or "user compile time" (ie compiling a generated parser). Regardless of which you mean, I'm missing how the latest implementation changes that. If you could help me understand, that would be super helpful. The benchmarking does seem to show an improvement, and the proof is in the pudding in that sense, but I would feel more comfortable that we're not missing some horrendous corner cases if I felt better about understanding what was wrong in the initial implementation and how this differs.

Given the above, I'd be curious to fix the Cascade performance bug of calling new() in a loop and rerun that benchmark against all three cases. I did start looking at more extensive automated benchmarking last week and got a decent start on the harness, but still have a ways to go. I'll see about making more progress there ASAP.

@QuarticCat
Copy link
Contributor Author

I'm still a little fuzzy on why this is faster.

In the old implementation (before this PR), every token is first matched by regex_set. Then for all matchings, regex_vec[i] will be used to match once more to determine which one is the longest. That means all tokens are matched at least twice, a huge waste. This is because the high-level API regex::RegexSet doesn't provide enough information. This PR solves this problem by using low-level APIs. This is the biggest win I believe.

But in general, the dfa::dense approach should make next() faster at the expense of new() by precompiling.

True. But my twitter_json benchmark shows that hybrid is faster, which implies that dfa::dense doesn't necessarily have a faster next(). I haven't tried dfa::sparse, as it would probably result in a more expensive new().

There's a number of "compile times" in parser generator land, so I'm not sure if you mean "regex compile time", or "parser generation time" (ie build.rs), or "user compile time" (ie compiling a generated parser). Regardless of which you mean, I'm missing how the latest implementation changes that.

I mean "user compile time". And this PR doesn't change that. So I used the word "also".

@yannham
Copy link
Contributor

yannham commented Sep 29, 2023

I just gave a very quick look to the code, and tried to grok what's going on at a high-level, but FWIW I think this whole PR and the convergence toward hybrid approach make a lot of sense. It sounds like the hybrid approach should not have corner cases which are obvious pessimization (such as dfa::dense+minimize). Once again, a custom lexer can still be used for very specific cases, so the generic lexer should be good enough in most situations, but doesn't have to be the absolute best solution everywhere or even for some defined cases.

Modulo code reviewing and technical considerations, my half-educated vote would be to accept the approach and the changes proposed here.

@dburgener
Copy link
Contributor

I reran my Cascade benchmark, with fixing the performance bug I mentioned above, and compared this branch with just the skip rule commit to the latest commit on the branch, and it's a solid performance improvement in that case:

Full system compile time: [4.2422 ms 4.3042 ms 4.3713 ms]
change: [-24.977% -23.160% -21.153%] (p = 0.00 < 0.05)
Performance has improved.

That means all tokens are matched at least twice, a huge waste.

Ah, yes I see this now, and this makes a ton of sense. Thanks for the explanation.

So I used the word "also".

Ah, I see now what you mean. I had been interpreting "also" to mean "in addition to the other problems with the original implementation" rather than "like the most recent version".

FWIW I think this whole PR and the convergence toward hybrid approach make a lot of sense.

I agree.

At this point, we have three benchmarks on this change (parse-rosetta, Cascade, twitter-json). It bothers me slightly that two of those are json parsing, so we may not be super representative of diverse performance cases, but I don't think that's a good reason to hold this up any longer. The approach makes sense, and should be an across the board improvement.

Once again, a custom lexer can still be used for very specific cases, so the generic lexer should be good enough in most situations, but doesn't have to be the absolute best solution everywhere or even for some defined cases.

This is to some extent a fair point, although I do think that if we published a new release that tanked the performance of the built in lexer in a some use case, then "you can write a custom lexer" isn't a very user-friendly response. The bigger thing for me is that I don't think this PR is likely to be tanking performance in weird corner cases. If it somehow does, we can treat it as a performance bug and investigate and fix it then.

Modulo code reviewing and technical considerations, my half-educated vote would be to accept the approach and the changes proposed here.

I did a pretty thorough technical review yesterday. I'm no regex-automata expert, but I dug through the docs and the code here makes sense. I'll go ahead and mark this as approved from me, and resolve the merge conflicts. If I don't hear objections from another maintainer I'll plan on merging it sometime over the weekend.

@dburgener
Copy link
Contributor

I'll go ahead and mark this as approved from me, and resolve the merge conflicts.

I've resolved the conflicts via github's merge interface, so there's a merge commit from me on the HEAD of quarticcat/master now. The only issues in merge were the is-terminal PR updated the rust version past where this did, and the skip rule merged independently, but then the remaining commits modified it to remove the anchors.

@yannham
Copy link
Contributor

yannham commented Sep 29, 2023

Sounds good to me! Hats off @QuarticCat for the high-quality contribution and @dburgener for having put quite a lot of time and energy to review and make sure we understand very precisely the tradeoffs involved 🙂

@dburgener dburgener added this pull request to the merge queue Sep 30, 2023
Merged via the queue into lalrpop:master with commit e2eb361 Sep 30, 2023
9 checks passed
@dburgener
Copy link
Contributor

Merged! Thanks @QuarticCat for the contribution!

@dburgener dburgener mentioned this pull request Oct 16, 2023
dburgener added a commit to dburgener/lalrpop that referenced this pull request Oct 17, 2023
regex_syntax::parse() converts our regex strings into the HIR of the
regex, part of that includes unpacking various metacharacters into a
list of symbols.  In many cases, this expansion changes depending if it
should expand into unicode or not.

Prior to lalrpop#814, we were still outputting unicode regexes unconditionally,
but regex internals seem to be compiling them away and avoiding errors.
The switch in lalrpop#814 caused these to be result in real errors.

Follow-up work will be needed to determine why existing tests didn't
detect this.
dburgener added a commit to dburgener/lalrpop that referenced this pull request Oct 18, 2023
regex_syntax::parse() converts our regex strings into the HIR of the
regex, part of that includes unpacking various metacharacters into a
list of symbols.  In many cases, this expansion changes depending if it
should expand into unicode or not.

Prior to lalrpop#814, we were still outputting unicode regexes unconditionally,
but regex internals seem to be compiling them away and avoiding errors.
The switch in lalrpop#814 caused these to be result in real errors.

Follow-up work will be needed to determine why existing tests didn't
detect this.
github-merge-queue bot pushed a commit that referenced this pull request Oct 20, 2023
regex_syntax::parse() converts our regex strings into the HIR of the
regex, part of that includes unpacking various metacharacters into a
list of symbols.  In many cases, this expansion changes depending if it
should expand into unicode or not.

Prior to #814, we were still outputting unicode regexes unconditionally,
but regex internals seem to be compiling them away and avoiding errors.
The switch in #814 caused these to be result in real errors.

Follow-up work will be needed to determine why existing tests didn't
detect this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants