-
-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(linter): apply small file optimization, up to 30% faster #6600
perf(linter): apply small file optimization, up to 30% faster #6600
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @camchenry and the rest of your teammates on Graphite |
CodSpeed Performance ReportMerging #6600 will improve performances by 35.14%Comparing Summary
Benchmarks breakdown
|
4206d0c
to
06000dc
Compare
Benchmark results are wild! I'll take a look at this tomorrow. |
06000dc
to
5faeea2
Compare
This is rather mysterious! Personally I think unlikely this effect is due to cache locality. I could be wrong, but it seems unlikely to me that'd have such a large effect. More likely it's just doing a lot less work turning the loops this way around. The big question is: what's the big difference between I suggest running before/after with a proper flamegraph (better than what CodSpeed gives) and see what that reveals. Also, you might want to remove the intermediate |
I did try this first, I had just re-added it to see if it made any difference. It didn't improve the regression, so I'm going to leave it out.
Yeah, I'm going to work on profiling this to see what's up. |
8601a72
to
eab9176
Compare
Samply didn't help too much in profiling, I'm still a bit confused. The only thing that sticks out is this branch doesn't have a big call to The call tree shows some strange differences like |
This is a benchmark fallacy where all rules are turn on. For the two loops, that's 420 rules * total number of AST nodes, so performance will obviously be different for different file sizes. For real usages, it's around 100 rules so performance will be different as well. |
Yeah, the number on codspeed is much larger than what I am seeing locally in more realistic scenarios. But still when I do I'm still not sure why |
What about applying a strategy where we switch the loop when the total number of ast nodes is smaller than a threshold? |
I don't see any good reason why Maybe #6623 will make that go away, but if it does, that'd probably indicate another problem! It would suggest we are storing a large amount of temporary data in the arena, which is an anti-pattern. |
eab9176
to
90a93af
Compare
I'm not in a rush to merge this, so I'm going to continue investigating as best as I can. However, the performance gains here are definitely real and very statistically significant. With default settings used for
I did some debugging in Xcode and did find that this new version actually has more L1 data cache misses than the code on |
90a93af
to
14ada18
Compare
So I noticed an interesting pattern here between However, for this PR, the L1 data cache misses are evenly distributed over the duration of the process, and there are ~20x more cache misses with ~45M L1 cache misses: For some reason, ~35M of these cache misses are being attributed to the And specifically, it looks like it's a lot of time being spent getting these possible jest nodes: So I'm wondering if by changing the data access pattern here we've actually made the caching worse because now the Jest nodes are constantly getting pushed out of the cache and then put back into the cache. It might be worth working on #6038 and then revisiting this to see if it has improved. It's possible that the performance regression in CodSpeed could maybe be explained by different caching behavior versus my M1 laptop? |
b3c3c2f
to
318bf0e
Compare
a8e6348
to
41980c3
Compare
I think I finally figured this one out. Re-profiling with the latest code changes shows a much healthier usage of L1 cache, where we have a bunch of misses initially but it falls off to a smaller amount. I think the reason why we were encountering issues in the So I've taken @Boshen's suggestion from earlier to have a threshold for when we apply this strategy. This makes this essentially a "small file" optimization where smaller files are linted in a slightly different order but large files are unaffected. This does complicate the linter running code, but I believe it's worth the benefit. Benchmarking shows this can speed up the linter by 10-40%, depending on the number of rules and plugins, but generally the higher the number of rules the better this performs. I think the extra complication here is well worth the slightly duplicated implementations. |
Great that the perf drop on Out of interest, how many nodes do the various benchmarks have? Clearly But... (and I'm sorry to be the guy to piss on your parade when you've achieved such a stellar result)... are you sure you're not over-fitting the data? i.e. optimizing for our specific benchmarks, rather than the general case? And did you ever get to the bottom of what's going on with the expensive Just to be clear... despite me asking questions, what I'm not questioning is that this is excellent work. Bravo! |
Totally agree. It's not a huge complication, and you've made a comment which clearly explains the rationale. |
I might have misspoken earlier, when I said "dropped out" I think I just meant it didn't appear in the profiler call tree. But I think there was some strange debug symbol attribution going on there, I didn't see this happening again when profiling in Xcode Instruments. I didn't see anything that pointed to this method being a big issue currently. |
41980c3
to
f5e1426
Compare
Thanks very much for giving the real-world results on various codebases. That's pretty convincing. Perhaps we could finesse the 200,000 "tipping point", but it seems to me it's clearly good enough. In the real world who (apart from the maniac authors of TS) writes code in files anywhere near as large as that? We might encounter files that big in transformer - large libraries pre-bundled into a single file in
Can I ask you a favour? Would you mind just checking if benchmark result on Beyond that, LGTM! |
@overlookmotel I think I did benchmark this on |
f5e1426
to
bdf6007
Compare
The codspeed benchmark has all the benchmark turned on. Is your hyperfine benchmark running all rules or just the default? I would optimize for the default case, where most people run ~100 rules, not 430. |
This benchmarking was done with the default rules and plugins, so it was 98 rules in total. |
I'm super happy with the results, eager to merge! |
Merge activity
|
Theory: iterating over the rules three times has slightly worse cache locality, because the prior iterations have pushed `rule` out of the cache by the time we iterate over it again. By iterating over each rule only once, we improve cache performance (hopefully). We also don't need to collect rules to a Vec, so it saves some CPU/memory there too. In practice: the behavior here actually depends on the number of AST nodes that are in the program. If the number of nodes is large, then it's better to iterate over the nodes only once and iterate the rules multiple times. But if the number of nodes is small, then it's better to iterate over nodes multiple times and only iterate over the rules once. See this comment for more context: #6600 (comment), as well as the comment inside the PR: https://github.com/oxc-project/oxc/pull/6600/files#diff-207225884c5e031ffd802bb99e4fbacbd8364b1343a1cec5485bf50f29186300R131-R143. In practice, this can make linting a file 1-45% faster, depending on the size of the file, number of AST nodes, number of files, CPU cache size, etc. To accommodate large and small files better, we have an explicit threshold of 200,000 AST nodes, which is an arbitrary number picked based on some benchmarks on my laptop. For large files, the linter behavior doesn't change. For small files, we switch to iterating over nodes in the inner loop and iterating over rules once in the outer loop.
bdf6007
to
8387bac
Compare
## [0.10.2] - 2024-10-22 ### Features - dbe1972 linter: Import/no-cycle should turn on ignore_types by default (#6761) (Boshen) - 619d06f linter: Fix suggestion for `eslint:no_empty_static_block` rule (#6732) (Tapan Prakash) ### Bug Fixes ### Performance - 8387bac linter: Apply small file optimization, up to 30% faster (#6600) (camchenry) ### Refactor - b884577 linter: All ast_util functions take Semantic (#6753) (DonIsaac) - 744aa74 linter: Impl `Deref<Target = Semantic>` for `LintContext` (#6752) (DonIsaac) - 6ffdcc0 oxlint: Lint/mod.rs -> lint.rs (#6746) (Boshen) ### Testing - b03cec6 oxlint: Add `--fix` test case (#6747) (Boshen) --------- Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Theory: iterating over the rules three times has slightly worse cache locality, because the prior iterations have pushed
rule
out of the cache by the time we iterate over it again. By iterating over each rule only once, we improve cache performance (hopefully). We also don't need to collect rules to a Vec, so it saves some CPU/memory there too.In practice: the behavior here actually depends on the number of AST nodes that are in the program. If the number of nodes is large, then it's better to iterate over the nodes only once and iterate the rules multiple times. But if the number of nodes is small, then it's better to iterate over nodes multiple times and only iterate over the rules once. See this comment for more context: #6600 (comment), as well as the comment inside the PR: https://github.com/oxc-project/oxc/pull/6600/files#diff-207225884c5e031ffd802bb99e4fbacbd8364b1343a1cec5485bf50f29186300R131-R143.
In practice, this can make linting a file 1-45% faster, depending on the size of the file, number of AST nodes, number of files, CPU cache size, etc. To accommodate large and small files better, we have an explicit threshold of 200,000 AST nodes, which is an arbitrary number picked based on some benchmarks on my laptop. For large files, the linter behavior doesn't change. For small files, we switch to iterating over nodes in the inner loop and iterating over rules once in the outer loop.