-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8340272: C2 SuperWord: JMH benchmark for Reduction vectorization #21032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 82 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, the benchmark is very thorough! I was interested to see how it performed on my Zen 3 (AVX2) machine, I've attached the results here in case it's interesting/useful: perf_results.txt
|
@jaskarth thanks for the benchmark! I included it in these results now: The results are quite comparable to the AVX512 results. Some comments:
|
|
Going to push as commit aeba1ea.
Your commit was automatically rebased without conflicts. |
|
The subword results seem quite tricky, especially since there are some things that performed well for me (like char) but ended up causing regressions for your machine. The long results are also quite strange, but it may just be random noise. I'll definitely make sure to investigate further. Thanks a lot for the analysis! |
@eme64 Thanks for building this. I ended up creating a min/max specific benchmark in #20098. The main reason I created something different was to be able to control data in the arrays such that the branching factors could be pre-determined. The results can vary depending that. Then I used the opportunity to add both reduction and non-reduction vector benchmarks. |


I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. See Future Work below.
How I run my benchmarks
All benchmarks
make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64Some specific benchmark, with profiler that tells me which code snippet is hottest:
make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"JMH logs
Run on my AVX512 laptop, with master:
run_avx512_master.txt
Run on remote asimd (aarch64, NEON) machine:
run_asimd_master.txt
Results
I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
Here the interesting

int / long / float / doubleresults, discussion further below:And there the less spectacular
byte / char / shortresults. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.Here the PDF:
benchmark_results.pdf
Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?
Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
But when I disable this code, then I see on the aarch64/ASIMD machine:
Hence, this assumption no longer holds. I think it is because we are actually able to move the reductions out of the loop now, and that was not the case when this code was added.
2-Element Reductions for INT / LONG
Apparently, all 2-element int and long reductions are currently deemed not profitable, see change: a880f3d
This means that the
longreductions do not vectorize on the ASIMD / aarch64 machine withMaxVectorSize=16.AARCH64 / NEON / ASIMD
This is why the
float / doubleadd / mulreductions (yellowMATCHER) do not vectorize. We might be able to tackle this with an appropriate alternative implementation.Also all of the
longcases fail. For one because they would only be 2-element reductions (because onlyMaxVectorSize=16). But also because theMulVLis not allowed apparently, see below.Float / Double with Add and Mul
On
aarch64NEON these cases do not vectorize, see last section.It turns out that many of these cases actually do vectorize (on x64), but the code is just as fast as the scalar code.
This is because the reduction order is strict, to maintain correct rounding.
Interestingly, the code runs about at the same speed, if vectorized or not. it seems that the latency of the reduction is simply the determining factor, no matter if it is vectorized or scalar.
Running this for example shows that the loop-bodies are quite different:
Scalar loop body, note only scalar xmm registers are used:
Vector loop uses
zmm,ymmandxmmregisters, and not justmulandadd, but alsovpshufdandvextractf128instructions to shuffle the values around in the reduction.No vectorization for longMulBig?
I locally can get vectorization in another setting, but somehow not in the JMH benchmark. This is strange, and I'll have to keep investigating.
No speedup with doubleMinDotProduct
Strangely, on my AVX512 machine, that benchmark did vectorize, but it did not experience any speedup. I'm quite confused about that. Especially because the parallel benchmark
doubleMaxDotProductvectorizes just fine. More investigation needed.Future Work
Simplebenchmarks should be profitable -> allow vectorization!byte / char / short: is it all due to over-unrolling?float / doublewithadd / mul:Float.addAssociativeso that we can do non-strict reduction? That could be a cool feature for those who care about performance and are willing to give up some rounding precision.longMulBiganddoubleMinDotProduct.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21032/head:pull/21032$ git checkout pull/21032Update a local copy of the PR:
$ git checkout pull/21032$ git pull https://git.openjdk.org/jdk.git pull/21032/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21032View PR using the GUI difftool:
$ git pr show -t 21032Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21032.diff
Webrev
Link to Webrev Comment