-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8340093: C2 SuperWord: implement cost model #27803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 144 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
SirYwell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work :)
| // For now, we use unit cost. We might refine that in the future. | ||
| // If needed, we could also use platform specific costs, if the | ||
| // default here is not accurate enough. | ||
| float VLoopAnalyzer::cost_for_vector_reduction(int opcode, int vlen, BasicType bt, bool requires_strict_order) const { | ||
| // Each reduction is composed of multiple instructions, each estimated with a unit cost. | ||
| // Linear: shuffle and reduce Recursive: shuffle and reduce | ||
| float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unit cost" sounds a bit too simple given that there is some kind of estimation going on already. Maybe it would make sense to add some discussion how strict order affects the shape of the resulting vectorized code?
I assume cases where the reduction can be moved after the loop are covered somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment :)
By "unit cost" I mean unit cost per hardware instruction. Reduction ops use multiple instructions, so we count the instructions, and return that count.
Yes, if we move reductions out of the loop, then the reduction node is not in the loop anymore, and instead we have vector accumulators. And then we count the cost of the vector accumulators.
That's why I need methods like VTransformGraph::mark_vtnodes_in_loop to know what nodes are in the loop (the new vector accumulators, and not the reductions if moved out of the loop).
I think I'll improve the comments a little to make that more clear :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, when referring to hardware instructions this makes perfectly sense, somehow I assumed "unit cost of a node". Thanks for clarifying!
Co-authored-by: Hannes Greule <SirYwell@users.noreply.github.com>
| // NEON instructions support them. But the match rule support for them is profitable for | ||
| // Vector API intrinsics. | ||
| // NEON instructions support them. They use multiple instructions which is more | ||
| // expensive in almost all cases where we would auto vectorize. | ||
| // But the match rule support for them is profitable for Vector API intrinsics. | ||
| if ((opcode == Op_VectorCastD2X && (bt == T_INT || bt == T_SHORT)) || | ||
| (opcode == Op_VectorCastL2X && bt == T_FLOAT) || | ||
| (opcode == Op_CountLeadingZerosV && bt == T_LONG) || | ||
| (opcode == Op_CountTrailingZerosV && bt == T_LONG) || | ||
| opcode == Op_MulVL || | ||
| // The implementations of Op_AddReductionVD/F in Neon are for the Vector API only. | ||
| // They are not suitable for auto-vectorization because the result would not conform | ||
| // to the JLS, Section Evaluation Order. | ||
| // Note: we could implement sequential reductions for these reduction operators, but | ||
| // this will still almost never lead to speedups, because the sequential | ||
| // reductions are latency limited along the reduction chain, and not | ||
| // throughput limited. This is unlike unordered reductions (associative op) | ||
| // and element-wise ops which are usually throughput limited. | ||
| opcode == Op_AddReductionVD || opcode == Op_AddReductionVF || | ||
| opcode == Op_MulReductionVD || opcode == Op_MulReductionVF || | ||
| opcode == Op_MulVL) { | ||
| opcode == Op_MulReductionVD || opcode == Op_MulReductionVF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: no functional changes, only moving Op_MulVL up to the other cases that work the same as it. And improving some comments.
| _do_vector_loop(phase()->C->do_vector_loop()), // whether to do vectorization/simd style | ||
| _num_work_vecs(0), // amount of vector work we have | ||
| _num_reductions(0) // amount of reduction work we have | ||
| _do_vector_loop(phase()->C->do_vector_loop()) // whether to do vectorization/simd style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: part of old reduction heuristic, no longer needed.
| if (!Matcher::match_rule_supported_vector(vopc, vlen, bt)) { | ||
| DEBUG_ONLY( this->print(); ) | ||
| assert(false, "do not have normal vector op for this reduction"); | ||
| return false; // not implemented | ||
| if (!Matcher::match_rule_supported_auto_vectorization(vopc, vlen, bt)) { | ||
| // The element-wise vector operation needed for the vector accumulator | ||
| // is not implemented / supported. | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider this a "performance bug", but it makes sense to fix it here.
match_rule_supported_vector returns true on aarch64 for MulVL, but match_rule_supported_auto_vectorization returns false. And that is because MulVL has a "fake vector implementation" in the backend on NEON, that just extracts to scalars, and does the op in scalar multiplication, and packs again.
Since we are auto-vectorizing, we should trust the match_rule_supported_auto_vectorization here.
On aarch64, the MulReductionVL is allowed for vectorization. But if we move it out of the loop here, we end up introducing a MulVL, which is very much not profitable. Making this change avoids this issue, and is also consistent with the match_rule_supported_auto_vectorization use instead of match_rule_supported_vector elsewhere in SuperWord.
| // For now, we use unit cost. We might refine that in the future. | ||
| // If needed, we could also use platform specific costs, if the | ||
| // default here is not accurate enough. | ||
| float VLoopAnalyzer::cost_for_vector_reduction(int opcode, int vlen, BasicType bt, bool requires_strict_order) const { | ||
| // Each reduction is composed of multiple instructions, each estimated with a unit cost. | ||
| // Linear: shuffle and reduce Recursive: shuffle and reduce | ||
| float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment :)
By "unit cost" I mean unit cost per hardware instruction. Reduction ops use multiple instructions, so we count the instructions, and return that count.
Yes, if we move reductions out of the loop, then the reduction node is not in the loop anymore, and instead we have vector accumulators. And then we count the cost of the vector accumulators.
That's why I need methods like VTransformGraph::mark_vtnodes_in_loop to know what nodes are in the loop (the new vector accumulators, and not the reductions if moved out of the loop).
I think I'll improve the comments a little to make that more clear :)
|
@SirYwell Thanks for the comments and suggestions :) |
| if (_trace._info) { | ||
| tty->print_cr("\nForced bailout of vectorization (AutoVectorizationOverrideProfitability=0)."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note. Consider separate RFE to change this to UL for such outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. The tricky part is that the current TraceAutoVectorization is a compile command that can be enabled with method name filtering. Is that already available via UL now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no. I think this is what @anton-seoane worked on before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have taken the task again so sooner than later CompileCommand filtering for UL will be enabled for cases such as this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's what I thought. For now, I'll extend the tracing the way I've been doing, and once we have UL available with method-level filtering, then I can migrate it all in one single PR :)
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine and not complex. I have only nit picks.
| #endif | ||
|
|
||
| float sum = 0; | ||
| for (int j = 0; j < body().body().length(); j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is body().body() mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLoopAnalyzer (this) has multiple analysis subcomponents. One of them is the VLoopBody, i.e. this->body() / this->_body. And it has access to a GrowableArray body(), which maps the nodes of the loop.
Maybe loopBody().nodes() would sound better here. If you prefer that, I file a separate renaming RFE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would be nice if you move body().body() into separate method with comment explaining it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I filed: JDK-8371391 C2 SuperWord: rename body().body() to something more understandable
| } | ||
|
|
||
| // Compute the cost over all operations in the (scalar) loop. | ||
| float VLoopAnalyzer::cost() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider renaming it to cost_for_scalar() and cost_for_scalar() to cost_for_scalar_node()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do some renamings to make it explicit which are for nodes, and which for the loop.
|
@vnkozlov Thanks for reviewing and the suggestions. I renamed some cost functions, and I like it better this way now too :) |
galderz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDK-8370671 C2 SuperWord [x86]: implement Long.max/min reduction for AVX2
This is familiar to me. I discovered this when I was intrinsifying MinL/MaxL for JDK-8307513 and one of my servers only had AX2. Vectorization kicked in with AVX512 so I left it there.
Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.
Looking at the results, seems like most instability is with doubles? In any case, on the topic of instability of min/max and branching, #20098 (comment) has a good analysis on past observations with the JMH benchmark now called MinMaxVector. This benchmark shapes the data such that data in the arrays is laid out to achieve a certain % of branch taken. It might not be fully applicable to the instabilities you observe but might help direct attention.
WRT to the code changes in this PR, I don't have anything else to say other than I'm glad basic cases like JDK-8345044 are getting solved.
|
@galderz Right, I did remember that you have had a better benchmark, and that's why I understood more quickly that the issue here with the doubles is just noise :) |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
| float VLoopAnalyzer::cost_for_vector_reduction_node(int opcode, int vlen, BasicType bt, bool requires_strict_order) const { | ||
| // Each reduction is composed of multiple instructions, each estimated with a unit cost. | ||
| // Linear: shuffle and reduce Recursive: shuffle and reduce | ||
| float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ask for the cost of the element-wise opcode here, something like (1 + element_wise_cost) would be more accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be a little more precise, the strict one should be something like:
vlen * (1 + Matcher::vector_op_pre_select_sz_estimate(Op_Extract, bt, vlen)) + (vlen - 1) * (1 + Matcher::scalar_op_pre_select_sz_estimate(opcode, bt)));
and the non-strict one would be:
float c = Matcher::vector_op_pre_select_sz_estimate(Op_Extract, bt, 2) * 2 + Matcher::scalar_op_pre_select_sz_estimate(opcode) + 3;
for (int i = 4; i <= vlen; i *= 2) {
c += 2 + Matcher::vector_op_pre_select_sz_estimate(Op_VectorRearrange, bt, i) + Matcher::vector_op_pre_select_sz_estimate(opcode, bt, i);
}
Maybe refactoring a little bit to make the Matcher::vector_op_pre_select_sz_estimate less awkward would be welcomed, too. Currently, it returns the estimated size - 1, which is unsettling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merykitty Can we do that in a follow-up RFE? For now, I'd like to keep it as simple as possible. Cost-models can become arbitrarily complex. There is a bit of a trade-off between simplicity and accuracy. And we can for sure improve things in the future, this PR just lays the foundation.
My goal here is to start as simple as possible, and then add complexity if there is a proven need for it.
So if you/we can find a benchmark where the cost model is not accurate enough yet, provable by -XX:AutoVectorizationOverrideProfitability=0/2, then we should make it more complex.
Would that be acceptable for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?
Personally, I don't want to get too stuck to counting instructions, but rather getting a throughput estimate. Counting instructions is an estimate for throughput, but I don't know yet if longterm it is the best.
I would like to wait a little more, and start depending on the cost model for more and more cases (extract, pack, shuffle, if-conversion, ...) and then we will run into issues along the way where the cost model is not yet accurate enough. And at that point we can think again what would produce the most accurate results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?
I believe it tries to estimate the number of instructions generated by a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm filing an RFE now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDK-8371393
C2 SuperWord: improve cost model
| // If needed, we could also use platform specific costs, if the | ||
| // default here is not accurate enough. | ||
| float VLoopAnalyzer::cost_for_vector_node(int opcode, int vlen, BasicType bt) const { | ||
| float c = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have Matcher::vector_op_pre_select_sz_estimate, could it be used here? The corresponding for scalar is Matcher::scalar_op_pre_select_sz_estimate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as above :)
| // For now, we use unit cost. We might refine that in the future. | ||
| // If needed, we could also use platform specific costs, if the | ||
| // default here is not accurate enough. | ||
| float VLoopAnalyzer::cost_for_scalar_node(int opcode) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a BasicType parameter for this method, some opcodes are used for multiple kinds of operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I actually tried it right now, and it would take a bit of engineering at the call sites. In quite a few cases the BasicType is not immediately available.
Is it ok if we ignore it for now, and only add it in once we really need it?
| // | ||
| // in_loop: vtn->_idx -> bool | ||
| void VTransformGraph::mark_vtnodes_in_loop(VectorSet& in_loop) const { | ||
| assert(is_scheduled(), "must already be scheduled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask if this schedule has already moved unordered reductions like addition out of the loop yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize happens before schedule. But the unordered reduction is still in the VTransformGraph, and so it is also scheduled. But mark_vtnodes_in_loop will find that the unordered reduction is outside the loop :)
Does that answer your question?
merykitty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your replies. I think leaving my suggestions to future RFEs is reasonable.
| float VLoopAnalyzer::cost_for_vector_reduction_node(int opcode, int vlen, BasicType bt, bool requires_strict_order) const { | ||
| // Each reduction is composed of multiple instructions, each estimated with a unit cost. | ||
| // Linear: shuffle and reduce Recursive: shuffle and reduce | ||
| float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?
I believe it tries to estimate the number of instructions generated by a node.
|
@vnkozlov Thanks for reviewing and the approval! @merykitty Thanks a lot for reviewing as well, and the ideas about improving the cost model. There is actually a lot of literature out there about cost models, and various compilers employ various methods. There could be a lot of exciting work in this area, but let's take it step-by-step ;) |
|
@merykitty @vnkozlov Thank you very much for the reviews! /integrate |
|
Going to push as commit 72989e0.
Your commit was automatically rebased without conflicts. |
Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.
Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: #20964
Main goal:
Why cost-model?
Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.
But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.
Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.
Implementation
Items:
VTransform::is_profitable: checks cost-model and some other cost related checks.VLoopAnalyzer::cost: scalar loop costVTransformGraph::cost: vector loop cost_num_work_vecsand_num_reductionsused to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.Testing
Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.
Some History
I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of my JVMLS2025 presentation.
During JDK9, reductions were first vectorized, but then restricted for "simple" and "2-element" reductions:
Integer/FP scalar reduction optimization
During JDK21, I further improved reductions:
Other reports:
And I've been mapping out the reduction performance with benchmarks: #25387
You can see that we already used to vectorize a lot of cases, but especially did not vectorize:
Future Work, discovered while writing the attached IR test:
Reduction Benchmarks
Results from the benchmark #25387 that is related to the attached IR test.
Legend:
master: performance before this patchP1: default with this patch, i.e.-XX:AutoVectorizationOverrideProfitability=1, relying on new cost-model.P0: patch, but auto vectorization disabled, i.e.-XX:AutoVectorizationOverrideProfitability=0.P2: patch, but auto vectorization forced, i.e.-XX:AutoVectorizationOverrideProfitability=2.How to look at the results below:
P1 vs master. Lower is better (marked green).P1 vs P0gives you a view on how many cases already profit from auto vectorization in total.P1 vs P2shows us how forced vectorization affects performance. There is basically no impact any more. See results from 8357530: C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability #25387 to see that we used to have a lot of cases where forcing vectorization led to speedups.Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.
linux_x64(AVX512)windows_x64(AVX2 - )macosx_x64_sandybridgelinux_aarch64(NEON)macosx_aarch64(NEON)Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27803/head:pull/27803$ git checkout pull/27803Update a local copy of the PR:
$ git checkout pull/27803$ git pull https://git.openjdk.org/jdk.git pull/27803/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27803View PR using the GUI difftool:
$ git pr show -t 27803Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27803.diff
Using Webrev
Link to Webrev Comment