Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Oct 14, 2025

Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.

Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: #20964

Main goal:

  • Carefully allow the vectorization of reduction cases that lead to speedups, and prevent those that do not (or may cause regressions).
  • Open up new vectorization opportunities in the future, that introduce expensive vector nodes that are only profitable in some cases but not others.

Why cost-model?

Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.

But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.

Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.

Implementation

Items:

  • New VTransform::is_profitable: checks cost-model and some other cost related checks.
    • VLoopAnalyzer::cost: scalar loop cost
    • VTransformGraph::cost: vector loop cost
  • Old reduction heuristic with _num_work_vecs and _num_reductions used to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.
  • Adapted existing tests.
  • Wrote a new comprehensive test, matching the related JMH benchmark, which we use below.

Testing
Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.


Some History

I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of my JVMLS2025 presentation.

During JDK9, reductions were first vectorized, but then restricted for "simple" and "2-element" reductions:

  • JDK-8074981
    Integer/FP scalar reduction optimization
    • Vectorized reduction, but led to a regression for some cases.
  • JDK-8078563 Restrict reduction optimization
    • Disabled vectorization for many cases. It seems we disabled a bit too many cases, because the regression really only happened with the float/double add/mul cases with linear reductions. And the int/long reductions were not affected but still disabled. We filed the following RFE for investigation:
  • JDK-8188313 C2: Consider enabling auto-vectorization for simple reductions (disabled by JDK-8078563)
    • Was never addressed.

During JDK21, I further improved reductions:

  • JDK-8302652 [SuperWord] Reduction should happen after loop, when possible
    • Now "simple" and "2-element" reductions of the int/long variety would be even more worth it, but still disabled because of JDK-8078563.

Other reports:

  • JDK-8345044 Sum of array elements not vectorized
  • JDK-8336000 C2 SuperWord: report that 2-element reductions do not vectorize
  • JDK-8307516 C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction

And I've been mapping out the reduction performance with benchmarks: #25387
You can see that we already used to vectorize a lot of cases, but especially did not vectorize:

  • "simple" reductions
  • "2-element" reductions

Future Work, discovered while writing the attached IR test:

  • JDK-8370671 C2 SuperWord [x86]: implement Long.max/min reduction for AVX2
  • JDK-8370673 C2 SuperWord [x86]: implement long mul reduction
  • JDK-8370677 C2 SuperWord [aarch64]: implement sequential reduction for add/mul D/F
  • JDK-8370685 C2 SuperWord: investigate why longMulBig does not vectorize
  • JDK-8370686 C2 SuperWord [aarch64]: investigate long mul reductions performance on NEON

Reduction Benchmarks

Results from the benchmark #25387 that is related to the attached IR test.

Legend:

  • master: performance before this patch
  • P1: default with this patch, i.e. -XX:AutoVectorizationOverrideProfitability=1, relying on new cost-model.
  • P0: patch, but auto vectorization disabled, i.e. -XX:AutoVectorizationOverrideProfitability=0.
  • P2: patch, but auto vectorization forced, i.e. -XX:AutoVectorizationOverrideProfitability=2.

How to look at the results below:

  • On the left, we have the raw performance numbers, and the errors.
  • On the right, we have the performance differences, marked with colors.
  • First focus on P1 vs master. Lower is better (marked green).
  • P1 vs P0 gives you a view on how many cases already profit from auto vectorization in total.
  • P1 vs P2 shows us how forced vectorization affects performance. There is basically no impact any more. See results from 8357530: C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability #25387 to see that we used to have a lot of cases where forcing vectorization led to speedups.

Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.

linux_x64 (AVX512)
image

windows_x64 (AVX2 - )
image

macosx_x64_sandybridge
image

linux_aarch64 (NEON)
image

macosx_aarch64 (NEON)
image


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8340093: C2 SuperWord: implement cost model (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27803/head:pull/27803
$ git checkout pull/27803

Update a local copy of the PR:
$ git checkout pull/27803
$ git pull https://git.openjdk.org/jdk.git pull/27803/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27803

View PR using the GUI difftool:
$ git pr show -t 27803

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27803.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 14, 2025

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 14, 2025

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8340093: C2 SuperWord: implement cost model

Reviewed-by: kvn, qamai

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 144 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8340093 8340093: C2 SuperWord: implement cost model Oct 14, 2025
@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Oct 14, 2025
@openjdk
Copy link

openjdk bot commented Oct 14, 2025

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@eme64 eme64 marked this pull request as ready for review November 3, 2025 12:20
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 3, 2025
@mlbridge
Copy link

mlbridge bot commented Nov 3, 2025

Webrevs

Copy link
Member

@SirYwell SirYwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work :)

Comment on lines 628 to 634
// For now, we use unit cost. We might refine that in the future.
// If needed, we could also use platform specific costs, if the
// default here is not accurate enough.
float VLoopAnalyzer::cost_for_vector_reduction(int opcode, int vlen, BasicType bt, bool requires_strict_order) const {
// Each reduction is composed of multiple instructions, each estimated with a unit cost.
// Linear: shuffle and reduce Recursive: shuffle and reduce
float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unit cost" sounds a bit too simple given that there is some kind of estimation going on already. Maybe it would make sense to add some discussion how strict order affects the shape of the resulting vectorized code?

I assume cases where the reduction can be moved after the loop are covered somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment :)

By "unit cost" I mean unit cost per hardware instruction. Reduction ops use multiple instructions, so we count the instructions, and return that count.

Yes, if we move reductions out of the loop, then the reduction node is not in the loop anymore, and instead we have vector accumulators. And then we count the cost of the vector accumulators.

That's why I need methods like VTransformGraph::mark_vtnodes_in_loop to know what nodes are in the loop (the new vector accumulators, and not the reductions if moved out of the loop).

I think I'll improve the comments a little to make that more clear :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, when referring to hardware instructions this makes perfectly sense, somehow I assumed "unit cost of a node". Thanks for clarifying!

Co-authored-by: Hannes Greule <SirYwell@users.noreply.github.com>
Comment on lines -132 to +149
// NEON instructions support them. But the match rule support for them is profitable for
// Vector API intrinsics.
// NEON instructions support them. They use multiple instructions which is more
// expensive in almost all cases where we would auto vectorize.
// But the match rule support for them is profitable for Vector API intrinsics.
if ((opcode == Op_VectorCastD2X && (bt == T_INT || bt == T_SHORT)) ||
(opcode == Op_VectorCastL2X && bt == T_FLOAT) ||
(opcode == Op_CountLeadingZerosV && bt == T_LONG) ||
(opcode == Op_CountTrailingZerosV && bt == T_LONG) ||
opcode == Op_MulVL ||
// The implementations of Op_AddReductionVD/F in Neon are for the Vector API only.
// They are not suitable for auto-vectorization because the result would not conform
// to the JLS, Section Evaluation Order.
// Note: we could implement sequential reductions for these reduction operators, but
// this will still almost never lead to speedups, because the sequential
// reductions are latency limited along the reduction chain, and not
// throughput limited. This is unlike unordered reductions (associative op)
// and element-wise ops which are usually throughput limited.
opcode == Op_AddReductionVD || opcode == Op_AddReductionVF ||
opcode == Op_MulReductionVD || opcode == Op_MulReductionVF ||
opcode == Op_MulVL) {
opcode == Op_MulReductionVD || opcode == Op_MulReductionVF) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: no functional changes, only moving Op_MulVL up to the other cases that work the same as it. And improving some comments.

_do_vector_loop(phase()->C->do_vector_loop()), // whether to do vectorization/simd style
_num_work_vecs(0), // amount of vector work we have
_num_reductions(0) // amount of reduction work we have
_do_vector_loop(phase()->C->do_vector_loop()) // whether to do vectorization/simd style
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: part of old reduction heuristic, no longer needed.

Comment on lines -1104 to +1248
if (!Matcher::match_rule_supported_vector(vopc, vlen, bt)) {
DEBUG_ONLY( this->print(); )
assert(false, "do not have normal vector op for this reduction");
return false; // not implemented
if (!Matcher::match_rule_supported_auto_vectorization(vopc, vlen, bt)) {
// The element-wise vector operation needed for the vector accumulator
// is not implemented / supported.
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this a "performance bug", but it makes sense to fix it here.
match_rule_supported_vector returns true on aarch64 for MulVL, but match_rule_supported_auto_vectorization returns false. And that is because MulVL has a "fake vector implementation" in the backend on NEON, that just extracts to scalars, and does the op in scalar multiplication, and packs again.

Since we are auto-vectorizing, we should trust the match_rule_supported_auto_vectorization here.

On aarch64, the MulReductionVL is allowed for vectorization. But if we move it out of the loop here, we end up introducing a MulVL, which is very much not profitable. Making this change avoids this issue, and is also consistent with the match_rule_supported_auto_vectorization use instead of match_rule_supported_vector elsewhere in SuperWord.

Comment on lines 628 to 634
// For now, we use unit cost. We might refine that in the future.
// If needed, we could also use platform specific costs, if the
// default here is not accurate enough.
float VLoopAnalyzer::cost_for_vector_reduction(int opcode, int vlen, BasicType bt, bool requires_strict_order) const {
// Each reduction is composed of multiple instructions, each estimated with a unit cost.
// Linear: shuffle and reduce Recursive: shuffle and reduce
float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment :)

By "unit cost" I mean unit cost per hardware instruction. Reduction ops use multiple instructions, so we count the instructions, and return that count.

Yes, if we move reductions out of the loop, then the reduction node is not in the loop anymore, and instead we have vector accumulators. And then we count the cost of the vector accumulators.

That's why I need methods like VTransformGraph::mark_vtnodes_in_loop to know what nodes are in the loop (the new vector accumulators, and not the reductions if moved out of the loop).

I think I'll improve the comments a little to make that more clear :)

@eme64
Copy link
Contributor Author

eme64 commented Nov 3, 2025

@SirYwell Thanks for the comments and suggestions :)
I sent a small update, hope that helps. And I also sent some GitHub comments that may help additionally to understand some of the small changes.

Comment on lines +1909 to 1910
if (_trace._info) {
tty->print_cr("\nForced bailout of vectorization (AutoVectorizationOverrideProfitability=0).");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note. Consider separate RFE to change this to UL for such outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. The tricky part is that the current TraceAutoVectorization is a compile command that can be enabled with method name filtering. Is that already available via UL now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. I think this is what @anton-seoane worked on before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have taken the task again so sooner than later CompileCommand filtering for UL will be enabled for cases such as this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's what I thought. For now, I'll extend the tracing the way I've been doing, and once we have UL available with method-level filtering, then I can migrate it all in one single PR :)

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine and not complex. I have only nit picks.

#endif

float sum = 0;
for (int j = 0; j < body().body().length(); j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is body().body() mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLoopAnalyzer (this) has multiple analysis subcomponents. One of them is the VLoopBody, i.e. this->body() / this->_body. And it has access to a GrowableArray body(), which maps the nodes of the loop.

Maybe loopBody().nodes() would sound better here. If you prefer that, I file a separate renaming RFE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be nice if you move body().body() into separate method with comment explaining it. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I filed: JDK-8371391 C2 SuperWord: rename body().body() to something more understandable

}

// Compute the cost over all operations in the (scalar) loop.
float VLoopAnalyzer::cost() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider renaming it to cost_for_scalar() and cost_for_scalar() to cost_for_scalar_node()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do some renamings to make it explicit which are for nodes, and which for the loop.

@eme64
Copy link
Contributor Author

eme64 commented Nov 5, 2025

@vnkozlov Thanks for reviewing and the suggestions. I renamed some cost functions, and I like it better this way now too :)

Copy link
Contributor

@galderz galderz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK-8370671 C2 SuperWord [x86]: implement Long.max/min reduction for AVX2

This is familiar to me. I discovered this when I was intrinsifying MinL/MaxL for JDK-8307513 and one of my servers only had AX2. Vectorization kicked in with AVX512 so I left it there.

Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.

Looking at the results, seems like most instability is with doubles? In any case, on the topic of instability of min/max and branching, #20098 (comment) has a good analysis on past observations with the JMH benchmark now called MinMaxVector. This benchmark shapes the data such that data in the arrays is laid out to achieve a certain % of branch taken. It might not be fully applicable to the instabilities you observe but might help direct attention.

WRT to the code changes in this PR, I don't have anything else to say other than I'm glad basic cases like JDK-8345044 are getting solved.

@eme64
Copy link
Contributor Author

eme64 commented Nov 5, 2025

@galderz Right, I did remember that you have had a better benchmark, and that's why I understood more quickly that the issue here with the doubles is just noise :)

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 5, 2025
float VLoopAnalyzer::cost_for_vector_reduction_node(int opcode, int vlen, BasicType bt, bool requires_strict_order) const {
// Each reduction is composed of multiple instructions, each estimated with a unit cost.
// Linear: shuffle and reduce Recursive: shuffle and reduce
float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ask for the cost of the element-wise opcode here, something like (1 + element_wise_cost) would be more accurate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a little more precise, the strict one should be something like:

vlen * (1 + Matcher::vector_op_pre_select_sz_estimate(Op_Extract, bt, vlen)) + (vlen - 1) * (1 + Matcher::scalar_op_pre_select_sz_estimate(opcode, bt)));

and the non-strict one would be:

float c = Matcher::vector_op_pre_select_sz_estimate(Op_Extract, bt, 2) * 2 + Matcher::scalar_op_pre_select_sz_estimate(opcode) + 3;
for (int i = 4; i <= vlen; i *= 2) {
  c += 2 + Matcher::vector_op_pre_select_sz_estimate(Op_VectorRearrange, bt, i) + Matcher::vector_op_pre_select_sz_estimate(opcode, bt, i);
}

Maybe refactoring a little bit to make the Matcher::vector_op_pre_select_sz_estimate less awkward would be welcomed, too. Currently, it returns the estimated size - 1, which is unsettling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merykitty Can we do that in a follow-up RFE? For now, I'd like to keep it as simple as possible. Cost-models can become arbitrarily complex. There is a bit of a trade-off between simplicity and accuracy. And we can for sure improve things in the future, this PR just lays the foundation.

My goal here is to start as simple as possible, and then add complexity if there is a proven need for it.

So if you/we can find a benchmark where the cost model is not accurate enough yet, provable by -XX:AutoVectorizationOverrideProfitability=0/2, then we should make it more complex.

Would that be acceptable for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?

Personally, I don't want to get too stuck to counting instructions, but rather getting a throughput estimate. Counting instructions is an estimate for throughput, but I don't know yet if longterm it is the best.

I would like to wait a little more, and start depending on the cost model for more and more cases (extract, pack, shuffle, if-conversion, ...) and then we will run into issues along the way where the cost model is not yet accurate enough. And at that point we can think again what would produce the most accurate results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?

I believe it tries to estimate the number of instructions generated by a node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm filing an RFE now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK-8371393
C2 SuperWord: improve cost model

// If needed, we could also use platform specific costs, if the
// default here is not accurate enough.
float VLoopAnalyzer::cost_for_vector_node(int opcode, int vlen, BasicType bt) const {
float c = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Matcher::vector_op_pre_select_sz_estimate, could it be used here? The corresponding for scalar is Matcher::scalar_op_pre_select_sz_estimate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above :)

// For now, we use unit cost. We might refine that in the future.
// If needed, we could also use platform specific costs, if the
// default here is not accurate enough.
float VLoopAnalyzer::cost_for_scalar_node(int opcode) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a BasicType parameter for this method, some opcodes are used for multiple kinds of operands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I actually tried it right now, and it would take a bit of engineering at the call sites. In quite a few cases the BasicType is not immediately available.

Is it ok if we ignore it for now, and only add it in once we really need it?

//
// in_loop: vtn->_idx -> bool
void VTransformGraph::mark_vtnodes_in_loop(VectorSet& in_loop) const {
assert(is_scheduled(), "must already be scheduled");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask if this schedule has already moved unordered reductions like addition out of the loop yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize happens before schedule. But the unordered reduction is still in the VTransformGraph, and so it is also scheduled. But mark_vtnodes_in_loop will find that the unordered reduction is outside the loop :)

Does that answer your question?

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your replies. I think leaving my suggestions to future RFEs is reasonable.

float VLoopAnalyzer::cost_for_vector_reduction_node(int opcode, int vlen, BasicType bt, bool requires_strict_order) const {
// Each reduction is composed of multiple instructions, each estimated with a unit cost.
// Linear: shuffle and reduce Recursive: shuffle and reduce
float c = requires_strict_order ? 2 * vlen : 2 * exact_log2(vlen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does Matcher::vector_op_pre_select_sz_estimate return? The number of instructions or some kind of throughput estimate?

I believe it tries to estimate the number of instructions generated by a node.

@eme64
Copy link
Contributor Author

eme64 commented Nov 6, 2025

@vnkozlov Thanks for reviewing and the approval!
FYI, I filed: JDK-8371391 C2 SuperWord: rename body().body() to something more understandable

@merykitty Thanks a lot for reviewing as well, and the ideas about improving the cost model. There is actually a lot of literature out there about cost models, and various compilers employ various methods. There could be a lot of exciting work in this area, but let's take it step-by-step ;)
FYI, I filed: JDK-8371393 C2 SuperWord: improve cost model

@eme64
Copy link
Contributor Author

eme64 commented Nov 10, 2025

@merykitty @vnkozlov Thank you very much for the reviews!

/integrate

@openjdk
Copy link

openjdk bot commented Nov 10, 2025

Going to push as commit 72989e0.
Since your change was applied there have been 200 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 10, 2025
@openjdk openjdk bot closed this Nov 10, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 10, 2025
@openjdk
Copy link

openjdk bot commented Nov 10, 2025

@eme64 Pushed as commit 72989e0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants