Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8302652: [SuperWord] Reduction should happen after loop, when possible #13056

Closed
wants to merge 20 commits into from

Conversation

eme64
Copy link
Contributor

@eme64 eme64 commented Mar 16, 2023

// Having ReductionNodes in the loop is expensive. They need to recursively
// fold together the vector values, for every vectorized loop iteration. If
// we encounter the following pattern, we can vector accumulate the values
// inside the loop, and only have a single UnorderedReduction after the loop.
//
// CountedLoop init
// | |
// +------+ | +-----------------------+
// | | | |
// PhiNode (s) |
// | |
// | Vector |
// | | |
// UnorderedReduction (first_ur) |
// | |
// ... Vector |
// | | |
// UnorderedReduction (last_ur) |
// | |
// +---------------------+
//
// We patch the graph to look like this:
//
// CountedLoop identity_vector
// | |
// +-------+ | +---------------+
// | | | |
// PhiNode (v) |
// | |
// | Vector |
// | | |
// VectorAccumulator |
// | |
// ... Vector |
// | | |
// init VectorAccumulator |
// | | | |
// UnorderedReduction +-----------+
//
// We turned the scalar (s) Phi into a vectorized one (v). In the loop, we
// use vector_accumulators, which do the same reductions, but only element
// wise. This is a single operation per vector_accumulator, rather than many
// for a UnorderedReduction. We can then reduce the last vector_accumulator
// after the loop, and also reduce the init value into it.
// We can not do this with all reductions. Some reductions do not allow the
// reordering of operations (for example float addition).
void PhaseIdealLoop::move_unordered_reduction_out_of_loop(IdealLoopTree* loop) {

I introduced a new abstract node type UnorderedReductionNode (subtype of ReductionNode). All of the reductions that can be re-ordered are to extend from this node type: int/long add/mul/and/or/xor/min/max, as well as float/double min/max. float/double add/mul do not allow for reordering of operations.

The optimization is part of loop-opts, and called after SuperWord in PhaseIdealLoop::build_and_optimize.

Performance results
I ran test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java, with 2_000 warmup and 100_000 perf iterations. I also increased the array length to RANGE = 16*1024.

I disabled turbo-boost.
Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16.
Full avx512 support, including avx512dq required for MulReductionVL.

operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
---------------------------------------------------------------
int add       2063   2085   660    530    415    283   |      |
int mul       2272   2257   1189   733    908    439   |      |
int min       2527   2520   2516   2579   2585   2542  | 1    |
int max       2548   2525   2551   2516   2515   2517  | 1    |
int and       2410   2414   602    480    353    263   |      |
int or        2149   2151   597    498    354    262   |      |
int xor       2059   2062   605    476    364    263   |      |
long add      1776   1790   2000   1000   1683   591   | 2    |
long mul      2135   2199   2185   2001   2176   1307  | 2    |
long min      1439   1424   1421   1420   1430   1427  | 3    |
long max      2299   2287   2303   2305   1433   1425  | 3    |
long and      1657   1667   2015   1003   1679   568   | 4    |
long or       1776   1783   2032   1009   1680   569   | 4    |
long xor      1834   1783   2012   1024   1679   570   | 4    |
float add     2779   2644   2633   2648   2632   2639  | 5    |
float mul     2779   2871   2810   2776   2732   2791  | 5    |
float min     2294   2620   1725   1286   872    672   |      |
float max     2371   2519   1697   1265   841    468   |      |
double add    2634   2636   2635   2650   2635   2648  | 5    |
double mul    3053   2955   2881   3030   2979   2927  | 5    |
double min    2364   2400   2439   2399   2486   2398  | 6    |
double max    2488   2459   2501   2451   2493   2498  | 6    |

Legend: M master, P with patch, N no superword reductions (-XX:-SuperWordReductions), 2 AVX2, 3 AVX512.

The lines without note show clear speedup as expected.

Notes:

  1. int min/max: bug JDK-8302673
  2. long add/mul: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note: MulReductionVL requires avx512dq.
  3. long min/max: Math.max(long, long) is currently not intrinsified JDK-8307513.
  4. long and/or/xor: without patch on AVX2, vectorization is slower. With patch, it is always faster now.
  5. float/double add/mul: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.
  6. double min/max: bug JDK-8300865.

Testing

I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.

Passes up to tier5 and stress-testing.
Performance testing did not show any regressions.
TODO can someone benchmark on aarch64?

Discussion

We should investigate if we can now allow reductions more eagerly, at least for UnorderedReduction, as the overhead is now much lower. @jatin-bhateja pointed to this:

if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {

I filed JDK-8307516.

So far, I did not work on byte, char, short, we can investigate this in the future.

FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps SuperWord for now.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8302652: [SuperWord] Reduction should happen after loop, when possible

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13056/head:pull/13056
$ git checkout pull/13056

Update a local copy of the PR:
$ git checkout pull/13056
$ git pull https://git.openjdk.org/jdk.git pull/13056/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13056

View PR using the GUI difftool:
$ git pr show -t 13056

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13056.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 16, 2023

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 16, 2023

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 16, 2023
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks good.

* compiler.loopopts.superword.ReductionPerf
* @bug 8074981 8302652
* @summary Test SuperWord Reduction Perf.
* @requires vm.compiler2.enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough. Yes, we need to check for C2 presence. But you need skip arm, ppc and s390 which have C2. You need second @requires as original but you can reduce checks for x86:

 * @requires vm.simpleArch == "x86" | vm.simpleArch == "x64" | vm.simpleArch == "aarch64" | vm.simpleArch == "riscv64""

Comment on lines 44 to 45
int iter_warmup = 2_000;
int iter_perf = 5_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure total execution time of this test?
Warmup 2_000 iterations is too big number I think. You have 8192 iterations already in tested methods. 10 should be enough to trigger compilation. May be add -Xbatch if you want to make sure C2 does compile it.
If reduction is not supported (MulReduction on RISC-V and ARCH64) the test will be slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the iteration count to 100 and 1000. For performance measurement they can be increased. I also added -Xbatch. On my laptop, the test now definately runs in less than 2 seconds, with or without SuperWord. So even platforms that do not support a feature, or SuperWord as a whole could run it decently fast.

if (ctrl == nullptr &&
in1 != nullptr && in1->is_Phi() && in1->in(2) == this && in1->outcnt() == 1 &&
in1->in(0)->is_CountedLoop() &&
in2->is_Vector()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also check that this reduction node doesn't have users inside loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnkozlov How should I do that? Can that even be done during IGVN? Or should I move the implementation to loopopts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I could put it before or after try_sink_out_of_loop inside PhaseIdealLoop::split_if_with_blocks_post.

Copy link
Contributor

@vnkozlov vnkozlov Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, you may check and mark Reduction node if it has users (other than Phi) inside loop in SuperWord code where we are creating vector referenced by Reduction node.

@eme64
Copy link
Contributor Author

eme64 commented Mar 22, 2023

@jatin-bhateja @sviswa7
What do you think about the performance numbers I measured? Do they make sense to you?

A few questions:

  • long min/max: Why do we require avx512vlbwdq in Matcher::match_rule_supported_vector? Would avx512f not be sufficient? C2_MacroAssembler::reduceL leads me to vextracti64x4 (that should only require avx512f) and reduce_operation_256 (where vpminsq only requires avx2 via assert).
  • What do you think about the double min/max performance? What do you think could be the reason it is not similar to the behavior of float min/max?

@sviswa7
Copy link

sviswa7 commented Mar 22, 2023

@eme64 The double min/max reduction is also affected by JDK-8300865. With the following patch, I see double min/max reduction happening and good perf gain (> 2x) with your PR:

diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp
index baf880aac20..04b20904cf4 100644
--- a/src/hotspot/share/opto/superword.cpp
+++ b/src/hotspot/share/opto/superword.cpp
@@ -3353,7 +3353,8 @@ bool SuperWord::construct_bb() {
             // First see if we can map the reduction on the given system we are on, then
             // make a data entry operation for each reduction we see.
             BasicType bt = use->bottom_type()->basic_type();
-            if (ReductionNode::implemented(use->Opcode(), Matcher::min_vector_size(bt), bt)) {
+            int min_vec_size = Matcher::min_vector_size(bt);
+            if (ReductionNode::implemented(use->Opcode(), min_vec_size < 2 ? 2 : min_vec_size, bt)) {
               reduction_uses++;
             }
           }

Hope this helps.

@sviswa7
Copy link

sviswa7 commented Mar 22, 2023

@eme64 For long min/max, currently Math.min(long, long) is not getting intrinsified. Only int/float/double are getting intrinsified. No scalar intrinsification for Math.min(long, long) leads to no MinL scalar node generation and in turn no vectorization and no reduction.

@eme64
Copy link
Contributor Author

eme64 commented Mar 23, 2023

@sviswa7 thanks for your quick response!

I can confirm: we do not "intrinsify" (ie turn into MinL/MaxL), rather we just inline the java.lang.Math::Min/Max methods, implemented with CmpL / If-branching. Do you think this makes sense, or should we intrinsify, at least when the hardware supports it?

@sviswa7
Copy link

sviswa7 commented Mar 23, 2023

@sviswa7 thanks for your quick response!

I can confirm: we do not "intrinsify" (ie turn into MinL/MaxL), rather we just inline the java.lang.Math::Min/Max methods, implemented with CmpL / If-branching. Do you think this makes sense, or should we intrinsify, at least when the hardware supports it?

@eme64 We should intrinsify MinL/MaxL when the hardware supports it.

@sviswa7
Copy link

sviswa7 commented Mar 23, 2023

@eme64 The MinI doesn't vectorize due to the rewrite as right-spline graph in MinINode::Ideal.

Comment on lines 2887 to 2890
for (uint i = 0; i < unordered_reductions.size(); i++) {
UnorderedReductionNode* ur = unordered_reductions.at(i)->as_UnorderedReduction();
move_unordered_reduction_out_of_loop(ur);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eme64 , if we move this processing post SLP to a stand alone pass, we can also handler vector IR created through VectorAPI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also relax following limitation with your patch since loop body will now comprise of lane wise vector operations with reduction moved out of loop it may allow vectorizing patterns like res += a[i]; which is composed of single load and reduction operation, unrolling will create multiple vector operations within loop may improve performance.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2265

Copy link
Contributor Author

@eme64 eme64 Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja

if we move this processing post SLP to a stand alone pass, we can also handler vector IR created through VectorAPI.

Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.

Maybe in the long run, we should have a node-by-node pass during loop-opts, and allow all sorts of peep-hole optimizations that require ctrl/idom information. We already have a number of non-split-if optimizations that have snuck into the split_if code. Maybe a refactoring would be a good idea there. What do you think?

And about:

if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {

Yes, that is the hope, that we could allow things like that to vectorize. The question is if we can guarantee that my new optimization will happen. But probably it is ok to be a bit optimistic here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.

May I know the penalty which you see if we do this as a separate pass towards the end of PhaseIdealLoop::build_and_optimize, where we can iterate over _ltree_root and for each counted loop marked as a vector loop we can do this processing for all the reduction nodes part of loop body.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an opportunity to support reduction involving non-commutative bytecodes like isub and lsub, but it may need explicit backend support and can be taken up separately.

@@ -2664,6 +2667,9 @@ bool SuperWord::output() {
if (node_isa_reduction) {
const Type *arith_type = n->bottom_type();
vn = ReductionNode::make(opc, nullptr, in1, in2, arith_type->basic_type());
if (vn->is_UnorderedReduction()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja I want to check if vn is a UnorderedReduction, so I want a bool answer. If I ask for isa_UnorderdReduction(), I would get a UnorderedReductionNode*, and nullptr if it is not a UnorderedReduction.
Maybe I did not understand your suggestion.

BasicType bt = ur->vect_type()->element_basic_type();
const Type* bt_t = Type::get_const_basic_type(bt);

// Create vector of neutral elements (zero for add, one for mul, etc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nomenclature fix, we can use name identity_scalar instead of neutral_scalar, 0 is an additive identity, 1 is a multiplicative identity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja Ok, "neutral element" and "identity element" seem to be synonyms. I'll change it to "identity", since that is what we seem to use in the code already.

Comment on lines 2887 to 2890
for (uint i = 0; i < unordered_reductions.size(); i++) {
UnorderedReductionNode* ur = unordered_reductions.at(i)->as_UnorderedReduction();
move_unordered_reduction_out_of_loop(ur);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.

May I know the penalty which you see if we do this as a separate pass towards the end of PhaseIdealLoop::build_and_optimize, where we can iterate over _ltree_root and for each counted loop marked as a vector loop we can do this processing for all the reduction nodes part of loop body.

Comment on lines 2887 to 2890
for (uint i = 0; i < unordered_reductions.size(); i++) {
UnorderedReductionNode* ur = unordered_reductions.at(i)->as_UnorderedReduction();
move_unordered_reduction_out_of_loop(ur);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an opportunity to support reduction involving non-commutative bytecodes like isub and lsub, but it may need explicit backend support and can be taken up separately.

@@ -2664,6 +2667,9 @@ bool SuperWord::output() {
if (node_isa_reduction) {
const Type *arith_type = n->bottom_type();
vn = ReductionNode::make(opc, nullptr, in1, in2, arith_type->basic_type());
if (vn->is_UnorderedReduction()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja I want to check if vn is a UnorderedReduction, so I want a bool answer. If I ask for isa_UnorderdReduction(), I would get a UnorderedReductionNode*, and nullptr if it is not a UnorderedReduction.
Maybe I did not understand your suggestion.

@eme64 eme64 marked this pull request as ready for review May 5, 2023 07:31
@openjdk openjdk bot added the rfr Pull request is ready for review label May 5, 2023
@eme64
Copy link
Contributor Author

eme64 commented May 5, 2023

@jatin-bhateja @vnkozlov @sviswa7 I substantially reworked this RFE, and have it working now, and included your suggestions.

The lagorithm now sits in PhaseIdealLoop::build_and_optimize after SuperWord. It can now handle chains of UnorderedReduction, so that it is more robust agains unrolling.

The only thing missing for me is:

  1. Benchmark on aarch64. @fg1417 Would you want to have a look at that?
  2. Wait for the performance testing results.

@mlbridge
Copy link

mlbridge bot commented May 5, 2023

@fg1417
Copy link

fg1417 commented May 5, 2023

The only thing missing for me is:

  1. Benchmark on aarch64. @fg1417 Would you want to have a look at that?
  2. Wait for the performance testing results.

Nice Optimization! Sure, I'll test the benchmark on aarch64 machines.

@eme64
Copy link
Contributor Author

eme64 commented May 11, 2023

@jatin-bhateja exactly. With the Vector API the vector reduction can be explicitly put outside the loop. With SuperWord, we need to take care of it in the compiler.

@jatin-bhateja
Copy link
Member

@jatin-bhateja exactly. With the Vector API the vector reduction can be explicitly put outside the loop. With SuperWord, we need to take care of it in the compiler.

Your changes looks good to me. Thanks!

@sviswa7
Copy link

sviswa7 commented May 12, 2023

@eme64 Very nice and clean work. Thanks a lot for taking this up.

Node* use = current->fast_out(k);
if (use != phi && ctrl_or_self(use) == cl) {
DEBUG_ONLY( current->dump(-1); )
assert(false, "reduction has use inside loop");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been wondering, it is right to bailout here from the optimization but why do we assert here? It is perfectly legal (if not very meaningful) to have a scalar use of the last unordered reduction within the loop. This will still auto vectorize as the reduction is to a scalar. e.g. a slight modification of the SumRed_Int.java still auto vectorizes and has a use of the last unordered reduction within the loop:
public static int sumReductionImplement(
int[] a,
int[] b,
int[] c,
int total) {
int sum = 0;
for (int i = 0; i < a.length; i++) {
total += (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
sum = total + i;
}
return total + sum;
}
Do you think this is a valid concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the assert is not very necessary, but I'd rather have an assert more in there and figure out what cases I missed when the fuzzer eventually finds a case. But if it is wished I can also just remove that assert.

I wrote this Test.java:

class Test {
    static final int RANGE = 1024;
    static final int ITER  = 10_000;

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = i + 1;
        }
    }

    static int test(int[] data, int sum) {
        int x = 0;
        for (int i = 0; i < RANGE; i++) {
            sum += 11 * data[i];
            x = sum & i; // what happens with this AndI ?
        }
        return sum + x;
    }

    public static void main(String[] args) {
        int[] data = new int[RANGE];
        init(data);
        for (int i = 0; i < ITER; i++) {
            test(data, i);
        }
    }
}

And ran it like this, with my patch:

./java -Xbatch -XX:CompileCommand=compileonly,Test::test  -XX:+TraceNewVectors -XX:+TraceSuperWord Test.java

Everything vectorized as usual. But what happens with the AndI? It actually drops outside the loop. Its left input is the AddReductionVI, and the right input is (Phi #tripcount) + 63 (the last i thus already drops outside the loop).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: If I have uses of the reduction in each iteration, then we already refuse to vectorize the reduction, as in this case:

    static int test(int[] data, int sum) {
        int x = 0;
        for (int i = 0; i < RANGE; i++) {
            sum += 11 * data[i];
            x += sum & i;  // vector use of sum prevents vectorization of sum's reduction-vectorization -> whole chain not vectorized
        }
        return sum + x;
    }

Copy link
Contributor Author

@eme64 eme64 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My conclusion, given my best understanding: eigher we have a use of the sum in all iterations, which prevents vectorization of the reduction. Or we only have a use of the last iteration, and it drops out of the loop already.

So if there is such an odd example, I'd rather we run into an assert in debug and look at it again. Maybe it would be perfectly legal, or maybe it reveals a bug here or elsewhere in the reduction code.

@sviswa7 what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this hits one of my asserts:

    static int test(int[] data, int sum) {
        int x = 0;
        for (int i = 0; i < RANGE; i+=8) {
            sum += 11 * data[i+0];
            sum += 11 * data[i+1];
            sum += 11 * data[i+2];
            sum += 11 * data[i+3];
            x = sum + i;
            sum += 11 * data[i+4];
            sum += 11 * data[i+5];
            sum += 11 * data[i+6];
            sum += 11 * data[i+7];
        }
        return sum + x;
    }

With

./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:MaxVectorSize=16 Test.java

Triggers

assert(false, "reduction (not last) has more than one use");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this as a regression test, and remove that assert. Thanks @sviswa7 for making me look at this more closely :)

Still, I think it may be valuable to keep these two asserts - both indicate that something strange has happened:

assert(false, "reduction has use inside loop");

assert(false, "reduction has ctrl or bad vector_input");

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

// Create vector_accumulator to replace current.
Node* last_vector_accumulator = current->in(1);
Node* vector_input = current->in(2);
VectorNode* vector_accumulator = current->make_normal_vector_op(last_vector_accumulator, vector_input, vec_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be guarded by a safe Matcher::match_rule_supported_vector .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense. I'd have to guard before any transformations take place. So maybe I'll have a second method:
UnorderedReductionNode::make_normal_vector_op_supported, and use Matcher::match_rule_supported_vector inside.

@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 15, 2023
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 15, 2023
@eme64
Copy link
Contributor Author

eme64 commented May 16, 2023

@sviswa7 thanks for the review!
@jatin-bhateja @fg1417 @vnkozlov can I have at least one more re-review after the last changes, please?

Comment on lines 243 to 244
virtual VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) = 0;
virtual bool make_normal_vector_op_implemented(const TypeVect* vt) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about introducing virtual int vect_Opcode() (norm_vect_Opcode()) or something which returns normal vector opcode (Op_AddVI for AddReductionVINode for example). Then you don't need these 2 functions to be virtual:

  virtual int vect_Opcode() const = 0;
  VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) {
    return new VectorNode::make(vect_Opcode(), in1, in2, vt);
  }
  bool make_normal_vector_op_implemented(const TypeVect* vt) {
    return Matcher::match_rule_supported_vector(vect_Opcode(), vt->length(), vt->element_basic_type());
  }

If we need that in more cases then in your changes may be have even more general (in VectorNode class) scalar_Opcode() and use VectorNode::opcode(sclar_Opcode(), vt->element_basic_type()) to get normal vector opcode. This may need more changes and testing - a separate RFE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about need _op in these functions names.

Comment on lines 433 to 438
virtual VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) {
return new MulVINode(in1, in2, vt);
}
virtual bool make_normal_vector_op_implemented(const TypeVect* vt) {
return Matcher::match_rule_supported_vector(Op_MulVI, vt->length(), vt->element_basic_type());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Vladimir's comments, we can remove explicit calls from each reduction node class and introduce factory method VectorNode::make_from_ropc in vectornode.cpp similar to ReductionNode::make_from_vopc which accept reduction opcode and returns equivalent vector node.

BasicType bt = vec_t->element_basic_type();
const Type* bt_t = Type::get_const_basic_type(bt);

if (!last_ur->make_normal_vector_op_implemented(vec_t)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some naming comments, make prefix is more suitable for IR creation routines.

@eme64
Copy link
Contributor Author

eme64 commented May 17, 2023

@vnkozlov @jatin-bhateja I took the idea with VectorNode::scalar_opcode. It might be useful in the future, and it makes the code much simpler.
Running testing again...

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

const Type* bt_t = Type::get_const_basic_type(bt);

// Convert opcode from vector-reduction -> scalar -> normal-vector-op
const int sopc = VectorNode::scalar_opcode(last_ur->Opcode(), bt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other changes looks good to me, can you rename VectorNode::scalar_opcode to ReductionNode::scalar_opcode
, also move out vector opcode cases into a separate vector-to-scalar mapping routine if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not better to have VectorNode::scalar_opcode? It is more general - maybe it is useful in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not better to have VectorNode::scalar_opcode? It is more general - maybe it is useful in the future.

Not a blocker, but we intend to get a scalar opcode for ReductionNode, we have different factory method for Vector/Reduction Nodes, you can keep it for now

Best Regards,
Jatin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja I see your point. On the other hand, we would have quite some code duplication handling all the BasicType cases for every operation. I'll leave it the way I have it now, and we can still reconsider it if we want to in the future.

@eme64
Copy link
Contributor Author

eme64 commented May 22, 2023

@sviswa7 @pfustc @vnkozlov @jatin-bhateja Thanks for all the help! Let me know if there is still any concern, otherwise I will integrate this in 24h.

@theRealAph
Copy link
Contributor

@sviswa7 thanks for your quick response!
I can confirm: we do not "intrinsify" (ie turn into MinL/MaxL), rather we just inline the java.lang.Math::Min/Max methods, implemented with CmpL / If-branching. Do you think this makes sense, or should we intrinsify, at least when the hardware supports it?

@eme64 We should intrinsify MinL/MaxL when the hardware supports it.

I doubt it, unless there really is a performance payoff.

@eme64
Copy link
Contributor Author

eme64 commented May 23, 2023

@jatin-bhateja @sviswa7 @fg1417 @vnkozlov @pfustc
Thanks to everybody for the help and reviews!
/integrate

@openjdk
Copy link

openjdk bot commented May 23, 2023

Going to push as commit 06b0a5e.
Since your change was applied there have been 300 commits pushed to the master branch:

  • 69f508a: 8308300: enhance exceptions in MappedMemoryUtils.c
  • c440827: 8308093: Disable language preview features use in JDK
  • 422128b: 8306992: [JVMCI] mitigate more against JVMCI related OOME causing VM to exit
  • fe8c689: 8308038: java/util/concurrent/ThreadPerTaskExecutor/ThreadPerTaskExecutorTest.java timed out
  • ada416e: 8308235: ThreadContainer registry accumulates weak refs
  • 5d8ba93: 8308046: Move Solaris related charsets from java.base to jdk.charsets module
  • 878162b: 8306507: [linux] Print number of memory mappings in error reports
  • 90d5041: 8300086: Replace NULL with nullptr in share/c1/
  • 8474e69: 8308465: Reduce memory accesses in AArch64 MD5 intrinsic
  • f99ad11: 8302218: CHeapBitMap::free frees with incorrect size
  • ... and 290 more: https://git.openjdk.org/jdk/compare/ccf91f881c06308f39740751161111946487abf1...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 23, 2023
@openjdk openjdk bot closed this May 23, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 23, 2023
@openjdk
Copy link

openjdk bot commented May 23, 2023

@eme64 Pushed as commit 06b0a5e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
7 participants