8302652: [SuperWord] Reduction should happen after loop, when possible#13056
8302652: [SuperWord] Reduction should happen after loop, when possible#13056eme64 wants to merge 20 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
vnkozlov
left a comment
There was a problem hiding this comment.
In general it looks good.
| * compiler.loopopts.superword.ReductionPerf | ||
| * @bug 8074981 8302652 | ||
| * @summary Test SuperWord Reduction Perf. | ||
| * @requires vm.compiler2.enabled |
There was a problem hiding this comment.
This is not enough. Yes, we need to check for C2 presence. But you need skip arm, ppc and s390 which have C2. You need second @requires as original but you can reduce checks for x86:
* @requires vm.simpleArch == "x86" | vm.simpleArch == "x64" | vm.simpleArch == "aarch64" | vm.simpleArch == "riscv64""
There was a problem hiding this comment.
Did you measure total execution time of this test?
Warmup 2_000 iterations is too big number I think. You have 8192 iterations already in tested methods. 10 should be enough to trigger compilation. May be add -Xbatch if you want to make sure C2 does compile it.
If reduction is not supported (MulReduction on RISC-V and ARCH64) the test will be slow.
There was a problem hiding this comment.
I reduced the iteration count to 100 and 1000. For performance measurement they can be increased. I also added -Xbatch. On my laptop, the test now definately runs in less than 2 seconds, with or without SuperWord. So even platforms that do not support a feature, or SuperWord as a whole could run it decently fast.
There was a problem hiding this comment.
Should you also check that this reduction node doesn't have users inside loop?
There was a problem hiding this comment.
@vnkozlov How should I do that? Can that even be done during IGVN? Or should I move the implementation to loopopts?
There was a problem hiding this comment.
For example, I could put it before or after try_sink_out_of_loop inside PhaseIdealLoop::split_if_with_blocks_post.
There was a problem hiding this comment.
As we discussed offline, you may check and mark Reduction node if it has users (other than Phi) inside loop in SuperWord code where we are creating vector referenced by Reduction node.
|
@jatin-bhateja @sviswa7 A few questions:
|
|
@eme64 The double min/max reduction is also affected by JDK-8300865. With the following patch, I see double min/max reduction happening and good perf gain (> 2x) with your PR: Hope this helps. |
|
@eme64 For long min/max, currently Math.min(long, long) is not getting intrinsified. Only int/float/double are getting intrinsified. No scalar intrinsification for Math.min(long, long) leads to no MinL scalar node generation and in turn no vectorization and no reduction. |
|
@sviswa7 thanks for your quick response! I can confirm: we do not "intrinsify" (ie turn into |
@eme64 We should intrinsify MinL/MaxL when the hardware supports it. |
|
@eme64 The MinI doesn't vectorize due to the rewrite as right-spline graph in MinINode::Ideal. |
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
Hi @eme64 , if we move this processing post SLP to a stand alone pass, we can also handler vector IR created through VectorAPI.
There was a problem hiding this comment.
We can also relax following limitation with your patch since loop body will now comprise of lane wise vector operations with reduction moved out of loop it may allow vectorizing patterns like res += a[i]; which is composed of single load and reduction operation, unrolling will create multiple vector operations within loop may improve performance.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2265
There was a problem hiding this comment.
if we move this processing post SLP to a stand alone pass, we can also handler vector IR created through VectorAPI.
Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.
Maybe in the long run, we should have a node-by-node pass during loop-opts, and allow all sorts of peep-hole optimizations that require ctrl/idom information. We already have a number of non-split-if optimizations that have snuck into the split_if code. Maybe a refactoring would be a good idea there. What do you think?
And about:
jdk/src/hotspot/share/opto/superword.cpp
Line 2265 in 941a7ac
Yes, that is the hope, that we could allow things like that to vectorize. The question is if we can guarantee that my new optimization will happen. But probably it is ok to be a bit optimistic here.
There was a problem hiding this comment.
Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in
split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.
May I know the penalty which you see if we do this as a separate pass towards the end of PhaseIdealLoop::build_and_optimize, where we can iterate over _ltree_root and for each counted loop marked as a vector loop we can do this processing for all the reduction nodes part of loop body.
There was a problem hiding this comment.
There is also an opportunity to support reduction involving non-commutative bytecodes like isub and lsub, but it may need explicit backend support and can be taken up separately.
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
@jatin-bhateja I want to check if vn is a UnorderedReduction, so I want a bool answer. If I ask for isa_UnorderdReduction(), I would get a UnorderedReductionNode*, and nullptr if it is not a UnorderedReduction.
Maybe I did not understand your suggestion.
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
A minor nomenclature fix, we can use name identity_scalar instead of neutral_scalar, 0 is an additive identity, 1 is a multiplicative identity.
There was a problem hiding this comment.
@jatin-bhateja Ok, "neutral element" and "identity element" seem to be synonyms. I'll change it to "identity", since that is what we seem to use in the code already.
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
Where exactly would you put it? We need a location during LoopOpts, so that we have the ctrl information. I previously suggested in
split_if, but @vnkozlov seemed not very excited. Additionally, I have not seen any case where VectorAPI could make use of it. I gave it a quick look, so maybe you can find something.
May I know the penalty which you see if we do this as a separate pass towards the end of PhaseIdealLoop::build_and_optimize, where we can iterate over _ltree_root and for each counted loop marked as a vector loop we can do this processing for all the reduction nodes part of loop body.
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
There is also an opportunity to support reduction involving non-commutative bytecodes like isub and lsub, but it may need explicit backend support and can be taken up separately.
src/hotspot/share/opto/superword.cpp
Outdated
There was a problem hiding this comment.
@jatin-bhateja I want to check if vn is a UnorderedReduction, so I want a bool answer. If I ask for isa_UnorderdReduction(), I would get a UnorderedReductionNode*, and nullptr if it is not a UnorderedReduction.
Maybe I did not understand your suggestion.
|
@jatin-bhateja @vnkozlov @sviswa7 I substantially reworked this RFE, and have it working now, and included your suggestions. The lagorithm now sits in The only thing missing for me is:
|
Webrevs
|
Nice Optimization! Sure, I'll test the benchmark on aarch64 machines. |
|
@eme64 Very nice and clean work. Thanks a lot for taking this up. |
| Node* use = current->fast_out(k); | ||
| if (use != phi && ctrl_or_self(use) == cl) { | ||
| DEBUG_ONLY( current->dump(-1); ) | ||
| assert(false, "reduction has use inside loop"); |
There was a problem hiding this comment.
I have been wondering, it is right to bailout here from the optimization but why do we assert here? It is perfectly legal (if not very meaningful) to have a scalar use of the last unordered reduction within the loop. This will still auto vectorize as the reduction is to a scalar. e.g. a slight modification of the SumRed_Int.java still auto vectorizes and has a use of the last unordered reduction within the loop:
public static int sumReductionImplement(
int[] a,
int[] b,
int[] c,
int total) {
int sum = 0;
for (int i = 0; i < a.length; i++) {
total += (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
sum = total + i;
}
return total + sum;
}
Do you think this is a valid concern?
There was a problem hiding this comment.
I agree, the assert is not very necessary, but I'd rather have an assert more in there and figure out what cases I missed when the fuzzer eventually finds a case. But if it is wished I can also just remove that assert.
I wrote this Test.java:
class Test {
static final int RANGE = 1024;
static final int ITER = 10_000;
static void init(int[] data) {
for (int i = 0; i < RANGE; i++) {
data[i] = i + 1;
}
}
static int test(int[] data, int sum) {
int x = 0;
for (int i = 0; i < RANGE; i++) {
sum += 11 * data[i];
x = sum & i; // what happens with this AndI ?
}
return sum + x;
}
public static void main(String[] args) {
int[] data = new int[RANGE];
init(data);
for (int i = 0; i < ITER; i++) {
test(data, i);
}
}
}
And ran it like this, with my patch:
./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord Test.java
Everything vectorized as usual. But what happens with the AndI? It actually drops outside the loop. Its left input is the AddReductionVI, and the right input is (Phi #tripcount) + 63 (the last i thus already drops outside the loop).
There was a problem hiding this comment.
Note: If I have uses of the reduction in each iteration, then we already refuse to vectorize the reduction, as in this case:
static int test(int[] data, int sum) {
int x = 0;
for (int i = 0; i < RANGE; i++) {
sum += 11 * data[i];
x += sum & i; // vector use of sum prevents vectorization of sum's reduction-vectorization -> whole chain not vectorized
}
return sum + x;
}
There was a problem hiding this comment.
My conclusion, given my best understanding: eigher we have a use of the sum in all iterations, which prevents vectorization of the reduction. Or we only have a use of the last iteration, and it drops out of the loop already.
So if there is such an odd example, I'd rather we run into an assert in debug and look at it again. Maybe it would be perfectly legal, or maybe it reveals a bug here or elsewhere in the reduction code.
@sviswa7 what do you think?
There was a problem hiding this comment.
Ah, but this hits one of my asserts:
static int test(int[] data, int sum) {
int x = 0;
for (int i = 0; i < RANGE; i+=8) {
sum += 11 * data[i+0];
sum += 11 * data[i+1];
sum += 11 * data[i+2];
sum += 11 * data[i+3];
x = sum + i;
sum += 11 * data[i+4];
sum += 11 * data[i+5];
sum += 11 * data[i+6];
sum += 11 * data[i+7];
}
return sum + x;
}
With
./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:MaxVectorSize=16 Test.java
Triggers
jdk/src/hotspot/share/opto/loopopts.cpp
Line 4217 in 31d977c
There was a problem hiding this comment.
I will add this as a regression test, and remove that assert. Thanks @sviswa7 for making me look at this more closely :)
Still, I think it may be valuable to keep these two asserts - both indicate that something strange has happened:
jdk/src/hotspot/share/opto/loopopts.cpp
Line 4210 in 31d977c
jdk/src/hotspot/share/opto/loopopts.cpp
Line 4199 in 31d977c
src/hotspot/share/opto/loopopts.cpp
Outdated
There was a problem hiding this comment.
Should this be guarded by a safe Matcher::match_rule_supported_vector .
There was a problem hiding this comment.
Right, makes sense. I'd have to guard before any transformations take place. So maybe I'll have a second method:
UnorderedReductionNode::make_normal_vector_op_supported, and use Matcher::match_rule_supported_vector inside.
… added test for it
|
@sviswa7 thanks for the review! |
There was a problem hiding this comment.
How about introducing virtual int vect_Opcode() (norm_vect_Opcode()) or something which returns normal vector opcode (Op_AddVI for AddReductionVINode for example). Then you don't need these 2 functions to be virtual:
virtual int vect_Opcode() const = 0;
VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) {
return new VectorNode::make(vect_Opcode(), in1, in2, vt);
}
bool make_normal_vector_op_implemented(const TypeVect* vt) {
return Matcher::match_rule_supported_vector(vect_Opcode(), vt->length(), vt->element_basic_type());
}
If we need that in more cases then in your changes may be have even more general (in VectorNode class) scalar_Opcode() and use VectorNode::opcode(sclar_Opcode(), vt->element_basic_type()) to get normal vector opcode. This may need more changes and testing - a separate RFE.
There was a problem hiding this comment.
I am also not sure about need _op in these functions names.
There was a problem hiding this comment.
I agree with Vladimir's comments, we can remove explicit calls from each reduction node class and introduce factory method VectorNode::make_from_ropc in vectornode.cpp similar to ReductionNode::make_from_vopc which accept reduction opcode and returns equivalent vector node.
src/hotspot/share/opto/loopopts.cpp
Outdated
There was a problem hiding this comment.
Some naming comments, make prefix is more suitable for IR creation routines.
|
@vnkozlov @jatin-bhateja I took the idea with |
| const Type* bt_t = Type::get_const_basic_type(bt); | ||
|
|
||
| // Convert opcode from vector-reduction -> scalar -> normal-vector-op | ||
| const int sopc = VectorNode::scalar_opcode(last_ur->Opcode(), bt); |
There was a problem hiding this comment.
Other changes looks good to me, can you rename VectorNode::scalar_opcode to ReductionNode::scalar_opcode
, also move out vector opcode cases into a separate vector-to-scalar mapping routine if needed.
There was a problem hiding this comment.
Is it not better to have VectorNode::scalar_opcode? It is more general - maybe it is useful in the future.
There was a problem hiding this comment.
Is it not better to have
VectorNode::scalar_opcode? It is more general - maybe it is useful in the future.
Not a blocker, but we intend to get a scalar opcode for ReductionNode, we have different factory method for Vector/Reduction Nodes, you can keep it for now
Best Regards,
Jatin
There was a problem hiding this comment.
@jatin-bhateja I see your point. On the other hand, we would have quite some code duplication handling all the BasicType cases for every operation. I'll leave it the way I have it now, and we can still reconsider it if we want to in the future.
|
@sviswa7 @pfustc @vnkozlov @jatin-bhateja Thanks for all the help! Let me know if there is still any concern, otherwise I will integrate this in 24h. |
I doubt it, unless there really is a performance payoff. |
|
@jatin-bhateja @sviswa7 @fg1417 @vnkozlov @pfustc |
|
Going to push as commit 06b0a5e.
Your commit was automatically rebased without conflicts. |
jdk/src/hotspot/share/opto/loopopts.cpp
Lines 4125 to 4171 in cc9e7e8
I introduced a new abstract node type
UnorderedReductionNode(subtype ofReductionNode). All of the reductions that can be re-ordered are to extend from this node type:int/long add/mul/and/or/xor/min/max, as well asfloat/double min/max.float/double add/muldo not allow for reordering of operations.The optimization is part of loop-opts, and called after
SuperWordinPhaseIdealLoop::build_and_optimize.Performance results
I ran
test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java, with2_000warmup and100_000perf iterations. I also increased the array length toRANGE = 16*1024.I disabled
turbo-boost.Machine:
11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16.Full
avx512support, includingavx512dqrequired forMulReductionVL.Legend:
Mmaster,Pwith patch,Nno superword reductions (-XX:-SuperWordReductions),2AVX2,3AVX512.The lines without note show clear speedup as expected.
Notes:
int min/max: bug JDK-8302673long add/mul: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note:MulReductionVLrequiresavx512dq.long min/max:Math.max(long, long)is currently not intrinsified JDK-8307513.long and/or/xor: without patch on AVX2, vectorization is slower. With patch, it is always faster now.float/double add/mul: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.double min/max: bug JDK-8300865.Testing
I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.
Passes up to tier5 and stress-testing.
Performance testing did not show any regressions.
TODO can someone benchmark on
aarch64?Discussion
We should investigate if we can now allow reductions more eagerly, at least for
UnorderedReduction, as the overhead is now much lower. @jatin-bhateja pointed to this:jdk/src/hotspot/share/opto/superword.cpp
Line 2265 in 941a7ac
I filed JDK-8307516.
So far, I did not work on
byte, char, short, we can investigate this in the future.FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps
SuperWordfor now.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13056/head:pull/13056$ git checkout pull/13056Update a local copy of the PR:
$ git checkout pull/13056$ git pull https://git.openjdk.org/jdk.git pull/13056/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 13056View PR using the GUI difftool:
$ git pr show -t 13056Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13056.diff
Webrev
Link to Webrev Comment