New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8266054: VectorAPI rotate operation optimization #3720
Conversation
/label hotspot-compiler-dev |
|
@jatin-bhateja The label
|
@jatin-bhateja The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
/label remove hotspot |
@jatin-bhateja |
Webrevs
|
/label add hotspot-compiler-dev |
@jatin-bhateja |
I noticed the tests are only updated for int and long, is that intentional? The HotSpot changes in some cases seem to imply all integral types are supported via the use of is_integral_type
, contradicted by the use of is_subword_type
.
I would recommend trying to leverage Integer/Long.rotateLeft/Right implementations. They are not available for byte/short, so lets add specific methods in those cases, that should make the Java op implementation clearer.
|
||
@Benchmark | ||
public void testRotateLeftI(Blackhole bh) { | ||
for(int i = 0 ; i < 10000; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the outer loop. JMH will do that for you.
public class RotateBenchmark { | ||
|
||
@Param({"64","128","256"}) | ||
public int TESTSIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public int TESTSIZE; | |
int size; |
Lower case for instance field names (same applies to the species).
No need for public
.
public final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE}; | ||
public final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE}; | |
public final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE}; | |
static final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE}; | |
static final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE}; |
public void testRotateLeftI(Blackhole bh) { | ||
for(int i = 0 ; i < 10000; i++) { | ||
for (int j = 0 ; j < TESTSIZE; j+= ISPECIES.length()) { | ||
vecI = IntVector.fromArray(ISPECIES, inpI, j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vecI = IntVector.fromArray(ISPECIES, inpI, j); | |
var vecI = IntVector.fromArray(ISPECIES, inpI, j); |
Use a local variable instead of storing to a field
vecI = vecI.lanewise(VectorOperators.ROL, i); | ||
vecI = vecI.lanewise(VectorOperators.ROL, i); | ||
vecI.lanewise(VectorOperators.ROL, i).intoArray(resI, j); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
return resI; |
return the result to better ensure HotSpot does not detect the result is unused.
@@ -730,6 +725,24 @@ public abstract class $abstractvectortype$ extends AbstractVector<$Boxtype$> { | |||
v0.bOp(v1, (i, a, n) -> ($type$)(a >> n)); | |||
case VECTOR_OP_URSHIFT: return (v0, v1) -> | |||
v0.bOp(v1, (i, a, n) -> ($type$)((a & LSHR_SETUP_MASK) >>> n)); | |||
#if[long] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend you create new methods in IntVector
etc called rotateLeft
and rotateRight
that do what is in the lambda expressions. Then you can collapse this to non-conditional cases calling those methods.
Do the same for the tests (like i did with the unsigned support), see
#if[!FP] |
and
gen_compare_op "UNSIGNED_LT" "ult" "BITWISE" |
That will avoid the embedding of complex expressions.
src/hotspot/cpu/x86/x86.ad
Outdated
@@ -1649,6 +1649,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType | |||
break; | |||
case Op_RotateRightV: | |||
case Op_RotateLeftV: | |||
if (is_subword_type(bt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that have the effect of not intrinsifying for byte
or short
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sure that intrinsification is based on Shifts and Or operations instead of Rotation operation.
Hi @PaulSandoz , thanks your comments have been addressed. |
Testing-wise, can we reuse the Kernel-Binary-*-op.template
files? hence no need for separate templates
Further, i think we need to test with the vector accepting lane-wise method and the broadcast accepting method, since they go through different code paths. The broadcast method can use primitive type rather than cast to int
, likely making it easier to reuse the binary templates.
It would be good if the scalar methods for rotating left/right were identical for the main code and tests. I prefer the code in the test methods.
@@ -521,9 +521,9 @@ static boolean opKind(Operator op, int bit) { | |||
/** Produce {@code a>>>(n&(ESIZE*8-1))}. Integral only. */ | |||
public static final /*bitwise*/ Binary LSHR = binary("LSHR", ">>>", VectorSupport.VECTOR_OP_URSHIFT, VO_SHIFT); | |||
/** Produce {@code rotateLeft(a,n)}. Integral only. */ | |||
public static final /*bitwise*/ Binary ROL = binary("ROL", "rotateLeft", -1 /*VectorSupport.VECTOR_OP_LROTATE*/, VO_SHIFT | VO_SPECIAL); | |||
public static final /*bitwise*/ Binary ROL = binary("ROL", "rotateLeft", VectorSupport.VECTOR_OP_LROTATE, VO_SHIFT | VO_SPECIAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the VO_SPECIAL
flag on ROL
and ROR
now it is uniformly managed?
Java code updates look good
I believe you can now remove the four "-Rotate_.template" files now that you leverage exiting templates?
Also, i believe ancillary changes to gen-template.sh
are no longer strictly required, now that we defer to method calls for ROL/ROR?
Thanks Paul, redundant files (missed in last check-in) have been removed and benchmark results with latest code updated. |
Looks good. Someone from the HotSpot side needs to review related changes.
The way i read the perf numbers is that on non AVX512 systems the numbers are in the noise (no worse, no better), with significant improvement on AVX512.
1 similar comment
Just some format issues when I tried to use this benchmark.
intinp[i] = i; | ||
longinp[i] = i; | ||
} | ||
for (int i = 0 ; i < specialvalsbyte.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int i = 0 ; i < specialvalsbyte.length; i++) { | |
for (int i = 0; i < specialvalsbyte.length; i++) { |
Please remove this kind of space.
@Benchmark | ||
public void testRotateLeftB(Blackhole bh) { | ||
ByteVector bytevec = null; | ||
for (int j = 0 ; j < size; j+= bspecies.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int j = 0 ; j < size; j+= bspecies.length()) { | |
for (int j = 0 ; j < size; j += bspecies.length()) { |
Needs a space between +=
.
int rshiftopc = VectorNode::opcode(urshiftopc(), elem_bt); | ||
if (!is_supported && | ||
arch_supports_vector(lshiftopc, num_elem, elem_bt, VecMaskNotUsed) && | ||
arch_supports_vector(rshiftopc, num_elem, elem_bt, VecMaskNotUsed) && | ||
arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) { | ||
is_supported = true; | ||
} | ||
return is_supported; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments here why the Left/Right shift and Or opcodes are being checked here. Also add comments why for left shift we are only checking for int and long left shift opcodes whereas for right shift sub word opcodes are being checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both left and right shifts opcodes are selected for all integral types (byte/short/int/long). VectorNode::opcode returns the granular left shift type based on the sub-type i.e. elem_bt in case of LeftShiftI. Re-organizing the code for better readability.
if (!is_supported && | ||
arch_supports_vector(lshiftopc, num_elem, elem_bt, VecMaskNotUsed) && | ||
arch_supports_vector(rshiftopc, num_elem, elem_bt, VecMaskNotUsed) && | ||
arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) { | ||
is_supported = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When check_bcast is set, is_supported could be false when replicate is not supported. Is replicate not needed for shift+or sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_bcast is true only when shift value is a non-constant scalar value, in that case we need to check for broadcasting operation for shift, in all other cases broadcast is not needed. Constant shift value is an optimizing case since AVX512 offers instructions which directly accept 8bit immediate shift value.
} else { | ||
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad. | ||
if ((sopc != 0) && | ||
!arch_supports_vector(sopc, num_elem, elem_bt, is_vector_mask(vbox_klass) ? VecMaskUseAll : VecMaskNotUsed)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not call arch_supports_vector_rotate from arch_supports_vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
bool is_const_rotate = is_rotate && cnt_type && cnt_type->is_con() && | ||
-0x80 <= cnt_type->get_con() && cnt_type->get_con() < 0x80; | ||
if (is_rotate) { | ||
if (!arch_supports_vector_rotate(sopc, num_elem, elem_bt, !is_const_rotate)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between check_bcast and !is_const_rotate? Some comments here on this would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant shift value is an optimizing case since AVX512 offers instructions which directly accept constant shifts in the range (-256, 255). Similar handling is done in SLP.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2493
But I feel this is very X86 specific check in generic code, so moving decision to a new target specific matcher routine.
cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt; | ||
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt)); | ||
} else { | ||
// constant shift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean constant rotate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
assert(VectorNode::is_invariant_vector(cnt), "Broadcast expected"); | ||
cnt = cnt->in(1); | ||
if (bt == T_LONG) { | ||
// Shift count vector for Rotate vector has long elements too. | ||
assert(cnt->Opcode() == Op_ConvI2L, "ConvI2L expected"); | ||
cnt = cnt->in(1); | ||
} | ||
shiftRCnt = phase->transform(new AndINode(cnt, phase->intcon(shift_mask))); | ||
shiftRCnt = cnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove the And with mask here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And'ing with shift_mask is already done on Java API side implementation before making a call to intrinsic rountine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path is also taken from non vector api path, wont masking be needed there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja This question is still pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sviswa7, SLP flow will either have a constant 8bit shift value or a variable shift present in vector, this also include broadcasted non-constant shift value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better comment here, since the correctness relay on some others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theRealELiu , @sviswa7 , comment already exist in code, I guess I mentioned incorrectly earlier on this thread, rectified my comments.
Hi @sviswa7 your comments have been addressed. |
Hi @sviswa7, |
if (!is_const_rotate) { | ||
const Type * type_bt = Type::get_const_basic_type(elem_bt); | ||
cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt; | ||
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why conversion for only T_LONG and not for T_BYTE and T_SHORT? Is there an assumption here that only T_INT and T_LONG elem_bt are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correcting this, I2L may be needed in auto-vectorization flow since Integer/Long.rotate[Right/Left] APIs accept only integral shift, so for Long.rotate* operations integral shift value must be converted to long using I2L before broadcasting it. VectorAPI lanewise operations between vector-scalar, scalar type already matches with vector basic type. Since degeneration routine is common b/w both the flows so maintaining IR consistency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Vector API the shift is always coming in as int type for rotate by scalar (lanewiseShiftTemplate). The down conversion to byte or short needs to be done before scalar2vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that similar thing is done before for shift, so down conversion to sub type is not required.
Node* shift_mask_node = (bt == T_LONG) ? (Node*)(phase->longcon(shift_mask + 1L)) : | ||
(Node*)(phase->intcon(shift_mask + 1)); | ||
Node* vector_mask = phase->transform(VectorNode::scalar2vector(shift_mask_node,vlen, elem_ty)); | ||
int subVopc = VectorNode::opcode((bt == T_LONG) ? Op_SubL : Op_SubI, bt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be an assumption here that the vector type is INT or LONG only and not subword type. From Vector API you can get the sub word types as well.
Also if this path is coming from auto-vectorizer, don't we need masking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtype is being passed to VectorNode::opcode for correct opcode selection. Also shift_mask_node is a constant value node, so there is no assumption on vector type. Wrap around (masking) for shift value may not be needed here since we are degenerating rotate into shifts (logical left and rights). Also scalar rotate nodes getting into SLP flow wraps the constant shift values appropriately during RotateLeft/RightNode Idealizations.
if (!is_const_rotate) { | ||
const Type * type_bt = Type::get_const_basic_type(elem_bt); | ||
cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt; | ||
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that similar thing is done before for shift, so down conversion to sub type is not required.
/integrate |
Going to push as commit d994b93.
Your commit was automatically rebased without conflicts. |
@jatin-bhateja Pushed as commit d994b93. |
No comma after "2021" in |
@sviswa7 and @jatin-bhateja jatin-bhateja |
Yes, as discussed before please request that we perform internal tests before integrating e.g. CC me. Unfortunately the pre-commit PR tests don't cover all the tests cases and we don't yet have a way to expand that set. |
@vnkozlov @PaulSandoz Sorry for the inconvenience. @jatin-bhateja Please don't be in a hurry to push and reach out to Oracle engineers for testing before pushing. |
Current VectorAPI Java side implementation expresses rotateLeft and rotateRight operation using following operations:-
This patch moves above handling from Java side to C2 compiler which facilitates dismantling the rotate operation if target ISA does not support a direct rotate instruction.
AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over long and integer type vectors. For other cases (i.e. sub-word type vectors or for targets which do not support direct rotate operations ) instruction sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
Please find below the performance data for included JMH benchmark.
<style> </style>Machine: Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720
$ git checkout pull/3720
Update a local copy of the PR:
$ git checkout pull/3720
$ git pull https://git.openjdk.java.net/jdk pull/3720/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 3720
View PR using the GUI difftool:
$ git pr show -t 3720
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3720.diff