New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8256973: Intrinsic creation for VectorMask query (lastTrue,firstTrue,trueCount) APIs #3916
Conversation
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
/label hotspot-compiler-dev |
@jatin-bhateja |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These mask operations can be considered a form of reduction.
Do you think it makes sense to reuse VectorSupport.reductionCoerced
instead of adding a new intrinsic? (Note that we reuse VectorSupport.binaryOp
for mask logical binary operations).
Perhaps that allows for further reuse later if/when we add operations to integral vectors to count bits like we already have with scalars, such as Integer.bitCount
, Integer.numberOfLeadingZeros
etc?
@@ -173,6 +143,31 @@ static boolean allTrueHelper(boolean[] bits) { | |||
return true; | |||
} | |||
|
|||
/*package-private*/ | |||
static int trueCountHelper(boolean[] bits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming-wise i think you can drop Helper
from such methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a Helper routine called from the lambda expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we don't use that naming pattern in other places for the fallback Java code. It's just the scalar implementation.
Hi @PaulSandoz , that's a nice suggestion, I think instead of reduction which may emit bulky sequence, VectorMask.toLong() + Long.bitCount() could have been used for trueCount. But since toLong may not work for ARM SVE, so in the mean time intrinsifying at the level of API looked reasonable. |
Do you mean that reusing Note that i was not suggesting to reuse For example:
And notice that So rather than introducing specific constants, such as |
Hi @PaulSandoz , semantically reductionCoerced could be used as an entry point for trueCount (VECTOR_OP_BITCOUNT) since we are iterating over each lane element (boolean type in this case) and returning the final set bits count, but for lastTrue and firstTrue operation are more like iterative operation on the lines of Vector.lane and Vector.withLane for which we have explicit entry points. Also VectorSupport.reductionCoerced adds a constraint on the type parameter V to have lower bound as Vector, VectorMask is not in the hierarchy of Vector class. We can relax that constraint though. In addition we may need bypass some portions in LibraryCallKit::inline_vector_reduction for mask query APIs, given all this does it sound reasonable to add a one different entry point (maskOp) for all the mask query APIs. Looking for your feedback.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java code looks good.
Perhaps when we add bit-counting operations to vector we might find a way to consolidate. I don't wanna block progress based on something we might do in the future.
@@ -173,6 +143,31 @@ static boolean allTrueHelper(boolean[] bits) { | |||
return true; | |||
} | |||
|
|||
/*package-private*/ | |||
static int trueCountHelper(boolean[] bits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we don't use that naming pattern in other places for the fallback Java code. It's just the scalar implementation.
@Benchmark | ||
public void testTrueCountByte(Blackhole bh) { | ||
bh.consume(bmask.trueCount()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use a black hole. A returned value will be consumed by a black hole by the framework.
@Benchmark | |
public void testTrueCountByte(Blackhole bh) { | |
bh.consume(bmask.trueCount()); | |
} | |
@Benchmark | |
public int testTrueCountByte() { | |
return bmask.trueCount(); | |
} |
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 231 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Hi @PaulSandoz , thanks your comments on JMH have been addressed. @neliasso @iwanowww kindly share your feedback/comments on compiler side changes. |
@IntrinsicCandidate | ||
public static | ||
<M> | ||
int maskOp(int oper, Class<?> maskClass, Class<?> elemClass, int length, M m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second Paul here: maskOp
case is already covered by reductionCoerced
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above mixing it with reduction coerced will require changes in original entry point (type parameter have Vector as the lower bound) , also we may need to bypass some irrelevant portions in inline_vector_reduction() , for the time being to keep the things clean added a different entry point for all masked operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with reductionCoerced
:
- reflect in the name that it's effectively a reduction, but on masks (
maskReductionCoerced
?); - return type can be generalized to
long
; - bound on M:
<M extends VectorMask>
; - no need to introduce a special interface,
Function<T,R>
just works:VectorMaskOp<M>
->Function<M, Long>
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it withreductionCoerced
:
- reflect in the name that it's effectively a reduction, but on masks (
maskReductionCoerced
?);- return type can be generalized to
long
;
Hi @iwanowww, Can you kindly elaborate why should the return type be long here ?
We will need to again downcast it to integer since these APIs return an integer value.
- bound on M:
<M extends VectorMask>
;- no need to introduce a special interface,
Function<T,R>
just works:VectorMaskOp<M>
->Function<M, Long>
;
@@ -9217,6 +9217,14 @@ void Assembler::shrxq(Register dst, Register src1, Register src2) { | |||
emit_int16((unsigned char)0xF7, (0xC0 | encode)); | |||
} | |||
|
|||
void Assembler::evpmovb2m(KRegister dst, XMMRegister src, int vector_len) { | |||
assert(VM_Version::supports_avx512bw(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be VM_Version::supports_avx512vlbw()
?
VPMOVB2M requires AVX512VL for 128-/256-bit cases.
src/hotspot/cpu/x86/x86.ad
Outdated
@@ -8054,4 +8061,42 @@ instruct vmasked_store64(memory mem, vec src, kReg mask) %{ | |||
%} | |||
ins_pipe( pipe_slow ); | |||
%} | |||
|
|||
instruct vmask_true_count_evex(rRegI dst, vec mask, rRegL tmp, kReg ktmp, vec xtmp) %{ | |||
predicate(VM_Version::supports_avx512bw()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: VM_Version::supports_avx512vlbw()
?
src/hotspot/cpu/x86/x86.ad
Outdated
int vector_len = vector_length_encoding(mask_node); | ||
int opcode = this->ideal_Opcode(); | ||
int mask_len = mask_node->bottom_type()->is_vect()->length(); | ||
__ vector_mask_oper(opcode, $dst$$Register, $mask$$XMMRegister, $xtmp$$XMMRegister, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oper
looks misleading to me here: it usually means operand
in Mach-related code.
Either vector_mask_operation()
or vector_mask_op()
is a better alternative IMO.
src/hotspot/cpu/x86/x86.ad
Outdated
%} | ||
|
||
instruct vmask_true_count_avx(rRegI dst, vec mask, rRegL tmp, vec xtmp, vec xtmp1) %{ | ||
predicate(!VM_Version::supports_avx512bw()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VM_Version::supports_avx512vlbw()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in match_rule_supported_vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you still need to adjust the predicate to be able to correctly split between AVX512BW+VL and AVX512F/AVX/AVX2 configurations.
@@ -3721,3 +3721,99 @@ void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1, Register | |||
vpxor(vec2, vec2); | |||
} | |||
} | |||
|
|||
#ifdef _LP64 | |||
void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about stressing that it requires AVX512BW & VL extensions? For example, by putting an assert and adding _evex
suffix to the name.
|
||
#ifdef _LP64 | ||
void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp, | ||
Register tmp, KRegister ktmp, int masklen, int vlen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/vlen/vlen_enc/
} | ||
|
||
void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp, | ||
XMMRegister xtmp1, Register tmp, int masklen, int vlen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/vlen/vlen_enc/
@@ -1294,6 +1294,37 @@ Node* ShiftVNode::Identity(PhaseGVN* phase) { | |||
return this; | |||
} | |||
|
|||
Node* VectorMaskOpNode::Ideal(PhaseGVN* phase, bool can_reshape) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make much sense to me. Why don't you simply require the input to be in canonical shape from the very beginning by unconditionally wrapping it into VectorStoreMask
during construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -853,6 +853,47 @@ class VectorMaskGenNode : public TypeNode { | |||
const Type* _elemType; | |||
}; | |||
|
|||
class VectorMaskOpNode : public TypeNode { | |||
public: | |||
VectorMaskOpNode(Node* mask, const Type* ty, const Type* ety, int mopc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty
/ety
caught my eye. It doesn't match anything in vectornode.hpp and may confuse readers.
Any reason not to use vt
?
Also, any particular reason to cache full-blown type instead of capturing just the BasicType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case all the mask operations produce an integer value. Thus did not use vt, have removed ety since there its does have any direct use currently.
src/hotspot/cpu/x86/x86.ad
Outdated
@@ -1650,6 +1657,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType | |||
case Op_RotateRightV: | |||
case Op_RotateLeftV: | |||
case Op_MacroLogicV: | |||
case Op_VectorMaskLastTrue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't you support 128-/256-bit cases w/ AVX/AVX2 instructions?
This check effectively requires AVX512 as a baseline.
src/hotspot/cpu/x86/x86.ad
Outdated
%} | ||
|
||
instruct vmask_true_count_avx(rRegI dst, vec mask, rRegL tmp, vec xtmp, vec xtmp1) %{ | ||
predicate(!VM_Version::supports_avx512bw()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you still need to adjust the predicate to be able to correctly split between AVX512BW+VL and AVX512F/AVX/AVX2 configurations.
@IntrinsicCandidate | ||
public static | ||
<M> | ||
int maskOp(int oper, Class<?> maskClass, Class<?> elemClass, int length, M m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with reductionCoerced
:
- reflect in the name that it's effectively a reduction, but on masks (
maskReductionCoerced
?); - return type can be generalized to
long
; - bound on M:
<M extends VectorMask>
; - no need to introduce a special interface,
Function<T,R>
just works:VectorMaskOp<M>
->Function<M, Long>
;
Hi @iwanowww , your comments have been addressed.
There are two patterns now one which supports AVX512VLBW (to handle mask length from 2-64) and other non-AVX512LVBW ( to handle mask lengths 2-32) , Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.
|
Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:
FTR downcasts are fine here. In the context of JVM intrinsics the main question is what carrier type If you don't envision any future operations on masks to return 64-bit Otherwise, it's better to start with long. Because when such operation is introduced, return type (and all use Best regards, |
1 similar comment
Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:
FTR downcasts are fine here. In the context of JVM intrinsics the main question is what carrier type If you don't envision any future operations on masks to return 64-bit Otherwise, it's better to start with long. Because when such operation is introduced, return type (and all use Best regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.
Please, elaborate why matters Byte512Vector
here?
Intrinsics are fed with corresponding vector element type, so unconditionally refecting AVX512F case (w/ BW & VL absent) means that on Xeon Phis VectorMask.lastTrue/firstTrue/trueCont
on 512-bit masks are useless (irrespective of element type) while some 512-bit vector shapes are supported. Is it intended?
ciType* elem_type = elem_klass->const_oop()->as_instance()->java_mirror_type(); | ||
BasicType elem_bt = elem_type->basic_type(); | ||
|
||
if (num_elem <= 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that masks of length 2 are supported, but it's rejected here.
This is being enforced by Matcher::match_rule_supported_vector(), for a 512 bit vector of sub-word type is supported only if target supports AVX512BW. |
Ah, now I get it! Thanks for the clarifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the check on mask length aside (num_elem <= 2
in LibraryCallKit::inline_vector_mask_operation
), the patch looks good.
A couple minor suggestions follow.
src/hotspot/cpu/x86/x86.ad
Outdated
ins_encode %{ | ||
const MachNode* mask_node = static_cast<const MachNode*>(this->in(this->operand_index($mask))); | ||
assert(mask_node->bottom_type()->isa_vect(), ""); | ||
int vector_len = vector_length_encoding(mask_node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use int vlen_enc = vector_length_encoding(this, $mask);
here.
src/hotspot/cpu/x86/x86.ad
Outdated
int opcode = this->ideal_Opcode(); | ||
int mask_len = mask_node->bottom_type()->is_vect()->length(); | ||
__ vector_mask_operation(opcode, $dst$$Register, $mask$$XMMRegister, $xtmp$$XMMRegister, | ||
$tmp$$Register, $ktmp$$KRegister, mask_len, vector_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On naming: vector_len
and mask_len
are misleadingly similar. While the latter represents the number of elements, the former is x86-specific encoding of vector length. It makes sense to stress the difference w/ a different name. That's why I propose vlen_enc
. Unfortunately, it's not uniformly used across x86.ad
yet, but at least some code already migrated.
/integrate |
@jatin-bhateja Since your change was applied there have been 232 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 7aa6568. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Jatin, the final commit erroneously contains |
PR: #4107 |
Thanks, Jie. Reviewed. |
This patch intrinsifies following mask query APIs using optimal instruction sequence for X86 target.
Current implementations of above APIs iterates over the underlined boolean array encapsulated in a mask instance to ascertain the count/position index of true bits.
X86 AVX2 and AVX512 targets offers direct instructions to populate the masks held in the byte vector to a GP or an opmask register there by accelerating further querying.
Intrinsification is not performed for vector species containing less than two vector lanes.
Please find below the performance number for benchmark included in the patch:
Machine: Cascade Lake server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz 28C)
ALGO (1=bestcase, 2=worstcast,3=avgcase)
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3916/head:pull/3916
$ git checkout pull/3916
Update a local copy of the PR:
$ git checkout pull/3916
$ git pull https://git.openjdk.java.net/jdk pull/3916/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 3916
View PR using the GUI difftool:
$ git pr show -t 3916
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3916.diff