New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8262355: Support for AVX-512 opmask register allocation. #2768
Conversation
/label hotspot-compiler-dev |
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
@jatin-bhateja |
Webrevs
|
Mailing list message from Vladimir Ivanov on hotspot-compiler-dev: Good work, Jatin! I'd like to focus high-level aspects first. There's a significant amount of crux coming from the fact that masks Also, my understanding is AArch64/SVE allows predicate registers to be Second question is about x86 and different mask representations it has: Regarding the patch itself, RegisterSaver support and related changes Best regards, On 28.02.2021 21:40, Jatin Bhateja wrote: |
void setvectmask(Register dst, Register src); | ||
void restorevectmask(); | ||
void setvectmask(Register dst, Register src, KRegister mask = k1); | ||
void restorevectmask(KRegister mask = k1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much sense in default values here. The register should be explicitly specified at call sites.
@@ -146,10 +146,33 @@ void ZBarrierSetAssembler::load_at(MacroAssembler* masm, | |||
__ movdqu(Address(rsp, xmm_size * 1), xmm1); | |||
__ movdqu(Address(rsp, xmm_size * 0), xmm0); | |||
|
|||
const int opmask_size = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in ZBarrierSetAssembler::load_at
doesn't handle vector registers. Why should it care about predicate registers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious.
@@ -468,6 +502,15 @@ class ZSaveLiveRegisters { | |||
caller_saved.Insert(OptoReg::as_OptoReg(r9->as_VMReg())); | |||
caller_saved.Insert(OptoReg::as_OptoReg(r10->as_VMReg())); | |||
caller_saved.Insert(OptoReg::as_OptoReg(r11->as_VMReg())); | |||
if (UseAVX > 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here: ZSaveLiveRegisters
handles GP registers and vector registers differently. Why predicate registers are handled the same way GP registers are and not as vector registers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Vladimir,
ZGC load barriers embeds a CALL instruction which is not exposed to compiler, thus opmask registers which are caller saved as per X86 ABI are saved before making the call and later restored back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look at the relevant code and now I better understand what confuses me here.
Through vector registers are caller saved, they aren't added into caller_saved, but just unconditionally spilled.
Why can't you do the same for KRegisters? caller_saved.Member(opto_reg)
should always be true for KRegister, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious.
Hi, @XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: openjdk/panama-vector#40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (openjdk/panama-vector@e658f4d). If it helps, we can also propose a patch here in openjdk/jdk.
I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: openjdk/panama-vector@3f69d40
Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register? Thanks, |
Yes, as @nsjian mentioned above, we added a new mask type mapped to a predicate register. Besides, to make a difference with the old vector IRs that uses vector registers for mask on other platforms, we also added a new abstract IR ( |
src/hotspot/share/opto/chaitin.cpp
Outdated
@@ -894,12 +895,15 @@ void PhaseChaitin::gather_lrg_masks( bool after_aggressive ) { | |||
break; | |||
case Op_RegL: // Check for long or double | |||
case Op_RegD: | |||
#if defined(IA32) || defined(AMD64) | |||
case Op_RegVMask: | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this #ifdef is not necessary here, and should not put it together with Op_RegL here. For other arch, at least for Arm SVE, the num_regs for predicate reg is not 2. In openjdk/panama-vector@e658f4d, we define a new enum SlotsPerRegVMask. Do you think that works for x86 as well?
src/hotspot/share/opto/node.hpp
Outdated
@@ -1022,6 +1022,7 @@ class Node { | |||
// Check if 'this' node dominates or equal to 'sub'. | |||
bool dominates(Node* sub, Node_List &nlist); | |||
|
|||
virtual const void* meta_data() const { return NULL; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used?
src/hotspot/share/opto/regmask.cpp
Outdated
@@ -67,6 +67,9 @@ int RegMask::num_registers(uint ireg) { | |||
return SlotsPerVecX; | |||
case Op_VecD: | |||
return SlotsPerVecD; | |||
#if defined(AMD64) || defined(IA32) | |||
case Op_RegVMask: | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it would be better we have less #ifdef in the shared code. So SlotsPerVMask? :-)
Current register allocation framework can perform allocation at the granularity of 32 bit. Thus in order to allocate a 64 bit register we reserve 2 bits from the register mask of its corresponding live range. Spilling code (MachSpillCopyNode::implementation) also is sensitive to this since a 32 bit def in 64 bit mode spill only 32 bit value. Opmask register is special in a way such that usable register portion could be 8,16,32 or 64 bit wide depending on the lane type and vector size. Thus in an optimal implementation both allocator and spill code may allocate and spill only the usable portion of the opmask register. This may not be possible in current allocation frame work. Keeping this added complexity out of implementation, existing patch performs both allocation and spilling at 64 bit granularity. This is why a LONG type is sufficient to represent an Opmask register for X86. I agree that ARM SVE may have to create a new mask Type since performing the spill and allocation at widest possible mask will be costly. Thus Matcher::perdicate_reg_type() can be used to return the LONG type for X86 and new mask type for ARM SVE. This will prevent any modification in target independent IR. Also for X86 a mask generating node may have different Ideal type and register for non-AVX512 targets. Please let me know if there is any disconnect in my understanding here.
For AVX-512 targets any mask generating node will have LONG type and a vector type for non-AVX512 case. Please elaborate what adverse implication do you see with this approach.
|
Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:
Yes, it may be attractive in the future, but I don't see it as something
That's fine with me.
Matcher::perdicate_reg_type() is not enough to hide the Considering there's active work going on on SVE support, I'm in favor of
What I'd like to avoid is the situation when different Ideal IR shapes Customizing node types fits that goal well, but I don't fully understand (1) having a node which can be of type TypeVect or TypeLong may be (2) as of now, there's nothing which forbids pre-AVX512 and AVX512 Best regards, |
Mailing list message from John Rose on hotspot-compiler-dev: On Mar 3, 2021, at 4:22 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
I agree. Would it help to have TypeMask as a #define of TypeLong, |
Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:
Unfortunately, I don't see how a macro could help here. It would still I find TypeVMask which is proposed as part of SVE support (Ningsheng https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1 Also, a comment on process: Ningsheng and Xiaohong started with RFC on Best regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better now.
@@ -257,9 +257,9 @@ class ConcreteRegisterImpl : public AbstractRegisterImpl { | |||
// it's optoregs. | |||
|
|||
number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register + | |||
2 * FloatRegisterImpl::number_of_registers + | |||
2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate, please, why 8 more slots are needed on x86_32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to cover for additional FILL0-7 definitions present only for 32 bit JVM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, still not clear to me why it matters only now.
Is it because predicate registers are put at the very end and the wrong sizing leaves them out of the RegMask bit map on x86-32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , its needed for correct computation RM_SIZE (number of of words to hold a RegMask), till now we were not accounting FILL register count while computing total number of registers, but it was not hurting since number of words is the ceiling multiple of word size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment to this line explaining where 8 and 0 numbers coming from.
@@ -105,6 +107,10 @@ class RegisterSaver { | |||
ymm_off = xmm_off + (XSAVE_AREA_YMM_BEGIN - XSAVE_AREA_BEGIN)/BytesPerInt, | |||
DEF_YMM_OFFS(0), | |||
DEF_YMM_OFFS(1), | |||
// 2..7 are implied in range usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is wrongly placed: it reads as if 2..7
were related to DEF_YMM_OFFS
while it refers to DEF_OPMASK_OFFS
.
Also, what about moving the declaration after DEF_ZMM_UPPER_OFFS
? It's a bit weird to see OPMASK
between YMM
and ZMM
declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it consistent with XSAVE stack layout dumping opmask at 1088 offset, anyways space was reserved in the frame, just that we weren't using it.
@@ -72,6 +72,7 @@ void VMRegImpl::set_regName() { | |||
#define X87_TYPE 2 | |||
#define STACK_TYPE 3 | |||
|
|||
//TODO: Case for KRegisters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, elaborate what is missing and how the missing code manifests at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would seek your suggestion here. Can it wait for base patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's used exclusively by Foreign ABI implementation which doesn't support k-registers on x86 now.
I'm fine with leaving it as is for now.
src/hotspot/cpu/x86/x86_64.ad
Outdated
} else if (dst_first_rc == rc_kreg) { | ||
if ((src_first & 1) == 0 && src_first + 1 == src_second && | ||
(dst_first & 1) == 0 && dst_first + 1 == dst_second) { | ||
// 64-bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 64-bit
identation is wrong.
src/hotspot/cpu/x86/x86_64.ad
Outdated
} | ||
} | ||
return 0; | ||
} else if(dst_first_rc == rc_float) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a space between if and the bracket.
src/hotspot/share/opto/regmask.hpp
Outdated
@@ -105,6 +105,7 @@ class RegMask { | |||
SlotsPerVecX = 4, | |||
SlotsPerVecY = 8, | |||
SlotsPerVecZ = 16, | |||
SlotsPerRegVmask = X86_ONLY(2) NOT_X86(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be named SlotsPerRegVMask
instead?
src/hotspot/share/opto/type.cpp
Outdated
@@ -658,6 +661,11 @@ void Type::Initialize_shared(Compile* current) { | |||
// get_zero_type() should not happen for T_CONFLICT | |||
_zero_type[T_CONFLICT]= NULL; | |||
|
|||
#if defined(AMD64) || defined(IA32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use X86
instead.
src/hotspot/share/opto/matcher.cpp
Outdated
if (Matcher::has_predicated_vectors()) { | ||
idealreg2regmask[Op_RegVMask] = regmask_for_ideal_register(Op_RegVMask, ret); | ||
} else { | ||
idealreg2regmask[Op_RegVMask] = idealreg2regmask[Op_RegI]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How idealreg2regmask[Op_RegVMask]
is used when Matcher::has_predicated_vectors() == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How
idealreg2regmask[Op_RegVMask]
is used whenMatcher::has_predicated_vectors() == false
?
No new register pressure threshold has been introduced for predicate registers, hence just a default initialization for targets which do not support them, since its getting used in raise/lower_pressure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply unconditionally initialize idealreg2regmask[Op_RegVMask]
to regmask_for_ideal_register(Op_RegVMask, ret)
then? rm
should never overlap with regmask_for_ideal_register(Op_RegVMask, ret)
in raise
/lower_pressure
when Matcher::has_predicated_vectors() == false
.
PS: you could also put Matcher::has_predicated_vectors()
in raise
/lower_pressure
for r.overlap(*Matcher::idealreg2regmask[Op_RegVMask])
check. Or even factor the whole condition out into a helper method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jatin,
Thanks for the work. Unfortunately, it seems that there are some conflicts (e.g. type definition) with our work at openjdk/panama-vector#40. Could you please help to take a look at openjdk/panama-vector#40, so that we could improve that as well and reduce the potential merge effort?
Thanks,
Ningsheng
src/hotspot/share/opto/type.cpp
Outdated
{ Bad, T_ILLEGAL, "vectora:", false, Op_VecA, relocInfo::none }, // VectorA. | ||
{ Bad, T_ILLEGAL, "vectors:", false, 0, relocInfo::none }, // VectorS | ||
{ Bad, T_ILLEGAL, "vectord:", false, Op_RegL, relocInfo::none }, // VectorD | ||
{ Bad, T_ILLEGAL, "vectorx:", false, Op_VecX, relocInfo::none }, // VectorX | ||
{ Bad, T_ILLEGAL, "vectory:", false, 0, relocInfo::none }, // VectorY | ||
{ Bad, T_ILLEGAL, "vectorz:", false, 0, relocInfo::none }, // VectorZ | ||
#elif defined(S390) | ||
{ Bad, T_ILLEGAL, "vectorm:", false, Op_RegVMask, relocInfo::none }, // VectorM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "vectorm" here looks like a kind of vector, is it? Also, since this line is the same in all the #define-else blocks, perhaps you can move it out of the #define block?
src/hotspot/share/opto/type.hpp
Outdated
@@ -299,6 +302,8 @@ class Type { | |||
const TypeAry *isa_ary() const; // Returns NULL of not ary | |||
const TypeVect *is_vect() const; // Vector | |||
const TypeVect *isa_vect() const; // Returns NULL if not a Vector | |||
const TypeVectMask *is_vectmask() const; // Predicate/Mask Vector | |||
const TypeVectMask *isa_vectmask() const; // Returns NULL if not a Predicate/Mask Vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if not a Predicate/Mask Vector" --> "if not a Vector Predicate/Mask"?
src/hotspot/share/opto/type.hpp
Outdated
class TypeVectMask : public TypeVect { | ||
public: | ||
friend class TypeVect; | ||
TypeVectMask(const Type* elem, uint length) : TypeVect(VectorM, elem, length) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why TypeVectMask is a subclass of TypeVect? Do existing helper functions in TypeVect also work for TypeVectMask, e.g. xmeet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For targets which do not support predicated registers, masks are propagated though vectors. Thus inheriting it seemed logical extension for mask type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think there should not be any mask type for those non-mask hardware targets. If in future we would like to have some type based optimization, e.g. all-true/all-false, that may be confusing with vector type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think there should not be any mask type for those non-mask hardware targets.
It seems what you are calling "non-mask hardware targets" are ISAs without dedicated predicate registers. Still, it doesn't mean such ISA doesn't support masked operations (e.g., AVX/AVX2 on x86). It will be unfortunate if we have to distinguish between different mask representations in Ideal IR. TypeVectMask
describes an "ideal" vector mask and IMO it is natural to represent it as a vector.
If in future we would like to have some type based optimization, e.g. all-true/all-false, that may be confusing with vector type.
The missing piece to make it work is vector constant support. Once it's there, it'll be possible to inspect elements of a vector constant. It is not limited to masks, but nicely generalizes to all vector shapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Vladimir, Ningsheng,
In the recent version of the patch I have demonstrated the special handling for opmask registers during de-optimization.
This code may not be relevant for base version of patch though, since only mask generating node defining an opmask register
currently is VectorMaskGenNode.
In PhaseVector::scalarize_vbox_node we stitch the vector payload to SafePoint node by creating
a SafepointScalarObjectNode, passing the vector payload of masked class though a VectorStoreMaskNode before creating a
SafePointScalarObjectNode will make value re-materialization process during de-optimization agnostic to existence
of predicate vectors. Wanted to know your thoughts on it?
This may further help in removing the logic to add mask vectors to callee saved list of RegisterMap
(accessed during de-opt for object reallocation) since payload will always be present in a byte vector.
It may come at an extra cost though, since register pressure over vectors may increase.
Do you suggest removing re-construction logic and addition of opmask vectors to callee save list from base patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest removing re-construction logic and addition of opmask vectors to callee save list from base patch?
Yes, please. That was my first thought when I saw the latest patch.
Let's review/discuss it separately once the base patch lands in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to make this patch simpler and have separate patches for further work.
…ing nodes during de-opt, it will be added separately.
src/hotspot/share/opto/type.cpp
Outdated
@@ -2406,6 +2412,14 @@ const TypeVect* TypeVect::make(const Type *elem, uint length) { | |||
return NULL; | |||
} | |||
|
|||
const TypeVect *TypeVect::makemask(const BasicType elem_bt, uint length) { | |||
if (Matcher::has_predicated_vectors()) { | |||
return (TypeVect*)(new TypeVectMask(get_const_basic_type(T_BOOLEAN), MaxVectorSize))->hashcons(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the TypeVectMask
only saves the T_BOOLEAN
as the basic type for all the mask type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct for targets which do not support predicate registers flow differs and a TypeVect matches with type of vector register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Actually that's correct for AVX-512. However, unlike AVX-512 K registers, SVE predicate needs the real basic type as a reference when controlling the vector operations. I'm worried that this is not friendly for SVE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XiaohongGong, can you elaborate why SVE predicate needs the basic type of vector element it is applied to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is that the lane size of SVE predicate depends on the vector lane type, i.e 1-mask-bit per one byte
while 2-mask-bit per one short
and so on. So most of the predicate operations depend on the vector element basic type. We uses the CNTP
as an example which computes the active numbers in a predicate register: CNTP <Xd>, <Pg>, <Pn>.B
is different from CNTP <Xd>, <Pg>, <Pn>.H
. Another usage of the real basic type is the API VectorMask.toVector()
. When creating a vector from a mask, we need the real element type (T_BOOLEAN is not enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it's a mask bit per vector byte then. It looks similar to how AVX/AVX2 represents masks (w/o predicate registers, but reusing wide vector registers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a mask bit per byte.
src/hotspot/share/opto/matcher.cpp
Outdated
@@ -531,6 +538,9 @@ void Matcher::init_first_stack_mask() { | |||
*idealreg2spillmask[Op_RegD] = *idealreg2regmask[Op_RegD]; | |||
idealreg2spillmask[Op_RegD]->OR(aligned_stack_mask); | |||
|
|||
*idealreg2spillmask[Op_RegVMask] = *idealreg2regmask[Op_RegVMask]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still get a SEGV on arm (32bit) build. I think it's here?
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xfee7ed8e, pid=18153, tid=18162
#
# JRE version: OpenJDK Runtime Environment (17.0) (fastdebug build 17-internal+0-git-85a6b6e69)
# Java VM: OpenJDK Server VM (fastdebug 17-internal+0-git-85a6b6e69, mixed mode, serial gc, linux-arm)
# Problematic frame:
# V [libjvm.so+0xaeed8e] Matcher::init_first_stack_mask()+0x89d
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to //core.18153)
#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally removed earlier Matcher::has_predicated_vectors() checks, thanks for reporting before check-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (Modulo the crash Ningsheng observed.)
Thanks a lot for addressing all my comments/requests!
@@ -409,7 +409,9 @@ uint PhaseChaitin::count_int_pressure(IndexSet* liveout) { | |||
LRG& lrg = lrgs(lidx); | |||
if (lrg.mask_is_nonempty_and_up() && | |||
!lrg.is_float_or_vector() && | |||
lrg.mask().overlap(*Matcher::idealreg2regmask[Op_RegI])) { | |||
(lrg.mask().overlap(*Matcher::idealreg2regmask[Op_RegI]) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup suggestion: the check grows in size and it is repeated 3 times. Please, extract it into a helper method.
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 15 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Hi @iwanowww , @reviewers, we have got one reviewer clearance, please suggest if it's ok to integrate it or wait for another review concent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a (R)eviewer, but looks good to me. (Those files you've modified still need to update copyright year?)
@@ -845,6 +853,12 @@ class TypeVectZ : public TypeVect { | |||
TypeVectZ(const Type* elem, uint length) : TypeVect(VectorZ, elem, length) {} | |||
}; | |||
|
|||
class TypeVectMask : public TypeVect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, recently we found an issue that some optimizations cannot work well for the floating point VectorMask due to the different element type for different APIs (see: #3238). It might need more consideration for the hash()/eq()
method TypeVectMask
especially for SVE. I guess the current implementation works well for AVX-512? If so, maybe we can have a separate patch to fix it in future once the main mask support codes are merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, bitwise operation involving VectorMask nodes (i.e. the nodes with type TypeVectMask) works over predicate registers. Since for x86 opmask register representation is consistent for all primitive types, it may not be a problem. For SVE a different instruction pattern for masks of floating types may suffice. Yes, we can do refinements in subsequent patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks! Another consideration is: since mask type representations are different for platforms that do not have mask features (using TypeVect), and SVE/AVX-512 (using TypeVectMask). And this special handling is needed for both platforms/types. Is it possible to use one type (i.e. TypeVectMask) for all mask types, with different idea_reg()
for different platforms (like what you did in your original patch) ? Or did you have considered this solution before? Same with above, we can also refine this in future if it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @XiaohongGong , this base patch is purely to extend allocation support without modifying existing vector API mask nodes behavior. We can do that discussion on panama-dev to reach to a final solution which covers both SVE and AVX512 masking operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's great to me. Thanks!
Considering the size of the patch, I'd prefer to see 2 (R)eviews. Meanwhile, I'll submit it for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few notes.
- In
sharedRuntime_*.cpp
you have nameopmask_state_*
and.ad
files alsoopmask
andopmask_reg_k*'. But in C2 code you call them
vectmaskwhich is confusing. I prefer to use the same name everywhere -
vectmask`:
const RegMask* Matcher::predicate_reg_mask(void) {
return &_OPMASK_REG_mask;
}
const TypeVect* Matcher::predicate_reg_type(const Type* elemTy, int length) {
return new TypeVectMask(TypeInt::BOOL, length);
}
- I assume that they are used only for vectors in compiled code and in other cases the don't need to be saved/restored at safepoints.
- In C2 shared code names are mess too:
VectorM, VMASK, RegVMask, TypeVectMask, VectorMaskCmp
. I suggest to useVectorMask, VECTMASK, RegVectMask
.
@@ -218,7 +218,7 @@ class KRegisterImpl : public AbstractRegisterImpl { | |||
public: | |||
enum { | |||
number_of_registers = 8, | |||
max_slots_per_register = 1 | |||
max_slots_per_register = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment. So the max size is 64 bytes?
@@ -257,9 +257,9 @@ class ConcreteRegisterImpl : public AbstractRegisterImpl { | |||
// it's optoregs. | |||
|
|||
number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register + | |||
2 * FloatRegisterImpl::number_of_registers + | |||
2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment to this line explaining where 8 and 0 numbers coming from.
AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
Please review and share your feedback.
Summary of changes:
AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
Register class initializations related to Op_RegVMask during matcher startup.
Handling for mask generating node: Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands for such MachNodes. This will constrain the allocation set for these operands to opmask registers(K1-K7).
Creation of a new concrete type TypeVectMask for mask generation nodes and a convivence routine Type::makemask which creates a regular vector types (TypeVect[SDXYZ]) for non-AVX-512 targets and TypeVectMask for a AVX-512 targets.
[1] : Section 15.1.3 : https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
[2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768
$ git checkout pull/2768
Update a local copy of the PR:
$ git checkout pull/2768
$ git pull https://git.openjdk.java.net/jdk pull/2768/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 2768
View PR using the GUI difftool:
$ git pr show -t 2768
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/2768.diff