Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8262355: Support for AVX-512 opmask register allocation. #2768

Closed
wants to merge 23 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Feb 28, 2021

AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).

This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.

We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.

Please review and share your feedback.

Summary of changes:

  1. AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.

  2. Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.

  3. Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.

  4. Register class initializations related to Op_RegVMask during matcher startup.

  5. Handling for mask generating node: Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands for such MachNodes. This will constrain the allocation set for these operands to opmask registers(K1-K7).

  6. Creation of a new concrete type TypeVectMask for mask generation nodes and a convivence routine Type::makemask which creates a regular vector types (TypeVect[SDXYZ]) for non-AVX-512 targets and TypeVectMask for a AVX-512 targets.

[1] : Section 15.1.3 : https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
[2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8262355: Support for AVX-512 opmask register allocation.

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768
$ git checkout pull/2768

Update a local copy of the PR:
$ git checkout pull/2768
$ git pull https://git.openjdk.java.net/jdk pull/2768/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 2768

View PR using the GUI difftool:
$ git pr show -t 2768

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/2768.diff

@jatin-bhateja
Copy link
Member Author

/label hotspot-compiler-dev

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 28, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler hotspot-compiler-dev@openjdk.org labels Feb 28, 2021
@openjdk
Copy link

openjdk bot commented Feb 28, 2021

@jatin-bhateja
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Feb 28, 2021

@mlbridge
Copy link

mlbridge bot commented Mar 2, 2021

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

Good work, Jatin!

I'd like to focus high-level aspects first.

There's a significant amount of crux coming from the fact that masks
don't have their own type. Reusing TypeLong::LONG for k-registers may
look appealing at first, but then all the places where RegVMask matters
have to handle the types specially. Why not introduce a dedicated type
for masks?

Also, my understanding is AArch64/SVE allows predicate registers to be
larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
representation will be needed.

Second question is about x86 and different mask representations it has:
AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
vector registers. Are we fine with leaking this difference into Ideal IR
and specifying what shape a mask value has during Ideal construction?

Regarding the patch itself, RegisterSaver support and related changes
can be integrated separately.

Best regards,
Vladimir Ivanov

On 28.02.2021 21:40, Jatin Bhateja wrote:

void setvectmask(Register dst, Register src);
void restorevectmask();
void setvectmask(Register dst, Register src, KRegister mask = k1);
void restorevectmask(KRegister mask = k1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much sense in default values here. The register should be explicitly specified at call sites.

@@ -146,10 +146,33 @@ void ZBarrierSetAssembler::load_at(MacroAssembler* masm,
__ movdqu(Address(rsp, xmm_size * 1), xmm1);
__ movdqu(Address(rsp, xmm_size * 0), xmm0);

const int opmask_size = 8;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in ZBarrierSetAssembler::load_at doesn't handle vector registers. Why should it care about predicate registers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious.

@@ -468,6 +502,15 @@ class ZSaveLiveRegisters {
caller_saved.Insert(OptoReg::as_OptoReg(r9->as_VMReg()));
caller_saved.Insert(OptoReg::as_OptoReg(r10->as_VMReg()));
caller_saved.Insert(OptoReg::as_OptoReg(r11->as_VMReg()));
if (UseAVX > 2) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here: ZSaveLiveRegisters handles GP registers and vector registers differently. Why predicate registers are handled the same way GP registers are and not as vector registers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Vladimir,
ZGC load barriers embeds a CALL instruction which is not exposed to compiler, thus opmask registers which are caller saved as per X86 ABI are saved before making the call and later restored back.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look at the relevant code and now I better understand what confuses me here.

Through vector registers are caller saved, they aren't added into caller_saved, but just unconditionally spilled.
Why can't you do the same for KRegisters? caller_saved.Member(opto_reg) should always be true for KRegister, shouldn't it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious.

@nsjian
Copy link

nsjian commented Mar 3, 2021

Hi,

@XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: openjdk/panama-vector#40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (openjdk/panama-vector@e658f4d). If it helps, we can also propose a patch here in openjdk/jdk.

I'd like to focus high-level aspects first.

There's a significant amount of crux coming from the fact that masks
don't have their own type. Reusing TypeLong::LONG for k-registers may
look appealing at first, but then all the places where RegVMask matters
have to handle the types specially. Why not introduce a dedicated type
for masks?

I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: openjdk/panama-vector@3f69d40

Also, my understanding is AArch64/SVE allows predicate registers to be
larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
representation will be needed.

Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register?

Thanks,
Ningsheng

@XiaohongGong
Copy link
Contributor

Second question is about x86 and different mask representations it has:
AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
vector registers. Are we fine with leaking this difference into Ideal IR
and specifying what shape a mask value has during Ideal construction?

Yes, as @nsjian mentioned above, we added a new mask type mapped to a predicate register. Besides, to make a difference with the old vector IRs that uses vector registers for mask on other platforms, we also added a new abstract IR (VectorMaskNode) to represent the mask on SVE. All the mask generation IRs are extended from it. Please see codes: openjdk/panama-vector@3f69d40 .

@@ -894,12 +895,15 @@ void PhaseChaitin::gather_lrg_masks( bool after_aggressive ) {
break;
case Op_RegL: // Check for long or double
case Op_RegD:
#if defined(IA32) || defined(AMD64)
case Op_RegVMask:
#endif
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this #ifdef is not necessary here, and should not put it together with Op_RegL here. For other arch, at least for Arm SVE, the num_regs for predicate reg is not 2. In openjdk/panama-vector@e658f4d, we define a new enum SlotsPerRegVMask. Do you think that works for x86 as well?

@@ -1022,6 +1022,7 @@ class Node {
// Check if 'this' node dominates or equal to 'sub'.
bool dominates(Node* sub, Node_List &nlist);

virtual const void* meta_data() const { return NULL; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

@@ -67,6 +67,9 @@ int RegMask::num_registers(uint ireg) {
return SlotsPerVecX;
case Op_VecD:
return SlotsPerVecD;
#if defined(AMD64) || defined(IA32)
case Op_RegVMask:
#endif
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it would be better we have less #ifdef in the shared code. So SlotsPerVMask? :-)

@jatin-bhateja
Copy link
Member Author

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

Good work, Jatin!

I'd like to focus high-level aspects first.

There's a significant amount of crux coming from the fact that masks
don't have their own type. Reusing TypeLong::LONG for k-registers may
look appealing at first, but then all the places where RegVMask matters
have to handle the types specially. Why not introduce a dedicated type
for masks?

Also, my understanding is AArch64/SVE allows predicate registers to be
larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
representation will be needed.

Current register allocation framework can perform allocation at the granularity of 32 bit. Thus in order to allocate a 64 bit register we reserve 2 bits from the register mask of its corresponding live range. Spilling code (MachSpillCopyNode::implementation) also is sensitive to this since a 32 bit def in 64 bit mode spill only 32 bit value.

Opmask register is special in a way such that usable register portion could be 8,16,32 or 64 bit wide depending on the lane type and vector size. Thus in an optimal implementation both allocator and spill code may allocate and spill only the usable portion of the opmask register. This may not be possible in current allocation frame work.

Keeping this added complexity out of implementation, existing patch performs both allocation and spilling at 64 bit granularity. This is why a LONG type is sufficient to represent an Opmask register for X86.

I agree that ARM SVE may have to create a new mask Type since performing the spill and allocation at widest possible mask will be costly. Thus Matcher::perdicate_reg_type() can be used to return the LONG type for X86 and new mask type for ARM SVE. This will prevent any modification in target independent IR.

Also for X86 a mask generating node may have different Ideal type and register for non-AVX512 targets.

Please let me know if there is any disconnect in my understanding here.

Second question is about x86 and different mask representations it has:
AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
vector registers. Are we fine with leaking this difference into Ideal IR
and specifying what shape a mask value has during Ideal construction?

For AVX-512 targets any mask generating node will have LONG type and a vector type for non-AVX512 case.
This will pass down from IDEAL IR to Machine IR and thus allocator will take allocation call based on the ideal register corresponding to the type.

Please elaborate what adverse implication do you see with this approach.

Regarding the patch itself, RegisterSaver support and related changes
can be integrated separately.

Best regards,
Vladimir Ivanov

On 28.02.2021 21:40, Jatin Bhateja wrote:

@mlbridge
Copy link

mlbridge bot commented Mar 3, 2021

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

There's a significant amount of crux coming from the fact that masks
don't have their own type. Reusing TypeLong::LONG for k-registers may
look appealing at first, but then all the places where RegVMask matters
have to handle the types specially. Why not introduce a dedicated type
for masks?

Also, my understanding is AArch64/SVE allows predicate registers to be
larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
representation will be needed.

Current register allocation framework can perform allocation at the granularity of 32 bit. Thus in order to allocate a 64 bit register we reserve 2 bits from the register mask of its corresponding live range. Spilling code (MachSpillCopyNode::implementation) also is sensitive to this since a 32 bit def in 64 bit mode spill only 32 bit value.

Opmask register is special in a way such that usable register portion could be 8,16,32 or 64 bit wide depending on the lane type and vector size. Thus in an optimal implementation both allocator and spill code may allocate and spill only the usable portion of the opmask register. This may not be possible in current allocation frame work.

Yes, it may be attractive in the future, but I don't see it as something
important enough for the first versions.

Keeping this added complexity out of implementation, existing patch performs both allocation and spilling at 64 bit granularity. This is why a LONG type is sufficient to represent an Opmask register for X86.

That's fine with me.

I agree that ARM SVE may have to create a new mask Type since performing the spill and allocation at widest possible mask will be costly. Thus Matcher::perdicate_reg_type() can be used to return the LONG type for X86 and new mask type for ARM SVE. This will prevent any modification in target independent IR.

Matcher::perdicate_reg_type() is not enough to hide the
platform-specific choice. The choice of TypeLong introduces significant
complexity in shared code (e.g, PhiNode and MachSpillCopyNode-related
changes). Moreover, SVE will have to choose an alternative and more
generic representation. Hence, it makes the ad-hoc changes needed for
RegVMask+TypeLong even less attractive.

Considering there's active work going on on SVE support, I'm in favor of
collaborating on unified representation between platforms and rely on it
in the first version.

Also for X86 a mask generating node may have different Ideal type and register for non-AVX512 targets.

Second question is about x86 and different mask representations it has:
AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
vector registers. Are we fine with leaking this difference into Ideal IR
and specifying what shape a mask value has during Ideal construction?

For AVX-512 targets any mask generating node will have LONG type and a vector type for non-AVX512 case.
This will pass down from IDEAL IR to Machine IR and thus allocator will take allocation call based on the ideal register corresponding to the type.

What I'd like to avoid is the situation when different Ideal IR shapes
are needed to work with different mask representations.

Customizing node types fits that goal well, but I don't fully understand
all the implications yet, in particular:

(1) having a node which can be of type TypeVect or TypeLong may be
problematic for some existing code;

(2) as of now, there's nothing which forbids pre-AVX512 and AVX512
code to be mixed in a single compilation unit. Is it feasible to require
only a single mask representation to be used at runtime?

Best regards,
Vladimir Ivanov

@mlbridge
Copy link

mlbridge bot commented Mar 4, 2021

Mailing list message from John Rose on hotspot-compiler-dev:

On Mar 3, 2021, at 4:22 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:

Matcher::perdicate_reg_type() is not enough to hide the platform-specific choice. The choice of TypeLong introduces significant complexity in shared code (e.g, PhiNode and MachSpillCopyNode-related changes). Moreover, SVE will have to choose an alternative and more generic representation. Hence, it makes the ad-hoc changes needed for RegVMask+TypeLong even less attractive.

Considering there's active work going on on SVE support, I'm in favor of collaborating on unified representation between platforms and rely on it in the first version.

I agree. Would it help to have TypeMask as a #define of TypeLong,
in the first versions? (I?m thinking of TypeX as a similar pseudo-type.)

@mlbridge
Copy link

mlbridge bot commented Mar 4, 2021

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

Matcher::perdicate_reg_type() is not enough to hide the
platform-specific choice. The choice of TypeLong introduces
significant complexity in shared code (e.g, PhiNode and
MachSpillCopyNode-related changes). Moreover, SVE will have to choose
an alternative and more generic representation. Hence, it makes the
ad-hoc changes needed for RegVMask+TypeLong even less attractive.

Considering there's active work going on on SVE support, I'm in favor
of collaborating on unified representation between platforms and rely
on it in the first version.

I agree. ?Would it help to have TypeMask as a #define of TypeLong,
in the first versions? ?(I?m thinking of TypeX as a similar pseudo-type.)

Unfortunately, I don't see how a macro could help here. It would still
be an instance of TypeLong and the code would have to explicitly
dispatch between mask and non-mask cases when choosing appropriate ideal
register.

I find TypeVMask which is proposed as part of SVE support (Ningsheng
mentioned it earlier) a much better alternative:

https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1

Also, a comment on process: Ningsheng and Xiaohong started with RFC on
panama-dev which looks more convenient. It allows to commit earlier and
iterate quicker compared to jdk. It makes sense to come up with missing
primitives first which satisfy both AVX-512 and SVE requirements. Once
they are available, proper support can follow. And alternative
implementations can happily coexist in Panama repo while the
implementations converge.

Best regards,
Vladimir Ivanov

@jatin-bhateja
Copy link
Member Author

@nsjian @iwanowww , all your outstanding comments have been resolved.

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now.

@@ -257,9 +257,9 @@ class ConcreteRegisterImpl : public AbstractRegisterImpl {
// it's optoregs.

number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register +
2 * FloatRegisterImpl::number_of_registers +
2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate, please, why 8 more slots are needed on x86_32?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to cover for additional FILL0-7 definitions present only for 32 bit JVM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, still not clear to me why it matters only now.
Is it because predicate registers are put at the very end and the wrong sizing leaves them out of the RegMask bit map on x86-32?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , its needed for correct computation RM_SIZE (number of of words to hold a RegMask), till now we were not accounting FILL register count while computing total number of registers, but it was not hurting since number of words is the ceiling multiple of word size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to this line explaining where 8 and 0 numbers coming from.

@@ -105,6 +107,10 @@ class RegisterSaver {
ymm_off = xmm_off + (XSAVE_AREA_YMM_BEGIN - XSAVE_AREA_BEGIN)/BytesPerInt,
DEF_YMM_OFFS(0),
DEF_YMM_OFFS(1),
// 2..7 are implied in range usage
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrongly placed: it reads as if 2..7 were related to DEF_YMM_OFFS while it refers to DEF_OPMASK_OFFS.

Also, what about moving the declaration after DEF_ZMM_UPPER_OFFS? It's a bit weird to see OPMASK between YMM and ZMM declarations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it consistent with XSAVE stack layout dumping opmask at 1088 offset, anyways space was reserved in the frame, just that we weren't using it.

@@ -72,6 +72,7 @@ void VMRegImpl::set_regName() {
#define X87_TYPE 2
#define STACK_TYPE 3

//TODO: Case for KRegisters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, elaborate what is missing and how the missing code manifests at runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would seek your suggestion here. Can it wait for base patch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's used exclusively by Foreign ABI implementation which doesn't support k-registers on x86 now.
I'm fine with leaving it as is for now.

src/hotspot/cpu/x86/x86_64.ad Outdated Show resolved Hide resolved
} else if (dst_first_rc == rc_kreg) {
if ((src_first & 1) == 0 && src_first + 1 == src_second &&
(dst_first & 1) == 0 && dst_first + 1 == dst_second) {
// 64-bit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// 64-bit identation is wrong.

}
}
return 0;
} else if(dst_first_rc == rc_float) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a space between if and the bracket.

@@ -105,6 +105,7 @@ class RegMask {
SlotsPerVecX = 4,
SlotsPerVecY = 8,
SlotsPerVecZ = 16,
SlotsPerRegVmask = X86_ONLY(2) NOT_X86(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be named SlotsPerRegVMask instead?

@@ -658,6 +661,11 @@ void Type::Initialize_shared(Compile* current) {
// get_zero_type() should not happen for T_CONFLICT
_zero_type[T_CONFLICT]= NULL;

#if defined(AMD64) || defined(IA32)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use X86 instead.

if (Matcher::has_predicated_vectors()) {
idealreg2regmask[Op_RegVMask] = regmask_for_ideal_register(Op_RegVMask, ret);
} else {
idealreg2regmask[Op_RegVMask] = idealreg2regmask[Op_RegI];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How idealreg2regmask[Op_RegVMask] is used when Matcher::has_predicated_vectors() == false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How idealreg2regmask[Op_RegVMask] is used when Matcher::has_predicated_vectors() == false?

No new register pressure threshold has been introduced for predicate registers, hence just a default initialization for targets which do not support them, since its getting used in raise/lower_pressure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply unconditionally initialize idealreg2regmask[Op_RegVMask] to regmask_for_ideal_register(Op_RegVMask, ret) then? rm should never overlap with regmask_for_ideal_register(Op_RegVMask, ret) in raise/lower_pressure when Matcher::has_predicated_vectors() == false.

PS: you could also put Matcher::has_predicated_vectors() in raise/lower_pressure for r.overlap(*Matcher::idealreg2regmask[Op_RegVMask]) check. Or even factor the whole condition out into a helper method.

Copy link

@nsjian nsjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jatin,

Thanks for the work. Unfortunately, it seems that there are some conflicts (e.g. type definition) with our work at openjdk/panama-vector#40. Could you please help to take a look at openjdk/panama-vector#40, so that we could improve that as well and reduce the potential merge effort?

Thanks,
Ningsheng

{ Bad, T_ILLEGAL, "vectora:", false, Op_VecA, relocInfo::none }, // VectorA.
{ Bad, T_ILLEGAL, "vectors:", false, 0, relocInfo::none }, // VectorS
{ Bad, T_ILLEGAL, "vectord:", false, Op_RegL, relocInfo::none }, // VectorD
{ Bad, T_ILLEGAL, "vectorx:", false, Op_VecX, relocInfo::none }, // VectorX
{ Bad, T_ILLEGAL, "vectory:", false, 0, relocInfo::none }, // VectorY
{ Bad, T_ILLEGAL, "vectorz:", false, 0, relocInfo::none }, // VectorZ
#elif defined(S390)
{ Bad, T_ILLEGAL, "vectorm:", false, Op_RegVMask, relocInfo::none }, // VectorM.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "vectorm" here looks like a kind of vector, is it? Also, since this line is the same in all the #define-else blocks, perhaps you can move it out of the #define block?

@@ -299,6 +302,8 @@ class Type {
const TypeAry *isa_ary() const; // Returns NULL of not ary
const TypeVect *is_vect() const; // Vector
const TypeVect *isa_vect() const; // Returns NULL if not a Vector
const TypeVectMask *is_vectmask() const; // Predicate/Mask Vector
const TypeVectMask *isa_vectmask() const; // Returns NULL if not a Predicate/Mask Vector
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if not a Predicate/Mask Vector" --> "if not a Vector Predicate/Mask"?

class TypeVectMask : public TypeVect {
public:
friend class TypeVect;
TypeVectMask(const Type* elem, uint length) : TypeVect(VectorM, elem, length) {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why TypeVectMask is a subclass of TypeVect? Do existing helper functions in TypeVect also work for TypeVectMask, e.g. xmeet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For targets which do not support predicated registers, masks are propagated though vectors. Thus inheriting it seemed logical extension for mask type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think there should not be any mask type for those non-mask hardware targets. If in future we would like to have some type based optimization, e.g. all-true/all-false, that may be confusing with vector type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think there should not be any mask type for those non-mask hardware targets.

It seems what you are calling "non-mask hardware targets" are ISAs without dedicated predicate registers. Still, it doesn't mean such ISA doesn't support masked operations (e.g., AVX/AVX2 on x86). It will be unfortunate if we have to distinguish between different mask representations in Ideal IR. TypeVectMask describes an "ideal" vector mask and IMO it is natural to represent it as a vector.

If in future we would like to have some type based optimization, e.g. all-true/all-false, that may be confusing with vector type.

The missing piece to make it work is vector constant support. Once it's there, it'll be possible to inspect elements of a vector constant. It is not limited to masks, but nicely generalizes to all vector shapes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Vladimir, Ningsheng,

In the recent version of the patch I have demonstrated the special handling for opmask registers during de-optimization.
This code may not be relevant for base version of patch though, since only mask generating node defining an opmask register
currently is VectorMaskGenNode.

In PhaseVector::scalarize_vbox_node we stitch the vector payload to SafePoint node by creating
a SafepointScalarObjectNode, passing the vector payload of masked class though a VectorStoreMaskNode before creating a
SafePointScalarObjectNode will make value re-materialization process during de-optimization agnostic to existence
of predicate vectors. Wanted to know your thoughts on it?

This may further help in removing the logic to add mask vectors to callee saved list of RegisterMap
(accessed during de-opt for object reallocation) since payload will always be present in a byte vector.
It may come at an extra cost though, since register pressure over vectors may increase.

Do you suggest removing re-construction logic and addition of opmask vectors to callee save list from base patch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest removing re-construction logic and addition of opmask vectors to callee save list from base patch?

Yes, please. That was my first thought when I saw the latest patch.
Let's review/discuss it separately once the base patch lands in the repo.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to make this patch simpler and have separate patches for further work.

@@ -2406,6 +2412,14 @@ const TypeVect* TypeVect::make(const Type *elem, uint length) {
return NULL;
}

const TypeVect *TypeVect::makemask(const BasicType elem_bt, uint length) {
if (Matcher::has_predicated_vectors()) {
return (TypeVect*)(new TypeVectMask(get_const_basic_type(T_BOOLEAN), MaxVectorSize))->hashcons();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the TypeVectMask only saves the T_BOOLEAN as the basic type for all the mask type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct for targets which do not support predicate registers flow differs and a TypeVect matches with type of vector register.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Actually that's correct for AVX-512. However, unlike AVX-512 K registers, SVE predicate needs the real basic type as a reference when controlling the vector operations. I'm worried that this is not friendly for SVE.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong, can you elaborate why SVE predicate needs the basic type of vector element it is applied to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is that the lane size of SVE predicate depends on the vector lane type, i.e 1-mask-bit per one byte while 2-mask-bit per one short and so on. So most of the predicate operations depend on the vector element basic type. We uses the CNTP as an example which computes the active numbers in a predicate register: CNTP <Xd>, <Pg>, <Pn>.B is different from CNTP <Xd>, <Pg>, <Pn>.H. Another usage of the real basic type is the API VectorMask.toVector(). When creating a vector from a mask, we need the real element type (T_BOOLEAN is not enough).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it's a mask bit per vector byte then. It looks similar to how AVX/AVX2 represents masks (w/o predicate registers, but reusing wide vector registers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a mask bit per byte.

@jatin-bhateja
Copy link
Member Author

Hi @iwanowww , @nsjian, please let me know if there are other comments, related to the patch or is it good to land in.

@@ -531,6 +538,9 @@ void Matcher::init_first_stack_mask() {
*idealreg2spillmask[Op_RegD] = *idealreg2regmask[Op_RegD];
idealreg2spillmask[Op_RegD]->OR(aligned_stack_mask);

*idealreg2spillmask[Op_RegVMask] = *idealreg2regmask[Op_RegVMask];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still get a SEGV on arm (32bit) build. I think it's here?

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xfee7ed8e, pid=18153, tid=18162
#
# JRE version: OpenJDK Runtime Environment (17.0) (fastdebug build 17-internal+0-git-85a6b6e69)
# Java VM: OpenJDK Server VM (fastdebug 17-internal+0-git-85a6b6e69, mixed mode, serial gc, linux-arm)
# Problematic frame:
# V [libjvm.so+0xaeed8e] Matcher::init_first_stack_mask()+0x89d
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to //core.18153)
#

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally removed earlier Matcher::has_predicated_vectors() checks, thanks for reporting before check-in.

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (Modulo the crash Ningsheng observed.)

Thanks a lot for addressing all my comments/requests!

@@ -409,7 +409,9 @@ uint PhaseChaitin::count_int_pressure(IndexSet* liveout) {
LRG& lrg = lrgs(lidx);
if (lrg.mask_is_nonempty_and_up() &&
!lrg.is_float_or_vector() &&
lrg.mask().overlap(*Matcher::idealreg2regmask[Op_RegI])) {
(lrg.mask().overlap(*Matcher::idealreg2regmask[Op_RegI]) ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup suggestion: the check grows in size and it is repeated 3 times. Please, extract it into a helper method.

@openjdk
Copy link

openjdk bot commented Mar 25, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8262355: Support for AVX-512 opmask register allocation.

Reviewed-by: vlivanov, njian, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 15 new commits pushed to the master branch:

  • 0780666: 8254050: HotSpot Style Guide should permit using the "override" virtual specifier
  • f259eea: 8264393: JDK-8258284 introduced dangling TLH race
  • 9b2232b: 8264123: add ThreadsList.is_valid() support
  • e8eda65: 8264664: use text blocks in javac module tests
  • cec66cf: 8264572: ForkJoinPool.getCommonPoolParallelism() reports always 1
  • 9c283da: 8264662: ProblemList vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java on win-x64 with ZGC
  • eb0ac86: 8264655: Minor internal doc comment cleanup
  • 3991b32: 8264539: Improve failure message of java/nio/file/WatchService/SensitivityModifier.java
  • 4133ded: 8264658: ProblemList javax/swing/JInternalFrame/8146321/JInternalFrameIconTest.java on linux-x64
  • 220ddbd: 8264657: ProblemList java/awt/Focus/FrameMinimizeTest/FrameMinimizeTest.java on linux-x64
  • ... and 5 more: https://git.openjdk.java.net/jdk/compare/7d0a0bad7f3b6f9900d44ec5cd64c6d2a151a804...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 25, 2021
@jatin-bhateja
Copy link
Member Author

Hi @iwanowww , @reviewers, we have got one reviewer clearance, please suggest if it's ok to integrate it or wait for another review concent.

Copy link

@nsjian nsjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a (R)eviewer, but looks good to me. (Those files you've modified still need to update copyright year?)

@@ -845,6 +853,12 @@ class TypeVectZ : public TypeVect {
TypeVectZ(const Type* elem, uint length) : TypeVect(VectorZ, elem, length) {}
};

class TypeVectMask : public TypeVect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, recently we found an issue that some optimizations cannot work well for the floating point VectorMask due to the different element type for different APIs (see: #3238). It might need more consideration for the hash()/eq() method TypeVectMask especially for SVE. I guess the current implementation works well for AVX-512? If so, maybe we can have a separate patch to fix it in future once the main mask support codes are merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, bitwise operation involving VectorMask nodes (i.e. the nodes with type TypeVectMask) works over predicate registers. Since for x86 opmask register representation is consistent for all primitive types, it may not be a problem. For SVE a different instruction pattern for masks of floating types may suffice. Yes, we can do refinements in subsequent patches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks! Another consideration is: since mask type representations are different for platforms that do not have mask features (using TypeVect), and SVE/AVX-512 (using TypeVectMask). And this special handling is needed for both platforms/types. Is it possible to use one type (i.e. TypeVectMask) for all mask types, with different idea_reg() for different platforms (like what you did in your original patch) ? Or did you have considered this solution before? Same with above, we can also refine this in future if it's possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiaohongGong , this base patch is purely to extend allocation support without modifying existing vector API mask nodes behavior. We can do that discussion on panama-dev to reach to a final solution which covers both SVE and AVX512 masking operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's great to me. Thanks!

@iwanowww
Copy link

please suggest if it's ok to integrate it or wait for another review concent.

Considering the size of the patch, I'd prefer to see 2 (R)eviews.

Meanwhile, I'll submit it for testing.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few notes.

  1. In sharedRuntime_*.cpp you have name opmask_state_* and .ad files also opmask and opmask_reg_k*'. But in C2 code you call them vectmaskwhich is confusing. I prefer to use the same name everywhere -vectmask`:
const RegMask* Matcher::predicate_reg_mask(void) {
   return &_OPMASK_REG_mask;
 }

 const TypeVect* Matcher::predicate_reg_type(const Type* elemTy, int length) {
   return new TypeVectMask(TypeInt::BOOL, length);
 }
  1. I assume that they are used only for vectors in compiled code and in other cases the don't need to be saved/restored at safepoints.
  2. In C2 shared code names are mess too: VectorM, VMASK, RegVMask, TypeVectMask, VectorMaskCmp. I suggest to use VectorMask, VECTMASK, RegVectMask.

@@ -218,7 +218,7 @@ class KRegisterImpl : public AbstractRegisterImpl {
public:
enum {
number_of_registers = 8,
max_slots_per_register = 1
max_slots_per_register = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment. So the max size is 64 bytes?

@@ -257,9 +257,9 @@ class ConcreteRegisterImpl : public AbstractRegisterImpl {
// it's optoregs.

number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register +
2 * FloatRegisterImpl::number_of_registers +
2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to this line explaining where 8 and 0 numbers coming from.

@jatin-bhateja
Copy link
Member Author