8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. #24
Conversation
|
@jatin-bhateja The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
Regular testing (x64) results are good.
But I also run jdk/incubator/vector/
tests locally with 32-bit VM (fastdebug). Most tests passed but some hit timeout (run for more than default 2 min):
jdk/incubator/vector/Byte512VectorTests.java
jdk/incubator/vector/ByteMaxVectorTests.java
jdk/incubator/vector/VectorReshapeTests.java
I assume such vectors are not supported in 32-bit VM and code is slow. Simple solution is to add /timeout=240
(4 min) to these tests. They passed for me after that.
Hi @vnkozlov , x86 32bit target constraints the number of vector registers to 8 (zmm0-zmm7) so above testcases should still be able to intrinsify the operation. I tested above tests on AVX512 target and they executes in lass than 2 mins. On a non-AVX512 target any operation based on Byte512 species is not inline expanded. Since the changes in this patch are only related to AVX512 thus my testing went clean with UseAVX=3. |
I ran these tests with |
Hi @vnkozlov , I have added explicit time out check as suggested in above 3 testcases. |
@jatin-bhateja This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 28 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
|
@svisva7 can you kindly be second reviewer. |
src/hotspot/cpu/x86/x86_32.ad
Outdated
instruct mask_all_evexI_imm_GT32(kReg dst, immI cnt, rRegI tmp, kReg ktmp) %{ | ||
predicate(Matcher::vector_length(n) > 32); | ||
match(Set dst (MaskAll cnt)); | ||
effect(TEMP dst, TEMP tmp, TEMP ktmp); | ||
format %{ "mask_all_evex_imm_GT32 $dst, $cnt \t! using $tmp and $ktmp as TEMP" %} | ||
ins_encode %{ | ||
int mask_len = Matcher::vector_length(this); | ||
__ movl($tmp$$Register, $cnt$$constant); | ||
__ vector_maskall_operation32($dst$$KRegister, $tmp$$Register, $ktmp$$KRegister, mask_len); | ||
%} | ||
ins_pipe( pipe_slow ); | ||
%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed. The more generic mask_all_evexI_GT32 rule will take care of this.
src/hotspot/cpu/x86/x86.ad
Outdated
match(Set dst (MaskAll src)); | ||
effect(TEMP_DEF dst); | ||
format %{ "mask_all_evexL $dst, $src \t! mask all operation" %} | ||
effect(TEMP dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP dst is not needed.
src/hotspot/cpu/x86/x86.ad
Outdated
match(Set dst (MaskAll src)); | ||
effect(TEMP_DEF dst, TEMP tmp); | ||
format %{ "mask_all_evexI $dst, $src \t! using $tmp as TEMP" %} | ||
effect(TEMP dst, TEMP tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP dst is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dst is both read and written to in macro assembly routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dst both written and read is not the criteria for TEMP dst effect. Only if the src is used after dest is written in the encoding for instruct, TEMP dst is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, do we need TEMP dst
if some tmp
is used after dst
is written, as without it tmp
and dst
are implied to have non-overlapping lifetime and can lead to conflict. This is a pure question unrelated to the PR. Thank you very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @merykitty, yes we would need TEMP dst in that case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP attribute ensures creation of a temporary machine operand which interferes with source operands if their registerclasses are overlapping, a TEMP + DST data flow attribute ensures that DST is not accidentally allocated a SRC register even if SRC is not live beyond that instruction. This shall prevent overriding the src before using its value in the instruction encoding block.
src/hotspot/cpu/x86/x86_32.ad
Outdated
instruct mask_all_evexL_LT32(kReg dst, eRegL src) %{ | ||
predicate(Matcher::vector_length(n) <= 32); | ||
match(Set dst (MaskAll src)); | ||
effect(TEMP dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP dst is not needed.
src/hotspot/cpu/x86/x86_64.ad
Outdated
@@ -13011,6 +13011,43 @@ instruct safePoint_poll_tls(rFlagsReg cr, rRegP poll) | |||
ins_pipe(ialu_reg_mem); | |||
%} | |||
|
|||
instruct mask_all_evexL(kReg dst, rRegL src) %{ | |||
match(Set dst (MaskAll src)); | |||
effect(TEMP dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP dst not needed.
src/hotspot/cpu/x86/x86_64.ad
Outdated
instruct mask_all_evex_imm_GT32(kReg dst, immI cnt, rRegL tmp) %{ | ||
predicate(Matcher::vector_length(n) > 32); | ||
match(Set dst (MaskAll cnt)); | ||
effect(TEMP dst, TEMP tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEMP dst not needed.
assert(mask_len <= 16, ""); | ||
kmovwl(dst, src); | ||
kshiftrwl(dst, dst, 16 - mask_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If masklen == 16 then kshiftrwl is not needed?
if (VM_Version::supports_avx512bw()) { | ||
if (mask_len > 32) { | ||
kmovql(dst, src); | ||
kshiftrql(dst, dst, 64 - mask_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here masklen is 64, so kshiftrql is not needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src operand carry either a 0 (false mask) or 0XFFFFFFFFFFFFFFFF (true mask) value. Thus a right shift here ensures that destination bits whose size matches with mask_len are set.
I have introduced special case handling when mask_len is 64, 32 and 16.
kmovdl(dst, src); | ||
kshiftrdl(dst, dst, 32 - mask_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If masklen is 32 then kshiftrdl is not needed. Only needed if masklen < 32.
assert(VM_Version::supports_avx512bw(), ""); | ||
kmovdl(tmp, src); | ||
kshiftlql(dst, tmp, 32); | ||
korql(dst, dst, tmp); | ||
kshiftrql(dst, dst, 64 - mask_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the kshiftrql here? The masklen is 64 here.
You could alternatively use:
kmovdl dst, src
kunpckdq dst, dst, dst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
if (mask_len != 64) { | ||
kshiftrql(dst, dst, 64 - mask_len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64.
assert(VM_Version::supports_avx512bw(), ""); | ||
kmovdl(tmp, src); | ||
kunpckdql(dst, tmp, tmp); | ||
kshiftrql(dst, dst, 64 - mask_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64 so we dont need kshiftrql.
Hi @sviswa7, your comments addressed. TEMP def was an artifact of some other change I was doing initially. |
/integrate |
Going to push as commit 97c5cd7.
Your commit was automatically rebased without conflicts. |
@jatin-bhateja Pushed as commit 97c5cd7. |
Kindly review and share your feedback.
Best Regards,
Jatin
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24
$ git checkout pull/24
Update a local copy of the PR:
$ git checkout pull/24
$ git pull https://git.openjdk.java.net/jdk18 pull/24/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24
View PR using the GUI difftool:
$ git pr show -t 24
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk18/pull/24.diff