Skip to content
This repository has been archived by the owner. It is now read-only.

8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. #24

Closed
wants to merge 4 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Dec 14, 2021

  • Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector.
  • In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector.

Kindly review and share your feedback.

Best Regards,
Jatin


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8278508: Enable X86 maskAll instruction pattern for 32 bit JVM.

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24
$ git checkout pull/24

Update a local copy of the PR:
$ git checkout pull/24
$ git pull https://git.openjdk.java.net/jdk18 pull/24/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24

View PR using the GUI difftool:
$ git pr show -t 24

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk18/pull/24.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 14, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Dec 14, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Dec 14, 2021

@jatin-bhateja The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler label Dec 14, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Dec 14, 2021

Webrevs

Copy link

@vnkozlov vnkozlov left a comment

Seems reasonable. Let me test it.

Copy link

@vnkozlov vnkozlov left a comment

Regular testing (x64) results are good.

But I also run jdk/incubator/vector/ tests locally with 32-bit VM (fastdebug). Most tests passed but some hit timeout (run for more than default 2 min):

jdk/incubator/vector/Byte512VectorTests.java
jdk/incubator/vector/ByteMaxVectorTests.java
jdk/incubator/vector/VectorReshapeTests.java

I assume such vectors are not supported in 32-bit VM and code is slow. Simple solution is to add /timeout=240 (4 min) to these tests. They passed for me after that.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Dec 15, 2021

Regular testing (x64) results are good.

But I also run jdk/incubator/vector/ tests locally with 32-bit VM (fastdebug). Most tests passed but some hit timeout (run for more than default 2 min):

jdk/incubator/vector/Byte512VectorTests.java
jdk/incubator/vector/ByteMaxVectorTests.java
jdk/incubator/vector/VectorReshapeTests.java

I assume such vectors are not supported in 32-bit VM and code is slow. Simple solution is to add /timeout=240 (4 min) to these tests. They passed for me after that.

Hi @vnkozlov , x86 32bit target constraints the number of vector registers to 8 (zmm0-zmm7) so above testcases should still be able to intrinsify the operation. I tested above tests on AVX512 target and they executes in lass than 2 mins. On a non-AVX512 target any operation based on Byte512 species is not inline expanded.

Since the changes in this patch are only related to AVX512 thus my testing went clean with UseAVX=3.

@vnkozlov
Copy link

@vnkozlov vnkozlov commented Dec 15, 2021

I ran these tests with -XX:-TieredCompilation -Xbatch -XX:UseAVX=3 flags and 32-bit VM on Skylake.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Dec 16, 2021

jdk/incubator/vector/VectorReshapeTests.java

Hi @vnkozlov , I have added explicit time out check as suggested in above 3 testcases.

Copy link

@vnkozlov vnkozlov left a comment

Good. You need second review for this.

@openjdk
Copy link

@openjdk openjdk bot commented Dec 16, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8278508: Enable X86 maskAll instruction pattern for 32 bit JVM.

Reviewed-by: kvn, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 28 new commits pushed to the master branch:

  • 84d3333: 8279081: ProblemList jdk/jfr/event/oldobject/TestLargeRootSet.java on 2 platforms
  • 1128674: 8278627: Shenandoah: TestHeapDump test failed
  • 54517fa: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64
  • ac7430c: 8278044: ObjectInputStream methods invoking the OIF.CFG.getSerialFilterFactory() silent about error cases.
  • db3d6d7: 8278087: Deserialization filter and filter factory property error reporting under specified
  • 467f654: 8279011: JFR: JfrChunkWriter incorrectly handles int64_t chunk size as size_t
  • 819f9bd: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297
  • ad12828: 8278609: [macos] accessibility frame is misplaced on a secondary monitor on macOS
  • deaf75a: 8278413: C2 crash when allocating array of size too large
  • 36676db: 8278970: [macos] SigningPackageTest is failed with runtime exception
  • ... and 18 more: https://git.openjdk.java.net/jdk18/compare/475ec8e6c5abc3431344d69bd46395e8c4b46e4c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Dec 16, 2021
@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Dec 17, 2021

@svisva7 can you kindly be second reviewer.

instruct mask_all_evexI_imm_GT32(kReg dst, immI cnt, rRegI tmp, kReg ktmp) %{
predicate(Matcher::vector_length(n) > 32);
match(Set dst (MaskAll cnt));
effect(TEMP dst, TEMP tmp, TEMP ktmp);
format %{ "mask_all_evex_imm_GT32 $dst, $cnt \t! using $tmp and $ktmp as TEMP" %}
ins_encode %{
int mask_len = Matcher::vector_length(this);
__ movl($tmp$$Register, $cnt$$constant);
__ vector_maskall_operation32($dst$$KRegister, $tmp$$Register, $ktmp$$KRegister, mask_len);
%}
ins_pipe( pipe_slow );
%}
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed. The more generic mask_all_evexI_GT32 rule will take care of this.

match(Set dst (MaskAll src));
effect(TEMP_DEF dst);
format %{ "mask_all_evexL $dst, $src \t! mask all operation" %}
effect(TEMP dst);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP dst is not needed.

match(Set dst (MaskAll src));
effect(TEMP_DEF dst, TEMP tmp);
format %{ "mask_all_evexI $dst, $src \t! using $tmp as TEMP" %}
effect(TEMP dst, TEMP tmp);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP dst is not needed here.

Copy link
Member Author

@jatin-bhateja jatin-bhateja Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst is both read and written to in macro assembly routine.

Copy link

@sviswa7 sviswa7 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst both written and read is not the criteria for TEMP dst effect. Only if the src is used after dest is written in the encoding for instruct, TEMP dst is needed.

Copy link
Contributor

@merykitty merykitty Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do we need TEMP dst if some tmp is used after dst is written, as without it tmp and dst are implied to have non-overlapping lifetime and can lead to conflict. This is a pure question unrelated to the PR. Thank you very much.

Copy link

@sviswa7 sviswa7 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merykitty, yes we would need TEMP dst in that case as well.

Copy link
Member Author

@jatin-bhateja jatin-bhateja Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP attribute ensures creation of a temporary machine operand which interferes with source operands if their registerclasses are overlapping, a TEMP + DST data flow attribute ensures that DST is not accidentally allocated a SRC register even if SRC is not live beyond that instruction. This shall prevent overriding the src before using its value in the instruction encoding block.

instruct mask_all_evexL_LT32(kReg dst, eRegL src) %{
predicate(Matcher::vector_length(n) <= 32);
match(Set dst (MaskAll src));
effect(TEMP dst);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP dst is not needed.

@@ -13011,6 +13011,43 @@ instruct safePoint_poll_tls(rFlagsReg cr, rRegP poll)
ins_pipe(ialu_reg_mem);
%}

instruct mask_all_evexL(kReg dst, rRegL src) %{
match(Set dst (MaskAll src));
effect(TEMP dst);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP dst not needed.

instruct mask_all_evex_imm_GT32(kReg dst, immI cnt, rRegL tmp) %{
predicate(Matcher::vector_length(n) > 32);
match(Set dst (MaskAll cnt));
effect(TEMP dst, TEMP tmp);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEMP dst not needed.

assert(mask_len <= 16, "");
kmovwl(dst, src);
kshiftrwl(dst, dst, 16 - mask_len);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If masklen == 16 then kshiftrwl is not needed?

if (VM_Version::supports_avx512bw()) {
if (mask_len > 32) {
kmovql(dst, src);
kshiftrql(dst, dst, 64 - mask_len);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here masklen is 64, so kshiftrql is not needed here?

Copy link
Member Author

@jatin-bhateja jatin-bhateja Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src operand carry either a 0 (false mask) or 0XFFFFFFFFFFFFFFFF (true mask) value. Thus a right shift here ensures that destination bits whose size matches with mask_len are set.

I have introduced special case handling when mask_len is 64, 32 and 16.

kmovdl(dst, src);
kshiftrdl(dst, dst, 32 - mask_len);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If masklen is 32 then kshiftrdl is not needed. Only needed if masklen < 32.

assert(VM_Version::supports_avx512bw(), "");
kmovdl(tmp, src);
kshiftlql(dst, tmp, 32);
korql(dst, dst, tmp);
kshiftrql(dst, dst, 64 - mask_len);
Copy link

@sviswa7 sviswa7 Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the kshiftrql here? The masklen is 64 here.
You could alternatively use:
kmovdl dst, src
kunpckdq dst, dst, dst

Copy link
Member Author

@jatin-bhateja jatin-bhateja Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

if (mask_len != 64) {
kshiftrql(dst, dst, 64 - mask_len);
}
Copy link

@sviswa7 sviswa7 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64.

assert(VM_Version::supports_avx512bw(), "");
kmovdl(tmp, src);
kunpckdql(dst, tmp, tmp);
kshiftrql(dst, dst, 64 - mask_len);
Copy link

@sviswa7 sviswa7 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64 so we dont need kshiftrql.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Dec 21, 2021

Hi @sviswa7, your comments addressed. TEMP def was an artifact of some other change I was doing initially.

Copy link

@sviswa7 sviswa7 left a comment

Patch looks good to me.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Dec 22, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Dec 22, 2021

Going to push as commit 97c5cd7.
Since your change was applied there have been 29 commits pushed to the master branch:

  • 9ee3ccf: 8279045: Intrinsics missing vzeroupper instruction
  • 84d3333: 8279081: ProblemList jdk/jfr/event/oldobject/TestLargeRootSet.java on 2 platforms
  • 1128674: 8278627: Shenandoah: TestHeapDump test failed
  • 54517fa: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64
  • ac7430c: 8278044: ObjectInputStream methods invoking the OIF.CFG.getSerialFilterFactory() silent about error cases.
  • db3d6d7: 8278087: Deserialization filter and filter factory property error reporting under specified
  • 467f654: 8279011: JFR: JfrChunkWriter incorrectly handles int64_t chunk size as size_t
  • 819f9bd: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297
  • ad12828: 8278609: [macos] accessibility frame is misplaced on a secondary monitor on macOS
  • deaf75a: 8278413: C2 crash when allocating array of size too large
  • ... and 19 more: https://git.openjdk.java.net/jdk18/compare/475ec8e6c5abc3431344d69bd46395e8c4b46e4c...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated label Dec 22, 2021
@openjdk openjdk bot closed this Dec 22, 2021
@openjdk openjdk bot removed ready rfr labels Dec 22, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Dec 22, 2021

@jatin-bhateja Pushed as commit 97c5cd7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hotspot-compiler integrated
4 participants