8283232: x86: Improve vector broadcast operations #7832

merykitty · 2022-03-16T01:19:24Z

Hi,

This patch improves the generation of broadcasting a scalar in several ways:

As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay

With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow:

                                          Before          After
Benchmark                  Mode  Cnt   Score   Error   Score   Error  Units     Gain
SpiltReplicate.testDouble  avgt    5  42.621 ± 0.598  38.771 ± 0.797  ns/op   +9.03%
SpiltReplicate.testFloat   avgt    5  42.245 ± 1.464  38.603 ± 0.367  ns/op   +8.62%
SpiltReplicate.testInt     avgt    5  20.581 ± 5.791  13.755 ± 0.375  ns/op  +33.17%
SpiltReplicate.testLong    avgt    5  17.794 ± 4.781  13.663 ± 0.387  ns/op  +23.22%

As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for long/double and 128 bytes for int/float cases.

This patch also removes some redundant code paths and renames some incorrectly named instructions.

Thank you very much.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8283232: x86: Improve vector broadcast operations

Reviewers

Vladimir Kozlov (@vnkozlov - Reviewer)
Jatin Bhateja (@jatin-bhateja - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832
$ git checkout pull/7832

Update a local copy of the PR:
$ git checkout pull/7832
$ git pull https://git.openjdk.org/jdk pull/7832/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7832

View PR using the GUI difftool:
$ git pr show -t 7832

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7832.diff

bridgekeeper · 2022-03-16T01:19:45Z

👋 Welcome back merykitty! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-03-16T01:22:33Z

@merykitty The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

merykitty · 2022-03-16T01:24:51Z

/label hotspot-compiler

openjdk · 2022-03-16T01:25:26Z

@merykitty
The hotspot-compiler label was successfully added.

mlbridge · 2022-03-16T01:28:03Z

Webrevs

merykitty · 2022-03-16T14:52:07Z

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

Thanks.

jatin-bhateja · 2022-03-16T15:56:44Z

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

Thanks.

Thanks meant to refer to above text. I have removed incorrect reference.

jatin-bhateja · 2022-03-16T17:25:53Z

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

Thanks.

Thanks meant to refer to above text. I have removed incorrect reference.

It will still be good if we can come up with a micro benchmark, that shows the gain with the patch.

merykitty · 2022-03-17T12:48:37Z

Doing a simple benchmark that has a lot of register pressure

@Benchmark
public long broadcastCon() {
    var species = IntVector.SPECIES_PREFERRED;
    var sum = IntVector.zero(species);
    return sum.add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
            .add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
            .add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
            .add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
            .add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
            .add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
            .add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
            .add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
            .reinterpretAsLongs()
            .lane(0);
}

provides the following result:

Before:
Benchmark                     Mode  Cnt   Score   Error  Units
VectorReplicate.broadcastCon  avgt    5  16.417 ± 0.515  ns/op

After:
Benchmark                     Mode  Cnt   Score   Error  Units
VectorReplicate.broadcastCon  avgt    5  13.851 ± 0.154  ns/op

The constant table size decreases from 1024 bytes to 128 bytes, which is much more manageable. The throughput improvement mostly comes from the vector being rematerialized instead of being spilt on the stack.

I have not been able to observe performance gain regarding bypass delay, which is expected as according to "Agner's optimisation manual on the micro architecture of Intel, AMD and VIA CPUs", Intel CPUs since Skylake seem to have only a few such delays.

Thank you very much.

bridgekeeper · 2022-05-13T02:14:33Z

@merykitty This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2022-07-08T03:08:32Z

@merykitty This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

merykitty · 2022-07-27T09:45:31Z

@vnkozlov I have fixed those errors in the last commits. The second one is due to pshufb being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think?

Thanks a lot.

jdk/src/hotspot/share/asm/codeBuffer.hpp

Line 742 in 2bd90c2

inline int CodeSection::alignment(int section) {

vnkozlov · 2022-07-27T15:07:42Z

@vnkozlov I have fixed those errors in the last commits. The second one is due to pshufb being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think?

I agree with doing in separate changes. And I will start new testing.

vnkozlov · 2022-07-27T20:53:02Z

Got new failure (and testing still running). Test compiler/c2/cr7200264/TestSSE2IntVect.java failed with -Xcomp:

java.lang.RuntimeException: Unexpected SubVI number: expected 2 >= 4
	at jdk.test.lib.Asserts.fail(Asserts.java:594)
	at jdk.test.lib.Asserts.assertGreaterThanOrEqual(Asserts.java:288)
	at jdk.test.lib.Asserts.assertGTE(Asserts.java:259)
	at compiler.c2.cr7200264.TestDriver.verifyVectorizationNumber(TestDriver.java:65)
	at compiler.c2.cr7200264.TestDriver.run(TestDriver.java:43)
	at compiler.c2.cr7200264.TestSSE2IntVect.main(TestSSE2IntVect.java:48)

merykitty · 2022-07-28T02:53:24Z

It does not seem related as this patch has effects only after matching so it should not change the IR graph of the compilations

vnkozlov

I verified that the latest failure I posted is not related to these changes. There were no other failures. Approved.

openjdk · 2022-07-28T16:47:12Z

@merykitty This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283232: x86: Improve vector broadcast operations

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 87 new commits pushed to the master branch:

0ae8341: 8290908: misc tests fail: assert(!thread->owns_locks()) failed: must release all locks when leaving VM
5acf2d7: 8291578: Remove JMX related tests from ProblemList-svc-vthreads.txt
a6564d4: 8291650: Add delay to ClassUnloadEventTest before exiting to give time for JVM to send all events before VMDeath
af76c0c: 8291654: AArch64: assert from JDK-8287393 causes crashes
a9db5bb: 8291626: Remove Mutex::contains as it is unused
a2cff26: 8291597: [BACKOUT] JDK-8289996: Fix array range check hoisting for some scaled loop iv
554f44e: 8282730: LdapLoginModule throw NPE from logout method after login failure
f714ac5: 8290718: Remove ALLOCATION_SUPER_CLASS_SPEC
6cbc234: 8287393: AArch64: Remove trampoline_call1
57bf603: 8289948: Improve test coverage for XPath functions: Node Set Functions
... and 77 more: https://git.openjdk.org/jdk/compare/0599a05f8c7e26d4acae0b2cc805a65bdd6c6f67...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @jatin-bhateja) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

jatin-bhateja · 2022-07-29T05:23:30Z

src/hotspot/cpu/x86/x86.ad

+instruct ReplB_mem(vec dst, memory mem) %{
+  predicate(UseAVX >= 2);
+  match(Set dst (ReplicateB (LoadB mem)));
+  format %{ "replicateB $dst,$mem" %}
  ins_encode %{
-    InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this)));
-    __ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this));
+    int vlen_enc = vector_length_encoding(this);
+    __ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc);
  %}
  ins_pipe( pipe_slow );
 %}

 // ====================ReplicateS=======================================

-instruct ReplS_reg(vec dst, rRegI src) %{
+instruct vReplS_reg(vec dst, rRegI src) %{
+  predicate(UseAVX >= 2);


Can be folded with below pattern, by pushing predicate into encoding block.

Aligning the predicate of the reg and the mem version allows the adlc parser to recognise their relationship and during register allocation can substitute a reg operation with a spilt operand with its corresponding mem node. You can see in the generated code the reg node has specific methods such as cisc_operand and cisc_version

May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp

jatin-bhateja · 2022-07-29T05:25:32Z

src/hotspot/cpu/x86/x86.ad

+instruct vReplB_reg(vec dst, rRegI src) %{
+  predicate(UseAVX >= 2);
  match(Set dst (ReplicateB src));
  format %{ "replicateB $dst,$src" %}
  ins_encode %{
    uint vlen = Matcher::vector_length(this);
+    int vlen_enc = vector_length_encoding(this);
    if (vlen == 64 || VM_Version::supports_avx512vlbw()) { // AVX512VL for <512bit operands
      assert(VM_Version::supports_avx512bw(), "required"); // 512-bit byte vectors assume AVX512BW
-      int vlen_enc = vector_length_encoding(this);
      __ evpbroadcastb($dst$$XMMRegister, $src$$Register, vlen_enc);
-    } else if (VM_Version::supports_avx2()) {
-      int vlen_enc = vector_length_encoding(this);
-      __ movdl($dst$$XMMRegister, $src$$Register);
-      __ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc);
    } else {
      __ movdl($dst$$XMMRegister, $src$$Register);
-      __ punpcklbw($dst$$XMMRegister, $dst$$XMMRegister);
-      __ pshuflw($dst$$XMMRegister, $dst$$XMMRegister, 0x00);
-      if (vlen >= 16) {
-        __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
-        if (vlen >= 32) {
-          assert(vlen == 32, "sanity");
-          __ vinserti128_high($dst$$XMMRegister, $dst$$XMMRegister);
-        }
-      }
+      __ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc);
    }
  %}
  ins_pipe( pipe_slow );
 %}

-instruct ReplB_mem(vec dst, memory mem) %{
-  predicate(VM_Version::supports_avx2());
-  match(Set dst (ReplicateB (LoadB mem)));


Merge these rules and create a macro assembly routine for encoding block logic.

src/hotspot/cpu/x86/x86.ad

jatin-bhateja · 2022-07-29T07:51:04Z

src/hotspot/share/opto/machnode.cpp

  // Stretching lots of inputs - don't do it.
-  if (req() > 2) {
+  // A MachContant has the last input being the constant base
+  if (req() > (is_MachConstant() ? 3U : 2U)) {


Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check.

Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly.

There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions.

I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ?

A node being decided to prefer rematerialising to spilling has to satisfy that:

The node is not explicitly said to be expensive, divD and divF fails at this stage.

The node declaration only contains simple register rules (explicit or implicit DEF dst and USE src), vround fails this because it has temp register, cmpF_imm and cmpD_imm fail this because they kill flags.

This method we are at agrees with the rematerialising.

I have looked at all instances where constantaddress is used and found no node where accidental rematerialisation is inefficient.

Thanks for your explanations, I agree.

merykitty · 2022-07-29T13:57:53Z

@jatin-bhateja Thanks a lot for your comments, I have addressed those in the last commit.
@vnkozlov Thanks very much for the review and testing.

jatin-bhateja · 2022-07-29T18:46:13Z

src/hotspot/cpu/x86/x86.ad

+instruct ReplB_mem(vec dst, memory mem) %{
+  predicate(UseAVX >= 2);
+  match(Set dst (ReplicateB (LoadB mem)));
+  format %{ "replicateB $dst,$mem" %}
  ins_encode %{
-    InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this)));
-    __ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this));
+    int vlen_enc = vector_length_encoding(this);
+    __ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc);
  %}
  ins_pipe( pipe_slow );
 %}

 // ====================ReplicateS=======================================

-instruct ReplS_reg(vec dst, rRegI src) %{
+instruct vReplS_reg(vec dst, rRegI src) %{
+  predicate(UseAVX >= 2);


May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.

merykitty · 2022-07-31T15:24:01Z

Thanks for your reviews. Does this PR need another run through the tests?

vnkozlov

I submitted new testing for version 012. Last update was not simple.

vnkozlov

Testing of version 12 passed.

merykitty · 2022-08-03T02:18:50Z

/integrate

openjdk · 2022-08-03T02:20:40Z

@merykitty
Your change (at version e83ccaa) is now ready to be sponsored by a Committer.

merykitty · 2022-08-04T07:36:46Z

May I have this PR sponsored, please? Thanks a lot for your help.

jatin-bhateja · 2022-08-04T16:25:30Z

/sponsor

openjdk · 2022-08-04T16:28:15Z

Going to push as commit 92d2982.
Since your change was applied there have been 113 commits pushed to the master branch:

966ab21: 8291895: Remove PRAGMA_NONNULL_IGNORED from x86 and AArch64
aa557b9: 8288327: Executable.hasRealParameterData should not be volatile
d4a795d: 8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs
a3040fc: 8291360: Create entry points to expose low-level class file information
ce61eb6: 8290349: IP_DONTFRAGMENT doesn't set DF bit in IPv4 header
26e5c11: 4890041: Remove TAB and Shift TAB from Popup Menu in Motif Look & Feel
0bc804d: 8291762: Backout JDK-8291757 from jdk/jdk
3493973: Merge
43bb399: 8291757: Remove EA from JDK 19 version string starting with Initial RC promotion B35 on August 11, 2022
4772354: 8291825: java/time/nontestng/java/time/zone/CustomZoneNameTest.java fails if defaultLocale and defaultFormatLocale are different
... and 103 more: https://git.openjdk.org/jdk/compare/0599a05f8c7e26d4acae0b2cc805a65bdd6c6f67...master

Your commit was automatically rebased without conflicts.

openjdk · 2022-08-04T16:28:48Z

@jatin-bhateja @merykitty Pushed as commit 92d2982.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

merykitty added 7 commits March 15, 2022 19:20

initial commit

55eeadb

fix

f61c3f5

improve

89b8184

rematerialize

cc71a14

minor changes

8101d7e

fix

a48ff70

fix

0706aa5

openjdk bot added the rfr Pull request is ready for review label Mar 16, 2022

openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 16, 2022

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 16, 2022

fix crash in sse

8216d79

fix rematerialize, constant deduplication

3bc7731

merykitty added 4 commits March 17, 2022 22:43

fix comparison

3dbc743

rematerializing input count

bb494bc

unsignness

2b1c1da

remove duplicate

63d84bd

merykitty marked this pull request as draft March 29, 2022 10:24

openjdk bot removed the rfr Pull request is ready for review label Mar 29, 2022

bridgekeeper bot closed this Jul 8, 2022

unnecessary TEMP dst

bc01c21

vnkozlov approved these changes Jul 28, 2022

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jul 28, 2022

jatin-bhateja reviewed Jul 29, 2022

View reviewed changes

add load_constant_vector

e83ccaa

jatin-bhateja approved these changes Jul 29, 2022

View reviewed changes

vnkozlov reviewed Aug 2, 2022

View reviewed changes

openjdk bot removed the ready Pull request is ready to be integrated label Aug 2, 2022

vnkozlov approved these changes Aug 2, 2022

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Aug 2, 2022

openjdk bot added the sponsor Pull request is ready to be sponsored label Aug 3, 2022

openjdk bot added the integrated Pull request has been integrated label Aug 4, 2022

openjdk bot closed this Aug 4, 2022

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 4, 2022

merykitty deleted the improveReplicate branch August 12, 2022 19:22

Uh oh!

8283232: x86: Improve vector broadcast operations #7832

8283232: x86: Improve vector broadcast operations #7832

Uh oh!

Conversation

merykitty commented Mar 16, 2022 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Mar 16, 2022

Uh oh!

openjdk bot commented Mar 16, 2022

Uh oh!

merykitty commented Mar 16, 2022

Uh oh!

openjdk bot commented Mar 16, 2022

Uh oh!

mlbridge bot commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

merykitty commented Mar 16, 2022

Uh oh!

jatin-bhateja commented Mar 16, 2022

Uh oh!

jatin-bhateja commented Mar 16, 2022

Uh oh!

merykitty commented Mar 17, 2022

Uh oh!

bridgekeeper bot commented May 13, 2022

Uh oh!

bridgekeeper bot commented Jul 8, 2022

Uh oh!

merykitty commented Jul 27, 2022

Uh oh!

vnkozlov commented Jul 27, 2022

Uh oh!

vnkozlov commented Jul 27, 2022

Uh oh!

merykitty commented Jul 28, 2022

Uh oh!

vnkozlov left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merykitty Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merykitty Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merykitty commented Mar 16, 2022 •

edited by openjdk bot

Loading

mlbridge bot commented Mar 16, 2022 •

edited

Loading

openjdk bot commented Jul 28, 2022 •

edited

Loading

merykitty Jul 29, 2022 •

edited

Loading

merykitty Jul 29, 2022 •

edited

Loading