Skip to content

Conversation

@merykitty
Copy link
Member

@merykitty merykitty commented Mar 16, 2022

Hi,

This patch improves the generation of broadcasting a scalar in several ways:

  • As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
  • Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
  • Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay

With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow:

                                          Before          After
Benchmark                  Mode  Cnt   Score   Error   Score   Error  Units     Gain
SpiltReplicate.testDouble  avgt    5  42.621 ± 0.598  38.771 ± 0.797  ns/op   +9.03%
SpiltReplicate.testFloat   avgt    5  42.245 ± 1.464  38.603 ± 0.367  ns/op   +8.62%
SpiltReplicate.testInt     avgt    5  20.581 ± 5.791  13.755 ± 0.375  ns/op  +33.17%
SpiltReplicate.testLong    avgt    5  17.794 ± 4.781  13.663 ± 0.387  ns/op  +23.22%

As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for long/double and 128 bytes for int/float cases.

This patch also removes some redundant code paths and renames some incorrectly named instructions.

Thank you very much.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8283232: x86: Improve vector broadcast operations

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832
$ git checkout pull/7832

Update a local copy of the PR:
$ git checkout pull/7832
$ git pull https://git.openjdk.org/jdk pull/7832/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7832

View PR using the GUI difftool:
$ git pr show -t 7832

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7832.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 16, 2022

👋 Welcome back merykitty! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 16, 2022
@openjdk
Copy link

openjdk bot commented Mar 16, 2022

@merykitty The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 16, 2022
@merykitty
Copy link
Member Author

/label hotspot-compiler

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 16, 2022
@openjdk
Copy link

openjdk bot commented Mar 16, 2022

@merykitty
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Mar 16, 2022

@merykitty
Copy link
Member Author

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

image

Thanks.

@jatin-bhateja
Copy link
Member

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

image

Thanks.

Thanks meant to refer to above text. I have removed incorrect reference.

@jatin-bhateja
Copy link
Member

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2
image
Thanks.

Thanks meant to refer to above text. I have removed incorrect reference.

It will still be good if we can come up with a micro benchmark, that shows the gain with the patch.

@merykitty
Copy link
Member Author

Doing a simple benchmark that has a lot of register pressure

@Benchmark
public long broadcastCon() {
    var species = IntVector.SPECIES_PREFERRED;
    var sum = IntVector.zero(species);
    return sum.add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
            .add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
            .add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
            .add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
            .add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
            .add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
            .add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
            .add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
            .reinterpretAsLongs()
            .lane(0);
}

provides the following result:

Before:
Benchmark                     Mode  Cnt   Score   Error  Units
VectorReplicate.broadcastCon  avgt    5  16.417 ± 0.515  ns/op

After:
Benchmark                     Mode  Cnt   Score   Error  Units
VectorReplicate.broadcastCon  avgt    5  13.851 ± 0.154  ns/op

The constant table size decreases from 1024 bytes to 128 bytes, which is much more manageable. The throughput improvement mostly comes from the vector being rematerialized instead of being spilt on the stack.

I have not been able to observe performance gain regarding bypass delay, which is expected as according to "Agner's optimisation manual on the micro architecture of Intel, AMD and VIA CPUs", Intel CPUs since Skylake seem to have only a few such delays.

Thank you very much.

@merykitty merykitty marked this pull request as draft March 29, 2022 10:24
@openjdk openjdk bot removed the rfr Pull request is ready for review label Mar 29, 2022
@bridgekeeper
Copy link

bridgekeeper bot commented May 13, 2022

@merykitty This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 8, 2022

@merykitty This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Jul 8, 2022
@merykitty
Copy link
Member Author

@vnkozlov I have fixed those errors in the last commits. The second one is due to pshufb being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think?

Thanks a lot.

inline int CodeSection::alignment(int section) {

@vnkozlov
Copy link
Contributor

@vnkozlov I have fixed those errors in the last commits. The second one is due to pshufb being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think?

I agree with doing in separate changes. And I will start new testing.

@vnkozlov
Copy link
Contributor

Got new failure (and testing still running). Test compiler/c2/cr7200264/TestSSE2IntVect.java failed with -Xcomp:

java.lang.RuntimeException: Unexpected SubVI number: expected 2 >= 4
	at jdk.test.lib.Asserts.fail(Asserts.java:594)
	at jdk.test.lib.Asserts.assertGreaterThanOrEqual(Asserts.java:288)
	at jdk.test.lib.Asserts.assertGTE(Asserts.java:259)
	at compiler.c2.cr7200264.TestDriver.verifyVectorizationNumber(TestDriver.java:65)
	at compiler.c2.cr7200264.TestDriver.run(TestDriver.java:43)
	at compiler.c2.cr7200264.TestSSE2IntVect.main(TestSSE2IntVect.java:48)

@merykitty
Copy link
Member Author

It does not seem related as this patch has effects only after matching so it should not change the IR graph of the compilations

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that the latest failure I posted is not related to these changes. There were no other failures. Approved.

@openjdk
Copy link

openjdk bot commented Jul 28, 2022

@merykitty This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283232: x86: Improve vector broadcast operations

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 87 new commits pushed to the master branch:

  • 0ae8341: 8290908: misc tests fail: assert(!thread->owns_locks()) failed: must release all locks when leaving VM
  • 5acf2d7: 8291578: Remove JMX related tests from ProblemList-svc-vthreads.txt
  • a6564d4: 8291650: Add delay to ClassUnloadEventTest before exiting to give time for JVM to send all events before VMDeath
  • af76c0c: 8291654: AArch64: assert from JDK-8287393 causes crashes
  • a9db5bb: 8291626: Remove Mutex::contains as it is unused
  • a2cff26: 8291597: [BACKOUT] JDK-8289996: Fix array range check hoisting for some scaled loop iv
  • 554f44e: 8282730: LdapLoginModule throw NPE from logout method after login failure
  • f714ac5: 8290718: Remove ALLOCATION_SUPER_CLASS_SPEC
  • 6cbc234: 8287393: AArch64: Remove trampoline_call1
  • 57bf603: 8289948: Improve test coverage for XPath functions: Node Set Functions
  • ... and 77 more: https://git.openjdk.org/jdk/compare/0599a05f8c7e26d4acae0b2cc805a65bdd6c6f67...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @jatin-bhateja) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 28, 2022
Comment on lines +4145 to +4159
instruct ReplB_mem(vec dst, memory mem) %{
predicate(UseAVX >= 2);
match(Set dst (ReplicateB (LoadB mem)));
format %{ "replicateB $dst,$mem" %}
ins_encode %{
InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this)));
__ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this));
int vlen_enc = vector_length_encoding(this);
__ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc);
%}
ins_pipe( pipe_slow );
%}

// ====================ReplicateS=======================================

instruct ReplS_reg(vec dst, rRegI src) %{
instruct vReplS_reg(vec dst, rRegI src) %{
predicate(UseAVX >= 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be folded with below pattern, by pushing predicate into encoding block.

Copy link
Member Author

@merykitty merykitty Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning the predicate of the reg and the mem version allows the adlc parser to recognise their relationship and during register allocation can substitute a reg operation with a spilt operand with its corresponding mem node. You can see in the generated code the reg node has specific methods such as cisc_operand and cisc_version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.

Comment on lines +4110 to -4141
instruct vReplB_reg(vec dst, rRegI src) %{
predicate(UseAVX >= 2);
match(Set dst (ReplicateB src));
format %{ "replicateB $dst,$src" %}
ins_encode %{
uint vlen = Matcher::vector_length(this);
int vlen_enc = vector_length_encoding(this);
if (vlen == 64 || VM_Version::supports_avx512vlbw()) { // AVX512VL for <512bit operands
assert(VM_Version::supports_avx512bw(), "required"); // 512-bit byte vectors assume AVX512BW
int vlen_enc = vector_length_encoding(this);
__ evpbroadcastb($dst$$XMMRegister, $src$$Register, vlen_enc);
} else if (VM_Version::supports_avx2()) {
int vlen_enc = vector_length_encoding(this);
__ movdl($dst$$XMMRegister, $src$$Register);
__ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc);
} else {
__ movdl($dst$$XMMRegister, $src$$Register);
__ punpcklbw($dst$$XMMRegister, $dst$$XMMRegister);
__ pshuflw($dst$$XMMRegister, $dst$$XMMRegister, 0x00);
if (vlen >= 16) {
__ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
if (vlen >= 32) {
assert(vlen == 32, "sanity");
__ vinserti128_high($dst$$XMMRegister, $dst$$XMMRegister);
}
}
__ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc);
}
%}
ins_pipe( pipe_slow );
%}

instruct ReplB_mem(vec dst, memory mem) %{
predicate(VM_Version::supports_avx2());
match(Set dst (ReplicateB (LoadB mem)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge these rules and create a macro assembly routine for encoding block logic.

// Stretching lots of inputs - don't do it.
if (req() > 2) {
// A MachContant has the last input being the constant base
if (req() > (is_MachConstant() ? 3U : 2U)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ?

Copy link
Member Author

@merykitty merykitty Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A node being decided to prefer rematerialising to spilling has to satisfy that:

  • The node is not explicitly said to be expensive, divD and divF fails at this stage.
  • The node declaration only contains simple register rules (explicit or implicit DEF dst and USE src), vround fails this because it has temp register, cmpF_imm and cmpD_imm fail this because they kill flags.
  • This method we are at agrees with the rematerialising.

I have looked at all instances where constantaddress is used and found no node where accidental rematerialisation is inefficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanations, I agree.

@merykitty
Copy link
Member Author

@jatin-bhateja Thanks a lot for your comments, I have addressed those in the last commit.
@vnkozlov Thanks very much for the review and testing.

Comment on lines +4145 to +4159
instruct ReplB_mem(vec dst, memory mem) %{
predicate(UseAVX >= 2);
match(Set dst (ReplicateB (LoadB mem)));
format %{ "replicateB $dst,$mem" %}
ins_encode %{
InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this)));
__ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this));
int vlen_enc = vector_length_encoding(this);
__ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc);
%}
ins_pipe( pipe_slow );
%}

// ====================ReplicateS=======================================

instruct ReplS_reg(vec dst, rRegI src) %{
instruct vReplS_reg(vec dst, rRegI src) %{
predicate(UseAVX >= 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.

@merykitty
Copy link
Member Author

Thanks for your reviews. Does this PR need another run through the tests?

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted new testing for version 012. Last update was not simple.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 2, 2022
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing of version 12 passed.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 2, 2022
@merykitty
Copy link
Member Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Aug 3, 2022
@openjdk
Copy link

openjdk bot commented Aug 3, 2022

@merykitty
Your change (at version e83ccaa) is now ready to be sponsored by a Committer.

@merykitty
Copy link
Member Author

May I have this PR sponsored, please? Thanks a lot for your help.

@jatin-bhateja
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented Aug 4, 2022

Going to push as commit 92d2982.
Since your change was applied there have been 113 commits pushed to the master branch:

  • 966ab21: 8291895: Remove PRAGMA_NONNULL_IGNORED from x86 and AArch64
  • aa557b9: 8288327: Executable.hasRealParameterData should not be volatile
  • d4a795d: 8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs
  • a3040fc: 8291360: Create entry points to expose low-level class file information
  • ce61eb6: 8290349: IP_DONTFRAGMENT doesn't set DF bit in IPv4 header
  • 26e5c11: 4890041: Remove TAB and Shift TAB from Popup Menu in Motif Look & Feel
  • 0bc804d: 8291762: Backout JDK-8291757 from jdk/jdk
  • 3493973: Merge
  • 43bb399: 8291757: Remove EA from JDK 19 version string starting with Initial RC promotion B35 on August 11, 2022
  • 4772354: 8291825: java/time/nontestng/java/time/zone/CustomZoneNameTest.java fails if defaultLocale and defaultFormatLocale are different
  • ... and 103 more: https://git.openjdk.org/jdk/compare/0599a05f8c7e26d4acae0b2cc805a65bdd6c6f67...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 4, 2022
@openjdk openjdk bot closed this Aug 4, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 4, 2022
@openjdk
Copy link

openjdk bot commented Aug 4, 2022

@jatin-bhateja @merykitty Pushed as commit 92d2982.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@merykitty merykitty deleted the improveReplicate branch August 12, 2022 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants