-
Couldn't load subscription status.
- Fork 6.1k
8283232: x86: Improve vector broadcast operations #7832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back merykitty! A progress list of the required criteria for merging this PR into |
|
@merykitty The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
/label hotspot-compiler |
|
@merykitty |
Webrevs
|
|
Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
The manual mentions the guideline at section 3.5.2.2 Thanks. |
Thanks meant to refer to above text. I have removed incorrect reference. |
|
Doing a simple benchmark that has a lot of register pressure provides the following result: The constant table size decreases from 1024 bytes to 128 bytes, which is much more manageable. The throughput improvement mostly comes from the vector being rematerialized instead of being spilt on the stack. I have not been able to observe performance gain regarding bypass delay, which is expected as according to "Agner's optimisation manual on the micro architecture of Intel, AMD and VIA CPUs", Intel CPUs since Skylake seem to have only a few such delays. Thank you very much. |
|
@merykitty This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
@merykitty This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
|
@vnkozlov I have fixed those errors in the last commits. The second one is due to Thanks a lot. jdk/src/hotspot/share/asm/codeBuffer.hpp Line 742 in 2bd90c2
|
I agree with doing in separate changes. And I will start new testing. |
|
Got new failure (and testing still running). Test compiler/c2/cr7200264/TestSSE2IntVect.java failed with |
|
It does not seem related as this patch has effects only after matching so it should not change the IR graph of the compilations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that the latest failure I posted is not related to these changes. There were no other failures. Approved.
|
@merykitty This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 87 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @jatin-bhateja) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
| instruct ReplB_mem(vec dst, memory mem) %{ | ||
| predicate(UseAVX >= 2); | ||
| match(Set dst (ReplicateB (LoadB mem))); | ||
| format %{ "replicateB $dst,$mem" %} | ||
| ins_encode %{ | ||
| InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this))); | ||
| __ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this)); | ||
| int vlen_enc = vector_length_encoding(this); | ||
| __ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc); | ||
| %} | ||
| ins_pipe( pipe_slow ); | ||
| %} | ||
|
|
||
| // ====================ReplicateS======================================= | ||
|
|
||
| instruct ReplS_reg(vec dst, rRegI src) %{ | ||
| instruct vReplS_reg(vec dst, rRegI src) %{ | ||
| predicate(UseAVX >= 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be folded with below pattern, by pushing predicate into encoding block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aligning the predicate of the reg and the mem version allows the adlc parser to recognise their relationship and during register allocation can substitute a reg operation with a spilt operand with its corresponding mem node. You can see in the generated code the reg node has specific methods such as cisc_operand and cisc_version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.
| instruct vReplB_reg(vec dst, rRegI src) %{ | ||
| predicate(UseAVX >= 2); | ||
| match(Set dst (ReplicateB src)); | ||
| format %{ "replicateB $dst,$src" %} | ||
| ins_encode %{ | ||
| uint vlen = Matcher::vector_length(this); | ||
| int vlen_enc = vector_length_encoding(this); | ||
| if (vlen == 64 || VM_Version::supports_avx512vlbw()) { // AVX512VL for <512bit operands | ||
| assert(VM_Version::supports_avx512bw(), "required"); // 512-bit byte vectors assume AVX512BW | ||
| int vlen_enc = vector_length_encoding(this); | ||
| __ evpbroadcastb($dst$$XMMRegister, $src$$Register, vlen_enc); | ||
| } else if (VM_Version::supports_avx2()) { | ||
| int vlen_enc = vector_length_encoding(this); | ||
| __ movdl($dst$$XMMRegister, $src$$Register); | ||
| __ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); | ||
| } else { | ||
| __ movdl($dst$$XMMRegister, $src$$Register); | ||
| __ punpcklbw($dst$$XMMRegister, $dst$$XMMRegister); | ||
| __ pshuflw($dst$$XMMRegister, $dst$$XMMRegister, 0x00); | ||
| if (vlen >= 16) { | ||
| __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); | ||
| if (vlen >= 32) { | ||
| assert(vlen == 32, "sanity"); | ||
| __ vinserti128_high($dst$$XMMRegister, $dst$$XMMRegister); | ||
| } | ||
| } | ||
| __ vpbroadcastb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); | ||
| } | ||
| %} | ||
| ins_pipe( pipe_slow ); | ||
| %} | ||
|
|
||
| instruct ReplB_mem(vec dst, memory mem) %{ | ||
| predicate(VM_Version::supports_avx2()); | ||
| match(Set dst (ReplicateB (LoadB mem))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge these rules and create a macro assembly routine for encoding block logic.
| // Stretching lots of inputs - don't do it. | ||
| if (req() > 2) { | ||
| // A MachContant has the last input being the constant base | ||
| if (req() > (is_MachConstant() ? 3U : 2U)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A node being decided to prefer rematerialising to spilling has to satisfy that:
- The node is not explicitly said to be expensive,
divDanddivFfails at this stage. - The node declaration only contains simple register rules (explicit or implicit DEF dst and USE src),
vroundfails this because it has temp register,cmpF_immandcmpD_immfail this because they kill flags. - This method we are at agrees with the rematerialising.
I have looked at all instances where constantaddress is used and found no node where accidental rematerialisation is inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanations, I agree.
|
@jatin-bhateja Thanks a lot for your comments, I have addressed those in the last commit. |
| instruct ReplB_mem(vec dst, memory mem) %{ | ||
| predicate(UseAVX >= 2); | ||
| match(Set dst (ReplicateB (LoadB mem))); | ||
| format %{ "replicateB $dst,$mem" %} | ||
| ins_encode %{ | ||
| InternalAddress addr = $constantaddress(T_BYTE, vreplicate_imm(T_BYTE, $con$$constant, Matcher::vector_length(this))); | ||
| __ load_vector($dst$$XMMRegister, addr, Matcher::vector_length_in_bytes(this)); | ||
| int vlen_enc = vector_length_encoding(this); | ||
| __ vpbroadcastb($dst$$XMMRegister, $mem$$Address, vlen_enc); | ||
| %} | ||
| ins_pipe( pipe_slow ); | ||
| %} | ||
|
|
||
| // ====================ReplicateS======================================= | ||
|
|
||
| instruct ReplS_reg(vec dst, rRegI src) %{ | ||
| instruct vReplS_reg(vec dst, rRegI src) %{ | ||
| predicate(UseAVX >= 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply.
|
Thanks for your reviews. Does this PR need another run through the tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted new testing for version 012. Last update was not simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing of version 12 passed.
|
/integrate |
|
@merykitty |
|
May I have this PR sponsored, please? Thanks a lot for your help. |
|
/sponsor |
|
Going to push as commit 92d2982.
Your commit was automatically rebased without conflicts. |
|
@jatin-bhateja @merykitty Pushed as commit 92d2982. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |

Hi,
This patch improves the generation of broadcasting a scalar in several ways:
With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow:
As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for
long/doubleand 128 bytes forint/floatcases.This patch also removes some redundant code paths and renames some incorrectly named instructions.
Thank you very much.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832$ git checkout pull/7832Update a local copy of the PR:
$ git checkout pull/7832$ git pull https://git.openjdk.org/jdk pull/7832/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7832View PR using the GUI difftool:
$ git pr show -t 7832Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7832.diff