Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble #3256

Closed
wants to merge 6 commits into from

Conversation

@sviswa7
Copy link

@sviswa7 sviswa7 commented Mar 30, 2021

For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved. Thanks a lot to Eric Caspole for finding this issue.

Benchmark:
@benchmark
public double sqrtDouble() {
return Math.sqrt(double4Dot1);
}

Current code generated (linux format) by c2 JIT is:
vsqrtsd 0x50(%r10),%xmm0,%xmm0

The vsqrtsd instruction operation is specified as below:
VSQRTSD (VEX.128 encoded version)
DEST[63:0] := SQRT(SRC2[63:0])
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.

By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.

Code generated after patch:
vxorpd %xmm0,%xmm0,%xmm0
vsqrtsd 0x50(%r10),%xmm0,%xmm0

Performance before patch:
Benchmark Mode Cnt Score Error Units
MathBench.sqrtDouble thrpt 8 193612.396 ± 95.807 ops/ms

Performance after patch:
MathBench.sqrtDouble thrpt 8 276388.024 ± 846.372 ops/ms

Best Regards,
Sandhya


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Reviewers

Contributors

  • Eric Caspole <ecaspole@openjdk.org>
  • Charlie Hunt <huntch@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256
$ git checkout pull/3256

Update a local copy of the PR:
$ git checkout pull/3256
$ git pull https://git.openjdk.java.net/jdk pull/3256/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3256

View PR using the GUI difftool:
$ git pr show -t 3256

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3256.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Mar 30, 2021

👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Mar 30, 2021

@sviswa7 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@sviswa7 sviswa7 marked this pull request as ready for review Mar 30, 2021
@openjdk openjdk bot added the rfr label Mar 30, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 30, 2021

Copy link
Contributor

@neliasso neliasso left a comment

In the bug report you are mentioning new micro benchmark test cases. Have you already contributed those, or would you like to add those to this PR?

@ericcaspole
Copy link

@ericcaspole ericcaspole commented Mar 30, 2021

I added these micros the the https://github.com/openjdk/jmh-jdk-microbenchmarks a couple weeks ago, before we knew about this problem, because it is easy to work on JMH with Maven. But it would be best to contribute these micros into the JDK repo at the same time as this patch. I will get it together for the JDK repo and connect with Sandhya.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

What is faster: xor(xmm) + sqrt(xmm, mem) or mov(xmm, mem) + sqrt(xmm,xmm) ?

Do we have other instructions which may suffer stall too if dst register is not zeroed?

@@ -3247,10 +3249,11 @@ instruct sqrtF_reg(regF dst, regF src) %{
instruct sqrtF_mem(regF dst, memory src) %{
predicate(UseSSE>=1);
match(Set dst (SqrtF (LoadF src)));

effect(TEMP dst);

This comment has been minimized.

@iwanowww

iwanowww Mar 30, 2021

Why do you declare dst as TEMP (here and in other places)?

This comment has been minimized.

@sviswa7

sviswa7 Mar 30, 2021
Author

Good point. I was being extra cautious that the register in address should not be overwritten by the xorps. But of course the address cannot have xmm register here so TEMP dst is not needed. I will update the patch accordingly.

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 30, 2021

@vnkozlov Both xor(xmm) + sqrt(xmm, mem) or mov(xmm, mem) + sqrt(xmm,xmm) have same performance. I could look into simplifying the patch by just keeping the register version of these rules.

I looked through x86.ad for other usages of unary instructions but didn't find any other instances. We implement negate and abs using binary instructions.

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 31, 2021

/contributor add @ericcaspole

@openjdk openjdk bot removed the rfr label Mar 31, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 31, 2021

@sviswa7
Contributor Eric Caspole <ecaspole@openjdk.org> successfully added.

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 31, 2021

/contributor add @huntch

@openjdk
Copy link

@openjdk openjdk bot commented Mar 31, 2021

@sviswa7 Could not parse @huntch as a valid contributor.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>
@openjdk openjdk bot added the rfr label Mar 31, 2021
@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 31, 2021

/contributor add huntch

@openjdk
Copy link

@openjdk openjdk bot commented Mar 31, 2021

@sviswa7
Contributor Charlie Hunt <huntch@openjdk.org> successfully added.

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 31, 2021

@neliasso @vnkozlov @iwanowww All the review comments are implemented.
The MathBench.java and StrictMathBench.java contribution is from Eric Caspole and Charlie Hunt.
Please let me know if any other changes are needed.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Nice solution - let C2 generate the best value loading instruction. I assume the improvement is the same with this update.

Thank you for checking other instructions.

@openjdk
Copy link

@openjdk openjdk bot commented Mar 31, 2021

@sviswa7 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Co-authored-by: Eric Caspole <ecaspole@openjdk.org>
Co-authored-by: Charlie Hunt <huntch@openjdk.org>
Reviewed-by: neliasso, kvn, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 130 new commits pushed to the master branch:

  • cb70ab0: 8263235: sanity/client/SwingSet/src/ColorChooserDemoTest.java failed throwing java.lang.NoClassDefFoundError
  • e2ec997: 8263551: Provide shared lock-free FIFO queue implementation
  • dec3447: 8264346: nullptr_t undefined in global namespace for clang+libstdc++
  • 0fa3572: 8264489: Add more logging to LargeCopyWithMark.java
  • f43d14a: 8264396: Use the blessed modifier order in jdk.internal.jvmstat
  • 6225ae6: 8264466: Cut-paste error in InterfaceCalls JMH
  • 40c3249: 8264149: BreakpointInfo::set allocates metaspace object in VM thread
  • 999c134: 8264417: ParallelCompactData::region_offset should not accept pointers outside the current region
  • 604b14c: 8264112: (fs) Reorder methods/constructor/fields in UnixUserDefinedFileAttributeView.java
  • 9061271: 8261957: [PPC64] Support for Concurrent Thread-Stack Processing
  • ... and 120 more: https://git.openjdk.java.net/jdk/compare/47ef038977cf02ccfd52c283cba755c9ae6f444b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Mar 31, 2021
@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Mar 31, 2021

@vnkozlov Yes the performance improvement is the same with the updated patch.

Copy link

@iwanowww iwanowww left a comment

Looks good.

@@ -3232,73 +3232,22 @@ instruct negD_reg_reg(vlRegD dst, vlRegD src) %{
ins_pipe(pipe_slow);
%}

instruct sqrtF_reg(regF dst, regF src) %{

This comment has been minimized.

@iwanowww

iwanowww Mar 31, 2021

Would be helpful to have a comment describing why there are only reg-to-reg variants kept for SqrtF/SqrtD.

This comment has been minimized.

@sviswa7

sviswa7 Mar 31, 2021
Author

Done, added comments for the sqrt rules.

Copy link
Contributor

@neliasso neliasso left a comment

Thanks for adding the test.

Looks good.

sviswa7 added 2 commits Mar 31, 2021
@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Apr 1, 2021

/integrate

@openjdk openjdk bot closed this Apr 1, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Apr 1, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 1, 2021

@sviswa7 Since your change was applied there have been 131 commits pushed to the master branch:

  • 16acfaf: 8012229: [lcms] Improve performance of color conversion for images with alpha channel
  • cb70ab0: 8263235: sanity/client/SwingSet/src/ColorChooserDemoTest.java failed throwing java.lang.NoClassDefFoundError
  • e2ec997: 8263551: Provide shared lock-free FIFO queue implementation
  • dec3447: 8264346: nullptr_t undefined in global namespace for clang+libstdc++
  • 0fa3572: 8264489: Add more logging to LargeCopyWithMark.java
  • f43d14a: 8264396: Use the blessed modifier order in jdk.internal.jvmstat
  • 6225ae6: 8264466: Cut-paste error in InterfaceCalls JMH
  • 40c3249: 8264149: BreakpointInfo::set allocates metaspace object in VM thread
  • 999c134: 8264417: ParallelCompactData::region_offset should not accept pointers outside the current region
  • 604b14c: 8264112: (fs) Reorder methods/constructor/fields in UnixUserDefinedFileAttributeView.java
  • ... and 121 more: https://git.openjdk.java.net/jdk/compare/47ef038977cf02ccfd52c283cba755c9ae6f444b...master

Your commit was automatically rebased without conflicts.

Pushed as commit 52d8a22.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants