Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8259498: Reduce overhead of MD5 and SHA digests #1855

Closed
wants to merge 23 commits into from

Conversation

@cl4es
Copy link
Member

@cl4es cl4es commented Dec 20, 2020

  • The MD5 intrinsics added by JDK-8250902 shows that the int[] x isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration
  • Using VarHandles we can simplify the code in ByteArrayAccess enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization.
  • Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.

Baseline:

MessageDigests.digest                                MD5        16     15  2714.307 ±   21.133  ops/ms
MessageDigests.digest                                MD5      1024     15   318.087 ±    0.637  ops/ms
MessageDigests.digest                              SHA-1        16     15  1387.266 ±   40.932  ops/ms
MessageDigests.digest                              SHA-1      1024     15   109.273 ±    0.149  ops/ms
MessageDigests.digest                            SHA-256        16     15   995.566 ±   21.186  ops/ms
MessageDigests.digest                            SHA-256      1024     15    89.104 ±    0.079  ops/ms
MessageDigests.digest                            SHA-512        16     15   803.030 ±   15.722  ops/ms
MessageDigests.digest                            SHA-512      1024     15   115.611 ±    0.234  ops/ms
MessageDigests.getAndDigest                          MD5        16     15  2190.367 ±   97.037  ops/ms
MessageDigests.getAndDigest                          MD5      1024     15   302.903 ±    1.809  ops/ms
MessageDigests.getAndDigest                        SHA-1        16     15  1262.656 ±   43.751  ops/ms
MessageDigests.getAndDigest                        SHA-1      1024     15   104.889 ±    3.554  ops/ms
MessageDigests.getAndDigest                      SHA-256        16     15   914.541 ±   55.621  ops/ms
MessageDigests.getAndDigest                      SHA-256      1024     15    85.708 ±    1.394  ops/ms
MessageDigests.getAndDigest                      SHA-512        16     15   737.719 ±   53.671  ops/ms
MessageDigests.getAndDigest                      SHA-512      1024     15   112.307 ±    1.950  ops/ms

GC:
MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   312.011 ±    0.005    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   584.020 ±    0.006    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   544.019 ±    0.016    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15  1056.037 ±    0.003    B/op

Target:

Benchmark                                 (digesterName)  (length)    Cnt     Score      Error   Units
MessageDigests.digest                                MD5        16     15  3134.462 ±   43.685  ops/ms
MessageDigests.digest                                MD5      1024     15   323.667 ±    0.633  ops/ms
MessageDigests.digest                              SHA-1        16     15  1418.742 ±   38.223  ops/ms
MessageDigests.digest                              SHA-1      1024     15   110.178 ±    0.788  ops/ms
MessageDigests.digest                            SHA-256        16     15  1037.949 ±   21.214  ops/ms
MessageDigests.digest                            SHA-256      1024     15    89.671 ±    0.228  ops/ms
MessageDigests.digest                            SHA-512        16     15   812.028 ±   39.489  ops/ms
MessageDigests.digest                            SHA-512      1024     15   116.738 ±    0.249  ops/ms
MessageDigests.getAndDigest                          MD5        16     15  2314.379 ±  229.294  ops/ms
MessageDigests.getAndDigest                          MD5      1024     15   307.835 ±    5.730  ops/ms
MessageDigests.getAndDigest                        SHA-1        16     15  1326.887 ±   63.263  ops/ms
MessageDigests.getAndDigest                        SHA-1      1024     15   106.611 ±    2.292  ops/ms
MessageDigests.getAndDigest                      SHA-256        16     15   961.589 ±   82.052  ops/ms
MessageDigests.getAndDigest                      SHA-256      1024     15    88.646 ±    0.194  ops/ms
MessageDigests.getAndDigest                      SHA-512        16     15   775.417 ±   56.775  ops/ms
MessageDigests.getAndDigest                      SHA-512      1024     15   112.904 ±    2.014  ops/ms

GC
MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   232.009 ±    0.006    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   584.021 ±    0.001    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   272.012 ±    0.015    B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15   400.017 ±    0.019    B/op

For the digest micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.

For the getAndDigest micro - which tests MessageDigest.getInstance(..).digest(..) there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.

I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1855/head:pull/1855
$ git checkout pull/1855

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 20, 2020

👋 Welcome back redestad! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Dec 20, 2020

@cl4es The following labels will be automatically applied to this pull request:

  • core-libs
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@DellCliff
Copy link

@DellCliff DellCliff commented Jan 5, 2021

Since java.util.UUID and sun.security.provider.MD5 are both in java.base, would it make sense to create new instances by calling new MD5() instead of java.security.MessageDigest.getInstance("MD5") and bypassing the whole MessageDigest logic?

@DellCliff
Copy link

@DellCliff DellCliff commented Jan 5, 2021

Are you sure you're not ending up paying more using a VarHandle and having to cast and using a var args call (long) LONG_ARRAY_HANDLE.get(buf, ofs); instead of creating a ByteBuffer once via ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()?

@cl4es
Copy link
Member Author

@cl4es cl4es commented Jan 6, 2021

Hitting up new MD5() directly could be a great idea. I expect this would be just as fast as the cache+clone (if not faster), but I'm a bit worried we'd be short-circuiting the ability to install an alternative MD5 provider (which may or may not be a thing we must support..), but it's worth exploring.

Comparing performance of this against a ByteBuffer impl is on my TODO. The VarHandle gets heavily inlined and optimized here, though, with performance in my tests similar to the Unsafe use in ByteArrayAccess.

@cl4es
Copy link
Member Author

@cl4es cl4es commented Jan 6, 2021

I've identified a number of optimizations to the plumbing behind MessageDigest.getDigest(..) over in #1933 that removes 80-90% of the throughput overhead and all the allocation overhead compared to the clone() approach prototyped here. The remaining 20ns/op overhead might not be enough of a concern to do a point fix in UUID::nameUUIDFromBytes.

@cl4es
Copy link
Member Author

@cl4es cl4es commented Jan 7, 2021

Removing the UUID clone cache and running the microbenchmark along with the changes in #1933:

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    2.182 ±  0.090  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate                     20000  thrpt   12  439.020 ± 18.241  MB/sec
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  264.022 ±  0.003    B/op

The goal now is if to simplify the digest code and compare alternatives.

@cl4es
Copy link
Member Author

@cl4es cl4es commented Jan 7, 2021

I've run various tests and concluded that the VarHandleized code is matching or improving upon the Unsafe-riddled code in ByteArrayAccess. I then went ahead and consolidated to use similar code pattern in ByteArrayAccess for consistency, which amounts to a good cleanup.

With MD5 intrinsics disabled, I get this baseline:

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.245 ±  0.077  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  488.042 ±  0.004    B/op

With the current patch here (not including #1933):

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.431 ±  0.106  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  408.035 ±  0.006    B/op

If I isolate the ByteArrayAccess changes I'm getting performance neutral or slightly better numbers compared to baseline for these tests:

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.317 ±  0.092  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  488.042 ±  0.004    B/op
@cl4es cl4es changed the title Improve performance of digesting small payloads with MD5 8259498: Reduce overhead of MD5 and SHA digests Jan 8, 2021
@cl4es cl4es marked this pull request as ready for review Jan 8, 2021
@openjdk openjdk bot added the rfr label Jan 8, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 8, 2021

@valeriepeng
Copy link

@valeriepeng valeriepeng commented Jan 13, 2021

Thanks for the performance enhancement, I will take a look.

if ((outOfs < 0) || ((out.length - outOfs) < 4)) {
throw new ArrayIndexOutOfBoundsException();
}
Comment on lines -212 to -214

This comment has been minimized.

@valeriepeng

valeriepeng Jan 15, 2021

Why do we remove the index checking from all methods? Isn't it safer to check here in case the caller didn't? Or is it such checking is already implemented inside the the various methods of VarHandle?

This comment has been minimized.

@cl4es

cl4es Jan 15, 2021
Author Member

Yes, IOOBE checking is done by the VarHandle methods, while the Unsafe API is unsafe and needs careful precondition checking. It doesn't seem to matter for performance (interpreted code sees some benefit by the removal).

With the current usage an IOOBE is probably not observable, but there's a test that reflects into ByteArrayAccess and verifies exceptions are thrown as expected on faulty inputs.

Copy link

@valeriepeng valeriepeng left a comment

Changes look good. Thanks.

@openjdk
Copy link

@openjdk openjdk bot commented Jan 16, 2021

@cl4es This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8259498: Reduce overhead of MD5 and SHA digests

Reviewed-by: valeriep

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 21 new commits pushed to the master branch:

  • a37cd5a: 8259859: Missing metaspace NMT memory tag
  • 33dcc00: 8132984: incorrect type for Reference.discovered
  • 3edf393: 8259978: PPC64 builds broken after JDK-8258004
  • 5d8861b: 8259995: Missing comma to separate years in copyright header
  • 5cfb36e: 8259036: Failed JfrVersionSystem invariant when VM built with -fno-elide-constructors
  • c0e9c44: 8259962: Shenandoah: task queue statistics is inconsistent after JDK-8255019
  • 82adfb3: 8134540: Much nearly duplicated code for PerfMemory support
  • a9519c8: 8259924: GitHub actions fail on Linux x86_32 with "Could not configure libc6:i386"
  • 139f5d3: 8259035: Comments for load order of hsdis should be updated
  • bd81ccf: 8259957: Build failure without C1 Compiler after JDK-8258004
  • ... and 11 more: https://git.openjdk.java.net/jdk/compare/db9c114d40cd20c2854121ed8b40a7c9ef8e59b3...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Jan 16, 2021
@cl4es
Copy link
Member Author

@cl4es cl4es commented Jan 20, 2021

/integrate

@openjdk openjdk bot closed this Jan 20, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Jan 20, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jan 20, 2021

@cl4es Since your change was applied there have been 36 commits pushed to the master branch:

  • 27cc62a: 8259911: byteArrayViewVarHandle should throw ArrayIndexOutOfBoundsException
  • 1f47de5: 8260010: UTF8ZipCoder not thread-safe since JDK-8243469
  • 8b95d95: 8256895: Add support for RFC 8954: Online Certificate Status Protocol (OCSP) Nonce Extension
  • 4f11ff3: 8259488: Shenandoah: Missing timing tracking for STW CLD root processing
  • 0785147: 8259949: x86 32-bit build fails when -fcf-protection is passed in the compiler flags
  • 5891509: 8259947: (fs) Optimize UnixPath.encode implementation
  • 69f90b5: 8259843: initialize dli_fname array before calling dll_address_to_library_name
  • 52ed2aa: 8259786: initialize last parameter of getpwuid_r
  • 70b5b31: 8257664: HTMLEditorKit: Wrong CSS relative font sizes
  • 0b01d69: 8260005: Shenandoah: Remove unused AlwaysTrueClosure in ShenandoahConcurrentRootScanner::roots_do()
  • ... and 26 more: https://git.openjdk.java.net/jdk/compare/db9c114d40cd20c2854121ed8b40a7c9ef8e59b3...master

Your commit was automatically rebased without conflicts.

Pushed as commit 35c9da7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants