-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8308465: Reduce memory accesses in AArch64 MD5 intrinsic #14068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The performance is improved by ~ with `micro:org.openjdk.bench.java.security.MessageDigests`. | | digest | digest | getAndDigest | getAndDigest | | |------------|---------|--------|--------------|---------------|-------| | | 64 | 16,384 | 64 | 16,384 | bytes | | Graviton 2 | -1.37% | 1.51% | -0.65% | 1.89% | Graviton 2 ``` Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units ---- baseline ------------------------------------------------------------------------------------------ MessageDigests.digest md5 64 DEFAULT thrpt 15 3706.112 ± 29.183 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.270 ± 0.012 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2898.102 ± 97.519 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.013 ± 0.006 ops/ms ---- optimized ----------------------------------------------------------------------------------------- MessageDigests.digest md5 64 DEFAULT thrpt 15 3655.283 ± 36.055 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.742 ± 0.051 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2879.340 ± 90.737 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.599 ± 0.008 ops/ms ```
👋 Welcome back yftsai! A progress list of the required criteria for merging this PR into |
Webrevs
|
_masm(masm), _base(base) { | ||
assert(rs.size() == 8, "%u registers are used to cache 16 4-byte data", rs.size()); | ||
auto it = rs.begin(); | ||
for (int i = 0; i < 8; ++i, ++it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a counter here. Just loop over the RegSet
.
} | ||
|
||
// Generate code extracting i-th unsigned word (4 bytes) from cached 64 bytes. | ||
void gen_unsigned_word_extract(Register dest, int i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void gen_unsigned_word_extract(Register dest, int i) { | |
void extract_u32(Register dest, int i) { |
Register state_regs[2] = { r12, r13 }; | ||
RegSet saved_regs = RegSet::range(r16, r22) - r18_tls; | ||
Cached64Bytes reg_cache(_masm, buf, RegSet::of(r14, r15) + saved_regs); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a note here to the effect that the rest of this patch requires there to be exactly 8 registers in this set. Maybe assert that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requirement has been asserted in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but a note here would have made it easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this patch looks pretty good. Just a few minor nits.
__ ubfx(rscratch1, state_regs[0], 0, 32); | ||
__ ubfx(rscratch2, state_regs[0], 32, 32); | ||
__ ubfx(rscratch3, state_regs[1], 0, 32); | ||
__ ubfx(rscratch4, state_regs[1], 32, 32); | ||
|
||
__ addw(a, rscratch1, a); | ||
__ addw(b, rscratch2, b); | ||
__ addw(c, rscratch3, c); | ||
__ addw(d, rscratch4, d); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__ ubfx(rscratch1, state_regs[0], 0, 32); | |
__ ubfx(rscratch2, state_regs[0], 32, 32); | |
__ ubfx(rscratch3, state_regs[1], 0, 32); | |
__ ubfx(rscratch4, state_regs[1], 32, 32); | |
__ addw(a, rscratch1, a); | |
__ addw(b, rscratch2, b); | |
__ addw(c, rscratch3, c); | |
__ addw(d, rscratch4, d); | |
__ addw(a, state_regs[0], a); | |
__ ubfx(rscratch2, state_regs[0], 32, 32); | |
__ addw(b, rscratch2, b); | |
__ addw(c, state_regs[1], c); | |
__ ubfx(rscratch4, state_regs[1], 32, 32); | |
__ addw(d, rscratch4, d); | |
@yftsai This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 98 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @phohensee) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
/integrate |
/sponsor |
Going to push as commit 8474e69.
Your commit was automatically rebased without conflicts. |
@phohensee @yftsai Pushed as commit 8474e69. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.
Optimization 1: Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.
The original snippet loaded the value (step 3) soon after it was written to the memory (step 2).
The snippet is optimized to avoid memory loads and writes in the loop.
Optimization 2: Redundant loads generated by
md5_GG
,md5_HH
, andmd5_II
are removed.The original snippet, generated by two
md5_FF
s andmd5_GG
s, read the same data repeatedly.The snippet is optimized by caching the values in registers and removing the redundant loads.
Test
The following tests have passed.
Performance
The performance is improved by ~ 1-2% with
micro:org.openjdk.bench.java.security.MessageDigests
on larger inputs.MessageDigests.digest improvement
MessageDigests.getAndDigest improvement
Graviton 2
Graviton 3
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14068/head:pull/14068
$ git checkout pull/14068
Update a local copy of the PR:
$ git checkout pull/14068
$ git pull https://git.openjdk.org/jdk.git pull/14068/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14068
View PR using the GUI difftool:
$ git pr show -t 14068
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14068.diff
Webrev
Link to Webrev Comment