Skip to content

8308465: Reduce memory accesses in AArch64 MD5 intrinsic #14068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

yftsai
Copy link
Contributor

@yftsai yftsai commented May 20, 2023

Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.

Optimization 1: Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.

The original snippet loaded the value (step 3) soon after it was written to the memory (step 2).

md5_loop:
    __ ldrw(a, Address(state, 0));         // step 3: load the value from memory
    ... // loop body
    __ ldrw(rscratch1, Address(state, 0)); // step 1: load the value at Address(state, 0)
    __ addw(rscratch1, rscratch1, a);
    __ strw(rscratch1, Address(state, 0)); // step 2: write the value to memory
    ...
    __ br(Assembler::LE, md5_loop);

The snippet is optimized to avoid memory loads and writes in the loop.

    __ ldp(s0, s1, Address(state,  0));    // load the value at Address(state, 0) to a register
    __ ubfx(a, s0, 0, 32);
md5_loop:
    .. // body
    __ ubfx(rscratch1, s0, 0, 32);         // step 1: extract the value from the register
    __ addw(a, rscratch1, a);
    __ orr(s0, a, b, Assembler::LSL, 32);  // step 2: preserve the value in the register
    ....
    __ br(Assembler::LE, md5_loop);
    ....
    __ str(s0, Address(state, 0));         // write the result to memory only once

Optimization 2: Redundant loads generated by md5_GG, md5_HH, and md5_II are removed.

The original snippet, generated by two md5_FFs and md5_GGs, read the same data repeatedly.

__ ldrw(rscratch1, Address(buf, 0));    // from md5_FF(.., k = 0, ..)
...
__ ldrw(rscratch1, Address(buf, 4));    // from md5_FF(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 4));    // from md5_GG(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 0));    // from md5_GG(.., k = 0, ..)

The snippet is optimized by caching the values in registers and removing the redundant loads.

__ ldp (buf0, buf1, Address(buf, 0));  // load both values into buf0
...
__ ubfx(rscratch1, buf0, 0, 32);       // extract the value of k = 0 from the lower 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32);      // extract the value of k = 1 from the higher 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32); 
...
__ ubfx(rscratch1, buf0, 0, 32);

Test
The following tests have passed.

test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5Intrinsics.java
test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5MultiBlockIntrinsics.java

Performance
The performance is improved by ~ 1-2% with micro:org.openjdk.bench.java.security.MessageDigests on larger inputs.

MessageDigests.digest improvement

64 256 1,024 4,096 16,384 bytes
Graviton 2 -1.39% 0.57% 1.80% 2.20% 2.32%
Graviton 3 -3.71% -0.40% 0.72% 1.05% 1.13%

MessageDigests.getAndDigest improvement

64 256 1,024 4,096 16,384 bytes
Graviton 2 -0.78% 0.32% 1.47% 1.83% 1.93%
Graviton 3 -0.06% 0.71% 1.05% 1.15% 1.16%

Graviton 2

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3709.849 ± 30.327  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1513.543 ±  0.616  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   462.135 ±  0.382  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   122.360 ±  0.024  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.037 ±  0.010  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2902.714 ± 92.908  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1395.815 ±  2.292  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   448.729 ±  7.343  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   120.616 ±  0.038  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.010 ±  0.007  ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3658.278 ± 43.499  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1522.199 ±  2.013  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   470.466 ±  0.282  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   125.053 ±  0.035  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.757 ±  0.006  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2879.987 ± 88.433  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1400.283 ±  5.980  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   455.327 ±  6.663  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   122.828 ±  0.162  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.609 ±  0.010  ops/ms

Graviton 3

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  4122.050 ±  8.495  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1634.045 ±  0.341  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   490.091 ±  0.072  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   129.017 ±  0.007  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    32.687 ±  0.002  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3212.170 ± 81.253  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1504.159 ±  1.091  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   476.164 ±  3.869  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   126.983 ±  0.011  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    32.546 ±  0.004  ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3968.992 ± 12.005  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1627.529 ±  0.857  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   493.640 ±  0.050  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   130.374 ±  0.010  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    33.057 ±  0.002  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3210.093 ± 76.102  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1514.893 ±  1.163  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   481.159 ±  3.447  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   128.440 ±  0.017  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    32.924 ±  0.006  ops/ms

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8308465: Reduce memory accesses in AArch64 MD5 intrinsic

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14068/head:pull/14068
$ git checkout pull/14068

Update a local copy of the PR:
$ git checkout pull/14068
$ git pull https://git.openjdk.org/jdk.git pull/14068/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14068

View PR using the GUI difftool:
$ git pr show -t 14068

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14068.diff

Webrev

Link to Webrev Comment

The performance is improved by ~ with `micro:org.openjdk.bench.java.security.MessageDigests`.

|            | digest  | digest | getAndDigest | getAndDigest  |       |
|------------|---------|--------|--------------|---------------|-------|
|            | 64      | 16,384 | 64           | 16,384        | bytes |
| Graviton 2 | -1.37%  | 1.51%  | -0.65%       | 1.89%         |

Graviton 2
```
Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3706.112 ± 29.183  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.270 ±  0.012  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2898.102 ± 97.519  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.013 ±  0.006  ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3655.283 ± 36.055  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.742 ±  0.051  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2879.340 ± 90.737  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.599 ±  0.008  ops/ms
```
@bridgekeeper
Copy link

bridgekeeper bot commented May 20, 2023

👋 Welcome back yftsai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 20, 2023
@openjdk
Copy link

openjdk bot commented May 20, 2023

@yftsai The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 20, 2023
@mlbridge
Copy link

mlbridge bot commented May 20, 2023

Webrevs

_masm(masm), _base(base) {
assert(rs.size() == 8, "%u registers are used to cache 16 4-byte data", rs.size());
auto it = rs.begin();
for (int i = 0; i < 8; ++i, ++it) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a counter here. Just loop over the RegSet.

}

// Generate code extracting i-th unsigned word (4 bytes) from cached 64 bytes.
void gen_unsigned_word_extract(Register dest, int i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void gen_unsigned_word_extract(Register dest, int i) {
void extract_u32(Register dest, int i) {

Register state_regs[2] = { r12, r13 };
RegSet saved_regs = RegSet::range(r16, r22) - r18_tls;
Cached64Bytes reg_cache(_masm, buf, RegSet::of(r14, r15) + saved_regs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a note here to the effect that the rest of this patch requires there to be exactly 8 registers in this set. Maybe assert that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement has been asserted in the constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but a note here would have made it easier to understand.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this patch looks pretty good. Just a few minor nits.

Comment on lines 3544 to 3553
__ ubfx(rscratch1, state_regs[0], 0, 32);
__ ubfx(rscratch2, state_regs[0], 32, 32);
__ ubfx(rscratch3, state_regs[1], 0, 32);
__ ubfx(rscratch4, state_regs[1], 32, 32);

__ addw(a, rscratch1, a);
__ addw(b, rscratch2, b);
__ addw(c, rscratch3, c);
__ addw(d, rscratch4, d);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__ ubfx(rscratch1, state_regs[0], 0, 32);
__ ubfx(rscratch2, state_regs[0], 32, 32);
__ ubfx(rscratch3, state_regs[1], 0, 32);
__ ubfx(rscratch4, state_regs[1], 32, 32);
__ addw(a, rscratch1, a);
__ addw(b, rscratch2, b);
__ addw(c, rscratch3, c);
__ addw(d, rscratch4, d);
__ addw(a, state_regs[0], a);
__ ubfx(rscratch2, state_regs[0], 32, 32);
__ addw(b, rscratch2, b);
__ addw(c, state_regs[1], c);
__ ubfx(rscratch4, state_regs[1], 32, 32);
__ addw(d, rscratch4, d);

@openjdk
Copy link

openjdk bot commented May 21, 2023

@yftsai This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308465: Reduce memory accesses in AArch64 MD5 intrinsic

Reviewed-by: aph, phh

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 98 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @phohensee) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 21, 2023
@yftsai
Copy link
Contributor Author

yftsai commented May 21, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label May 21, 2023
@openjdk
Copy link

openjdk bot commented May 21, 2023

@yftsai
Your change (at version 0fcb9d4) is now ready to be sponsored by a Committer.

@yftsai yftsai changed the title 8308465: Reduce memory reads in AArch64 MD5 intrinsic 8308465: Reduce memory accesses in AArch64 MD5 intrinsic May 22, 2023
@phohensee
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented May 22, 2023

Going to push as commit 8474e69.
Since your change was applied there have been 98 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 22, 2023
@openjdk openjdk bot closed this May 22, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels May 22, 2023
@openjdk
Copy link

openjdk bot commented May 22, 2023

@phohensee @yftsai Pushed as commit 8474e69.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants