Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions #5390

Closed

Conversation

theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Sep 7, 2021

An interleaved version of AES/GCM.

Performance, now and then:

Apple M1, 3.2 GHz:

Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op

AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op

Neoverse N1, 2.5GHz:

Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op

AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op

A note about the implementation for the reviewers:

Unrolled and hand-scheduled intrinsics are often written in a way that
I don't find satisfactory. Often they are a conglomeration of
copy-and-paste programming and C macros, which makes them hard to
understand and hard to maintain. I won't name any names, but there are
many examples to be found in free software across the Internet,

I spent a while thinking about a structured way to develop and
implement them, and I think I've got something better. The idea is
that you transform a pre-existing implementation into a generator for
the interleaved version. The transformation shouldn't be too hard to
do, but more importantly it should be possible for a reader to verify
that the interleaved and unrolled version performs the same function.

A generator takes the form of a subclass of KernelGenerator. The
core idea is that the programmer defines the base case of the
intrinsic and a method to generate a clone of it, shifted to a
different set of registers. KernelGenerator will then generate
several interleaved copies of the function, with each one using a
different set of registers.

The subclass must implement three methods: length(), which is the
number of instruction bundles in the intrinsic, generate(int n)
which emits the nth instruction bundle in the intrinsic, and next()
which takes an instance of the generator and returns a version of it,
shifted to a new set of registers.

As an example, here's the inner loop of AES encryption:

(Some details elided for clarity.)

    BIND(L_aes_loop);
      ld1(v0, T16B, post(from, 16));
      
      cmpw(keylen, 44);
      br(Assembler::CC, L_rounds_44);
      br(Assembler::EQ, L_rounds_52);

      aes_round(v0, v17);
      aes_round(v0, v18);
    BIND(L_rounds_52);
      aes_round(v0, v19);
      aes_round(v0, v20);
    BIND(L_rounds_44);
    ...

The generator for the unrolled version looks like:

  virtual void generate(int index) {
    switch (index) {
    case  0:
      ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
      break;
    case  1:
      if (_once) {
        cmpw(_keylen, 52);
        br(Assembler::LO, _rounds_44);
        br(Assembler::EQ, _rounds_52);
      }
      break;
    case  2:  aes_round(_data, _subkeys +  0);  break;
    case  3:  aes_round(_data, _subkeys +  1);  break;
    case  4:
      if (_once)  bind(_rounds_52);
      break;
    case  5:  aes_round(_data, _subkeys +  2);  break;
    case  6:  aes_round(_data, _subkeys +  3);  break;
    case  7:
      if (_once)  bind(_rounds_44);
      break;
    ...

The job of converting a single inline intrinsic is, as you can see,
not much more than adding a switch statement. Some instructions should
only be emitted once, rather than several times, such as the labels
and branches. (You can use a list of C++ lambdas rather than a switch
statement to do the same thing, very LISP, but that seems a bit of a
sledgehammer. YMMV.)

I believe that this approach will be more maintainable and easier to
understand than other approaches we've seen. Also, the number of
unrolls is just a number that can be tweaked as required.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5390/head:pull/5390
$ git checkout pull/5390

Update a local copy of the PR:
$ git checkout pull/5390
$ git pull https://git.openjdk.java.net/jdk pull/5390/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5390

View PR using the GUI difftool:
$ git pr show -t 5390

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5390.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 7, 2021

👋 Welcome back aph! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 7, 2021
@openjdk
Copy link

openjdk bot commented Sep 7, 2021

@theRealAph The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 7, 2021
@theRealAph theRealAph changed the title JDK-8271567: Arch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions Sep 7, 2021
@mlbridge
Copy link

mlbridge bot commented Sep 7, 2021

@openjdk
Copy link

openjdk bot commented Sep 7, 2021

⚠️ @theRealAph This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@theRealAph theRealAph force-pushed the JDK-8271567-aarch64-gcm-rebase branch from e5ea9b3 to ac6b7f5 Compare September 8, 2021 08:55
@mlbridge
Copy link

mlbridge bot commented Sep 10, 2021

Mailing list message from Nick Gasson on hotspot-dev:

On 07/09/21 22:36 pm, Andrew Haley wrote:

A generator takes the form of a subclass of `KernelGenerator`. The
core idea is that the programmer defines the base case of the
intrinsic and a method to generate a clone of it, shifted to a
different set of registers. `KernelGenerator` will then generate
several interleaved copies of the function, with each one using a
different set of registers.

The subclass must implement three methods: `length()`, which is the
number of instruction bundles in the intrinsic, `generate(int n)`
which emits the nth instruction bundle in the intrinsic, and `next()`
which takes an instance of the generator and returns a version of it,
shifted to a new set of registers.

Can you include this explanation in the code somewhere? Perhaps as a
comment above KernelGenerator. I like the idea but the generate()
method is a bit opaque without this.

--
Thanks,
Nick

@theRealAph
Copy link
Contributor Author

Can you include this explanation in the code somewhere? Perhaps as a
comment above KernelGenerator. I like the idea but the generate()
method is a bit opaque without this.

Right you are: I'm forever asking committers to do just that.
Physician, heal thyself!

@openjdk openjdk bot removed the rfr Pull request is ready for review label Sep 10, 2021
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 10, 2021
Copy link
Contributor

@nick-arm nick-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I tested on several different machines and got speed-ups between 5x and 17x (dataSize=16384).

@openjdk
Copy link

openjdk bot commented Sep 13, 2021

@theRealAph This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions

Reviewed-by: ngasson, adinn, xliu

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 112 new commits pushed to the master branch:

  • 394ebc8: 8270553: Tests should not use (real, in-use, routable) 1.1.1.1 as dummy IP value
  • 0f31d0f: 8273373: Zero: Cannot invoke JVM in primordial threads on Zero
  • fe89dd3: 8271254: javac generates unreachable code when using empty semicolon statement
  • 8974b95: 8273730: WorkGangBarrierSync constructor unused
  • 1d3eb14: 8273635: Attempting to acquire lock StackWatermark_lock/9 out of order with lock tty_lock/3
  • 31667da: 8273491: java.util.spi.LocaleServiceProvider spec contains statement that is too strict
  • ed7789d: 8256977: Bump minimum GCC from 5.x to 6 for JDK
  • 5bfd043: 8273497: building.md should link to both md and html
  • 3884580: 8273597: Rectify Thread::is_ConcurrentGC_thread()
  • f527289: 8273639: tests fail with "assert(_handle_mark_nesting > 1) failed: memory leak: allocating handle outside HandleMark"
  • ... and 102 more: https://git.openjdk.java.net/jdk/compare/fc546d6de9a3ed33cf4b04e24e62714332b069cb...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 13, 2021
Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. In particular, use of the kernel generator model to mange unrolling is something that should be used in all generated code that relies on unrolling. It is highly readable, which is rarely the case with hand-crafted code, because the generator methods clearly signal the structure of the interleaved code. It should also be far easier to update if the code ever needs revising. I suspect it would be hard to produce hand-crafted code that does significantly better when it comes to performance.

@theRealAph
Copy link
Contributor Author

In case anyone is wondering why this one hasn't been committed yet.
There's a problem with prolonged time-to-safepoint when these intrinsics are executed with large-sized arguments. Intel are also looking at this intrinsic on x86, 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125.

I could commit this now, and fix its time-to-safepoint later. Thoughts?

@phohensee
Copy link
Member

I'd commit it now in order to get experience with it, and fix time-to-safepoint later. There's still plenty of time left in the Java 18 schedule for the latter.

@rose00
Copy link
Contributor

rose00 commented Sep 22, 2021

Not a review, but that's the best assembly code I think I've ever seen.

Probably the only way to make it decisively better would be to code it in Java, using the Vector API on top of the (as yet uninvented) statically compiled but self-hosting System Java dialect.

@theRealAph
Copy link
Contributor Author

Not a review, but that's the best assembly code I think I've ever seen.

I'm going to frame that and put it on my wall.

@theRealAph theRealAph closed this Sep 23, 2021
@theRealAph
Copy link
Contributor Author

/integrate

@theRealAph theRealAph reopened this Sep 23, 2021
@openjdk
Copy link

openjdk bot commented Sep 23, 2021

Going to push as commit 4f3b626.
Since your change was applied there have been 231 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Sep 23, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 23, 2021
@openjdk
Copy link

openjdk bot commented Sep 23, 2021

@theRealAph Pushed as commit 4f3b626.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

6 participants