Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8261027: AArch64: Support for LSE atomics C++ HotSpot code #2434

Closed
wants to merge 16 commits into from

Conversation

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Feb 5, 2021

Go back a few years, and there were simple atomic load/store exclusive
instructions on Arm. Say you want to do an atomic increment of a
counter. You'd do an atomic load to get the counter into your local cache
in exclusive state, increment that counter locally, then write that
incremented counter back to memory with an atomic store. All the time
that cache line was in exclusive state, so you're guaranteed that
no-one else changed anything on that cache line while you had it.

This is hard to scale on a very large system (e.g. Fugaku) because if
many processors are incrementing that counter you get a lot of cache
line ping-ponging between cores.

So, Arm decided to add a locked memory increment instruction that
works without needing to load an entire line into local cache. It's a
single instruction that loads, increments, and writes back. The secret
is to send a cache control message to whichever processor owns the
cache line containing the count, tell that processor to increment the
counter and return the incremented value. That way cache coherency
traffic is mimimized. This new set of instructions is known as Large
System Extensions, or LSE.

Unfortunately, in recent processors, the "old" load/store exclusive
instructions, sometimes perform very badly. Therefore, it's now
necessary for software to detect which version of Arm it's running
on, and use the "new" LSE instructions if they're available. Otherwise
performance can be very poor under heavy contention.

GCC's -moutline-atomics does this by providing library calls which use
LSE if it's available, but this option is only provided on newer
versions of GCC. This is particularly problematic with older versions
of OpenJDK, which build using old GCC versions.

Also, I suspect that some other operating systems could use this.
Perhaps not MacOS, given that all Apple CPUs support LSE, but
maybe Windows.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8261027: AArch64: Support for LSE atomics C++ HotSpot code

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/2434/head:pull/2434
$ git checkout pull/2434

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 5, 2021

👋 Welcome back aph! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Feb 5, 2021

@theRealAph The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@theRealAph theRealAph changed the title 8261027: AArch64: Placeholder for -moutline-atomics 8261027: AArch64: Support for LSE atomics C++ HotSpot code Feb 5, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 5, 2021

@theRealAph
Copy link
Contributor Author

@theRealAph theRealAph commented Feb 6, 2021

With regard to performance, the overhead of the call ... ret sequence seems to be almost negligible on the systems I've tested.

On ThunderX2, there is little difference, whatever you do. A straight-line count and increment loop is 5% slower.

On Neoverse N1 there is some 25% straight-line improvement for a simple count and increment loop with this patch. GCC's -moutline-atomics isn't quite as good as this patch, with only a 17% improvement.

But simple straight-line tests aren't really the point of LSE. The big performance hit with the "old" atomics happens at times of heavy contention, when fairness problems cause severe scaling issues. This is more likely to be a problem on large systems with many cores and large heaps.

ThunderX2:

Baseline:

real 0m24.001s

Patched:

-XX:+UseLSE

real 0m25.222s

-XX:-UseLSE

real 0m25.215s

Built with -moutline-atomics:

real 0m25.227s

Neoverse N1:

Baseline:

real 0m10.027s

Patched:

-XX:+UseLSE

real 0m8.027s

-XX:-UseLSE

real 0m10.429s

Built with -moutline-atomics:

real 0m8.538s

@@ -0,0 +1,89 @@
// Copyright (c) 2021, Red Hat Inc. All rights reserved.

This comment has been minimized.

@nick-arm

nick-arm Feb 8, 2021
Member

Does this file work with the Windows assembler?

This comment has been minimized.

@theRealAph

theRealAph Feb 8, 2021
Author Contributor

I have no idea. If it doesn't, please tell me; I have no Windows system.

This comment has been minimized.

@nick-arm

nick-arm Feb 8, 2021
Member

I don't either unfortunately. Maybe @mo-beck or @luhenry could help? The Windows build on GitHub Actions failed cryptically with:

 Creating support/modules_libs/java.base/server/jvm.dll from 1045 file(s)
/usr/bin/bash: -c: line 0: syntax error near unexpected token `;'
make[3]: *** [lib/CompileJvm.gmk:143: /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm/objs/atomic_aarch64.obj] Error 2

This comment has been minimized.

@nick-arm

nick-arm Feb 8, 2021
Member

I suppose you could just move it to os_cpu/linux_aarch64/ as these are only called from the Linux atomics?

This comment has been minimized.

@theRealAph

theRealAph Feb 8, 2021
Author Contributor

They're probably needed on Windows, or I'd have put them in linux_aarch64.

This comment has been minimized.

@luhenry

luhenry Feb 8, 2021
Member

I can confirm that assembly code targeted at Linux generally won't compile out of the box on Windows. For windows_aarch64, could we have simple fallbacks in C++ code that simply call InterlockedAdd, InterlockedExchange, and InterlockedCompareExchange? It would be similar to atomic_windows_aarch64.hpp

src/hotspot/cpu/aarch64/atomic_aarch64.S Outdated Show resolved Hide resolved
@adinn
adinn approved these changes Feb 8, 2021
Copy link
Contributor

@adinn adinn left a comment

It would help to change the name of old_value to value. Otherwise this si ok as is.

@openjdk
Copy link

@openjdk openjdk bot commented Feb 8, 2021

@theRealAph This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8261027: AArch64: Support for LSE atomics C++ HotSpot code

Reviewed-by: adinn, simonis

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 128 new commits pushed to the master branch:

  • 1740de2: 8261297: NMT: Final report should use scale 1
  • c342323: 8261431: SA: Add comments about load address of executable
  • 4a72cea: 8261509: Move per-thread StackWatermark from Thread to JavaThread class
  • eef86a8: 8261029: Code heap page sizes not traced correctly using os::trace_page_sizes
  • 0a89987: 8240281: Remove failing assertion code when selecting first memory state in SuperWord::co_locate_pack
  • 9fed604: 8261300: jpackage: rewrite while(0)/while(false) to proper blocks
  • 8b6ab31: 8261418: Reduce decoder creation overheads for sun.nio.cs.ext Charsets
  • 5e1b809: 8261444: Remove unused fields in Lower
  • a9c3680: 8261250: Dependencies: Remove unused dependency types
  • 3ede231: 8259430: C2: assert(in_vt->length() == out_vt->length()) failed: mismatch on number of elements
  • ... and 118 more: https://git.openjdk.java.net/jdk/compare/90376156be1a2d93dac612ef77d1c0f267e72c60...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Feb 8, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 8, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 2/8/21 1:38 PM, Ludovic Henry wrote:

On Mon, 8 Feb 2021 11:24:00 GMT, Andrew Haley <aph at openjdk.org> wrote:

I suppose you could just move it to `os_cpu/linux_aarch64/` as these are only called from the Linux atomics?

They're probably needed on Windows, or I'd have put them in linux_aarch64.

I can confirm that assembly code targeted at Linux generally won't compile out of the box on Windows.

Generally, OK, but what's wrong with that specific file? It should run
just fine. We're getting an incomprehensible error message, but what
does it mean?

For `windows_aarch64`, could we have simple fallbacks in C++ code that simply call `InterlockedAdd`, `InterlockedExchange`, and `InterlockedCompareExchange`? It would be similar to [atomic_windows_aarch64.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hpp)

Well, I don't know. Do InterlockedAdd and its friends generate good code?
You've got the machines there, so you can have a look, but I can't.
It's up to you to decide whether you care about code quality. I'm
trying my best to help, but without your assistance there's nothing I
can do.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@theRealAph
Copy link
Contributor Author

@theRealAph theRealAph commented Feb 8, 2021

Mailing list message from Andrew Haley on hotspot-dev:
Well, I don't know. Do InterlockedAdd and its friends generate good code?
You've got the machines there, so you can have a look, but I can't.

I'm sorry, that was unnecessarily sharp of me! It's entirely up to you, but you might find investigating this to be useful.

Copy link
Member

@simonis simonis left a comment

Hi Andrew,

I'm happy to see this change after the long and tedious discussions we had about preferring C++ intrinsic over inline assembly :)

In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. __atomic_exchange_n and __atomic_add_fetch) were called with __ATOMIC_RELEASE semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

Finally, I didn't fully understand how you've measured the call..ret overhead and what's the "simple stright-line test" you've posted performance numbers for.

Other than that, the change looks fine to me.

@theRealAph
Copy link
Contributor Author

@theRealAph theRealAph commented Feb 9, 2021

I'm happy to see this change after the long and tedious discussions we had about preferring C++ intrinsic over inline assembly :)

In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. __atomic_exchange_n and __atomic_add_fetch) were called with __ATOMIC_RELEASE semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

None of these sequences is ideal, so I'll follow up with some higher-performance LSE versions in a new patch.

One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

We'd need an instance of Assembler very early, before the JVM is initialized. It could be done, but it would also a page of memory to be allocated early too. I did try, but it was rather awful. That is why I ended up with these simple bootstrapping versions of the atomics.

Finally, I didn't fully understand how you've measured the call..ret overhead and what's the "simple stright-line test" you've posted performance numbers for.

That was just a counter in a loop. It's not terribly important for this case, given that the current code is way from optimal, but I wanted to know if the call...ret overhead could be measured. It can't, because it's swamped by the cost of the barriers.

@simonis
Copy link
Member

@simonis simonis commented Feb 9, 2021

I'm happy to see this change after the long and tedious discussions we had about preferring C++ intrinsic over inline assembly :)
In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. __atomic_exchange_n and __atomic_add_fetch) were called with __ATOMIC_RELEASE semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

None of these sequences is ideal, so I'll follow up with some higher-performance LSE versions in a new patch.

One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

We'd need an instance of Assembler very early, before the JVM is initialized. It could be done, but it would also a page of memory to be allocated early too. I did try, but it was rather awful. That is why I ended up with these simple bootstrapping versions of the atomics.

OK, I see. Bootstrapping is more complex than I thought :)

But nevertheless I think implementing the default versions in native assembly isn't really simple and putting that Linux/gcc specific assembly code into the generic aarch64 directory src/hotspot/cpu/aarch64 will break other aarch64 platforms like Windows and Mac. Why don't you leave the default implementation as simple wrappers for the C++ compiler intrinsics somewhere under src/hotspot/os_cpu/linux_aarch64 and remove the:

    if (! UseLSE) {
      return;
    }

in generate_atomic_entry_points()? In that case, the intrinsic versions would really only be used for bootstrapping until the stubs have been generated. I don't see a good reason why we should maintain two different assembler implementations of the non-LSE atomics for aarch64 - one in native assembler and on in generated assembler.

Finally, I didn't fully understand how you've measured the call..ret overhead and what's the "simple stright-line test" you've posted performance numbers for.

That was just a counter in a loop. It's not terribly important for this case, given that the current code is way from optimal, but I wanted to know if the call...ret overhead could be measured. It can't, because it's swamped by the cost of the barriers.

@openjdk openjdk bot added ready and removed ready labels Feb 9, 2021
@theRealAph
Copy link
Contributor Author

@theRealAph theRealAph commented Feb 9, 2021

But nevertheless I think implementing the default versions in native assembly isn't really simple and putting that Linux/gcc specific assembly code into the generic aarch64 directory src/hotspot/cpu/aarch64 will break other aarch64 platforms like Windows and Mac.

OK. We can do Windows another way, and I will move the assembler stubs to Linux.

Why don't you leave the default implementation as simple wrappers for the C++ compiler intrinsics

Because the atomic stubs use a non-standard calling convention that only clobbers a few registers, so they can't be written in C++ because we can't control which registers the C++ compiler uses. If we were to use the native calling convention to call stubs we'd need to save and restore a ton of registers somehow - and not just the integer registers but also the vectors. It wouldn't be any simpler.

I do intend to provide lower-overhead versions of the Atomic functions in a later patch. This one does the LSE/non-LSE split without changing anything else.

@simonis
Copy link
Member

@simonis simonis commented Feb 9, 2021

OK, got it. Then I'm fine with these changes modulo the upcoming move of the assembler stubs to Linux.

theRealAph added 2 commits Feb 10, 2021
theRealAph added 2 commits Feb 10, 2021
@theRealAph
Copy link
Contributor Author

@theRealAph theRealAph commented Feb 12, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Feb 12, 2021

@theRealAph Since your change was applied there have been 141 commits pushed to the master branch:

  • 9ffabf3: 8252971: WindowsFileAttributes does not know about Unix domain sockets
  • 682e78e: 8261071: AArch64: Refactor interpreter native wrappers
  • ebaa58d: 8261505: Test test/hotspot/jtreg/gc/parallel/TestDynShrinkHeap.java killed by Linux OOM Killer
  • 3210095: 8261079: Fix support for @hidden in classes and interfaces
  • 9c0ec8d: 8260941: Remove the conc_scan parameter for CardTable
  • da9895a: 8261499: Simplify HTML for javadoc links
  • 0779add: 8255059: Regressions >5% in all Javadoc benchmarks in 16-b19
  • 6a84ec6: 8260044: Parallel GC: Concurrent allocation after heap expansion may cause unnecessary full gc
  • 92ff891: 8261593: Do not use NULL pointer as write buffer parameter in jfrEmergencyDump.cpp write_repository_files
  • 60a2072: 8260431: com/sun/jdi/JdbOptions.java failed with "RuntimeException: 'prop[boo] = >foo<' missing from stdout/stderr"
  • ... and 131 more: https://git.openjdk.java.net/jdk/compare/90376156be1a2d93dac612ef77d1c0f267e72c60...master

Your commit was automatically rebased without conflicts.

Pushed as commit 40ae993.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment