8261027: AArch64: Support for LSE atomics C++ HotSpot code #2434
Go back a few years, and there were simple atomic load/store exclusive
This is hard to scale on a very large system (e.g. Fugaku) because if
So, Arm decided to add a locked memory increment instruction that
Unfortunately, in recent processors, the "old" load/store exclusive
GCC's -moutline-atomics does this by providing library calls which use
Also, I suspect that some other operating systems could use this.
With regard to performance, the overhead of the
On ThunderX2, there is little difference, whatever you do. A straight-line count and increment loop is 5% slower.
On Neoverse N1 there is some 25% straight-line improvement for a simple count and increment loop with this patch. GCC's -moutline-atomics isn't quite as good as this patch, with only a 17% improvement.
But simple straight-line tests aren't really the point of LSE. The big performance hit with the "old" atomics happens at times of heavy contention, when fairness problems cause severe scaling issues. This is more likely to be a problem on large systems with many cores and large heaps.
Built with -moutline-atomics:
Built with -moutline-atomics:
@theRealAph This change now passes all automated pre-integration checks.
After integration, the commit message for the final commit will be:
At the time when this comment was updated there had been 128 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
On 2/8/21 1:38 PM, Ludovic Henry wrote:
Generally, OK, but what's wrong with that specific file? It should run
Well, I don't know. Do InterlockedAdd and its friends generate good code?
I'm sorry, that was unnecessarily sharp of me! It's entirely up to you, but you might find investigating this to be useful.
simonis left a comment
In general I'm fine with the change. Some of the previous C++ intrinsics (e.g.
One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.
Finally, I didn't fully understand how you've measured the
Other than that, the change looks fine to me.
None of these sequences is ideal, so I'll follow up with some higher-performance LSE versions in a new patch.
We'd need an instance of Assembler very early, before the JVM is initialized. It could be done, but it would also a page of memory to be allocated early too. I did try, but it was rather awful. That is why I ended up with these simple bootstrapping versions of the atomics.
That was just a counter in a loop. It's not terribly important for this case, given that the current code is way from optimal, but I wanted to know if the call...ret overhead could be measured. It can't, because it's swamped by the cost of the barriers.
OK, I see. Bootstrapping is more complex than I thought :)
But nevertheless I think implementing the default versions in native assembly isn't really simple and putting that Linux/gcc specific assembly code into the generic aarch64 directory
OK. We can do Windows another way, and I will move the assembler stubs to Linux.
Because the atomic stubs use a non-standard calling convention that only clobbers a few registers, so they can't be written in C++ because we can't control which registers the C++ compiler uses. If we were to use the native calling convention to call stubs we'd need to save and restore a ton of registers somehow - and not just the integer registers but also the vectors. It wouldn't be any simpler.
I do intend to provide lower-overhead versions of the Atomic functions in a later patch. This one does the LSE/non-LSE split without changing anything else.
@theRealAph Since your change was applied there have been 141 commits pushed to the
Your commit was automatically rebased without conflicts.
Pushed as commit 40ae993.