Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8322535: Change default AArch64 SpinPause instruction #17430

Closed

Conversation

fbredber
Copy link
Contributor

@fbredber fbredber commented Jan 15, 2024

The Java options OnSpinWaitInst lets you choose which AArch64 instruction should be used in SpinPause(). Valid values are "none", "nop", "isb" and "yield". Today the default value for OnSpinWaitInst is unfortunately "none".

However some CPUs changes the default SpinPause instruction to something better if the user hasn't used the OnSpinWaitInst option. For instance if you run a Neoverse N1, N2, V1 or V2, the default SpinPause instruction will be changed to "isb". After doing some measurements on Apple's M1-M3 CPUs it also seems like "isb" is the best yielding instruction on on those CPUs.

This PR changes the default SpinPause instruction to "yield" on all AArch64 platforms except on Apple's M1, M2 and M3 CPUs on which the default value will be "isb".

Tested tier1-tier7 successfully on linux-aarch64 and macosx-aarch64.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8322535: Change default AArch64 SpinPause instruction (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17430/head:pull/17430
$ git checkout pull/17430

Update a local copy of the PR:
$ git checkout pull/17430
$ git pull https://git.openjdk.org/jdk.git pull/17430/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17430

View PR using the GUI difftool:
$ git pr show -t 17430

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17430.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 15, 2024

👋 Welcome back fbredberg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 15, 2024

@fbredber The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jan 15, 2024
@fbredber fbredber marked this pull request as ready for review January 16, 2024 09:59
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 16, 2024
@mlbridge
Copy link

mlbridge bot commented Jan 16, 2024

Webrevs

@shipilev
Copy link
Member

Attn @eastig.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISB isn't really the right thing for this. Sure, it causes a delay, but the extent of the delay depends on what else the processor is doing. In some cases an ISB can work well, in other cases not. Some micro benchmarks show a great improvement with ISB.
It doesn't depend only on the target hardware, but on the application. Sure, in some cases an ISB is going to be exactly right, but on others it might be too much.

@stooart-mon
Copy link
Contributor

For the most part, "YIELD" is probably going to be equivalent to a "NOP". Unless there is a a demonstrable reason for this change, I would leave it as it is.
With regards to the change, do you have a suite of benchmark data that demonstrates this is a benefit on Apple Silicon?
Otherwise, as @theRealAph says, microbenchmarks can demonstrate an benefit from ISBs, but applications overall won't necessarily show any benefit.

@eastig
Copy link
Member

eastig commented Jan 16, 2024

ISB isn't really the right thing for this. Sure, it causes a delay, but the extent of the delay depends on what else the processor is doing. In some cases an ISB can work well, in other cases not. Some micro benchmarks show a great improvement with ISB. It doesn't depend only on the target hardware, but on the application. Sure, in some cases an ISB is going to be exactly right, but on others it might be too much.

BTW, In Armv8.7-A/Armv9.2-A we have WFE/WFI with timeouts which is supported by Cortex-X4, A720 and A520. Available implementations of them are MediaTek Dimensity 9300 and Qualcomm Snapdragon 8 Gen 3. It would be possible to benchmark WFE-based implementation.

@eastig
Copy link
Member

eastig commented Jan 16, 2024

For the most part, "YIELD" is probably going to be equivalent to a "NOP". Unless there is a a demonstrable reason for this change, I would leave it as it is. With regards to the change, do you have a suite of benchmark data that demonstrates this is a benefit on Apple Silicon? Otherwise, as @theRealAph says, microbenchmarks can demonstrate an benefit from ISBs, but applications overall won't necessarily show any benefit.

I agree it would be interesting to see whether desktop applications get any improvements from ISB.
We observed good performance improvements in our customers' cloud applications.

BTW, ISB gets spread:
https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h#L4812C14-L4812C14
rust-lang/rust@c064b65
https://github.com/simd-everywhere/simde/blob/14311d60539303ca8bad6204dcbc6a29f51b0e09/simde/x86/sse2.h#L4770

Copy link
Member

@eastig eastig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@theRealAph
Copy link
Contributor

For the most part, "YIELD" is probably going to be equivalent to a "NOP". Unless there is a a demonstrable reason for this change, I would leave it as it is. With regards to the change, do you have a suite of benchmark data that demonstrates this is a benefit on Apple Silicon? Otherwise, as @theRealAph says, microbenchmarks can demonstrate an benefit from ISBs, but applications overall won't necessarily show any benefit.

I agree it would be interesting to see whether desktop applications get any improvements from ISB. We observed good performance improvements in our customers' cloud applications.

Your customers are running cloud apps on Apple M1/M2?

BTW, ISB gets spread: https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h#L4812C14-L4812C14 rust-lang/rust@c064b65 https://github.com/simd-everywhere/simde/blob/14311d60539303ca8bad6204dcbc6a29f51b0e09/simde/x86/sse2.h#L4770

Yeah, I know. What I don't know is how much of a cargo cult this is. Apple M1 etc. have very large reorder buffers, and serializing all instructions may not be the best plan.

@eastig
Copy link
Member

eastig commented Jan 16, 2024

Your customers are running cloud apps on Apple M1/M2?

In theory they could. M1, M2 and M2 Pro instances are available in cloud. However I am not aware any such cases.

Yeah, I know. What I don't know is how much of a cargo cult this is. Apple M1 etc. have very large reorder buffers, and serializing all instructions may not be the best plan.

I hope hardware engineers will notice this improper uses of ISB and will either implement YIELD or something equivalent to it. YIELD could be an alias of a new instruction.

@theRealAph
Copy link
Contributor

Your customers are running cloud apps on Apple M1/M2?

In theory they could. M1, M2 and M2 Pro instances are available in cloud. However I am not aware any such cases.

Right.

Yeah, I know. What I don't know is how much of a cargo cult this is. Apple M1 etc. have very large reorder buffers, and serializing all instructions may not be the best plan.

I hope hardware engineers will notice this improper uses of ISB and will either implement YIELD or something equivalent to it. YIELD could be an alias of a new instruction.

I think the problem is that the right amount of time to spin for is application dependent. It also depends on things like the way the memory coherence system works, which is architecturally very different on Apple designs.

We could do something less violent than ISB for SpinPause. We could execute a bunch of UDIV instructions with a loop-carried dependency, or cycle an xor-shift generator. That could be made to delay for any number of clock cycles, so we can delay without the side effects of an ISB. We could try to measure how many cycles ISB takes in the "good" cases and design a delay that takes as long as an ISB without disrupting everything else.

@fbredber
Copy link
Contributor Author

When I was browsing the interweb I saw that it's not uncommon to use isb instead of yield while spinning on AArch64. Before jumping on the bandwagon I created a test program to measure how long time it takes to issue a large number of instructions from several threads running in parallel. I tested nop, yield and isb on Apple's M1, M2 and M3 CPUs. The yield instruction doesn't take longer to execute than a nop instruction (in fact it takes less time than nop). However isb always takes significantly longer time to run than nop or yield on all of the above mentioned Apple CPUs. This finding combined with the fact that the JVM
today uses isb as default for Neoverse CPUs, justified the use of isb on Apple's M1-M3 CPUs.

But I do agree with both @theRealAph and @stooart-mon, isb is not intended for this purpose. It might create a delay that is too long for spinning purposes and applications overall won't necessarily show any benefit from isb vs yield.

Maybe the most reasonable way forward is to only change the default value of OnSpinWaitInst from "none" to "yield" and NOT change it to "isb" for Apple CPUs.

After all, that would make us use the "correct" spinning instruction on all AArch64 CPUs (except Neoverse).

@eastig
Copy link
Member

eastig commented Jan 17, 2024

Maybe the most reasonable way forward is to only change the default value of OnSpinWaitInst from "none" to "yield" and NOT change it to "isb" for Apple CPUs.

Do we have anyone from Apple who can suggest a spin pause implementation?
As no real cases for Apple CPUs exists, just microbenchmarks, choosing isb might be premature.
IMO, even without real cases I would have chosen isb if it had a similar latency as the Intel pause.

We could execute a bunch of UDIV instructions with a loop-carried dependency, or cycle an xor-shift generator. That could be made to delay for any number of clock cycles, so we can delay without the side effects of an ISB.

This approach is not power efficient. In case of Neoverse isb have shown to use less power than any instruction executing on the CPU back-end. If Apple CPUs have the similar isb behaviour, it would be a reason to use isb.

@theRealAph
Copy link
Contributor

Maybe the most reasonable way forward is to only change the default value of OnSpinWaitInst from "none" to "yield" and NOT change it to "isb" for Apple CPUs.

Do we have anyone from Apple who can suggest a spin pause implementation? As no real cases for Apple CPUs exists, just microbenchmarks, choosing isb might be premature. IMO, even without real cases I would have chosen isb if it had a similar latency as the Intel pause.

It'd be nice if we knew what that latency was.

We could execute a bunch of UDIV instructions with a loop-carried dependency, or cycle an xor-shift generator. That could be made to delay for any number of clock cycles, so we can delay without the side effects of an ISB.

This approach is not power efficient.

Huh? The only real use for SpinPause is to prevent bus contention when trying to acquire a lock. Chances are we only really have to spin for a few dozen cycles before retrying. It's not long enough to affect power consumption much. Are you thinking of a longer pause?

In case of Neoverse isb have shown to use less power than any instruction executing on the CPU back-end. If Apple CPUs have the similar isb behaviour, it would be a reason to use isb.

OK, so now I'm really curious, given that ISB has a lot of work to do because it has to flush and restart a bunch of on-the-fly instructions. Can you provide any links for where it's been shown to use less power?

@fisk
Copy link
Contributor

fisk commented Jan 18, 2024

The main point of this PR is not to figure out what Apple HW should bind to, but rather to figure out what a good default is for unrecognized HW. The current default is "none", and the proposal is to change it to "yield". Since yield is the ISA defined instruction for this exact purpose, I think it makes more sense to use yield instead of none. It is certainly less surprising.
It's worth pointing out that new AmpereOne chips do indeed implement yield. By having the unsurprising yield instruction be the default, hopefully more HW vendors will find the motivation to follow suit and implement the instruction. We can have horrible hacks as opt-in for machines that have ignored this, but I think figuring out what said hacks should be and on what machines, is an orthogonal concern. The default should still be "yield", regardless, IMO.

@eastig
Copy link
Member

eastig commented Jan 18, 2024

OK, so now I'm really curious, given that ISB has a lot of work to do because it has to flush and restart a bunch of on-the-fly instructions.

According to our hardware enigeers isb don't need to flush anything. It can just stop fetching instructions until the backend gets idle. It's clearly faster and cheaper than mul/div instructions. Mul/div will spin up a whole complex arithmetic unit that might otherwise be idle. Except some cases, mul/div don't really pause CPU because CPU can execute instructions around it. For example it can get more loads out into the pipeline.

Can you provide any links for where it's been shown to use less power?

I don't have data I can share.

Stuart (@stooart-mon) is from arm. He might ask hardware engineers as well. Maybe he knows more or might provide some data.

@theRealAph
Copy link
Contributor

OK, so now I'm really curious, given that ISB has a lot of work to do because it has to flush and restart a bunch of on-the-fly instructions.

According to our hardware enigeers isb don't need to flush anything. It can just stop fetching instructions until the backend gets idle.

How is that any better? I get that how it might work, but that means that you have to wait for every instruction in progress to retire. And in a CPU with a hundreds of instructions on the fly that's no small thing. You have a choice, either to speculate and then rollback when the ISB is actually executed, or to stop speculating for a while. The effect is the same.

It's clearly faster

It's a delay.

and cheaper than mul/div instructions. Mul/div will spin up a whole complex arithmetic unit that might otherwise be idle. Except some cases, mul/div don't really pause CPU because CPU can execute instructions around it. For example it can get more loads out into the pipeline.

Definitely so, yes.

Stuart (@stooart-mon) is from arm. He might ask hardware engineers as well. Maybe he knows more or might provide some data.

OK. What concerns me is the blast radius of all this. It'd be nice to have some actual experiments.

@theRealAph
Copy link
Contributor

theRealAph commented Jan 19, 2024

So, if I may summarize:

Some Arm software uses ISB as a spin pause, and some claim better performance in some cases, but we have no supporting data.

At present, HotSpot on Apple silicon, spin pause is a nop. Apple silicon is an in-house design, which speculates more than other AArch64 implementations, and has more to lose with an ISB. That doesn't mean that an ISB on Apple silicon is bad for the purpose, it's just that we don't know.

I was hoping that we'd have an opportunity to do some experiments on contended spin locks to try some alternatives. I was also hoping that the PR to implement spin pause on some target would be a forcing function in that direction.

YIELD, which is the instruction actually intended for this purpose, has been implemented by Arm as a nop, which is why we're looking for alternatives. WFET is another possibility.

But "do nothing" is not a neutral position, even though we have no basis on which to make a decision..

@theRealAph
Copy link
Contributor

theRealAph commented Jan 27, 2024

@fbredber In https://bugs.openjdk.org/browse/JDK-8320317 you said "The performance decrease seen on AArch64 based macOS can be fixed by implementing SpinPause() (see: JDK-8321371)."

Please, where is the test case?

@fbredber
Copy link
Contributor Author

@theRealAph
The DaCapo-h2 test indicated that the regression could be mitigated by implementing SpinPause().

Since there is no consensus about if ISB is a good idea or not, we have decided not to use it as default for Apple silicon and just use YIELD for all AArch64 CPUs.

@theRealAph
Copy link
Contributor

@theRealAph The DaCapo-h2 test indicated that the regression could be mitigated by implementing SpinPause().

Since there is no consensus about if ISB is a good idea or not, we have decided not to use it as default for Apple silicon and just use YIELD for all AArch64 CPUs.

But there's been no consensus because (as far as I know) no-one has published the test results. With evidence we can discuss, consensus should be achievable.

@coleenp
Copy link
Contributor

coleenp commented Jan 29, 2024

It seems to me that we should just change the default to 'yield' and let other platforms determine the best tuning, since this might be the cause of aarch64 regression in JDK-8324221.

@theRealAph
Copy link
Contributor

Arm tell us that 'yield' is basically implemented as a nop.

@fisk
Copy link
Contributor

fisk commented Jan 30, 2024

Arm tell us that 'yield' is basically implemented as a nop.

It is not doing much in their current designs, indeed. That's their (questionable?) implementation choice. New AmpereOne chips do however implement yield.

Since the ISA interface gives us a yield instruction dedicated for this and at least one new chip implements it, it makes sense to me that it is the default instruction, rather than none, going forward.

Then we can continue to recognize chips that didn't implement yield and try to figure out how to deal with that awkwardness separately, and hope that vendors (indeed including ARM), start implementing the ISA, instead of having us doubling down on horrible hacks that try to quack like a yield instruction.

@theRealAph
Copy link
Contributor

theRealAph commented Jan 30, 2024

Arm tell us that 'yield' is basically implemented as a nop.

It is not doing much in their current designs, indeed. That's their (questionable?) implementation choice. New AmpereOne chips do however implement yield.

Since the ISA interface gives us a yield instruction dedicated for this and at least one new chip implements it, it makes sense to me that it is the default instruction, rather than none, going forward.

Then we can continue to recognize chips that didn't implement yield and try to figure out how to deal with that awkwardness separately, and hope that vendors (indeed including ARM), start implementing the ISA, instead of having us doubling down on horrible hacks that try to quack like a yield instruction.

This does have the appeal of sweet reasonableness, I agree. However, the vast majority of non-Apple parts are Arm's designs. But OK, yield it is. One day I'd love to try a few other ideas, but not now.

@openjdk
Copy link

openjdk bot commented Feb 5, 2024

@fbredber This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8322535: Change default AArch64 SpinPause instruction

Reviewed-by: eastigeevich, eosterlund, coleenp

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 21 new commits pushed to the master branch:

  • b75c134: 8325313: Header format error in TestIntrinsicBailOut after JDK-8317299
  • 4cd3187: 8324874: AArch64: crypto pmull based CRC32/CRC32C intrinsics clobber V8-V15 registers
  • b02599d: 8298046: Fix hidden but significant trailing whitespace in properties files for serviceability code
  • 6d911f6: 8317299: safepoint scalarization doesn't keep track of the depth of the JVM state
  • 542b0b6: 8324126: Error message for mistyping -XX:+Unlock...Options is not helpful
  • 9ee9f28: 8325213: Flags introduced by configure script are not passed to ADLC build
  • 729ae1d: 8325266: Enable this-escape javac warning in jdk.javadoc
  • e0fd3f4: 8325081: Move '_soft_ref_policy' to 'CollectedHeap'
  • f1f9398: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern
  • ab3b941: 8325270: ProblemList two compiler/intrinsics/float16 tests that fail due to JDK-8324724
  • ... and 11 more: https://git.openjdk.org/jdk/compare/51671c0b92ce9ee581bc850dff382b35a528b1cd...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @eastig, @fisk, @coleenp) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 5, 2024
@fbredber
Copy link
Contributor Author

fbredber commented Feb 6, 2024

Thank you guys for review comments. If anyone wants to continue to evaluate different SpinPause() instructions I've attached a file (spin_pause.c) to JDK-8322535 which can be used to measure the execution time for different instructions that are more or less suitable for use in SpinPause(), e.g. nop, pause, yield and isb.

If no one else has anything to add, I'll integrate (as soon as I can convince a sponsor).

@fbredber
Copy link
Contributor Author

fbredber commented Feb 6, 2024

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Feb 6, 2024
@openjdk
Copy link

openjdk bot commented Feb 6, 2024

@fbredber
Your change (at version c839a3d) is now ready to be sponsored by a Committer.

@coleenp
Copy link
Contributor

coleenp commented Feb 6, 2024

/sponsor

@openjdk
Copy link

openjdk bot commented Feb 6, 2024

Going to push as commit f356970.
Since your change was applied there have been 21 commits pushed to the master branch:

  • b75c134: 8325313: Header format error in TestIntrinsicBailOut after JDK-8317299
  • 4cd3187: 8324874: AArch64: crypto pmull based CRC32/CRC32C intrinsics clobber V8-V15 registers
  • b02599d: 8298046: Fix hidden but significant trailing whitespace in properties files for serviceability code
  • 6d911f6: 8317299: safepoint scalarization doesn't keep track of the depth of the JVM state
  • 542b0b6: 8324126: Error message for mistyping -XX:+Unlock...Options is not helpful
  • 9ee9f28: 8325213: Flags introduced by configure script are not passed to ADLC build
  • 729ae1d: 8325266: Enable this-escape javac warning in jdk.javadoc
  • e0fd3f4: 8325081: Move '_soft_ref_policy' to 'CollectedHeap'
  • f1f9398: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern
  • ab3b941: 8325270: ProblemList two compiler/intrinsics/float16 tests that fail due to JDK-8324724
  • ... and 11 more: https://git.openjdk.org/jdk/compare/51671c0b92ce9ee581bc850dff382b35a528b1cd...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 6, 2024
@openjdk openjdk bot closed this Feb 6, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Feb 6, 2024
@openjdk
Copy link

openjdk bot commented Feb 6, 2024

@coleenp @fbredber Pushed as commit f356970.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

7 participants