-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8322535: Change default AArch64 SpinPause instruction #17430
Conversation
👋 Welcome back fbredberg! A progress list of the required criteria for merging this PR into |
Webrevs
|
Attn @eastig. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISB isn't really the right thing for this. Sure, it causes a delay, but the extent of the delay depends on what else the processor is doing. In some cases an ISB can work well, in other cases not. Some micro benchmarks show a great improvement with ISB.
It doesn't depend only on the target hardware, but on the application. Sure, in some cases an ISB is going to be exactly right, but on others it might be too much.
For the most part, "YIELD" is probably going to be equivalent to a "NOP". Unless there is a a demonstrable reason for this change, I would leave it as it is. |
BTW, In Armv8.7-A/Armv9.2-A we have WFE/WFI with timeouts which is supported by Cortex-X4, A720 and A520. Available implementations of them are MediaTek Dimensity 9300 and Qualcomm Snapdragon 8 Gen 3. It would be possible to benchmark WFE-based implementation. |
I agree it would be interesting to see whether desktop applications get any improvements from ISB. BTW, ISB gets spread: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Your customers are running cloud apps on Apple M1/M2?
Yeah, I know. What I don't know is how much of a cargo cult this is. Apple M1 etc. have very large reorder buffers, and serializing all instructions may not be the best plan. |
In theory they could. M1, M2 and M2 Pro instances are available in cloud. However I am not aware any such cases.
I hope hardware engineers will notice this improper uses of ISB and will either implement YIELD or something equivalent to it. YIELD could be an alias of a new instruction. |
Right.
I think the problem is that the right amount of time to spin for is application dependent. It also depends on things like the way the memory coherence system works, which is architecturally very different on Apple designs. We could do something less violent than ISB for SpinPause. We could execute a bunch of UDIV instructions with a loop-carried dependency, or cycle an xor-shift generator. That could be made to delay for any number of clock cycles, so we can delay without the side effects of an ISB. We could try to measure how many cycles ISB takes in the "good" cases and design a delay that takes as long as an ISB without disrupting everything else. |
When I was browsing the interweb I saw that it's not uncommon to use isb instead of yield while spinning on AArch64. Before jumping on the bandwagon I created a test program to measure how long time it takes to issue a large number of instructions from several threads running in parallel. I tested nop, yield and isb on Apple's M1, M2 and M3 CPUs. The yield instruction doesn't take longer to execute than a nop instruction (in fact it takes less time than nop). However isb always takes significantly longer time to run than nop or yield on all of the above mentioned Apple CPUs. This finding combined with the fact that the JVM But I do agree with both @theRealAph and @stooart-mon, isb is not intended for this purpose. It might create a delay that is too long for spinning purposes and applications overall won't necessarily show any benefit from isb vs yield. Maybe the most reasonable way forward is to only change the default value of OnSpinWaitInst from "none" to "yield" and NOT change it to "isb" for Apple CPUs. After all, that would make us use the "correct" spinning instruction on all AArch64 CPUs (except Neoverse). |
Do we have anyone from Apple who can suggest a spin pause implementation?
This approach is not power efficient. In case of Neoverse |
It'd be nice if we knew what that latency was.
Huh? The only real use for SpinPause is to prevent bus contention when trying to acquire a lock. Chances are we only really have to spin for a few dozen cycles before retrying. It's not long enough to affect power consumption much. Are you thinking of a longer pause?
OK, so now I'm really curious, given that ISB has a lot of work to do because it has to flush and restart a bunch of on-the-fly instructions. Can you provide any links for where it's been shown to use less power? |
The main point of this PR is not to figure out what Apple HW should bind to, but rather to figure out what a good default is for unrecognized HW. The current default is "none", and the proposal is to change it to "yield". Since yield is the ISA defined instruction for this exact purpose, I think it makes more sense to use yield instead of none. It is certainly less surprising. |
According to our hardware enigeers
I don't have data I can share. Stuart (@stooart-mon) is from arm. He might ask hardware engineers as well. Maybe he knows more or might provide some data. |
How is that any better? I get that how it might work, but that means that you have to wait for every instruction in progress to retire. And in a CPU with a hundreds of instructions on the fly that's no small thing. You have a choice, either to speculate and then rollback when the ISB is actually executed, or to stop speculating for a while. The effect is the same.
It's a delay.
Definitely so, yes.
OK. What concerns me is the blast radius of all this. It'd be nice to have some actual experiments. |
So, if I may summarize: Some Arm software uses ISB as a spin pause, and some claim better performance in some cases, but we have no supporting data. At present, HotSpot on Apple silicon, spin pause is a nop. Apple silicon is an in-house design, which speculates more than other AArch64 implementations, and has more to lose with an ISB. That doesn't mean that an ISB on Apple silicon is bad for the purpose, it's just that we don't know. I was hoping that we'd have an opportunity to do some experiments on contended spin locks to try some alternatives. I was also hoping that the PR to implement spin pause on some target would be a forcing function in that direction. YIELD, which is the instruction actually intended for this purpose, has been implemented by Arm as a nop, which is why we're looking for alternatives. WFET is another possibility. But "do nothing" is not a neutral position, even though we have no basis on which to make a decision.. |
@fbredber In https://bugs.openjdk.org/browse/JDK-8320317 you said "The performance decrease seen on AArch64 based macOS can be fixed by implementing SpinPause() (see: JDK-8321371)." Please, where is the test case? |
@theRealAph Since there is no consensus about if ISB is a good idea or not, we have decided not to use it as default for Apple silicon and just use YIELD for all AArch64 CPUs. |
But there's been no consensus because (as far as I know) no-one has published the test results. With evidence we can discuss, consensus should be achievable. |
It seems to me that we should just change the default to 'yield' and let other platforms determine the best tuning, since this might be the cause of aarch64 regression in JDK-8324221. |
Arm tell us that 'yield' is basically implemented as a nop. |
It is not doing much in their current designs, indeed. That's their (questionable?) implementation choice. New AmpereOne chips do however implement yield. Since the ISA interface gives us a yield instruction dedicated for this and at least one new chip implements it, it makes sense to me that it is the default instruction, rather than none, going forward. Then we can continue to recognize chips that didn't implement yield and try to figure out how to deal with that awkwardness separately, and hope that vendors (indeed including ARM), start implementing the ISA, instead of having us doubling down on horrible hacks that try to quack like a yield instruction. |
This does have the appeal of sweet reasonableness, I agree. However, the vast majority of non-Apple parts are Arm's designs. But OK, |
@fbredber This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 21 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @eastig, @fisk, @coleenp) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Thank you guys for review comments. If anyone wants to continue to evaluate different If no one else has anything to add, I'll integrate (as soon as I can convince a sponsor). |
/integrate |
/sponsor |
Going to push as commit f356970.
Your commit was automatically rebased without conflicts. |
The Java options OnSpinWaitInst lets you choose which AArch64 instruction should be used in
SpinPause()
. Valid values are "none", "nop", "isb" and "yield". Today the default value for OnSpinWaitInst is unfortunately "none".However some CPUs changes the default SpinPause instruction to something better if the user hasn't used the OnSpinWaitInst option. For instance if you run a Neoverse N1, N2, V1 or V2, the default SpinPause instruction will be changed to "isb". After doing some measurements on Apple's M1-M3 CPUs it also seems like "isb" is the best yielding instruction on on those CPUs.
This PR changes the default SpinPause instruction to "yield" on all AArch64 platforms except on Apple's M1, M2 and M3 CPUs on which the default value will be "isb".
Tested tier1-tier7 successfully on linux-aarch64 and macosx-aarch64.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17430/head:pull/17430
$ git checkout pull/17430
Update a local copy of the PR:
$ git checkout pull/17430
$ git pull https://git.openjdk.org/jdk.git pull/17430/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17430
View PR using the GUI difftool:
$ git pr show -t 17430
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17430.diff
Webrev
Link to Webrev Comment