Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8321371: SpinPause() not implemented for bsd_aarch64/macOS #16994

Closed

Conversation

fbredber
Copy link
Contributor

@fbredber fbredber commented Dec 6, 2023

The SpinPause() function only returns 0 on bsd_aarch64 (i.e. macOS)

This PR implements SpinPause() for MacOS on AArch64 and makes it possible to choose between none, nop, isb and yield by using the OnSpinWaitInst option. The same functionality is found on AArch64 based Linux platforms.

Tested successfully on macosx-aarch64 tier1-tier5.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8321371: SpinPause() not implemented for bsd_aarch64/macOS (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16994/head:pull/16994
$ git checkout pull/16994

Update a local copy of the PR:
$ git checkout pull/16994
$ git pull https://git.openjdk.org/jdk.git pull/16994/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16994

View PR using the GUI difftool:
$ git pr show -t 16994

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16994.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 6, 2023

👋 Welcome back fbredberg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 6, 2023

@fbredber The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Dec 6, 2023
@fbredber fbredber marked this pull request as ready for review December 11, 2023 12:42
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 11, 2023
@mlbridge
Copy link

mlbridge bot commented Dec 11, 2023

Webrevs

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@openjdk
Copy link

openjdk bot commented Dec 13, 2023

@fbredber This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8321371: SpinPause() not implemented for bsd_aarch64/macOS

Reviewed-by: eosterlund, dholmes, dcubed, eastigeevich, shade

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 196 new commits pushed to the master branch:

  • 71aac7a: 8276809: java/awt/font/JNICheck/FreeTypeScalerJNICheck.java shows JNI warning on Windows
  • 09c6c4f: 8322489: 22-b27: Up to 7% regression in all Footprint3-*-G1/ZGC
  • eb9e754: 8323065: Unneccesary CodeBlob lookup in CompiledIC::internal_set_ic_destination
  • a40d397: 8323110: Eliminate -Wparentheses warnings in ppc code
  • 7edd10e: 8321786: SegmentAllocator:allocateFrom(ValueLayout, MemorySegment,ValueLayout,long,long) spec mismatch in exception scenario
  • d75d876: 8322806: Eliminate -Wparentheses warnings in aarch64 code
  • e442769: 8322754: click JComboBox when dialog about to close causes IllegalComponentStateException
  • 3560e68: 8322815: Eliminate -Wparentheses warnings in shenandoah code
  • faa9c69: 8322846: Running with -Djdk.tracePinnedThreads set can hang
  • ace010b: 8319757: java/nio/channels/DatagramChannel/InterruptibleOrNot.java failed: wrong exception thrown
  • ... and 186 more: https://git.openjdk.org/jdk/compare/92fd490f22f690ff7698182658363b7035bcc3bf...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@fisk, @dholmes-ora, @dcubed-ojdk, @eastig, @shipilev) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 13, 2023
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, but we really need the Aarch64 folk to chime in on this one.

Thanks

@shipilev
Copy link
Member

shipilev commented Dec 13, 2023

Paging @eastig and @theRealAph here.

@eastig
Copy link
Member

eastig commented Dec 13, 2023

Hi @fbredber,

If you want to implement SpinPause for bsd_aarch64/macOS, you need:

    if (FLAG_IS_DEFAULT(OnSpinWaitInst)) {
      FLAG_SET_DEFAULT(OnSpinWaitInst, "yield");
    }

    if (FLAG_IS_DEFAULT(OnSpinWaitInstCount)) {
      FLAG_SET_DEFAULT(OnSpinWaitInstCount, 1);
    }

@eastig
Copy link
Member

eastig commented Dec 13, 2023

Based on https://github.com/dougallj/applecpu, yield is nop. I'd use one isb as we use it for Neoverse. On M1, isb is ~25-30 clocks.

@eastig
Copy link
Member

eastig commented Dec 13, 2023

The better way to find out an instruction to use is to run microbenchmarks/benchmarks. See #6803.

@fbredber
Copy link
Contributor Author

fbredber commented Dec 13, 2023

Hi @eastig,

I initially did the things you suggest, but after some internal discussions changed the implementation into a single yield instruction.

Here's some background info from JDK-8321371:

Fredrik: My thought was to implement SpinPause() for MacOS by copying the implementing from src/hotspot/os_cpu/linux_aarch64/os_linux_aarch64.cpp to src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp.

@shipilev I remember trying the same thing along with JDK-8318986, but I realized SpinPause() is only used from the quite hot VM native code, and so it would probably affect GC and runtime performance. The actual Thread.onSpinWait from Java code should be already handled by intrinsics. I suspect the overhead of doing the stub call is already similar to whatever hint we finally emit in the stub, but the WX transition back and forth is likely to be quite bad to make often. SpinWait is quite likely used in busy loops, so this would add up.

So if we are doing this, we need to check how much does this actually cost.

Fredrik: I was running some performance tests after removing ObjectMonitor::NotRunnable() (see JDK-8320317).

The performance went up on Linux x86 and Windows x86 by approximately 12%, but went down with roughly the same amount on macOS AArch64. The performance decreased only slightly on Linux AArch64. So I stated to focus on the differences between macOS and Linux on AArch64 and found out that SpinPause() is implemented on Linux but not on macOS.

So I copied the source from Linux to macOS (or bsd_aarch64 if you'd like) and re-run the tests. This seemed to help bringing back macOS to the Linux level on AArch64.

I do agree that the overhead of doing the stub call is already similar to whatever hint we finally emit in the stub, and that the WX transition back and forth is likely to be quite bad.

My measurements showed that among the different OnSpinWaitInst options, "isb" generated the best result. If we could get rid of the OnSpinWaitInst options and just hard code an isb instruction (or any other instruction that people can agree upon) like it's done on x86, that would probably be best. For now I just wanted macOS AArch64 to be on par with Linux after the removal of NotRunnable().

About measuring the actual cost. Some of the performance tests show notoriously unstable values, when run multiple times. I've focused on the ones that I feel is stable.

After having some internal discussions, it seems like the most reasonable thing to do is to implement SpinPause() using a single inline yield instruction. This way we get rid of both the call and the WX stuff.

This solution also showed better performance figures than the OnSpinWaitInst options did.

The reason for using the yield instruction instead of the the isb instruction (which showed slightly better performance figures) is that the yield instruction is meant for this kind of use cases. So even if isb is slightly better on today's silicon, yield is likely to be better in the long run.

@eastig
Copy link
Member

eastig commented Dec 13, 2023

@fbredber I read comments on JDK-8321371.
I see misunderstanding of the purpose of SpinPause:

After having some internal discussions, it seems like the most reasonable thing to do is to implement SpinPause() using a single inline yield instruction. This way we get rid of both the call and the WX stuff.

This solution also showed better performance figures than the OnSpinWaitInst options did.

The reason for using the yield instruction instead of the the isb instruction (which showed slightly better performance figures) is that the yield instruction is meant for this kind of use cases. So even if isb is slightly better on today's silicon, yield is likely to be better in the long run.

SpinPause should introduce short delays ~30 - 100 clocks. It should have ARM yield or Intel pause semantics to reduce the power consumption. As it is unlikely yield to become a real instruction in the long run, not nop. We tried different approaches to simulate yield. We found that isb is the closest to the needed semantics.

Also without reviewing your benchmarking results, it is difficult to say what you measured. Could you share them?

You might also try another implementation: #5562 (comment)

@eastig
Copy link
Member

eastig commented Dec 13, 2023

According to #5562 (comment), isb works well on Apple M1.

@eastig
Copy link
Member

eastig commented Dec 13, 2023

@shipilev wrote on JDK-8321371:

I remember trying the same thing along with JDK-8318986, but I realized SpinPause() is only used from the quite hot VM native code, and so it would probably affect GC and runtime performance. The actual Thread.onSpinWait from Java code should be already handled by intrinsics. I suspect the overhead of doing the stub call is already similar to whatever hint we finally emit in the stub, but the WX transition back and forth is likely to be quite bad to make often. SpinWait is quite likely used in busy loops, so this would add up.

I don't know what WX transition is.
I can comment:

The actual Thread.onSpinWait from Java code should be already handled by intrinsics.

As bsd_aarch64 does not define what to use for spin wait, the intrinsic has no code.

@eastig
Copy link
Member

eastig commented Dec 13, 2023

The discussion of the ISB based approach: RFC: AArch64: Implementing spin pauses with ISB

@eastig
Copy link
Member

eastig commented Dec 13, 2023

... just hard code an isb instruction (or any other instruction that people can agree upon) like it's done on x86, that would probably be best.

I wanted the same but it was decided it'd be nice to have other options for spin wait/pause: nop and yield.

@fbredber
Copy link
Contributor Author

fbredber commented Dec 14, 2023

I understand. Here the discussion went something like this:

The BSD port is almost strictly a MacOSX port. The configurability is more costly on MacOSX, because of the need to call os::current_thread_enable_wx().

The vast majority of MacOSX users don't want to configure which SpinPause instruction to use, they want something that is good straight out of the box.

Since there is a dedicated instruction (yield) in the AArch64 for this use case, we ought to use it. Because it's in the interest of the CPU vendor to have as good yield implementation as possible for each and every variety of the CPU. So if the user upgrades to a new CPU version, the SpinPause should still perform as good as possible, without the need to reconfigure.

For more info about the WX stuff, see here:
https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon

@eastig
Copy link
Member

eastig commented Dec 14, 2023

I understand. Here the discussion went something like this:

The BSD port is almost strictly a MacOSX port. The configurability is more costly on MacOSX, because of the need to call os::current_thread_enable_wx().

Am I correct that in our current implementation if we call code of any generated stub we need to use MACOS_AARCH64_ONLY(ThreadWXEnable wx(WXExec, thread));?

The vast majority of MacOSX users don't want to configure which SpinPause instruction to use, they want something that is good straight out of the box.

And they don't need to do. The configuration feature is for us to define the best spin pause implementation for an AArch64 CPU.
Different AArch64 CPU might need different implementations. For Neoverse CPUs and Apple M1, one ISB is good.

As I wrote if you don't choose an instruction for spin pause, you will have an empty onSpinWait intrinsic.

Since there is a dedicated instruction (yield) in the AArch64 for this use case, we ought to use it.

yield first appeared in ARMv6. I think it can be considered as a legacy instruction. I don't know any currently existing ARM CPU which implements it. As it is not implemented, it is a NOP.

Because it's in the interest of the CPU vendor to have as good yield implementation as possible for each and every variety of the CPU.

It looks like vendors have not been interested inyield for 22 years since ARMv6 introduced in 2001. Instead in Armv8.7-A and Armv9.2-A WFE/WFI with timeouts were introduced. They might allow more effective spin loops in hardware independent way. Cortex-X4, A720 and A520 support them. Implementations of Cortex-X4 are MediaTek Dimensity 9300 and Qualcomm Snapdragon 8 Gen 3. BTW, Apple M3 is ARMv8.6-A.

So if the user upgrades to a new CPU version, the SpinPause should still perform as good as possible, without the need to reconfigure.

We usually know ahead of a user, when a widely used CPU family gets new features useful for JVM. By the time users get a new CPU, they will get JVM supporting the new hardware features. With the current JDK Project release model, it's more likely users get them as soon as possible.

@eastig
Copy link
Member

eastig commented Dec 14, 2023

@fbredber
I think what we have here is the XY problem. Please help me to understand what a problem you are trying to solve.

@fisk
Copy link
Contributor

fisk commented Dec 15, 2023

Let’s zoom out and look at the big picture for a bit.

So yield is the obvious instruction dedicated for this purpose in the ISA, and has been for a very long time. I suppose early ARM chips didn’t have a whole lot of concurrency and implementing it wasn’t all that beneficial as you frankly didn’t spend considerable time spinning. And now we seemingly have come to a classic chicken and egg problem. Software people like us don’t want to use the obvious yield instruction intended for this exact purpose, because hardware vendors haven’t implemented it. And hardware vendors don’t want to implement it, because no software is using it.

It feels like we had a sort of similar situation with neon vs SVE. All hardware was running neon, and nobody was running SVE. That doesn’t make it very encouraging as a software developer to implement SVE support in software, for an imaginary chip that doesn’t exist. And the fact that software doesn’t implement SVE doesn’t make it very encouraging to implement it. Yet we did it because it was the right thing to do, and no benchmark thanked us for it.

In the short term, it would seem like a better idea to use ISB, if you look at micro benchmarks. But what we are doing then is IMO what we tell our Java users not to do, and for good reason. We say “don’t use Unsafe to expose JDK and JVM internals that you happen to know how it works today, but you don’t know how it will work tomorrow, so you can look like a winner in a microbenchmark”. We are looking past the intended ISA contract, and look at current implementations today, and finding that as a hack, the ISB instruction which was designed to deal with cross modifying code, and not at all designed for this, is currently a better fit for doing what the yield instruction should have done had it been implemented, as the current ISB implementation has a long latency. And then a couple of years later, when the next LTS is released, a the Apple M3 is released. And then by the time we hit bulk of mainstream adoption, maybe we cycle through M4 and M5, and perhaps we end up with a hyper threaded core with 4 threads per core, and the isb turns out to be a disaster as it impedes progress of the other threads on the core to avoid shared resources being tripped over by cross modifying code, which is the exact opposite behaviour of what a pause instruction should do. Or maybe it won’t happen and everything is fine. The point is that we don’t know.

We are now in a situation where on MacOS the spinning hint does nothing. There is no cost over the baseline of making that spinning hint bind to the obvious yield instruction. It might not do a lot on Apple silicon today, and as of today, isb would probably yield better results (pun intended). But we don’t have to choose ISB just because it looks better today. We can instead be bold and break the chicken and egg situation, and use yield. Then the ball is on the HW court to do the right thing.

Because of this, I still think the right thing to do, is to bind SpinPause to the obvious ISA intended “yield” instruction, even though it is currently a nop.

@fbredber
Copy link
Contributor Author

@eastig

I think what we have here is the XY problem. Please help me to understand what a problem you are trying to solve.

I'd like to have a SpinPause() implementation in place for AArch64 based MacOSX platforms when I integrate #17050, which removes ObjectMonitor::NotRunnable().

@eastig
Copy link
Member

eastig commented Dec 15, 2023

@fisk I don't mind to use yield.

isb appeared as a result of real customer cases and consultations with hardware engineers. It was a real problem to write a micro-benchmark. I think it's still a problem.
We are aware that different implementations of AArch64 might need different spin pause implementations. This is why we provide options which instructions to use: yield, nop or isb.

I think hardcoding yield is not a good idea. We should use the current implemented approach where yield is selected. If the current approach does not work because it is based on a generated stub in CodeCache. Let's modify it to get code in place instead of stub.

@fbredber @fisk
What do you think about the following solution:

In src/hotspot/cpu/aarch64/spin_wait_aarch64.hpp we add DEFAULT to SpinWait::Inst which is set to YIELD. We also add a template function exec<SpinWait::Instr>() and its implementation exec<SpinWait::YIELD>()

class SpinWait {
public:
  enum Inst {
    NONE = -1,
    NOP,
    ISB,
    YIELD,
    DEFAULT = YIELD
  };
...
};

template<SpinWait::Inst inst> void exec();

template<> exec<SpinWait::YIELD>() {
  asm volatile("yield" : : : "memory");
}

In src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp:

extern "C" {
  int SpinPause() {
    exec<SpinWait::DEFAULT>();
    return 1;
  }

To enable onSpinWait intrinsic (this or a separate PR) you will need to set yield as default in src/hotspot/cpu/aarch64/globals_aarch64.hpp

  product(ccstr, OnSpinWaitInst, "yield", DIAGNOSTIC,                    \
          "The instruction to use to implement "                        \
          "java.lang.Thread.onSpinWait()."                              \
          "Options: none, nop, isb, yield.")                            \

You will also need to update tests:

  • test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java
  • test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitNoneAArch64.java

@fisk
Copy link
Contributor

fisk commented Dec 15, 2023

@eastig I think the essence of what you are proposing - having a non-WX solution for making the instruction selection configurable, but defaulting to yield - sounds very reasonable. Good suggestion!

@fbredber
Copy link
Contributor Author

Sounds like a way forward. I'll try it out. Thanks!

@fbredber
Copy link
Contributor Author

@eastig @fisk

I've now rewritten SpinPause() so you can use the OnSpinWaitInst option to choose between none, nop, isb and yield.
I choose not to change the default from none to yield in this PR. For that I've created JDK-8322535.

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@eastig
Copy link
Member

eastig commented Dec 20, 2023

@fbredber
The assembly code looks good to me.
Out of curiosity, why not to use a regular switch? Why is the manually written assembly better?
What if it looked like the following:

class SpinWait {
public:
   static void exec(Inst inst) {
      switch (inst) {
        case NOP:
            asm volatile();
            break;
        case ISB:
            asm volatile();
            break;
         case YIELD:
            asm volatile();
            break;
        default:
      }
   }
...
};

extern "C" {
  int SpinPause() {
    SpinWait::exec(VM_Version::spin_wait_desc().inst());
    return 1;
  }

@fbredber
Copy link
Contributor Author

@eastig

Out of curiosity, why not to use a regular switch? Why is the manually written assembly better?

I just like to keep away from conditional branches in code that is supposed to be in tight loops. :)

@dcubed-ojdk
Copy link
Member

@fbredber - This PR's description and the final comment in JDK-8321371
are out of date relative to the change in direction. Please update the PR's description
and add a new comment to the bug to reflect the current design of the change.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up with a couple of minor nits.

I like the idea of hooking the os_bsd_aarch64 version of SpinPause
into the OnSpinWaitInst option. This will enable easier experimentation
in the future without specially built binaries to switch the SpinPause
implementation.

src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp Outdated Show resolved Hide resolved
src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp Outdated Show resolved Hide resolved
@fbredber
Copy link
Contributor Author

@dcubed-ojdk

Please update the PR's description and add a new comment to the bug to reflect the current design of the change.

Fixed

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up.

Copy link
Member

@eastig eastig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fbredber
Copy link
Contributor Author

Thanks for the review guys. Since it's now the night before my Christmas vacation, I will wait and integrate this PR another year.

@fbredber
Copy link
Contributor Author

fbredber commented Jan 8, 2024

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jan 8, 2024
@openjdk
Copy link

openjdk bot commented Jan 8, 2024

@fbredber
Your change (at version 935f586) is now ready to be sponsored by a Committer.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, this one is pretty wild, but I somehow like it.

@fisk
Copy link
Contributor

fisk commented Jan 8, 2024

/sponsor

@openjdk
Copy link

openjdk bot commented Jan 8, 2024

Going to push as commit fc04750.
Since your change was applied there have been 197 commits pushed to the master branch:

  • 458e563: 8310711: [IR Framework] Remove safepoint while printing handling
  • 71aac7a: 8276809: java/awt/font/JNICheck/FreeTypeScalerJNICheck.java shows JNI warning on Windows
  • 09c6c4f: 8322489: 22-b27: Up to 7% regression in all Footprint3-*-G1/ZGC
  • eb9e754: 8323065: Unneccesary CodeBlob lookup in CompiledIC::internal_set_ic_destination
  • a40d397: 8323110: Eliminate -Wparentheses warnings in ppc code
  • 7edd10e: 8321786: SegmentAllocator:allocateFrom(ValueLayout, MemorySegment,ValueLayout,long,long) spec mismatch in exception scenario
  • d75d876: 8322806: Eliminate -Wparentheses warnings in aarch64 code
  • e442769: 8322754: click JComboBox when dialog about to close causes IllegalComponentStateException
  • 3560e68: 8322815: Eliminate -Wparentheses warnings in shenandoah code
  • faa9c69: 8322846: Running with -Djdk.tracePinnedThreads set can hang
  • ... and 187 more: https://git.openjdk.org/jdk/compare/92fd490f22f690ff7698182658363b7035bcc3bf...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 8, 2024
@openjdk openjdk bot closed this Jan 8, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Jan 8, 2024
@openjdk
Copy link

openjdk bot commented Jan 8, 2024

@fisk @fbredber Pushed as commit fc04750.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

6 participants