Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8186670: Implement _onSpinWait() intrinsic for AArch64 #5562

Closed
wants to merge 16 commits into from

Conversation

eastig
Copy link
Contributor

@eastig eastig commented Sep 17, 2021

This PR is a follow-up on the discussion “RFC: AArch64: Implementing spin pauses with ISB”.

It adds DIAGNOSTIC options OnSpinWaitInst=inst, where inst can be:

  • none: no implementation for spin pauses. This is the default value.
  • nop: use nop instruction for spin pauses.
  • isb: use isb instruction for spin pauses.
  • yield: use yield instruction for spin pauses.

And OnSpinWaitInstCount=count, where count specifies a number of OnSpinWaitInst and can be in 1..99 range. It is an error to use OnSpinWaitInstCount when OnSpinWaitInst is none.

The code for the Thread.onSpinWait intrinsic is generated based on the values of OnSpinWaitInst and OnSpinWaitInstCount.

Testing:

  • make test TEST="gtest": Passed
  • make run-test TEST="tier1": Passed
  • make run-test TEST="tier2": Passed
  • make run-test TEST=hotspot/jtreg/compiler/onSpinWait: Passed

CSR: https://bugs.openjdk.java.net/browse/JDK-8274564


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8186670: Implement _onSpinWait() intrinsic for AArch64

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5562/head:pull/5562
$ git checkout pull/5562

Update a local copy of the PR:
$ git checkout pull/5562
$ git pull https://git.openjdk.java.net/jdk pull/5562/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5562

View PR using the GUI difftool:
$ git pr show -t 5562

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5562.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Sep 17, 2021

👋 Welcome back eastig! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Sep 17, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 17, 2021

@eastig The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot label Sep 17, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 17, 2021

Copy link
Contributor

@stooart-mon stooart-mon left a comment

Looks Ok to me, this is the most future proof option.
Will you be adding code to set the default depending on model, or is that something for your fork?

Unimplemented();
switch (VM_Version::pause_impl_desc().inst()) {
case NOP:
for (unsigned int i = 1; i < VM_Version::pause_impl_desc().inst_count(); ++i) {
Copy link
Contributor

@stooart-mon stooart-mon Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these loops be indexed from 0?

Copy link
Contributor Author

@eastig eastig Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It is a copy-paste error.
Is there any method to test C1 generated assembly code?

Copy link
Contributor

@theRealAph theRealAph Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do it the same way as hotspot/jtreg/compiler/c2/aarch64/TestVolatiles.java, i.e. spawn a subtask and parse the output dump. It's very fiddly, though.

Copy link
Contributor Author

@eastig eastig Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used it as an example when I was writing tests for the PR. It works only for C2 because it relies on C2 XX:+PrintOptoAssembly. I haven't found anything similar for C1.

Copy link
Contributor Author

@eastig eastig Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-XX:+PrintAssembly

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have assembly instructions in -XX:+PrintAssembly output hsdis needs to be provided:

  0x0000ffff61ba2b5c: ;   {metadata({method} {0x0000000800466ab8} 'isLatin1' '()Z' in 'java/lang/String')}
  0x0000ffff61ba2b5c: 0857 8dd2 | c808 a0f2 | 0801 c0f2 | e807 00f9
 ;; 0xFFFFFFFFFFFFFFFF
  0x0000ffff61ba2b6c: 0800 8092 | e803 00f9

  0x0000ffff61ba2b74: ;   {runtime_call counter_overflow Runtime1 stub}

However it can help to skip to the place where instructions are expected and to check instructions' hex code.

Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. There's no C1 equivalent.

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote a test to parse XX:+PrintAssembly hex instructions.

Copy link
Member

@phohensee phohensee left a comment

Do you intend to make isb the default for N1?

@eastig
Copy link
Contributor Author

@eastig eastig commented Sep 17, 2021

Do you intend to make isb the default for N1?

Yes, I do.
I'll rewrite #5112 to use different implementations.
After that, I'd like to enable ISB for N1.

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Sep 19, 2021

/csr needed

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Sep 19, 2021

If you are adding a new product flag then a CSR request is needed.

David

@openjdk openjdk bot added the csr label Sep 20, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 20, 2021

@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@eastig please create a CSR request for issue JDK-8186670. This pull request cannot be integrated until the CSR request is approved.

switch (VM_Version::pause_impl_desc().inst()) {
case NOP:
PRINT_N_INST(nop)
break;
case ISB:
PRINT_N_INST(isb)
break;
case YIELD:
PRINT_N_INST(yield)
break;
default:
ShouldNotReachHere();
}
#undef EMIT_N_ASM_STRINGS
Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this is necessary. Printing "onspinwait" is enough.

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results no instructions implementing onspinwait in OptoAssembly output. Why do we want to hide the details?

Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not useful as a verification that the correct instructions were generated: you need a disassembly for that. OptoAssembly is usually a somewhat briefer format.

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

switch (VM_Version::pause_impl_desc().inst()) {
case NOP:
EMIT_N_INST(inst_count, nop);
break;
case ISB:
EMIT_N_INST(inst_count, isb);
break;
case YIELD:
EMIT_N_INST(inst_count, yield);
break;
default:
ShouldNotReachHere();
}
%}
Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let the MacroAssembler do this. Just call MacroAssembler::spin_wait().

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

EMIT_N_INST(inst_count, nop);
break;
case ISB:
EMIT_N_INST(inst_count, isb);
break;
case YIELD:
EMIT_N_INST(inst_count, yield);
break;
default:
ShouldNotReachHere();
Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please just call MacroAssembler::spin_wait().

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -110,7 +110,10 @@ define_pd_global(intx, InlineSmallCode, 1000);
product(int, SoftwarePrefetchHintDistance, -1, \
"Use prfm hint with specified distance in compiled code." \
"Value -1 means off.") \
range(-1, 4096)
range(-1, 4096) \
product(ccstr, UsePauseImpl, "none", \
Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "UsePauseImpl" fails to make the connection with onSpinWait. If you called it something like OnSpinWaitImpl that would make the connection.

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

range(-1, 4096)
range(-1, 4096) \
product(ccstr, UsePauseImpl, "none", \
"Use instructions to implement pauses." \
Copy link
Contributor

@theRealAph theRealAph Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Use instructions to implement pauses." \
"Use instructions to implement java.lang.Thread.onSpinWait() ." \

Copy link
Contributor Author

@eastig eastig Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@eastig
Copy link
Contributor Author

@eastig eastig commented Sep 21, 2021

If you are adding a new product flag then a CSR request is needed.

David

Hi David,
I'll create a CSR when the name of the option is finilazed.

Code emitting spin pauses is moved to MacroAssembler::spin_wait.
As OptoAssembly output is changed, tests are updated to parse
PrintAssembly.
@@ -1380,6 +1380,27 @@ class MacroAssembler: public Assembler {
void cache_wb(Address line);
void cache_wbsync(bool is_pre);

// Code for java.lang.Thread::onSpinWait() intrinsic.
void spin_wait() {
#define EMIT_N_INST(n, inst) for (int i = 0; i < (n); ++i) inst()
Copy link
Member

@nick-arm nick-arm Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a macro here? You could just put the loop around the switch statement. And the method body seems sufficiently large that it ought to go in the .cpp file.

Copy link
Contributor

@theRealAph theRealAph Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. There's no significant performance advantage to having this in the header.

Copy link
Contributor Author

@eastig eastig Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a macro here? You could just put the loop around the switch statement. And the method body seems sufficiently large that it ought to go in the .cpp file.

:) compiler engineering experience. Compilers have a problem to apply unswitching optimization to loop-invariant SWITCHes.
I'll update the code as suggested.

Copy link
Contributor Author

@eastig eastig Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@eastig
Copy link
Contributor Author

@eastig eastig commented Sep 22, 2021

@theRealAph, when I was writing a test I notice a strange thing in PrintAssembly output:

# {method} {0x0000ffff6ac00370} 'test' '()V' in 'compiler/onSpinWait/TestOnSpinWaitImplAArch64$Launcher'
  #           [sp+0x40]  (sp of caller)
  0x0000ffff9d557680: 1f20 03d5 | e953 40d1 | 3f01 00f9 | ff03 01d1 | fd7b 03a9 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5
  0x0000ffff9d5576a0: 1f20 03d5 | 1f20 03d5 | 1f20 03d5

  0x0000ffff9d5576ac: ;*invokestatic onSpinWait {reexecute=0 rethrow=0 return_oop=0}
                      ; - compiler.onSpinWait.TestOnSpinWaitImplAArch64$Launcher::test@0 (line 161)
  0x0000ffff9d5576ac: 1f20 03d5 | fd7b 43a9 | ff03 0191

The code is for the case when 7 NOPs are used for a spin pause.
In the output only one instruction is after invokestatic onSpinWait. Other 6 instructions are before it.
Is it expected behaviour or a bug?

In addition, comments are added to a checking method of a test.
@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Sep 22, 2021

It's pretty much expected, yes. The debuginfo isn't all that precise.

Copy link
Member

@phohensee phohensee left a comment

In pause_aarch64.hpp, I'd put the definition of PauseInst inside the definition of PauseImplDesc in order to not clutter the global namespec more than needed.

} else if (strcmp(s, "yield") == 0) {
return PauseImplDesc(YIELD, count);
} else if (strcmp(s, "none") != 0) {
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl", OnSpinWaitImpl);
Copy link
Contributor

@theRealAph theRealAph Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl", OnSpinWaitImpl);
vm_exit_during_initialization("The options for OnSpinWaitImpl are nop, isb, yield, and none", OnSpinWaitImpl);

Copy link
Contributor Author

@eastig eastig Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (isdigit(*s)) {
count = *s - '0';
if (count == 0) {
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl: zero instruction count", OnSpinWaitImpl);
}
s += 1;
}
Copy link
Contributor

@theRealAph theRealAph Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (isdigit(*s)) {
count = *s - '0';
if (count == 0) {
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl: zero instruction count", OnSpinWaitImpl);
}
s += 1;
}
while (isdigit(*s++));
count = atoi(OnSpinWaitImpl);

Copy link
Contributor

@theRealAph theRealAph Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, this combination of digits and named option is unusual in HotSpot; it may be unique. For the sake of not doing something so unfamiliar to our users, it may be worth separating the count and the option string.

@eastig
Copy link
Contributor Author

@eastig eastig commented Oct 14, 2021

Results:

  • 1 isb
Benchmark                              (maxNum)  Mode  Cnt    Score   Error  Units
ThreadOnSpinWait.count:withOnSpinWait   1000000  avgt   10   18.029 ± 0.001  ms/op
ThreadOnSpinWait.count:withSleep0       1000000  avgt   10  337.506 ± 2.530  ms/op
ThreadOnSpinWait.count:withoutPause     1000000  avgt   10    2.663 ± 0.073  ms/op
  • 1 yield
Benchmark                              (maxNum)  Mode  Cnt    Score   Error  Units
ThreadOnSpinWait.count:withOnSpinWait   1000000  avgt   10    2.667 ± 0.069  ms/op
ThreadOnSpinWait.count:withSleep0       1000000  avgt   10  339.933 ± 4.954  ms/op
ThreadOnSpinWait.count:withoutPause     1000000  avgt   10    2.677 ± 0.075  ms/op

@eastig
Copy link
Contributor Author

@eastig eastig commented Oct 15, 2021

@theRealAph, any comments on the microbenchmark I wrote?

@nick-arm
Copy link
Member

@nick-arm nick-arm commented Oct 15, 2021

@theRealAph, any comments on the microbenchmark I wrote?

I think to show the benefit of the onSpinWait() changes you'd need some contention between the threads - e.g. if nowork() was updating a shared counter.

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Oct 15, 2021

@theRealAph, any comments on the microbenchmark I wrote?

Something like this works well:

   @Param({"1000000"})
    public int maxNum;

    @Param({"4"})
    public int threadCount;

    AtomicInteger theCounter;

    Thread threads[];

    void work() {
        for (;;) {
            int prev = theCounter.get();
            if (prev >= maxNum) {
                break;
            }
            if (theCounter.compareAndExchange(prev, prev + 1) != prev) {
                Thread.onSpinWait();
            }
        }
    }

    @Setup(Level.Trial)
    public void foo() {
        theCounter = new AtomicInteger();
    }

    @Setup(Level.Invocation)
    public void setup() {
        theCounter.set(0);
        threads = new Thread[threadCount];

        for (int i = 0; i< threads.length; i++) {
            threads[i] = new Thread(this::work);
        }

    }

    @Benchmark
    public void trial() throws Exception {
        for (int i = 0; i< threads.length; i++) {
            threads[i].start();
        }
        for (int i = 0; i< threads.length; i++) {
            threads[i].join();
        }
    }
}

Before:

Benchmark               (maxNum)  (threadCount)  Mode  Cnt   Score    Error  Units
ThreadOnSpinWait.trial   1000000              2  avgt    3  43.830 ± 32.543  ms/op

With -XX:OnSpinWaitInst=isb -XX:OnSpinWaitInstCount=4

Benchmark               (maxNum)  (threadCount)  Mode  Cnt   Score    Error  Units
ThreadOnSpinWait.trial   1000000              2  avgt    3  22.181 ± 11.592  ms/op

With -XX:OnSpinWaitInst=isb -XX:OnSpinWaitInstCount=1

Benchmark               (maxNum)  (threadCount)  Mode  Cnt   Score    Error  Units
ThreadOnSpinWait.trial   1000000              2  avgt    3  36.281 ± 31.700  ms/op

This is Apple M1, where you have to be very careful because there's some processor
frequency scaling going on.

eastig added 2 commits Oct 18, 2021
ThreadOnSpinWait: latency of Thread.onSpinWait() vs Thread.sleep(0) vs
no pause.
ThreadOnSpinWaitProducerConsumer: producer-consumer where consumer can
go to 1ms sleep.
ThreadOnSpinWaitSharedCounter: threads racing to count up (code provided
by Andrew Haley, Red Hat).
@eastig
Copy link
Contributor Author

@eastig eastig commented Oct 18, 2021

@theRealAph, any comments on the microbenchmark I wrote?

Something like this works well:

Thank you for the microbenchmark. I added it.
I also added a microbenchmark to measure the latency of Thread.onSpinWait vs Thread.sleep(0) vs no pause.
And a microbenchmark ThreadOnSpinWaitProducerConsumer.

@eastig
Copy link
Contributor Author

@eastig eastig commented Oct 20, 2021

Hi @theRealAph,
Any comments on the microbenchmarks?

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class ThreadOnSpinWaitProducerConsumer {
@Param({"100"})
Copy link
Contributor

@theRealAph theRealAph Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems rather artificial and unrealistic to me. You can get improved performance simply by increasing the spinNum so that it waits for longer. The way to get some advantage from onSpinWait is to have multiple threads racing to update the same thing, e.g. to acquire a lock.

Copy link
Contributor Author

@eastig eastig Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I am removing it.

Copy link
Contributor Author

@eastig eastig Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph, done

Copy link
Contributor

@theRealAph theRealAph Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm not entirely sure whether this test is truly representative of the real-world cases that people have seen, but if we find out more we can always add another JMH test.

Copy link
Contributor Author

@eastig eastig Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too artificial. Going through my records I've found I have a microbenchmark for java.util.concurrent. SynchronousQueue which shows good improvements on jdk11. SynchronousQueue uses onSpinWait. Since jdk17 SynchronousQueue has not been using onSpinWait any more (See https://bugs.openjdk.java.net/browse/JDK-8267502). Maybe I can come up with a microbenchmark based on SynchronousQueue code:

        SNode awaitFulfill(SNode s, boolean timed, long nanos) {
            /*
             * When a node/thread is about to block, it sets its waiter
             * field and then rechecks state at least one more time
             * before actually parking, thus covering race vs
             * fulfiller noticing that waiter is non-null so should be
             * woken.
             *
             * When invoked by nodes that appear at the point of call
             * to be at the head of the stack, calls to park are
             * preceded by spins to avoid blocking when producers and
             * consumers are arriving very close in time.  This can
             * happen enough to bother only on multiprocessors.
             *
             * The order of checks for returning out of main loop
             * reflects fact that interrupts have precedence over
             * normal returns, which have precedence over
             * timeouts. (So, on timeout, one last check for match is
             * done before giving up.) Except that calls from untimed
             * SynchronousQueue.{poll/offer} don't check interrupts
             * and don't wait at all, so are trapped in transfer
             * method rather than calling awaitFulfill.
             */
            final long deadline = timed ? System.nanoTime() + nanos : 0L;
            Thread w = Thread.currentThread();
            int spins = shouldSpin(s)
                ? (timed ? MAX_TIMED_SPINS : MAX_UNTIMED_SPINS)
                : 0;
            for (;;) {
                if (w.isInterrupted())
                    s.tryCancel();
                SNode m = s.match;
                if (m != null)
                    return m;
                if (timed) {
                    nanos = deadline - System.nanoTime();
                    if (nanos <= 0L) {
                        s.tryCancel();
                        continue;
                    }
                }
                if (spins > 0) {
                    Thread.onSpinWait();
                    spins = shouldSpin(s) ? (spins - 1) : 0;
                }
                else if (s.waiter == null)
                    s.waiter = w; // establish waiter so can park next iter
                else if (!timed)
                    LockSupport.park(this);
                else if (nanos > SPIN_FOR_TIMEOUT_THRESHOLD)
                    LockSupport.parkNanos(this, nanos);
            }
        }

I've created https://bugs.openjdk.java.net/browse/JDK-8275728 to write such a microbenchmark.

Copy link
Contributor

@theRealAph theRealAph Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you do https://bugs.openjdk.java.net/browse/JDK-8275728 before you commit this. A benchmark which proves that this patch has some utility is needed, isn't it?

Copy link
Contributor Author

@eastig eastig Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew (@theRealAph),
I've created a PR: #6338 with a microbenchmark.

Copy link
Contributor

@theRealAph theRealAph Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew (@theRealAph), I've created a PR: #6338 with a microbenchmark.

That's really weird. Why is the benchmark not here?

Copy link
Contributor Author

@eastig eastig Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a separate PR would simplify a discussion. Sorry if I was wrong.
I added it here.

@eastig
Copy link
Contributor Author

@eastig eastig commented Nov 1, 2021

Hi @theRealAph,
I see there are no other comments.
Can I proceed to integrate?

@eastig
Copy link
Contributor Author

@eastig eastig commented Nov 11, 2021

ThreadOnSpinWaitProducerConsumer is to demonstrate Thread.onSpinWait can be used to avoid heavy locks.
The microbenchmark differs from Gil's original benchmark and Dmitry's variations. Those benchmarks produce/consume data by incrementing a volatile counter. The latency of such operations is almost zero. They also don't use heavy locks. According to Gil's SpinWaitTest.java:

This test can be used to measure and document the impact of Runtime.onSpinWait() behavior
on thread-to-thread communication latencies. E.g. when the two threads are pinned to
the two hardware threads of a shared x86 core (with a shared L1), this test will
demonstrate an estimate the best case thread-to-thread latencies possible on the
platform

Gil's microbenchmark targets SMT cases (x86 hyperthreading). As not all CPUs support SMT, the microbenchmarks cannot demonstrate benefits of Thread.onSpinWait. It is actually opposite. They show Thread.onSpinWait has negative impact on performance.

The microbenchmark from PR uses BigInteger to have 100 - 200 ns latencies for producing/consuming data. These latencies can cause either a producer or a consumer to wait each another. Waiting is implemented with Object.wait/Object.notify which are heavy. Thread.onSpinWait can be used in a spin loop to avoid them.

ARM64 results:

  • No spin loop
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt     Score    Error  Units
ThreadOnSpinWaitProducerConsumer.trial       100          0  avgt   75  1520.448 ± 40.507  us/op
  • No Thread.onSpinWait intrinsic
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt     Score    Error  Units
ThreadOnSpinWaitProducerConsumer.trial       100        125  avgt   75  1580.756 ± 47.501  us/op
  • ISB-based Thread.onSpinWait intrinsic
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt    Score     Error  Units
ThreadOnSpinWaitProducerConsumer.trial       100        125  avgt   75  617.454 ± 174.431  us/op

X86_64 results:

  • No spin loop
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt    Score     Error  Units
ThreadOnSpinWaitProducerConsumer.trial      100        125  avgt   75  1417.944 ± 1.691  us/op
  • No Thread.onSpinWait intrinsic
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt    Score     Error  Units
ThreadOnSpinWaitProducerConsumer.trial      100        125  avgt   75  1410.987 ± 2.093  us/op
  • PAUSE-based Thread.onSpinWait intrinsic
Benchmark                               (maxNum)  (spinNum)  Mode  Cnt    Score     Error  Units
ThreadOnSpinWaitProducerConsumer.trial      100        125  avgt   75  217.054 ± 1.283  us/op

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Nov 11, 2021

I'm getting this for -XX:OnSpinWaitInst=yield on Apple M1:

Benchmark                                (maxNum)  (spinNum)    Score   Error  Units
ThreadOnSpinWaitProducerConsumer.trial       100        125   355.686 ± 1.263  us/op

This for -XX:OnSpinWaitInst=none:

ThreadOnSpinWaitProducerConsumer.trial       100        125   359.635 ± 0.912  us/op

This for -XX:OnSpinWaitInst=isb:

ThreadOnSpinWaitProducerConsumer.trial       100        125   169.353 ± 3.932  us/op

Which looks pretty convincing, at least for this benchmark.

I'm a bit concerned that it took so much effort to find a convincing benchmark, but I note that OnSpinWaitInst=isb doesn't seem to make anything worse, so OK.

@eastig
Copy link
Contributor Author

@eastig eastig commented Nov 11, 2021

I'm a bit concerned that it took so much effort to find a convincing benchmark, but I note that OnSpinWaitInst=isb doesn't seem to make anything worse, so OK.

Thank you Andrew.
It took the time to study the current use cases of Thread.onSpinWait why they got performance improved or did not. As usual when you have written something simple you need to check it is correct. All of these took most of the time.

@eastig
Copy link
Contributor Author

@eastig eastig commented Nov 11, 2021

/integrate

@openjdk openjdk bot added the sponsor label Nov 11, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 11, 2021

@eastig
Your change (at version 0d6fc3f) is now ready to be sponsored by a Committer.

@phohensee
Copy link
Member

@phohensee phohensee commented Nov 11, 2021

/sponsor

@openjdk
Copy link

@openjdk openjdk bot commented Nov 11, 2021

Going to push as commit 6954b98.
Since your change was applied there have been 725 commits pushed to the master branch:

  • 3445e50: 8276265: jcmd man page is outdated
  • 0ca0acf: 8276947: Clarify how DateTimeFormatterBuilder.appendFraction handles value ranges
  • b0d7a9d: 8276994: java/nio/channels/Channels/TransferTo.java leaves multi-GB files in /tmp
  • 8aae88b: 8276763: java/nio/channels/SocketChannel/AdaptorStreams.java fails with "SocketTimeoutException: Read timed out"
  • 6f35eed: 8079267: [TEST_BUG] Test java/awt/Frame/MiscUndecorated/RepaintTest.java fails
  • 5e98f99: 8276800: Fix table headers in NumericShaper.html
  • 2ca4ff8: 8244202: Implementation of JEP 418: Internet-Address Resolution SPI
  • c29cab8: 8276112: Inconsistent scalar replacement debug info at safepoints
  • aea0967: 8275854: C2: assert(stride_con != 0) failed: missed some peephole opt
  • 9862cd0: 8275786: New javadoc option to add script files to generated documentation
  • ... and 715 more: https://git.openjdk.java.net/jdk/compare/7e92abe7a4bd2840fed19826fbff0285732f1765...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Nov 11, 2021
@openjdk openjdk bot added integrated and removed ready rfr sponsor labels Nov 11, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 11, 2021

@phohensee @eastig Pushed as commit 6954b98.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@mlbridge
Copy link

@mlbridge mlbridge bot commented May 10, 2022

Mailing list message from Stuart Monteith on hotspot-dev:

Hello,
Following on from Evgeny's patch, I've been looking at an alternative way implementation of onSpinWait. It is great
that we can choose between YIELDs, NOPs and ISBs or none, but it is desirable to move away from ISB. They are not
intended to be used for delays, and their implementation would likely change in future.

There are other possible implementations with Arm V8.7 features besides what we've implemented here, but silicon is just
not there yet. I've put a patch here:

https://github.com/stooart-mon/jdk/commit/5a973ac9c67db32c649be1c317adc2185c2568fd

This implements a delay by reading a virtualized timer and waiting for a period of time to be exceeded:

MRS X0, CNTVCT_EL0
ADD X0, X0, #<value>
loop: YIELD
MRS X1, CNTVCT_EL0
CMP X1, X0
B.LT loop

The counter is incremented at a rate that is particular to the CPU implementation - this is held in the CNTFRQ_EL0
register. For example, an Altra may tick at 25 MHz (40ns per tick), a RaspberryPi 4b will tick at 54Mhz (18ns).

This is straightforward in Java, we can just generate the static code as is. However, other software would need to load
the delay each time.

To enable this implementation, pass options like so:

-XX:OnSpinWaitInst=counter -XX:OnSpinWaitCounterDelay=10

One of the problems with this approach, and spinwaits in general, is knowing what the correct value should be. As the
counter frequency varies between machines, and it is not clear what the actual delay itself should be - I would expect
we'd offer a minimum value, and expect the algorithm calling the wait to adjust as necessary for its purposes. Ideally
we'd implement our spinloops in such a way that all this may be unnecessary - objectMonitor.cpp alludes to spinning
locally such as MCS.

For most systems, a delay of "2" to "15" would be a good range to test. With future revisions of the Arm ARM and some
more recent cores, the CNTFRQ register will report the counters increasing by 1 billion every second, but not
necessarily in increments of 1.

I'll be interested in hearing what people think.

BR,
Stuart

For the the ThreadProducerConsumer spinwait tests,

On 17/09/2021 12:36, Evgeny Astigeevich wrote:

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

@mlbridge
Copy link

@mlbridge mlbridge bot commented May 12, 2022

Mailing list message from Andrew Haley on hotspot-dev:

On 5/10/22 18:02, Stuart Monteith wrote:

I'll be interested in hearing what people think.

This looks like a very interesting ides.

Evgeny, could you try this on some of your benchmarks?

Thanks you,

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot integrated
9 participants