-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8186670: Implement _onSpinWait() intrinsic for AArch64 #5562
Conversation
👋 Welcome back eastig! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Ok to me, this is the most future proof option.
Will you be adding code to set the default depending on model, or is that something for your fork?
Unimplemented(); | ||
switch (VM_Version::pause_impl_desc().inst()) { | ||
case NOP: | ||
for (unsigned int i = 1; i < VM_Version::pause_impl_desc().inst_count(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these loops be indexed from 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It is a copy-paste error.
Is there any method to test C1 generated assembly code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do it the same way as hotspot/jtreg/compiler/c2/aarch64/TestVolatiles.java, i.e. spawn a subtask and parse the output dump. It's very fiddly, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I used it as an example when I was writing tests for the PR. It works only for C2 because it relies on C2 XX:+PrintOptoAssembly
. I haven't found anything similar for C1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-XX:+PrintAssembly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have assembly instructions in -XX:+PrintAssembly
output hsdis
needs to be provided:
0x0000ffff61ba2b5c: ; {metadata({method} {0x0000000800466ab8} 'isLatin1' '()Z' in 'java/lang/String')}
0x0000ffff61ba2b5c: 0857 8dd2 | c808 a0f2 | 0801 c0f2 | e807 00f9
;; 0xFFFFFFFFFFFFFFFF
0x0000ffff61ba2b6c: 0800 8092 | e803 00f9
0x0000ffff61ba2b74: ; {runtime_call counter_overflow Runtime1 stub}
However it can help to skip to the place where instructions are expected and to check instructions' hex code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. There's no C1 equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote a test to parse XX:+PrintAssembly
hex instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to make isb the default for N1?
Yes, I do. |
/csr needed |
If you are adding a new product flag then a CSR request is needed. David |
@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request. |
src/hotspot/cpu/aarch64/aarch64.ad
Outdated
switch (VM_Version::pause_impl_desc().inst()) { | ||
case NOP: | ||
PRINT_N_INST(nop) | ||
break; | ||
case ISB: | ||
PRINT_N_INST(isb) | ||
break; | ||
case YIELD: | ||
PRINT_N_INST(yield) | ||
break; | ||
default: | ||
ShouldNotReachHere(); | ||
} | ||
#undef EMIT_N_ASM_STRINGS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is necessary. Printing "onspinwait" is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results no instructions implementing onspinwait
in OptoAssembly output. Why do we want to hide the details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not useful as a verification that the correct instructions were generated: you need a disassembly for that. OptoAssembly is usually a somewhat briefer format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/hotspot/cpu/aarch64/aarch64.ad
Outdated
switch (VM_Version::pause_impl_desc().inst()) { | ||
case NOP: | ||
EMIT_N_INST(inst_count, nop); | ||
break; | ||
case ISB: | ||
EMIT_N_INST(inst_count, isb); | ||
break; | ||
case YIELD: | ||
EMIT_N_INST(inst_count, yield); | ||
break; | ||
default: | ||
ShouldNotReachHere(); | ||
} | ||
%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let the MacroAssembler do this. Just call MacroAssembler::spin_wait()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
EMIT_N_INST(inst_count, nop); | ||
break; | ||
case ISB: | ||
EMIT_N_INST(inst_count, isb); | ||
break; | ||
case YIELD: | ||
EMIT_N_INST(inst_count, yield); | ||
break; | ||
default: | ||
ShouldNotReachHere(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please just call MacroAssembler::spin_wait()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -110,7 +110,10 @@ define_pd_global(intx, InlineSmallCode, 1000); | |||
product(int, SoftwarePrefetchHintDistance, -1, \ | |||
"Use prfm hint with specified distance in compiled code." \ | |||
"Value -1 means off.") \ | |||
range(-1, 4096) | |||
range(-1, 4096) \ | |||
product(ccstr, UsePauseImpl, "none", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "UsePauseImpl" fails to make the connection with onSpinWait
. If you called it something like OnSpinWaitImpl
that would make the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
range(-1, 4096) | ||
range(-1, 4096) \ | ||
product(ccstr, UsePauseImpl, "none", \ | ||
"Use instructions to implement pauses." \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Use instructions to implement pauses." \ | |
"Use instructions to implement java.lang.Thread.onSpinWait() ." \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi David, |
Code emitting spin pauses is moved to MacroAssembler::spin_wait. As OptoAssembly output is changed, tests are updated to parse PrintAssembly.
@@ -1380,6 +1380,27 @@ class MacroAssembler: public Assembler { | |||
void cache_wb(Address line); | |||
void cache_wbsync(bool is_pre); | |||
|
|||
// Code for java.lang.Thread::onSpinWait() intrinsic. | |||
void spin_wait() { | |||
#define EMIT_N_INST(n, inst) for (int i = 0; i < (n); ++i) inst() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a macro here? You could just put the loop around the switch statement. And the method body seems sufficiently large that it ought to go in the .cpp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. There's no significant performance advantage to having this in the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a macro here? You could just put the loop around the switch statement. And the method body seems sufficiently large that it ought to go in the .cpp file.
:) compiler engineering experience. Compilers have a problem to apply unswitching optimization to loop-invariant SWITCHes.
I'll update the code as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@theRealAph, when I was writing a test I notice a strange thing in
The code is for the case when 7 NOPs are used for a spin pause. |
In addition, comments are added to a checking method of a test.
It's pretty much expected, yes. The debuginfo isn't all that precise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pause_aarch64.hpp, I'd put the definition of PauseInst inside the definition of PauseImplDesc in order to not clutter the global namespec more than needed.
} else if (strcmp(s, "yield") == 0) { | ||
return PauseImplDesc(YIELD, count); | ||
} else if (strcmp(s, "none") != 0) { | ||
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl", OnSpinWaitImpl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl", OnSpinWaitImpl); | |
vm_exit_during_initialization("The options for OnSpinWaitImpl are nop, isb, yield, and none", OnSpinWaitImpl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (isdigit(*s)) { | ||
count = *s - '0'; | ||
if (count == 0) { | ||
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl: zero instruction count", OnSpinWaitImpl); | ||
} | ||
s += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (isdigit(*s)) { | |
count = *s - '0'; | |
if (count == 0) { | |
vm_exit_during_initialization("Invalid value for OnSpinWaitImpl: zero instruction count", OnSpinWaitImpl); | |
} | |
s += 1; | |
} | |
while (isdigit(*s++)); | |
count = atoi(OnSpinWaitImpl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, this combination of digits and named option is unusual in HotSpot; it may be unique. For the sake of not doing something so unfamiliar to our users, it may be worth separating the count and the option string.
@theRealAph, any comments on the microbenchmark I wrote? |
I think to show the benefit of the |
Something like this works well:
Before:
With
With
This is Apple M1, where you have to be very careful because there's some processor |
ThreadOnSpinWait: latency of Thread.onSpinWait() vs Thread.sleep(0) vs no pause. ThreadOnSpinWaitProducerConsumer: producer-consumer where consumer can go to 1ms sleep. ThreadOnSpinWaitSharedCounter: threads racing to count up (code provided by Andrew Haley, Red Hat).
Thank you for the microbenchmark. I added it. |
Hi @theRealAph, |
@OutputTimeUnit(TimeUnit.MILLISECONDS) | ||
@State(Scope.Benchmark) | ||
public class ThreadOnSpinWaitProducerConsumer { | ||
@Param({"100"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems rather artificial and unrealistic to me. You can get improved performance simply by increasing the spinNum
so that it waits for longer. The way to get some advantage from onSpinWait
is to have multiple threads racing to update the same thing, e.g. to acquire a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I am removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theRealAph, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm not entirely sure whether this test is truly representative of the real-world cases that people have seen, but if we find out more we can always add another JMH test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is too artificial. Going through my records I've found I have a microbenchmark for java.util.concurrent. SynchronousQueue
which shows good improvements on jdk11. SynchronousQueue
uses onSpinWait
. Since jdk17 SynchronousQueue
has not been using onSpinWait
any more (See https://bugs.openjdk.java.net/browse/JDK-8267502). Maybe I can come up with a microbenchmark based on SynchronousQueue
code:
SNode awaitFulfill(SNode s, boolean timed, long nanos) {
/*
* When a node/thread is about to block, it sets its waiter
* field and then rechecks state at least one more time
* before actually parking, thus covering race vs
* fulfiller noticing that waiter is non-null so should be
* woken.
*
* When invoked by nodes that appear at the point of call
* to be at the head of the stack, calls to park are
* preceded by spins to avoid blocking when producers and
* consumers are arriving very close in time. This can
* happen enough to bother only on multiprocessors.
*
* The order of checks for returning out of main loop
* reflects fact that interrupts have precedence over
* normal returns, which have precedence over
* timeouts. (So, on timeout, one last check for match is
* done before giving up.) Except that calls from untimed
* SynchronousQueue.{poll/offer} don't check interrupts
* and don't wait at all, so are trapped in transfer
* method rather than calling awaitFulfill.
*/
final long deadline = timed ? System.nanoTime() + nanos : 0L;
Thread w = Thread.currentThread();
int spins = shouldSpin(s)
? (timed ? MAX_TIMED_SPINS : MAX_UNTIMED_SPINS)
: 0;
for (;;) {
if (w.isInterrupted())
s.tryCancel();
SNode m = s.match;
if (m != null)
return m;
if (timed) {
nanos = deadline - System.nanoTime();
if (nanos <= 0L) {
s.tryCancel();
continue;
}
}
if (spins > 0) {
Thread.onSpinWait();
spins = shouldSpin(s) ? (spins - 1) : 0;
}
else if (s.waiter == null)
s.waiter = w; // establish waiter so can park next iter
else if (!timed)
LockSupport.park(this);
else if (nanos > SPIN_FOR_TIMEOUT_THRESHOLD)
LockSupport.parkNanos(this, nanos);
}
}
I've created https://bugs.openjdk.java.net/browse/JDK-8275728 to write such a microbenchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you do https://bugs.openjdk.java.net/browse/JDK-8275728 before you commit this. A benchmark which proves that this patch has some utility is needed, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrew (@theRealAph),
I've created a PR: #6338 with a microbenchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrew (@theRealAph), I've created a PR: #6338 with a microbenchmark.
That's really weird. Why is the benchmark not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a separate PR would simplify a discussion. Sorry if I was wrong.
I added it here.
Hi @theRealAph, |
Gil's microbenchmark targets SMT cases (x86 hyperthreading). As not all CPUs support SMT, the microbenchmarks cannot demonstrate benefits of The microbenchmark from PR uses ARM64 results:
X86_64 results:
|
I'm getting this for
This for
This for
Which looks pretty convincing, at least for this benchmark. I'm a bit concerned that it took so much effort to find a convincing benchmark, but I note that OnSpinWaitInst=isb doesn't seem to make anything worse, so OK. |
Thank you Andrew. |
/integrate |
/sponsor |
Going to push as commit 6954b98.
Your commit was automatically rebased without conflicts. |
@phohensee @eastig Pushed as commit 6954b98. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Mailing list message from Stuart Monteith on hotspot-dev: Hello, There are other possible implementations with Arm V8.7 features besides what we've implemented here, but silicon is just https://github.com/stooart-mon/jdk/commit/5a973ac9c67db32c649be1c317adc2185c2568fd This implements a delay by reading a virtualized timer and waiting for a period of time to be exceeded: MRS X0, CNTVCT_EL0 The counter is incremented at a rate that is particular to the CPU implementation - this is held in the CNTFRQ_EL0 This is straightforward in Java, we can just generate the static code as is. However, other software would need to load To enable this implementation, pass options like so: -XX:OnSpinWaitInst=counter -XX:OnSpinWaitCounterDelay=10 One of the problems with this approach, and spinwaits in general, is knowing what the correct value should be. As the For most systems, a delay of "2" to "15" would be a good range to test. With future revisions of the Arm ARM and some I'll be interested in hearing what people think. BR, For the the ThreadProducerConsumer spinwait tests, On 17/09/2021 12:36, Evgeny Astigeevich wrote: IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. |
Mailing list message from Andrew Haley on hotspot-dev: On 5/10/22 18:02, Stuart Monteith wrote:
This looks like a very interesting ides. Evgeny, could you try this on some of your benchmarks? Thanks you, -- |
This PR is a follow-up on the discussion “RFC: AArch64: Implementing spin pauses with ISB”.
It adds DIAGNOSTIC options
OnSpinWaitInst=inst
, whereinst
can be:none
: no implementation for spin pauses. This is the default value.nop
: usenop
instruction for spin pauses.isb
: useisb
instruction for spin pauses.yield
: useyield
instruction for spin pauses.And
OnSpinWaitInstCount=count
, wherecount
specifies a number ofOnSpinWaitInst
and can be in1..99
range. It is an error to useOnSpinWaitInstCount
whenOnSpinWaitInst
isnone
.The code for the
Thread.onSpinWait
intrinsic is generated based on the values ofOnSpinWaitInst
andOnSpinWaitInstCount
.Testing:
make test TEST="gtest"
: Passedmake run-test TEST="tier1"
: Passedmake run-test TEST="tier2"
: Passedmake run-test TEST=hotspot/jtreg/compiler/onSpinWait
: PassedCSR: https://bugs.openjdk.java.net/browse/JDK-8274564
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5562/head:pull/5562
$ git checkout pull/5562
Update a local copy of the PR:
$ git checkout pull/5562
$ git pull https://git.openjdk.java.net/jdk pull/5562/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 5562
View PR using the GUI difftool:
$ git pr show -t 5562
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5562.diff