Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8318986: Improve GenericWaitBarrier performance #16404

Closed

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Oct 27, 2023

See the symptoms, reproducer and analysis in the bug.

Current code waits on disarm(), which effectively stalls leaving the safepoint if some threads lag behind. Having more runnable threads than CPUs nearly guarantees that we would wait for quite some time, but it also reproduces well if you have enough threads near the CPU count.

This PR implements a more efficient GenericWaitBarrier to recover the performance. Most of the implementation discussion is in the code comments. The key observation that drives this work is that we want to reuse Semaphore and related counters without being stuck waiting for threads to leave. (AFAICS, futex-based LinuxWaitBarrier does roughly the same, but handles this reuse on futex side, by assigning the "address" per futex.)

This issue affects everything except Linux. I initially found this on my M1 Mac, but pretty sure it is easy to reproduce on Windows as well. The safepoints from the reproducer in the bug improved dramatically on a Mac, see the graph below. The new version gives orders of magnitude better safepoint times. This also translates to much more active GC and attainable allocating rate, because GC throughput is not blocked by overly long safepoints.

plot-generic-wait-barrier-macos

Additional testing:

  • MacOS AArch64 server fastdebug, tier1
  • Linux x86_64 server fastdebug, tier1 tier2 tier3 (generic wait barrier enabled explicitly)
  • Linux AArch64 server fastdebug, tier1 tier2 tier3 (generic wait barrier enabled explicitly)
  • MacOS AArch64 server fastdebug, tier2 tier3
  • Linux x86_64 server fastdebug, tier4 (generic wait barrier enabled explicitly)
  • Linux AArch64 server fastdebug, tier4 (generic wait barrier enabled explicitly)

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8318986: Improve GenericWaitBarrier performance (Enhancement - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16404/head:pull/16404
$ git checkout pull/16404

Update a local copy of the PR:
$ git checkout pull/16404
$ git pull https://git.openjdk.org/jdk.git pull/16404/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16404

View PR using the GUI difftool:
$ git pr show -t 16404

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16404.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 27, 2023

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 27, 2023

@shipilev The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 27, 2023
@shipilev shipilev force-pushed the JDK-8318986-generic-wait-barrier branch from b4dff05 to e39ba69 Compare November 1, 2023 14:11
@shipilev shipilev force-pushed the JDK-8318986-generic-wait-barrier branch from a0694a8 to a390610 Compare November 1, 2023 16:07
@shipilev shipilev marked this pull request as ready for review November 1, 2023 23:07
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 1, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 1, 2023

@shipilev
Copy link
Member Author

shipilev commented Nov 2, 2023

@robehn, you might be interested in this :)

@shipilev
Copy link
Member Author

shipilev commented Nov 2, 2023

Stress tests show there is still a race condition, which can be triggered by putting a small sleep on this path:

diff --git a/src/hotspot/share/utilities/waitBarrier_generic.cpp b/src/hotspot/share/utilities/waitBarrier_generic.cpp
index f886e7b4cf5..13ba12c39a6 100644
--- a/src/hotspot/share/utilities/waitBarrier_generic.cpp
+++ b/src/hotspot/share/utilities/waitBarrier_generic.cpp
@@ -174,6 +174,7 @@ void GenericWaitBarrier::wait(int barrier_tag) {
   OrderAccess::fence();
 
   if (Atomic::load_acquire(&_barrier_tag) == barrier_tag) {
+    if (UseNewCode) os::naked_short_nanosleep(100);
     Atomic::add(&cell._unsignaled_waits, 1);
 
     // Wait for notification.

It would then fail/hang this tier4 test:

$ CONF=linux-x86_64-server-fastdebug make images test TEST=vmTestbase/nsk/monitoring/ThreadInfo/isSuspended/issuspended002.java TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode" 

Putting this issue to draft until I figure out the better way to do this.

@shipilev shipilev marked this pull request as draft November 2, 2023 12:33
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 2, 2023
@robehn
Copy link
Contributor

robehn commented Nov 2, 2023

@robehn, you might be interested in this :)

Yepp, thanks for pinging me in! I'll have a look as when you are ready!

Regarding the bug, accounting for context switches in every 'sub-state' usually gets you at least one time.
That is what the: Atomic::add(&_barrier_threads, 1);, did, "no marine left behind" :)

And while on this topic:
Note that there is an optimization that can be done on Linux also.
Waking all threads via futex the VM thread gets high runtime, so two safepoint very close to each other is slow.
Meaning if there are a new safepoint op depending, VM thread often will context switched out just after waking all threads.
Secondly the futex wake, involving visiting all run queues, can be parallelized similar to this by just waking a few thread and let them wake the rest. Intel actually did a draft of that just before meltdown/spectre, so it got lost.
I think that draft did wake like 6 or 8 at the time, one large system ~128 cores you could get the time to full utilization down by almost 50% (but you lose some latency for the JavaThreads doing the second round of wakening).

I mention this since you have setup measurements and graphs, so maybe you like to continue on this code :) (no jira issue for this)

@shipilev
Copy link
Member Author

shipilev commented Nov 2, 2023

Regarding the bug, accounting for context switches in every 'sub-state' usually gets you at least one time. That is what the: Atomic::add(&_barrier_threads, 1);, did, "no marine left behind" :)

Yes. I think most of the race condition mess comes from juggling several counters at once, plus depending on barrier_tag. The new version fuses the important counters together and melds in the arm/disarm status into the counter. Which allows to manage things more easily but CAS-ing one counter only. I think that resolves race conditions I saw in previous implementation. It yields same performance. Now running it through functional testing.

@shipilev shipilev marked this pull request as ready for review November 2, 2023 20:56
@shipilev
Copy link
Member Author

shipilev commented Nov 2, 2023

I think this is good for review. The reproducer that used to hang/fail on assert is now passing. tier1 tier2 tier3 are all passing. I am running more tests overnight.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 2, 2023
@shipilev
Copy link
Member Author

shipilev commented Nov 6, 2023

I think this is good for review. The reproducer that used to hang/fail on assert is now passing. tier1 tier2 tier3 are all passing. I am running more tests overnight.

Testing seems all good. I'll leave the Linux -> Generic switch in this PR, until the very last moment before integration to keep testing more easily.

@dcubed-ojdk
Copy link
Member

@shipilev - I'm glad that:

vmTestbase/nsk/monitoring/ThreadInfo/isSuspended/issuspended002.java

has proven to useful. I had been thinking about removing it from my weekly stress
kit runs since it has been a long time since I've seen a failure flushed out by that
test running the stress config. I think I'll keep it around for longer...

Copy link
Contributor

@robehn robehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this!
I had some comment.

Copy link
Contributor

@robehn robehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I like the new protocol with tag == 0 as disarmed.
Looks good!

(a bit more tired than this morning, so I'll have new look tomorrow, just in case a missed something. silence == all good)

@cl4es
Copy link
Member

cl4es commented Nov 15, 2023

Agreed. Would you like to formally approve this PR?

Even with Robbin's prior approval I feel it would be presumptuous of me to approve a PR in a piece of code I'm completely unfamiliar with.

@robehn
Copy link
Contributor

robehn commented Nov 15, 2023

@pchilano can you have look ?

@pchilano
Copy link
Contributor

@pchilano can you have look ?

I will. I might not finish the review until next week though.

Copy link
Member

@walulyai walulyai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shipilev
Copy link
Member Author

All right, thanks all. I think we are waiting for @pchilano's review, and then we can integrate.

Copy link
Contributor

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@shipilev
Copy link
Member Author

The performance improvements still hold. I would wait for some light testing to complete, and then I will integrate.

Copy link
Contributor

@robehn robehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Did you want to tackle the futex version also?
I'll create a JBS anyhow to track it.

@shipilev
Copy link
Member Author

Did you want to tackle the futex version also? I'll create a JBS anyhow to track it.

I have no plans to tackle futex version. Trying to finish the tasks already started this year :) But I would be surprised if there are things that can be improved on top what futex already does for us. Probably avalanche wakeups would be a thing to try?

@robehn
Copy link
Contributor

robehn commented Nov 21, 2023

Did you want to tackle the futex version also? I'll create a JBS anyhow to track it.

I have no plans to tackle futex version. Trying to finish the tasks already started this year :) But I would be surprised if there are things that can be improved on top what futex already does for us. Probably avalanche wakeups would be a thing to try?

Ok! Yes!

@shipilev
Copy link
Member Author

shipilev commented Nov 21, 2023

@pchilano @dholmes-ora @cl4es, may I ask you to check if anyone at Oracle did the testing runs for this PR, and schedule a run if not? I am sure we are good with current testing, but additional safety would be nice to avoid surprises this close to JDK 22 RDP1.

@pchilano
Copy link
Contributor

@pchilano @dholmes-ora @cl4es, may I ask you to check if anyone at Oracle did the testing runs for this PR, and schedule a run if not? I am sure we are good with current testing, but additional safety would be nice to avoid surprises this close to JDK 22 RDP1.

I'll schedule a round of testing for Tiers1-7 with the latest version.

@shipilev
Copy link
Member Author

I'll schedule a round of testing for Tiers1-7 with the latest version.

Thanks! Any failures so far?

@pchilano
Copy link
Contributor

I run Tiers[1-7] and there is one failure in tier5 in test vmTestbase/nsk/monitoring/stress/thread/strace016/TestDescription.java on windows-x64-debug. The output is:

#>  
#>  WARNING: switching log to verbose mode,
#>      because error is complained
#>  
ThreadMonitor> Test mode:	DIRECTLY access to MBean
ThreadController> number of created threads:	30
ThreadController> depth for all threads:	100
ThreadController> invocation type:	mixed

Starting threads.

ThreadController> locking threads

States of the threads are culminated.
# ERROR: 	Thread BLOCKED_ThreadMM001 wrong thread state: RUNNABLE
The following stacktrace is for failure analysis.
nsk.share.TestFailure:  Thread BLOCKED_ThreadMM001 wrong thread state: RUNNABLE
	at nsk.share.Log.logExceptionForFailureAnalysis(Log.java:431)
	at nsk.share.Log.complain(Log.java:402)
	at nsk.monitoring.stress.thread.strace010.runIt(strace010.java:148)
	at nsk.monitoring.stress.thread.strace010.run(strace010.java:99)
	at nsk.monitoring.stress.thread.strace010.main(strace010.java:95)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
	at java.base/java.lang.Thread.run(Thread.java:1570)

Checked 6 BLOCKED threads
# ERROR: Expected amount: 7 for BLOCKED threads actual: 6
Checked 7 WAITING threads
Checked 8 TIMED_WAITING threads
Checked 9 RUNNABLE threads
# ERROR: Expected amount: 8 for RUNNABLE threads actual: 9


Test FAILED


#>  
#>  SUMMARY: Following errors occured
#>      during test execution:
#>  
# ERROR: 	Thread BLOCKED_ThreadMM001 wrong thread state: RUNNABLE
# ERROR: Expected amount: 7 for BLOCKED threads actual: 6
# ERROR: Expected amount: 8 for RUNNABLE threads actual: 9

I re-run tier5 twice and the test alone 100 times but unfortunately couldn't reproduce the issue. I checked the history of failures and haven't seen this failed before. But it could also be that there is some race already in the test uncovered by this patch.

There are some jobs pending for macos-x64 (there is currently a bottleneck in the pipeline for this platform).

@shipilev
Copy link
Member Author

shipilev commented Nov 22, 2023

Thanks for testing!

I run Tiers[1-7] and there is one failure in tier5 in test vmTestbase/nsk/monitoring/stress/thread/strace016/TestDescription.java on windows-x64-debug. I re-run tier5 twice and the test alone 100 times but unfortunately couldn't reproduce the issue. I checked the history of failures and haven't seen this failed before. But it could also be that there is some race already in the test uncovered by this patch.

Yes, I think so too. I ran this test hundreds of times without failure. The output implies there is a thread that should be "blocked", but instead it is "runnable". I think the test itself contains the race condition, submitted: https://bugs.openjdk.org/browse/JDK-8320599. I would not treat this failure as integration blocker then.

Do you think we should wait for Mac pipeline to complete?

@robehn
Copy link
Contributor

robehn commented Nov 22, 2023

@pchilano you can change the waitBarrier.hpp so Linux also uses the generic one, as @shipilev did when he tested:

#if defined(LINUX)
#include "waitBarrier_linux.hpp"

And just use "typedef GenericWaitBarrier WaitBarrierDefault;"

For better coverage.

@pchilano
Copy link
Contributor

Thanks for testing!

I run Tiers[1-7] and there is one failure in tier5 in test vmTestbase/nsk/monitoring/stress/thread/strace016/TestDescription.java on windows-x64-debug. I re-run tier5 twice and the test alone 100 times but unfortunately couldn't reproduce the issue. I checked the history of failures and haven't seen this failed before. But it could also be that there is some race already in the test uncovered by this patch.

Yes, I think so too. I ran this test hundreds of times without failure. The output implies there is a thread that should be "blocked", but instead it is "runnable". I think the test itself contains the race condition, submitted: https://bugs.openjdk.org/browse/JDK-8320599. I would not treat this failure as integration blocker then.

Yes, I think the issue is in ThreadController.java with Blocker.block(). I'll keep investigating to see if I can reproduce it.

Do you think we should wait for Mac pipeline to complete?

I'm not sure when this tasks will finish. I think we should be good with all the testing done so far.

@pchilano
Copy link
Contributor

@pchilano you can change the waitBarrier.hpp so Linux also uses the generic one, as @shipilev did when he tested:

#if defined(LINUX)
#include "waitBarrier_linux.hpp"

And just use "typedef GenericWaitBarrier WaitBarrierDefault;"

For better coverage.

I actually realized of this yesterday after the jobs have been running for a while. So I submitted extra runs with that change to test Linux too. I run Tiers[4-7]. Tier7 completed successfully, and Tiers[4-6] is almost done too with no failures. There are again some macos-x64 jobs that are pending.

@shipilev
Copy link
Member Author

All right then. I will integrate today, hopefully within an hour. Thank you all!

@shipilev
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 22, 2023

Going to push as commit 30462f9.
Since your change was applied there have been 30 commits pushed to the master branch:

  • 407cdd4: 8320207: doclet incorrectly chooses code font for a See Also link
  • 1629a90: 8320331: G1 Full GC Heap verification relies on metadata not reset before verification
  • 93bdc2a: 8306055: Add a built-in Catalog to JDK XML module
  • a4bd9e4: 8319440: RISC-V: jdk can't be built with clang due to register keyword
  • 524da14: 8320418: PPC64: invokevfinal_helper duplicates code to handle ResolvedMethodEntry
  • 5d4a54b: 8319449: compiler/print/CompileCommandPrintMemStat.java fails on Graal
  • 35526d0: 8257076: os::scan_pages is empty on all platforms
  • 25cebe8: 8317799: AIX PPC64: FFI symbol lookup doesn't find symbols
  • c39d001: 8319137: release _object in ObjectMonitor dtor to avoid races
  • 8b47a14: 8320309: AIX: pthreads created by foreign test library don't work as expected
  • ... and 20 more: https://git.openjdk.org/jdk/compare/e055fae104a887c436da9f2924e88029518d5d96...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 22, 2023
@openjdk openjdk bot closed this Nov 22, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 22, 2023
@openjdk
Copy link

openjdk bot commented Nov 22, 2023

@shipilev Pushed as commit 30462f9.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shipilev shipilev deleted the JDK-8318986-generic-wait-barrier branch November 28, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

7 participants