-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8318986: Improve GenericWaitBarrier performance #16404
8318986: Improve GenericWaitBarrier performance #16404
Conversation
👋 Welcome back shade! A progress list of the required criteria for merging this PR into |
b4dff05
to
e39ba69
Compare
a0694a8
to
a390610
Compare
Webrevs
|
@robehn, you might be interested in this :) |
Stress tests show there is still a race condition, which can be triggered by putting a small sleep on this path:
It would then fail/hang this
Putting this issue to draft until I figure out the better way to do this. |
Yepp, thanks for pinging me in! I'll have a look as when you are ready! Regarding the bug, accounting for context switches in every 'sub-state' usually gets you at least one time. And while on this topic: I mention this since you have setup measurements and graphs, so maybe you like to continue on this code :) (no jira issue for this) |
Yes. I think most of the race condition mess comes from juggling several counters at once, plus depending on |
I think this is good for review. The reproducer that used to hang/fail on assert is now passing. |
Testing seems all good. I'll leave the |
@shipilev - I'm glad that: vmTestbase/nsk/monitoring/ThreadInfo/isSuspended/issuspended002.java has proven to useful. I had been thinking about removing it from my weekly stress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this!
I had some comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I like the new protocol with tag == 0 as disarmed.
Looks good!
(a bit more tired than this morning, so I'll have new look tomorrow, just in case a missed something. silence == all good)
Even with Robbin's prior approval I feel it would be presumptuous of me to approve a PR in a piece of code I'm completely unfamiliar with. |
@pchilano can you have look ? |
I will. I might not finish the review until next week though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
All right, thanks all. I think we are waiting for @pchilano's review, and then we can integrate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
The performance improvements still hold. I would wait for some light testing to complete, and then I will integrate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Did you want to tackle the futex version also?
I'll create a JBS anyhow to track it.
I have no plans to tackle futex version. Trying to finish the tasks already started this year :) But I would be surprised if there are things that can be improved on top what futex already does for us. Probably avalanche wakeups would be a thing to try? |
Ok! Yes! |
@pchilano @dholmes-ora @cl4es, may I ask you to check if anyone at Oracle did the testing runs for this PR, and schedule a run if not? I am sure we are good with current testing, but additional safety would be nice to avoid surprises this close to JDK 22 RDP1. |
I'll schedule a round of testing for Tiers1-7 with the latest version. |
Thanks! Any failures so far? |
I run Tiers[1-7] and there is one failure in tier5 in test vmTestbase/nsk/monitoring/stress/thread/strace016/TestDescription.java on windows-x64-debug. The output is:
I re-run tier5 twice and the test alone 100 times but unfortunately couldn't reproduce the issue. I checked the history of failures and haven't seen this failed before. But it could also be that there is some race already in the test uncovered by this patch. There are some jobs pending for macos-x64 (there is currently a bottleneck in the pipeline for this platform). |
Thanks for testing!
Yes, I think so too. I ran this test hundreds of times without failure. The output implies there is a thread that should be "blocked", but instead it is "runnable". I think the test itself contains the race condition, submitted: https://bugs.openjdk.org/browse/JDK-8320599. I would not treat this failure as integration blocker then. Do you think we should wait for Mac pipeline to complete? |
Yes, I think the issue is in ThreadController.java with Blocker.block(). I'll keep investigating to see if I can reproduce it.
I'm not sure when this tasks will finish. I think we should be good with all the testing done so far. |
I actually realized of this yesterday after the jobs have been running for a while. So I submitted extra runs with that change to test Linux too. I run Tiers[4-7]. Tier7 completed successfully, and Tiers[4-6] is almost done too with no failures. There are again some macos-x64 jobs that are pending. |
All right then. I will integrate today, hopefully within an hour. Thank you all! |
/integrate |
Going to push as commit 30462f9.
Your commit was automatically rebased without conflicts. |
See the symptoms, reproducer and analysis in the bug.
Current code waits on
disarm()
, which effectively stalls leaving the safepoint if some threads lag behind. Having more runnable threads than CPUs nearly guarantees that we would wait for quite some time, but it also reproduces well if you have enough threads near the CPU count.This PR implements a more efficient
GenericWaitBarrier
to recover the performance. Most of the implementation discussion is in the code comments. The key observation that drives this work is that we want to reuseSemaphore
and related counters without being stuck waiting for threads to leave. (AFAICS, futex-basedLinuxWaitBarrier
does roughly the same, but handles this reuse on futex side, by assigning the "address" per futex.)This issue affects everything except Linux. I initially found this on my M1 Mac, but pretty sure it is easy to reproduce on Windows as well. The safepoints from the reproducer in the bug improved dramatically on a Mac, see the graph below. The new version gives orders of magnitude better safepoint times. This also translates to much more active GC and attainable allocating rate, because GC throughput is not blocked by overly long safepoints.
Additional testing:
tier1
tier1 tier2 tier3
(generic wait barrier enabled explicitly)tier1 tier2 tier3
(generic wait barrier enabled explicitly)tier2 tier3
tier4
(generic wait barrier enabled explicitly)tier4
(generic wait barrier enabled explicitly)Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16404/head:pull/16404
$ git checkout pull/16404
Update a local copy of the PR:
$ git checkout pull/16404
$ git pull https://git.openjdk.org/jdk.git pull/16404/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16404
View PR using the GUI difftool:
$ git pr show -t 16404
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16404.diff
Webrev
Link to Webrev Comment