-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8257774: G1: Trigger collect when free region count drops below threshold to prevent evacuation failures #3143
Conversation
👋 Welcome back adityam! A progress list of the required criteria for merging this PR into |
@adityamandaleeka The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
There are one or two whitespace changes that I'll remove in a future commit prior to submission since the bot isn't happy about them. Will also handle the Windows build error. |
Webrevs
|
About the second question: I think |
The current code makes the thread stall if there is a proactive gc request and a gc locker enabled. This behavior is reasonable to me: otherwise since there is a high likelihood will likely be an evacuation failure, that gclocker induced gc will take a long time - given that we want to avoid this exact situation, but others might disagree. |
During some internal discussion with the people that had looked at the change a bit, we agreed that the additional Here is an attempt to remove this parameter by an explicit method that allocates using a new region. It does look nicer to me. No testing has been performed, just to see how it would look like. The change also removes that Please also merge with latest once more, the change is a bit out of date. |
Thanks @tschatzl, will review the feedback, re-test, and update the PR (likely sometime next week). |
If there is a burst of humongous allocations they can consume all of the free regions left in the heap. If there are no free regions for to-space a collect will fail to evacuate all objects. This makes the GC very slow. Since, humongous regions can be collected in G1GC young collects force a young collect before the number of free regions becomes less than the number the GC is likely to require. Fix TestGCLogMessages.java so it still causes Evacuation failures. Signed-off-by: Charlie Gracie <charlie.gracie@microsoft.com>
4f1d5db
to
66fa097
Compare
src/hotspot/share/gc/g1/g1Policy.cpp
Outdated
@@ -780,6 +783,9 @@ void G1Policy::record_collection_pause_end(double pause_time_ms, bool concurrent | |||
|
|||
update_rs_length_prediction(); | |||
|
|||
// Is this the right place? Should it be in the below? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a reviewer confirm this so that I can remove the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment can be removed imho.
Thanks again for the review. I agree that the proposed refactor of the alloc functions looks good, and I've applied it. Functional testing of the latest code is ongoing (doing hs_gc and vmTestbase_vm_gc in release and fastdebug on x64 Linux) but so far everything that's run has passed. What are your thoughts on further steps involved before getting this change merged? For instance, is there value/interest in putting the proactive GCs behind an -XX: option? Every such flag adds complexity and surface area for testing, so I'm somewhat hesitant but I'm interested to hear others' opinions on this... |
@@ -112,10 +112,21 @@ class G1Allocator : public CHeapObj<mtGC> { | |||
|
|||
// Allocate blocks of memory during mutator time. | |||
|
|||
// Attempt allocation in the current alloc region. If use_retained_region_if_available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reference to use_retained_region_if_available out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_retained_region_if_available
is no longer used remove from comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is starting to look pretty good. No major comments at this point, but I will have to take one more look in detail at the calculations in the policy.
// to the locking protocol. | ||
// Tries to allocate at least min_word_size words, and at most desired_word_size. | ||
// Returns the actual size of the block in actual_word_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format these lines to fit the rest of the comment:
// to the locking protocol. | |
// Tries to allocate at least min_word_size words, and at most desired_word_size. | |
// Returns the actual size of the block in actual_word_size. | |
// to the locking protocol. The min and desired word size allow | |
// specifying a minimum and maximum size of the allocation. The | |
// actual size of allocation is returned in actual_word_size. |
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove at least one of those two blank lines.
@@ -112,10 +112,21 @@ class G1Allocator : public CHeapObj<mtGC> { | |||
|
|||
// Allocate blocks of memory during mutator time. | |||
|
|||
// Attempt allocation in the current alloc region. If use_retained_region_if_available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_retained_region_if_available
is no longer used remove from comment.
src/hotspot/share/gc/g1/g1Policy.cpp
Outdated
return false; | ||
} | ||
|
||
if (_g1h->young_regions_count() == 0 && _collection_set->candidates() != NULL && _collection_set->candidates()->is_empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that we have no regions to collect. Adding a has_candidates()
to G1CollectionSet
seems to make sense now when we check for candidates a couple of times and will make this condition read a bit better as well:
if (_g1h->young_regions_count() == 0 && _collection_set->candidates() != NULL && _collection_set->candidates()->is_empty()) { | |
if (_g1h->young_regions_count() == 0 && !_collection_set->has_candidates()) { | |
// Don't attempt a proactive GC when there are no regions to collect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thomas and I talked about this before but decided not to do it because the two cases where we're doing this are doing slightly different things (one is checking for NULL || empty and the other is checking !NULL && empty, which is not the inverse of the first). Looking at it closer now though, I think this one should also be bailing in the case where candidates is NULL right? If we do that, we can definitely add the has_candidates function and use that in both cases.
src/hotspot/share/gc/g1/g1Policy.cpp
Outdated
|
||
// Old regions | ||
G1CollectionSetCandidates *candidates = _collection_set->candidates(); | ||
if (candidates == NULL || candidates->is_empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use !_collection_set->has_candidates()
if added.
@@ -122,7 +122,7 @@ VM_G1CollectForAllocation::VM_G1CollectForAllocation(size_t word_size, | |||
void VM_G1CollectForAllocation::doit() { | |||
G1CollectedHeap* g1h = G1CollectedHeap::heap(); | |||
|
|||
if (_word_size > 0) { | |||
if (_word_size > 0 && _gc_cause != GCCause::_g1_proactive_collection) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a helper to ´VM_G1CollectForAllocation`:
bool try_allocate_before_gc() {
if (_gc_cause == GCCause::_g1_proactive_collection) {
// Never allocate before proactive GCs.
return false;
}
return true;
}
The condition would then read a bit better:
if (_word_size > 0 && _gc_cause != GCCause::_g1_proactive_collection) { | |
if (try_allocate_before_gc() && _word_size > 0) { |
Fyi, I've been running this change through our performance test suite, and there are two notable differences in results in the Renaissance benchmark suite (http://renaissance.dev):
For the latter more concerning result in renaissance-scrabble, we use the JMH wrappers which should be part of the benchmark suite (I think), and run jmh with Original results:
Changes:
There are significant differences I can't explain right now. It may be due to (since we do not use any special VM options except logging) heap sizing or similar, but can you look into this a bit? |
@tschatzl Sorry for the delay, I was wrapped up in some other work and then wasn't feeling well the last couple of days. I finally had a chance to run the Renaissance-Scrabble benchmark with and without this change, but I was not able to reproduce the delta you observed. My "before" build is based on commit adityamandaleeka@28c35ae, and delivered the following result using the jmh flags you listed:
My "after" build is based on this PR of course and had the following result:
I also tried running the Neo4jAnalytics one but got "java.lang.RuntimeException: Benchmark 'neo4j-analytics' is not compatible with this JVM version!". Looking at the benchmark sources, I found |
In the interest of completeness, I re-ran the Renaissance-Scrabble benchmark with the latest updates (even though there is no reason to believe they'd affect the results). I also re-ran the "before" test. Before (adityamandaleeka@28c35ae):
After (57421e1):
As before, I'm not seeing the delta that was reported. @tschatzl Please let me know if you're still seeing the regression and we can figure out if perhaps I'm doing something different than you. Thanks! |
I'll look at this probably today, I have been busy yesterday with incoming issues. Please also consider that in most of Europe Thursday had been a public holiday and most people took Friday off to take an extended weekend. As you might imagine from my post I have been surprised by this difference as well and is likely an artifact of our performance regression suite setup. |
The test |
I can't reproduce the issue right now, and the logs do not offer enough information to proceed. Maybe it is useful and enough to add Another option is to disable preventive gcs for this case, which is probably the safer bet - I can imagine that in some cases this change would prevent the evacuation failures necessary for the test to complete successfully (to actually do an evacuation failure). |
I think this sounds like a good approach. If we test to get evac failures it is reasonable to turn of "preventive" GCs. |
The failing test made me look a bit more at the logs when running with this feature and I found at least one thing I think we need to fix. What I did was to run SPECjbb2015 with a fixed injection-rate and and with tweaked parameters to force a bit more live objects. Running this with an 8g heap was fine and I saw no "preventive" GCs. Turning the heap size down to 6gb the preventive GCs started to show and we do a lot more GCs with this patch than without, but looking at the total GC time it is pretty similar to before. So not sure how much to care about this. What I think we should change, or at least discuss is the fact that we now get:
A "preventive" Full GC looks a bit strange to me, I mean the Full GC occurs because we failed not prevent it. So I think we should avoid using this GC cause for Full collections. What do others think? |
Mailing list message from Kirk Pepperdine on hotspot-gc-dev: Hi Stefan,
This is certainly not preventing a full GC implying that the name is misleading. At 8G, I?m not sure that I care all that much about a full collection being triggered but it?s at a threshold where I?m going to take notice. At issue is? it?s very unlikely that we?d want to see a full collection when heaps get bigger than this. I?d much rather set a larger reserve or take other measures to prevent the failure rather than endure the penalty of a full collection. Kind regards, |
It is certainly misleading for the Full GC, but for normal collections I think it's the best alternative presented so far. You are not always certain that a preventive measure will be successful.
I agree, in cases like this the best thing from a configuration perspective is to give more memory to the process and hopefully avoid the Full GCs. But we still need to test and make sure a feature like this handle high heap occupancy in a good way. I did some extended runs during the night and after a while the "preventive" GCs occur even for the 8g case. But the overall numbers still look ok, currently running a less stressful configuration to make sure that is even less affected. |
Yup. I'll revert the test change that was made in the original PR and instead disable preventive collections for
I think you're right, this cause can be confusing when it's on a full collection. The question then is where to switch the cause, and what to switch it to. The problem is that we don't know at the point the cause is set what kind of collection will end up happening. We may decide to upgrade to a full collection in |
We discussed this a bit today and the general idea is that we should change the GC-cause for upgraded GCs to something better and more descriptive. We can do that outside this change, either before or after. I filed JDK-8268122 and will look into this. |
@kstefanj Sounds great, thanks. I've pushed the change to disable preventive GC for the evac failure test (and undid the other test changes that were done in the original PR). |
The three CI failures appear unrelated to this PR: the macOS x64 run failed to download the boot JDK, macOS arm64 didn't even get as far as checking out the source (?), and the Windows arm64 one failed with a linker error trying to open libcpmt.lib. |
Mailing list message from Kirk Pepperdine on hotspot-gc-dev: Traditionally in the logs you would see that the current collection is ?closed out? and a new record is started for the ?new? full collection. Kind regards, |
This will still be the case, saying that the young collection is upgraded is a bit misleading (but it is called that in the code). What we are discussing is to add a new specific cause used by the Full GCs. Right now you get:
With this change you would get:
Which looks more strange because the Full GC is not preventive. So the idea is to add a separate cause used by any Full GC triggered by a young collection not generating enough free memory, like this:
And for preventive collections it would be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All outstanding questions are answered and we should be ready to integrate.
Depending on who integrates first, there might be some small merge conflict with my new GC cause for Full collections (PR #4357). But it should be very easy to fix.
Thanks for being patient and addressing all our comments in this fairly long review process.
@adityamandaleeka This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 722 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@kstefanj, @tschatzl) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as kstefanj. Thanks.
/integrate |
@adityamandaleeka |
/sponsor |
@kstefanj @adityamandaleeka Since your change was applied there have been 729 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 81bad59. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
This PR picks up from this previous PR by @charliegracie.
I won't repeat the full description from the prior PR, but the general idea is to add the notion of a "proactive GC" to G1 which gets triggered in the slow allocation path if the number of free regions drops below the amount that would be required to complete a GC if it happened at that moment. The threshold is based on the survival rates from eden and survivor spaces along with the space required for tenured space evacuations.
There are a couple of outstanding issues/questions known:
zzinit_complete
) just to unblock testing.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3143/head:pull/3143
$ git checkout pull/3143
Update a local copy of the PR:
$ git checkout pull/3143
$ git pull https://git.openjdk.java.net/jdk pull/3143/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 3143
View PR using the GUI difftool:
$ git pr show -t 3143
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3143.diff