-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8257774: G1: Trigger collect when free region count drops below threshold to prevent evacuation failures #1650
Conversation
If there is a burst of humongous allocations they can consume all of the free regions left in the heap. If there are no free regions for to-space a collect will fail to evacuate all objects. This makes the GC very slow. Since, humongous regions can be collected in G1GC young collects force a young collect before the number of free regions becomes less than the number the GC is likely to require. Fix TestGCLogMessages.java so it still causes Evacuation failures. Signed-off-by: Charlie Gracie <charlie.gracie@microsoft.com>
👋 Welcome back cgracie! A progress list of the required criteria for merging this PR into |
@charliegracie The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a first pass over the changes; I need to think a bit more about the estimation, and whether there are any issues with this new heuristic, i.e. that we could end up doing lots of gcs that effectively achieve nothing at some point and whether this is preferable to running into an evacuation failure.
@@ -313,16 +313,17 @@ public static void main(String [] args) { | |||
static class GCTestWithEvacuationFailure { | |||
private static byte[] garbage; | |||
private static byte[] largeObject; | |||
private static Object[] holder = new Object[200]; // Must be larger than G1EvacuationFailureALotCount | |||
private static Object[] holder = new Object[800]; // Must be larger than G1EvacuationFailureALotCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about these changes: it is not immediately obvious to me why they are necessary as the mechanism to force evacuation failure (G1EvacuationFailureALotCount et al) should be independent of these changes.
And the 17MB (for the humonguous object)+ 16MB of garbage should be enough for at least one gc; but maybe these changes trigger an early gc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the GC was being triggered earlier and not getting the evacuation failure. With these adjustments it consistently gets an evacuation failure. I was seeing this with my initial prototype so I will verify that it is still required.
@@ -68,13 +68,15 @@ class VM_G1TryInitiateConcMark : public VM_GC_Operation { | |||
|
|||
class VM_G1CollectForAllocation : public VM_CollectForAllocation { | |||
bool _gc_succeeded; | |||
bool _force_gc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not completely happy about using an extra flag for these forced GCs here; also this makes them indistinguishable with other GCs in the logs as far as I can see.
What do you think about adding GCCause(s) instead to automatically make them stand out in the logs?
Or at least make sure that they stand out in the logs for debugging issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the _force_gc flag and add a Preemptive GCCause.
@@ -120,7 +122,7 @@ VM_G1CollectForAllocation::VM_G1CollectForAllocation(size_t word_size, | |||
void VM_G1CollectForAllocation::doit() { | |||
G1CollectedHeap* g1h = G1CollectedHeap::heap(); | |||
|
|||
if (_word_size > 0) { | |||
if (_word_size > 0 && !_force_gc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not start the GC with a requested word_size == 0 here and remove the need for the flag (still there is need to record somehow the purpose of this gc)? In some way it makes sense as space is not really exhausted.
@@ -98,6 +98,9 @@ class G1Policy: public CHeapObj<mtGC> { | |||
|
|||
uint _free_regions_at_end_of_collection; | |||
|
|||
size_t _predicted_survival_bytes_from_survivor; | |||
size_t _predicted_survival_bytes_from_old; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a non-English native speaker I think "survival_bytes" is strange as "survival" isn't an adjective. Maybe "surviving_bytes" sounds better?
The code in calculate_required_regions_for_next_collect
uses "survivor" btw. I would still prefer "surviving" in some way as it differs from the "survivor" in "survivor regions", but let's keep nomenclature at least consistent.
Also please add a comment what these are used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the names and properly comment them.
@@ -362,6 +365,8 @@ class G1Policy: public CHeapObj<mtGC> { | |||
double time_remaining_ms, | |||
uint& num_optional_regions); | |||
|
|||
bool can_mutator_consume_free_regions(uint region_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment in my next revision
@@ -420,25 +420,30 @@ HeapWord* G1CollectedHeap::attempt_allocation_slow(size_t word_size) { | |||
HeapWord* result = NULL; | |||
for (uint try_count = 1, gclocker_retry_count = 0; /* we'll return */; try_count += 1) { | |||
bool should_try_gc; | |||
bool force_gc = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force_gc
and should_try_gc
seems to overlap a bit here. At least the naming isn't perfect because we may not do a gc even if force_gc
is true which I'd kind of expect.
I do not have a good new name right now how to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be removed as part of my next round of changes
// No need for an ergo message here, can_expand_young_list() does this when | ||
// it returns true. | ||
result = _allocator->attempt_allocation_force(word_size); | ||
if (policy()->can_mutator_consume_free_regions(1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if force_gc
(or whatever name it will have) would be set here unconditionally as the else
is pretty far away here.
I.e.
force_gc = policy()->can_mutator_consume_free_regions(1);
if (force_gc) { // needing to use the name "force_gc" here shows that the name is wrong...
... try allocation
... check if we should expand young gen beyond regular size due to GCLocker
}
The other issue I have with using can_mutator_consume_free_regions()
here is that there is already a very similar G1Policy::should_allocate_mutator_region
; and anyway, the attempt_allocation_locked
call may actually succeed without requiring a new region (actually, it is not uncommon that another thread got a free region while trying to take the Heap_lock
.
I think a better place for can_mutator_consume_free_regions()
is in G1Policy::should_allocate_mutator_region()
for this case.
attempt_allocation_locked
however does not return a reason for why allocation failed (at the moment). Maybe it is better to let it return a tuple with result and reason (or a second "out" parameter)? (I haven't tried how this would look like, it seems worth trying and better than the current way of handling this).
This one could be used in the following code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this code I am investigating moving the check into either 'G1Policy::should_allocate_mutator_region()' or to the caller of 'G1Policy::should_allocate_mutator_region()' so I can distinguish the reason why it failed to allocate. Hopefully I have something ready to push this week. I am on vacation so my responses are a little slow
@@ -451,7 +456,7 @@ HeapWord* G1CollectedHeap::attempt_allocation_slow(size_t word_size) { | |||
if (should_try_gc) { | |||
bool succeeded; | |||
result = do_collection_pause(word_size, gc_count_before, &succeeded, | |||
GCCause::_g1_inc_collection_pause); | |||
GCCause::_g1_inc_collection_pause, force_gc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think it is better to add a new GCCause for this instead of the additional parameter. We are not doing a GC because we are out of space because of the allocation of word_size
, but a "pre-emptive" GC due to the allocation of word_size
.
This warrants a new GC cause imho.
@@ -872,7 +882,7 @@ HeapWord* G1CollectedHeap::attempt_allocation_humongous(size_t word_size) { | |||
if (should_try_gc) { | |||
bool succeeded; | |||
result = do_collection_pause(word_size, gc_count_before, &succeeded, | |||
GCCause::_g1_humongous_allocation); | |||
GCCause::_g1_humongous_allocation, force_gc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the reason for this GC if force_gc
is true is not that we do not have enough space for the humongous allocation, but we are doing a "pre-emptive" GC because of the allocation of "word_size" similar to before.
add_allocated_humongous_bytes_since_last_gc(size_in_regions * HeapRegion::GrainBytes); | ||
return result; | ||
size_t size_in_regions = humongous_obj_size_in_regions(word_size); | ||
if (policy()->can_mutator_consume_free_regions((uint)size_in_regions)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would prefer that the result of this is stored in a local not called force_gc
:) and then used instead of appending such a small else
after the "long'ish" true case.
@charliegracie This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@charliegracie This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! |
Bursts of short lived Humongous object allocations can cause GCs to be initiated with 0 free regions. When these GCs happen they take significantly longer to complete. No objects are evacuated so there is a large amount of time spent in reversing self forwarded pointers and the only memory recovered is from the short lived humongous objects. My proposal is to add a check to the slow allocation path which will force a GC to happen if the number of free regions drops below the amount that would be required to complete the GC if it happened at that moment. The threshold will be based on the survival rates from Eden and survivor spaces along with the space required for Tenure space evacuations.
The goal is to resolve the issue with bursts of short lived humongous objects without impacting other workloads negatively. I would appreciate reviews and any feedback that you might have. Thanks.
Here are the links to the threads on the mailing list where I initially discussion the issue and my idea to resolve it:
https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-November/032189.html
https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-December/032677.html
Progress
Issue
Download
$ git fetch https://git.openjdk.java.net/jdk pull/1650/head:pull/1650
$ git checkout pull/1650