Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8257774: G1: Trigger collect when free region count drops below threshold to prevent evacuation failures #1650

Closed

Conversation

charliegracie
Copy link
Member

@charliegracie charliegracie commented Dec 6, 2020

Bursts of short lived Humongous object allocations can cause GCs to be initiated with 0 free regions. When these GCs happen they take significantly longer to complete. No objects are evacuated so there is a large amount of time spent in reversing self forwarded pointers and the only memory recovered is from the short lived humongous objects. My proposal is to add a check to the slow allocation path which will force a GC to happen if the number of free regions drops below the amount that would be required to complete the GC if it happened at that moment. The threshold will be based on the survival rates from Eden and survivor spaces along with the space required for Tenure space evacuations.

The goal is to resolve the issue with bursts of short lived humongous objects without impacting other workloads negatively. I would appreciate reviews and any feedback that you might have. Thanks.

Here are the links to the threads on the mailing list where I initially discussion the issue and my idea to resolve it:
https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-November/032189.html
https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-December/032677.html


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8257774: G1: Trigger collect when free region count drops below threshold to prevent evacuation failures

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1650/head:pull/1650
$ git checkout pull/1650

If there is a burst of humongous allocations they can
consume all of the free regions left in the heap. If
there are no free regions for to-space a collect will
fail to evacuate all objects. This makes the GC very
slow. Since, humongous regions can be collected in
G1GC young collects force a young collect before the
number of free regions becomes less than the number
the GC is likely to require.

Fix TestGCLogMessages.java so it still causes Evacuation
failures.

Signed-off-by: Charlie Gracie <charlie.gracie@microsoft.com>
@bridgekeeper
Copy link

bridgekeeper bot commented Dec 6, 2020

👋 Welcome back cgracie! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 6, 2020
@openjdk
Copy link

openjdk bot commented Dec 6, 2020

@charliegracie The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Dec 6, 2020
@mlbridge
Copy link

mlbridge bot commented Dec 6, 2020

Webrevs

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a first pass over the changes; I need to think a bit more about the estimation, and whether there are any issues with this new heuristic, i.e. that we could end up doing lots of gcs that effectively achieve nothing at some point and whether this is preferable to running into an evacuation failure.

@@ -313,16 +313,17 @@ public static void main(String [] args) {
static class GCTestWithEvacuationFailure {
private static byte[] garbage;
private static byte[] largeObject;
private static Object[] holder = new Object[200]; // Must be larger than G1EvacuationFailureALotCount
private static Object[] holder = new Object[800]; // Must be larger than G1EvacuationFailureALotCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about these changes: it is not immediately obvious to me why they are necessary as the mechanism to force evacuation failure (G1EvacuationFailureALotCount et al) should be independent of these changes.

And the 17MB (for the humonguous object)+ 16MB of garbage should be enough for at least one gc; but maybe these changes trigger an early gc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the GC was being triggered earlier and not getting the evacuation failure. With these adjustments it consistently gets an evacuation failure. I was seeing this with my initial prototype so I will verify that it is still required.

@@ -68,13 +68,15 @@ class VM_G1TryInitiateConcMark : public VM_GC_Operation {

class VM_G1CollectForAllocation : public VM_CollectForAllocation {
bool _gc_succeeded;
bool _force_gc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely happy about using an extra flag for these forced GCs here; also this makes them indistinguishable with other GCs in the logs as far as I can see.
What do you think about adding GCCause(s) instead to automatically make them stand out in the logs?
Or at least make sure that they stand out in the logs for debugging issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the _force_gc flag and add a Preemptive GCCause.

@@ -120,7 +122,7 @@ VM_G1CollectForAllocation::VM_G1CollectForAllocation(size_t word_size,
void VM_G1CollectForAllocation::doit() {
G1CollectedHeap* g1h = G1CollectedHeap::heap();

if (_word_size > 0) {
if (_word_size > 0 && !_force_gc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not start the GC with a requested word_size == 0 here and remove the need for the flag (still there is need to record somehow the purpose of this gc)? In some way it makes sense as space is not really exhausted.

@@ -98,6 +98,9 @@ class G1Policy: public CHeapObj<mtGC> {

uint _free_regions_at_end_of_collection;

size_t _predicted_survival_bytes_from_survivor;
size_t _predicted_survival_bytes_from_old;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a non-English native speaker I think "survival_bytes" is strange as "survival" isn't an adjective. Maybe "surviving_bytes" sounds better?
The code in calculate_required_regions_for_next_collect uses "survivor" btw. I would still prefer "surviving" in some way as it differs from the "survivor" in "survivor regions", but let's keep nomenclature at least consistent.

Also please add a comment what these are used for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the names and properly comment them.

@@ -362,6 +365,8 @@ class G1Policy: public CHeapObj<mtGC> {
double time_remaining_ms,
uint& num_optional_regions);

bool can_mutator_consume_free_regions(uint region_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment in my next revision

@@ -420,25 +420,30 @@ HeapWord* G1CollectedHeap::attempt_allocation_slow(size_t word_size) {
HeapWord* result = NULL;
for (uint try_count = 1, gclocker_retry_count = 0; /* we'll return */; try_count += 1) {
bool should_try_gc;
bool force_gc = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_gc and should_try_gc seems to overlap a bit here. At least the naming isn't perfect because we may not do a gc even if force_gc is true which I'd kind of expect.

I do not have a good new name right now how to fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be removed as part of my next round of changes

// No need for an ergo message here, can_expand_young_list() does this when
// it returns true.
result = _allocator->attempt_allocation_force(word_size);
if (policy()->can_mutator_consume_free_regions(1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if force_gc (or whatever name it will have) would be set here unconditionally as the else is pretty far away here.

I.e.

force_gc = policy()->can_mutator_consume_free_regions(1);
  
  if (force_gc) { // needing to use the name "force_gc" here shows that the name is wrong...
    ... try allocation
    ... check if we should expand young gen beyond regular size due to GCLocker
  }

The other issue I have with using can_mutator_consume_free_regions() here is that there is already a very similar G1Policy::should_allocate_mutator_region; and anyway, the attempt_allocation_locked call may actually succeed without requiring a new region (actually, it is not uncommon that another thread got a free region while trying to take the Heap_lock.

I think a better place for can_mutator_consume_free_regions() is in G1Policy::should_allocate_mutator_region() for this case.

attempt_allocation_locked however does not return a reason for why allocation failed (at the moment). Maybe it is better to let it return a tuple with result and reason (or a second "out" parameter)? (I haven't tried how this would look like, it seems worth trying and better than the current way of handling this).

This one could be used in the following code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code I am investigating moving the check into either 'G1Policy::should_allocate_mutator_region()' or to the caller of 'G1Policy::should_allocate_mutator_region()' so I can distinguish the reason why it failed to allocate. Hopefully I have something ready to push this week. I am on vacation so my responses are a little slow

@@ -451,7 +456,7 @@ HeapWord* G1CollectedHeap::attempt_allocation_slow(size_t word_size) {
if (should_try_gc) {
bool succeeded;
result = do_collection_pause(word_size, gc_count_before, &succeeded,
GCCause::_g1_inc_collection_pause);
GCCause::_g1_inc_collection_pause, force_gc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think it is better to add a new GCCause for this instead of the additional parameter. We are not doing a GC because we are out of space because of the allocation of word_size, but a "pre-emptive" GC due to the allocation of word_size.
This warrants a new GC cause imho.

@@ -872,7 +882,7 @@ HeapWord* G1CollectedHeap::attempt_allocation_humongous(size_t word_size) {
if (should_try_gc) {
bool succeeded;
result = do_collection_pause(word_size, gc_count_before, &succeeded,
GCCause::_g1_humongous_allocation);
GCCause::_g1_humongous_allocation, force_gc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reason for this GC if force_gc is true is not that we do not have enough space for the humongous allocation, but we are doing a "pre-emptive" GC because of the allocation of "word_size" similar to before.

add_allocated_humongous_bytes_since_last_gc(size_in_regions * HeapRegion::GrainBytes);
return result;
size_t size_in_regions = humongous_obj_size_in_regions(word_size);
if (policy()->can_mutator_consume_free_regions((uint)size_in_regions)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would prefer that the result of this is stored in a local not called force_gc :) and then used instead of appending such a small else after the "long'ish" true case.

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 13, 2021

@charliegracie This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 10, 2021

@charliegracie This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants