-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8317755: G1: Periodic GC interval should test for the last whole heap GC #16107
Conversation
👋 Welcome back shade! A progress list of the required criteria for merging this PR into |
Webrevs
|
Initial comments when looking at it: From what I understand from the CR description, the VM is not completely idle but has GCs now and then which do not (or almost do not) increase old gen occupancy. So the problem seems to be that the existing policy for the periodic gcs does not detect idleness (and the request to give back memory) in this case. The JEP mentions the This is probably a significant behavioral change which I think is unintended: if the user enabled to use execution of a full gc instead of a concurrent marking - then instead of chugging along if there is at least some activity, there would be regular full gcs now regardless of "being idle" (depending on how you see idle; very long periodis of only young gcs can be induced by e.g. an overly large heap). I do not think this is expected, even if the user asked for full gcs for periodic gcs. Because this topic about periodic gcs has come up a few times, the RFE does not describe the actual situation you are in, so it would be interesting to have more information. For example, would it be acceptable to have some kind of interaction with the VM to set e.g. There may be other aspects of this suggestion. Formal things: This is a significant behavioral change which needs a CSR imo. The referenced JEP exactly defines the expected behavior to listen to young gcs. /csr needed |
@tschatzl has indicated that a compatibility and specification (CSR) request is needed for this pull request. @shipilev please create a CSR request for issue JDK-8317755 with the correct fix version. This pull request cannot be integrated until the CSR request is approved. |
Note that this requirement can be removed again if there are significant changes, so probably don't start writing yet... however the current change as is would need one imo. |
We discussed this issue at Oracle internally a bit, with the following thoughts: We are still not sure about the use case you want to cover with that change, and probably the option Instead the intention of this change is probably something more like general cleanup (potentially including heap compaction) that is currently only performed after a whole heap analysis. Similar issues were reported in the past (like https://mail.openjdk.org/pipermail/hotspot-gc-dev/2020-March/028857.html), but nobody acted on it yet. So the problems we see here are:
What we think would be most appropriate for solving your problem (if the suggestion in an earlier post does not apply) currently would be
Also note that just for heap compaction there is this CR https://bugs.openjdk.org/browse/JDK-8238687 that I've been working for a long time on and off which imo only needs testing and cleanup (hah!). If you want to look into this feel free to contact me. Hth, |
Thanks for looking at it! Re-reading JEP 346: “Promptly Return Unused Committed Memory from G1”... In the scenario we are seeing, we do have lots of unused committed memory that would not be reclaimed promptly until concurrent cycle executes. The need for that cleanup is in worst case driven by heap connectivity changes that do not readily reflect in other observable heap metrics. A poster example would be a cache sitting perfectly in "oldgen", eventually dropping the entries, producing a huge garbage patch. We would only discover this after whole heap marking. In some cases, the tuning based on occupancy (e.g. soft max heap size, heap reserve, etc.) would help if we promote enough stuff to actually trigger the concurrent cycle. But if we keep churning very efficient young collections, we would get there very slowly. Therefore, I’d argue the current behavior is against the spirit of the JEP 346, even though the letter says that we track “any” GC for periodic GC. There is no explicit mention why young GCs should actually be treated as recent GCs, even though — with the benefit of hindsight — they throw away promptness guarantees. Aside: Shenandoah periodic GC does not make any claims it would run only in idle phases; albeit the story is simpler without young GCs. But Generational Shenandoah would follow the same route: the periodic whole heap GC would start periodically regardless of young collections running in between or not. The current use of “idle” is also awkward in JEP 346. If user enables I doubt that users would mind if we change the behavior of So as the compromise I am leaning towards introducing the flag what defines whether periodic GC checks for last whole heap collection or not. The default can still be the current behavior, but the behavior we want here is also easily achievable. This is implemented as new commit. (I realize this still requires CSR, being a product flag and all.) |
I believe that having periodic Garbage Collections (whether they be young, concurrent, full or mixed) may be something that user applications need in certain scenarios. However, the current JVM lacks a way for Java to invoke a specific kind of GC cycle. If the WhiteBox API (or something equivalent) were available in Java, a user application could easily implement periodic GCs, along with any custom tweaks, e.g. the behavior of master -- short-circuiting periodic concurrent GC. That way, one can avoid confusions around |
The user can already do this to some degree with I think the point @shipilev (and @gctony) make is that there are resources (not only Java heap memory) that ought to be cleaned up "in time" to keep resource usage and performance "optimal". The end user has even less insight into this than the VM - being "idle" is only a convenient time (for the user) to do that, so making the end user doing random stuff does not make anyone happy. This is corroborated by experience from me working with end users actually doing that (i.e. force a whole heap analysis externally). The solution was to tell them to not do that because they were spamming these to an extent very detrimental to performance - because they did not know better (it did not help that at that time a forced concurrent cycle interrupted an ongoing one, so it happened that effectively they went into a full gc sometimes). Both ZGC and Shenandoah have this regular cleanup (and CMS also provided the option) based on time. Not the best solution, but workable. |
@shipilev : I have not made up my mind about the other parts of your proposal, but:
This is the reason for the |
Yes, that is why I said "without doing anything else". With that example, I wanted to point out that the definition of "idle" is already quite murky even with current JEP, where we have an additional option to tell if "idle" includes the actual system load. In this view, having another option that would tell if "idle" includes young GCs fits well, I think. |
Yes, and that is the beauty of periodic GCs that are driven by GC itself, when it knows it should not start the periodic GCs if there was recent activity. That part if captured in JEP, I think. What this PR does, is extends the usefulness of periodic GCs to the cases where you need a whole heap analysis to make the best call about what to do next. In G1 case, maybe discovering that we need to start running mixed collections to unclutter the heap. We can indeed make an external agent that calls |
Mailing list message from Kirk Pepperdine on hotspot-gc-dev: Don?t we have already have a GCInterval with a default value of Long.MAX_VALUE? When I hear the word idle I immediately start thinking, CPU idle. In this case however, I quickly shifted to memory idle which I think translates nicely into how idle are the allocators. Thus basing heap sizing ergonomics on allocation rates seems like a reasonable metric until you consider the edge cases. The most significant ?edge case? IME is when GC overheads start exceeding 20%. In those cases GC will throttle allocations and that in turn cases ergonomics to reduce sizes instead of increasing them (to reduce GC overhead). My conclusion from this is that ergonomics should consider both allocation rates and GC overhead when deciding on how to resize heap at the end of a collection. Fortunately there are a steady stream of GC events that create a convenient point in time to make an ergonomic adjustment. Not having allocations and as a result not having the collector run implies one has to manufacture a convenient point in time to make an ergonomic sizing decision. Unfortunately time based triggers are speculative and the history of speculatively triggered GC cycles has been less than wonderful (think DGC as but one case). My last consulting engagement prior to joining MS involved me tuning an application where the OEM?s recommended configuration (set out of the box) was to run a full collection every two minutes. As one can imagine, the results were devastating. It took 3 days of meetings with various stakeholders and managers to get permission to turn that setting turned off. If the application is truly idle then it?s a no foul no harm situation. However there are entire classes of applications that are light allocators and the question would be, what would be the impact of speculative collections on those applications? As for returning memory, two issues, there appears to be no definition for ?unused memory?. Secondly, what I can say after looking at 1000s of GC logs is that the amount of floating garbage that G1 leaves behind even after several concurrent cycles is not insignificant. I also write a G1 heap fragmentation viewer and what it revealed is that heap remains highly fragmented and scattered after each GC cycle. All this suggests that heap will need to be compacted with a full collection in order to return a significantly large enough block of memory to make the entire effort worthwhile. Again, if the application is idle, then no harm no foul. However, for those applications that are memory-idle but not CPU-idle this might not be a great course of action. In my mind, any trigger for a speculative collection would need to take into consideration, allocation rates, GC overhead, and mutator busyness (for cases when GC and allocator activity is low to 0). Kind regards, |
Sorry for the late reply, I have been busy with getting the 423 JEP out the door :)
There is no such flag.
Allocation rate directly impacts GC overhead by causing more GCs.
See JDK-8238687. That CR and this proposed feature complement each other, GC should do something when there is no steady stream of gc events.
G1 currently respects
There is no harm with fragmentation on a region level; G1 will give back memory on a region level anyway if able to do so (see
Hence the flag described in the JEP. The reason why the defaults disable the CPU-idle check is that there is no good default.
It does (if told to do so appropriately). From @shipilev :
I think there is still a misconception with the current "periodic gcs" here: it is first and foremost thought to be applicable for catching the idle case with all its options (also consider cpu idle, do full or concurrent gc). This is the reason for the suggestion to phase out the current option I could easily imagine the use case where one would want to act fairly quickly (and allowing a disruptive full gc) on the detected idle case ("shortly" after detecting idle), and be more relaxed about general cleanup (long period, being least disruptive using a concurrent mark). Thomas |
Yes, "idle case". This PR adds another option that lets users to further fine-tune what "idle case" is.
IMO, the cost/benefit for renaming flags is too high to be worthwhile here. It would add to further confusion for users, as we change more and more flags across JDK releases. I would argue that proposed scheme fits the existing amendments to the periodic GC configurations:
All right, so Shenandoah has It looks to me as creating the parallel hierarchy of periodic GC control options. If we phase out |
Doing periodic whole heap inspections to free resources should not be tied to some notion of "idle" (ideally gc would schedule them at that time, but that is another matter). E.g. the problem mentioned in https://mail.openjdk.org/pipermail/hotspot-gc-dev/2020-March/028857.html is independent of idle cleanup. Same issue applies to finalizers/symbol table/etc (also written up in https://bugs.openjdk.org/browse/JDK-8213198). As far as I understand your proposal, in combination with |
Now one can argue that it is up to the user of |
There is also the option to trigger full gcs on "idle"; would be fairly bad if they were triggered regularly with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Very useful when there is no direct feedback loop from resource pressure that we might otherwise have used for signaling the need for a whole heap GC, so is a good escape hatch to put in the hands of service owners. Of course, like any escape hatch, it could be misused, but I don't think that is a concern about ejection seats from fighter jets for example. We trust those who fly the jets not to eject for no reason.
I would suggest formally dissociating this from the JEP itself. The fix ups to the JEP's and its implementation shortcomings can be addressed fully in the fullness of time. My USD 0.02.
I do not think breaking existing functionality to add new functionality given that the former is widely publicized via the JEP for the default collector and "fixing it up later" is the proper way forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment stream.
@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@shipilev This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
/open |
@shipilev This pull request is now open |
@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Hi Thomas @tschatzl :
Thanks for the response. I read the JEP (authored by you, Thomas). It got the impression from the verbage there (albeit slightly ambiguous) that the intention was to do the periodic gc's which would induce a concurrent major cycle and collect the entire heap, because the objective was to run a cycle that reclaims garbage in the absence of other triggers and returns memory to the OS. The JEP also states that a major cycle is essential because a minor cycle doesn't suffice. As such, the only rational solution that meets the proposed objectives of the JEP is to consider major cycles as the criterion for the timer whose tripping triggers the major cycle. I submit that, from this point of view, this PR actually corrects a small defect in the original implementation, and does so in a backward compatible fashion, so that any users dependent on the old, arguably defective, behavior are not surprised. I think your stance, if I understood correctly, is that there are other things that we have learned since that need to be corrected too, but it looks like we are then allowing the perfect to be the enemy of the better, in allowing this correction to be made. I may have misunderstood the long comment stream, so will go back and reread in case I missed something crucial about the objection to this PR in its current form. |
Clearly ...
I now see that you noted it might be better to separate this out into its own independent controller option:
I believe I now understand your objection. Thanks, and sorry for the noise :-) |
❗ This change is not yet ready to be integrated. |
@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@shipilev This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
See the description in the bug. Fortunately, we already track the last whole-heap GC. The new regression test verifies the behavior.
Additional testing:
tier1 tier2 tier3
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16107/head:pull/16107
$ git checkout pull/16107
Update a local copy of the PR:
$ git checkout pull/16107
$ git pull https://git.openjdk.org/jdk.git pull/16107/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16107
View PR using the GUI difftool:
$ git pr show -t 16107
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16107.diff
Webrev
Link to Webrev Comment