-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8140326: G1: Consider putting regions where evacuation failed into next collection set #14220
8140326: G1: Consider putting regions where evacuation failed into next collection set #14220
Conversation
👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into |
497d402
to
dba3b65
Compare
…ate evacuation failed regions asap. It consists of several changes: * collect dirty cards into evacuation failed regions proactively (always). In my tests, the amount of cards/live objects has been very small. Dirty cards are put into the global refinement buffer always, assuming that we will keep most if not all evacuation failed regions. * post evac phase 2 will determine whether the region will be retained (kept for "immediate" evacuation) or not. * these regions will be collected in a new "from retained" set in the collection set candidates. Having the "from retained" and "from marking" sets separate in the collection set candidates is ultimately easier than to use a single list and the picking stuff from it. * particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list. * changes to determining the collection set: * at any time before gc (determining young gen length) if there are entries in the retained collection set candidates set ("retained set" in the following) g1 reserves up to 20% of max gc pause time for it (random number) to make sure that these are cleared out asap to free memory. * during gc, the collection set is preferentially (first) populated with regions from the "from marking" candidates (these are the important regions to get cleaned out), second from the "from retained" list. * changes to end of gc/marking * retained regions will not be marked through during concurrent mark, i.e. they are considered outside of the snapshot. So they are added to the "root regions" during the concurrent start pause. This may be a performance issue (we can't do a gc until all root regions have been marked through), but so far since evacuation failure regions are typically very sparsely populated, this is very fast.
dba3b65
to
263ff28
Compare
Webrevs
|
Because I've been asked about why the strict separation of from-marking and retained regions in the policy: to keep the current gc cycle policy fairly intact. The current heuristic to do young gcs, then a fixed amount of mixed gcs that clean out the old gen asap is, as ugly as it is, surprisingly good in the general case. Treating the retained regions the same as from-marking regions would make it necessary to rethink that: Retained regions are often/most of the time prime targets for evacuation (high efficiency), which means that g1 would start concentrating on these regions first even during the mixed phase (which is generally fine...), but due to how mixed gc works (use "smallest young gen", conceptually fixed amount of gcs) it ultimately would not clean out old gen fast enough (or completely, depends) as tuned right now. All the low efficiency regions would need to be cleaned out later. However the prediction isn't good enough to cope well with them. They are typically predicted worse than high efficicency ones, that means failing to be exact for them is worse than for high efficiency regions, so it will not take enough of them. Now one could extend the mixed phase, but in corner cases g1 would then potentially stay in mixed phase (if evacuation failure/and later pinned regions were commonly encountered but still few) forever, which prohibits marking (leading to full gcs), and degrades performance (as it will be using a small young gen). In some way, evacuation failed regions, can be seen as kind of extra regions/work due to a mismatch between application and VM configuration (compared to current master). Concentrating on that extra work in the phase that's about keeping the gc cycle going without full gc isn't the best thing to do (the current policy just has fairly strong provisions to take low efficiency regions and avoid full gc), particularly it has less overall impact to stuff high efficiency region collection into the existing young gcs that can go almost if not at full speed (i.e. it has less impact to be wrong about the prediction of a high efficiency region vs. low efficiency one). Basically I think g1 ultimately needs to get away from what mixed gc is now, and how g1 determines when to start/stop that reclamation phase and how to determine the "right" amount of things to evacuate at what speed. That will certainly have to do something with making sure that the old gen allocation rate is countered, and at the same time being efficient overall (i.e. doing more than necessary to do as many young collections as possible within allowed time goal) without degrading into full gcs. Fwiw: https://bugs.openjdk.org/browse/JDK-8159697. I tried and failed to do that in this change (to be better than the current heuristic). Apart from being a different topic, this change in itself has its own merit: it improves resiliency vs. evacuation failure. In the past, what you often had is like having an evacuation failure because you ran out of memory. Since this produced lots of garbage, there has been a very high risk of getting into another evacuation failure because the even more decreased available free space causes gcs more often (smaller young gen available), which causes more surviving objects, resulting in more serious evacuation failures (with more live objects). Ultimately you end up with a full gc very quickly because there is not enough time to mark through the old gen. Obviously it is also an important stepping stone for handling pinned regions reasonably. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
@tschatzl This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 48 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
- improve comment in ClearRetainedRegionData (formerly ClearRetainedRegionBitmap) about why we need to clear marking data here - added separate flag for threshold for retaining evacuation failed regions - fix bug in calculating total base time
- do not add retained regions time during base time calculation when finalizing young gen - add marking state (tams, bitmap, live bytes) verification at the end of a concurrent start pause (with -XX:+VerifyAfterGC)
During review, @albertnetymk found that there is a bug in base time calculation when finalizing the young generation: it added retained regions time, which is wrong. Evacuating retained regions is optional and should not count there. There has also been the question why the policy to take retained regions is different from old regions: as described earlier, retained regions are considered fairly optional to take, and any retained region that one takes is better than the current handling. It is also very hard to come up with a good policy that subsumes both from-marking regions and retained regions; this change does not even attempt that. The current policy for retained regions uses magic constants at this time based on experience that while we want these regions be fairly optional, we still want to account for them when determining eden size somewhat. Then there has been the question about
This verification passed tier 1-5 without any special flag and with all verification (before/during/after including this new verification) on plus |
- More documentation - some renames - also exclude collection set regions from initial TAMS update as it is likely they will not fail evacuation
…ed-into-collection-set
…nt_start to before determining collection set to simplify condition what regions to set tams for
/contributor add @albertnetymk |
@tschatzl |
After some discussion with @albertnetymk we agreed that updating stats (tams, liveness) as implemented in that other branch would be the best way forward in this matter. |
I also did another merge with master. |
Pushed a new change, testing more thoroughly. |
Correctly use live_bytes in set_live_bytes() Some includes, fix using live bytes in set_live_bytes()
After some discussion with @albertnetymk we changed the following:
|
Still LGTM! |
Thanks @albertnetymk @walulyai for your reviews /integrate |
tier1-5 passed |
Going to push as commit 7e20952.
Your commit was automatically rebased without conflicts. |
This change adds management of retained regions, i.e. trying to evacuate evacuation failed regions asap.
The advantage is that evacuation failed regions do not need to wait until the next marking to be cleaned out; as they are often very sparsely occupied (often being eden regions), this occupies a lot of space, potentially causing additional evacuation failures later on.
Another use of this change will be region pinning, which are basically evacuation failed regions that can not be reclaimed as long as they are pinned - however as soon as they are unpinned, they should be reclaimed for the same reasons as well.
It consists of several behavioral changes:
During garbage collection:
... in the Evacuation phase:
... during Post Evacuation 2/Free Collection Set phase:
These "retained" regions are collected in a new "from retained" set in the collection set candidates and managed separately from "from marking" regions. Having the "from retained" and "from marking" sets separate in the collection set candidates is easier to manage than to use a single list and the picking stuff from it. Particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list.
... determining the collection set during the pause:
During marking
... changes to marking proper
... changes to scrubbing
During mutator time:
... try to accomodate retained candidate regions in the predictions, giving them at most 20% of pause time (random value)
Testing: multiple tier1-5 runs, with forced verification on and/or induced evacuation failure
Progress
Issue
Reviewers
Contributors
<ayang@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14220/head:pull/14220
$ git checkout pull/14220
Update a local copy of the PR:
$ git checkout pull/14220
$ git pull https://git.openjdk.org/jdk.git pull/14220/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14220
View PR using the GUI difftool:
$ git pr show -t 14220
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14220.diff
Webrev
Link to Webrev Comment