Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8140326: G1: Consider putting regions where evacuation failed into next collection set #14220

Conversation

tschatzl
Copy link
Contributor

@tschatzl tschatzl commented May 30, 2023

This change adds management of retained regions, i.e. trying to evacuate evacuation failed regions asap.

The advantage is that evacuation failed regions do not need to wait until the next marking to be cleaned out; as they are often very sparsely occupied (often being eden regions), this occupies a lot of space, potentially causing additional evacuation failures later on.
Another use of this change will be region pinning, which are basically evacuation failed regions that can not be reclaimed as long as they are pinned - however as soon as they are unpinned, they should be reclaimed for the same reasons as well.

It consists of several behavioral changes:

During garbage collection:

... in the Evacuation phase:

  • always collect dirty cards into evacuation failed regions proactively. In tests, the amount of cards/live objects per evacuation failed region is typically very small. Dirty cards are always put into the global refinement buffer immediately, assuming that we will keep most if not all evacuation failed regions.

... during Post Evacuation 2/Free Collection Set phase:

  • determine whether the region will be retained (kept for "immediate" evacuation) or not. Highly occupied regions are assumed to stay (mostly) live at least until the next marking, so do not bother with them.

These "retained" regions are collected in a new "from retained" set in the collection set candidates and managed separately from "from marking" regions. Having the "from retained" and "from marking" sets separate in the collection set candidates is easier to manage than to use a single list and the picking stuff from it. Particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list.

... determining the collection set during the pause:

  • during gc, the collection set is preferentially (first) populated with regions from the "from marking" candidates (these are the important regions to get cleaned out), second from the "from retained" list.
  • g1 reserves up to 20% of max gc pause time for retained regions as optional candidates (this is a random number) to make sure that these are cleared out asap to free memory. There is also a minimum number of regions to take from the retained regions list.

During marking

... changes to marking proper

  • retained regions will not be marked through during concurrent mark, i.e. they are considered outside of the snapshot. So they are added to the "root regions" during the concurrent start pause. This may be a performance issue (we can't do a gc until all root regions have been marked through), but so far since evacuation failure regions are typically very sparsely populated, this is very fast.

... changes to scrubbing

  • during scrubbing, regions may now be reclaimed. That means scrubbing needs to be aware of (more) regions being reclaimed while working on them.

During mutator time:

... try to accomodate retained candidate regions in the predictions, giving them at most 20% of pause time (random value)

Testing: multiple tier1-5 runs, with forced verification on and/or induced evacuation failure


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8140326: G1: Consider putting regions where evacuation failed into next collection set (Enhancement - P4)

Reviewers

Contributors

  • Albert Mingkun Yang <ayang@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14220/head:pull/14220
$ git checkout pull/14220

Update a local copy of the PR:
$ git checkout pull/14220
$ git pull https://git.openjdk.org/jdk.git pull/14220/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14220

View PR using the GUI difftool:
$ git pr show -t 14220

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14220.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 30, 2023

👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8140326 8140326: G1: Consider putting regions where evacuation failed into next collection set May 30, 2023
@openjdk
Copy link

openjdk bot commented May 30, 2023

@tschatzl The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label May 30, 2023
@tschatzl tschatzl force-pushed the submit/8140326-evacuate-retained-regions-at-any-time branch from 497d402 to dba3b65 Compare May 31, 2023 07:29
…ate evacuation failed

regions asap.

It consists of several changes:

* collect dirty cards into evacuation failed regions proactively (always). In my tests, the
  amount of cards/live objects has been very small. Dirty cards are put into the global
  refinement buffer always, assuming that we will keep most if not all evacuation failed
  regions.

* post evac phase 2 will determine whether the region will be retained (kept for "immediate"
  evacuation) or not.

* these regions will be collected in a new "from retained" set in the collection set
  candidates. Having the "from retained" and "from marking" sets separate in the collection
  set candidates is ultimately easier than to use a single list and the picking stuff from
  it.
    * particularly wrt to making sure that mixed gcs preferentially pick from the "from
      marking" list first, then second from the "from retained" list.

* changes to determining the collection set:
  * at any time before gc (determining young gen length) if there are entries in the retained
    collection set candidates set ("retained set" in the following) g1 reserves up to 20% of
    max gc pause time for it (random number) to make sure that these are cleared out asap to
    free memory.
  * during gc, the collection set is preferentially (first) populated with regions from the
    "from marking" candidates (these are the important regions to get cleaned out), second
    from the "from retained" list.

* changes to end of gc/marking
  * retained regions will not be marked through during concurrent mark, i.e. they are
    considered outside of the snapshot. So they are added to the "root regions" during the
    concurrent start pause.
    This may be a performance issue (we can't do a gc until all root regions have been
    marked through), but so far since evacuation failure regions are typically very sparsely
    populated, this is very fast.
@tschatzl tschatzl force-pushed the submit/8140326-evacuate-retained-regions-at-any-time branch from dba3b65 to 263ff28 Compare May 31, 2023 07:37
@tschatzl tschatzl marked this pull request as ready for review May 31, 2023 07:37
@openjdk openjdk bot added the rfr Pull request is ready for review label May 31, 2023
@mlbridge
Copy link

mlbridge bot commented May 31, 2023

@tschatzl
Copy link
Contributor Author

tschatzl commented Jun 1, 2023

Because I've been asked about why the strict separation of from-marking and retained regions in the policy: to keep the current gc cycle policy fairly intact.
I.e. do the same as before, with some almost optional additional reclamation of known sparsely populated regions (compared to existing code).

The current heuristic to do young gcs, then a fixed amount of mixed gcs that clean out the old gen asap is, as ugly as it is, surprisingly good in the general case. Treating the retained regions the same as from-marking regions would make it necessary to rethink that: Retained regions are often/most of the time prime targets for evacuation (high efficiency), which means that g1 would start concentrating on these regions first even during the mixed phase (which is generally fine...), but due to how mixed gc works (use "smallest young gen", conceptually fixed amount of gcs) it ultimately would not clean out old gen fast enough (or completely, depends) as tuned right now. All the low efficiency regions would need to be cleaned out later. However the prediction isn't good enough to cope well with them. They are typically predicted worse than high efficicency ones, that means failing to be exact for them is worse than for high efficiency regions, so it will not take enough of them.

Now one could extend the mixed phase, but in corner cases g1 would then potentially stay in mixed phase (if evacuation failure/and later pinned regions were commonly encountered but still few) forever, which prohibits marking (leading to full gcs), and degrades performance (as it will be using a small young gen).

In some way, evacuation failed regions, can be seen as kind of extra regions/work due to a mismatch between application and VM configuration (compared to current master). Concentrating on that extra work in the phase that's about keeping the gc cycle going without full gc isn't the best thing to do (the current policy just has fairly strong provisions to take low efficiency regions and avoid full gc), particularly it has less overall impact to stuff high efficiency region collection into the existing young gcs that can go almost if not at full speed (i.e. it has less impact to be wrong about the prediction of a high efficiency region vs. low efficiency one).

Basically I think g1 ultimately needs to get away from what mixed gc is now, and how g1 determines when to start/stop that reclamation phase and how to determine the "right" amount of things to evacuate at what speed. That will certainly have to do something with making sure that the old gen allocation rate is countered, and at the same time being efficient overall (i.e. doing more than necessary to do as many young collections as possible within allowed time goal) without degrading into full gcs.

Fwiw: https://bugs.openjdk.org/browse/JDK-8159697.

I tried and failed to do that in this change (to be better than the current heuristic). Apart from being a different topic, this change in itself has its own merit: it improves resiliency vs. evacuation failure. In the past, what you often had is like having an evacuation failure because you ran out of memory. Since this produced lots of garbage, there has been a very high risk of getting into another evacuation failure because the even more decreased available free space causes gcs more often (smaller young gen available), which causes more surviving objects, resulting in more serious evacuation failures (with more live objects). Ultimately you end up with a full gc very quickly because there is not enough time to mark through the old gen.
With this change at least in the next young gc g1 will compact the evacuation failed regions fairly quickly (next gc), allow better recovery and/or allowing G1 to more easily not loose the race with the mutator until the next marking cycle.

Obviously it is also an important stepping stone for handling pinned regions reasonably.

src/hotspot/share/gc/g1/g1Policy.cpp Show resolved Hide resolved
src/hotspot/share/gc/g1/g1Policy.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1CollectionSetCandidates.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1CollectionSetChooser.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1ParScanThreadState.hpp Outdated Show resolved Hide resolved
test/hotspot/jtreg/gc/g1/TestVerifyGCType.java Outdated Show resolved Hide resolved
Copy link
Member

@walulyai walulyai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@openjdk
Copy link

openjdk bot commented Jun 6, 2023

@tschatzl This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8140326: G1: Consider putting regions where evacuation failed into next collection set

Co-authored-by: Albert Mingkun Yang <ayang@openjdk.org>
Reviewed-by: iwalulya, ayang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 48 new commits pushed to the master branch:

  • 4726960: 8313779: RISC-V: use andn / orn in the MD5 instrinsic
  • bbbfa21: 8313880: Incorrect copyright header in jdk/java/foreign/TestFree.java after JDK-8310643
  • 0bb6af3: 8313791: Fix just zPage.inline.hpp and xPage.inline.hpp
  • 4b192a8: 8313676: Amend TestLoadIndexedMismatch test to target intrinsic directly
  • 0b4387e: 8310643: Misformatted copyright messages in FFM
  • 538f955: 8313701: GHA: RISC-V should use the official repository for bootstrap
  • 226cdc6: 8312585: Rename DisableTHPStackMitigation flag to THPStackMitigation
  • dc01604: 8305636: Expand and clean up predicate classes and move them into separate files
  • a38fdaf: 8166900: If you wrap a JTable in a JLayer, the cursor is moved to the last row of table by you press the page down key.
  • c1f4595: 8311160: [macOS, Accessibility] VoiceOver: No announcements on JRadioButtonMenuItem and JCheckBoxMenuItem
  • ... and 38 more: https://git.openjdk.org/jdk/compare/5d1b911c92b933c257c8e9afe1464ec175ca1cc2...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 6, 2023
src/hotspot/share/gc/g1/g1Policy.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1CollectionSetCandidates.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1Policy.cpp Show resolved Hide resolved
tschatzl added 2 commits June 13, 2023 11:34
- improve comment in ClearRetainedRegionData (formerly ClearRetainedRegionBitmap) about why we need to clear marking data here
- added separate flag for threshold for retaining evacuation failed regions
- fix bug in calculating total base time
- do not add retained regions time during base time calculation when finalizing young gen
- add marking state (tams, bitmap, live bytes) verification at the end of a concurrent start pause (with -XX:+VerifyAfterGC)
@tschatzl
Copy link
Contributor Author

During review, @albertnetymk found that there is a bug in base time calculation when finalizing the young generation: it added retained regions time, which is wrong. Evacuating retained regions is optional and should not count there.

There has also been the question why the policy to take retained regions is different from old regions: as described earlier, retained regions are considered fairly optional to take, and any retained region that one takes is better than the current handling. It is also very hard to come up with a good policy that subsumes both from-marking regions and retained regions; this change does not even attempt that. The current policy for retained regions uses magic constants at this time based on experience that while we want these regions be fairly optional, we still want to account for them when determining eden size somewhat.

Then there has been the question about ClearRetainedRegionData again and what marking expects at mark start for particular regions:

  • regions that are not going to be marked through need tams at bottom(), no marks on the bitmap and their liveness value zero.
  • regions that are going to be marked through need tams at top(), and either no marks on the bitmap and their liveness value zero OR marks on the bitmap and their liveness value correspond to the total size of the marked objects. This is the case for evacuation failed regions that are not retained (or, before this change, all evacuation failed regions). There is now some code to verify that during VerifyAfterGC.

This verification passed tier 1-5 without any special flag and with all verification (before/during/after including this new verification) on plus G1EvacuationFailureALot enabled.

tschatzl and others added 5 commits June 30, 2023 10:36
- More documentation
- some renames
- also exclude collection set regions from initial TAMS update as it is likely they will not fail evacuation
…nt_start to before determining collection set to simplify condition what regions to set tams for
@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 3, 2023

/contributor add @albertnetymk

@openjdk
Copy link

openjdk bot commented Aug 3, 2023

@tschatzl
Contributor Albert Mingkun Yang <ayang@openjdk.org> successfully added.

@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 3, 2023

After some discussion with @albertnetymk we agreed that updating stats (tams, liveness) as implemented in that other branch would be the best way forward in this matter.
There has been a minor optimization to note_start_of_marking(): if the calls to it were executed before selecting the collection sets, the condition to set the TAMS could be simplified greatly, because all regions in the collection set must have been candidates before.
This is implemented in the latest commit.
Currently testing this change more thoroughly, with initial local testing having passed.

@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 3, 2023

I also did another merge with master.

src/hotspot/share/gc/g1/g1HeapVerifier.cpp Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapVerifier.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1Policy.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1CollectionSetCandidates.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1Policy.cpp Outdated Show resolved Hide resolved
@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 4, 2023

Pushed a new change, testing more thoroughly.

Correctly use live_bytes in set_live_bytes()

Some includes, fix using live bytes in set_live_bytes()
@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 7, 2023

After some discussion with @albertnetymk we changed the following:

  • Iterator for G1CollectionSetCandidates does not need ordering, so removed that code
  • there has been a bug in G1ConcurrentMark::set_live_bytes() where the code did not divide the given live bytes by HeapWordSize to correctly set live words
  • Added some missing includes
  • Retained regions evac time prediction for young gen length only considers minimum amount of regions similar to marking regions now.

@walulyai
Copy link
Member

walulyai commented Aug 8, 2023

Still LGTM!

@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 8, 2023

Thanks @albertnetymk @walulyai for your reviews

/integrate

@tschatzl
Copy link
Contributor Author

tschatzl commented Aug 8, 2023

tier1-5 passed

@openjdk
Copy link

openjdk bot commented Aug 8, 2023

Going to push as commit 7e20952.
Since your change was applied there have been 56 commits pushed to the master branch:

  • 28fd7a1: 8311179: Generational ZGC: gc/z/TestSmallHeap.java failed with OutOfMemoryError
  • a1115a7: 8312204: unexpected else with statement causes compiler crash
  • 87a6acb: 8313792: Verify 4th party information in src/jdk.internal.le/share/legal/jline.md
  • 87b08b6: 8307408: Some jdk/sun/tools/jhsdb tests don't pass test JVM args to the debuggee JVM
  • 83edffa: 8309663: test fails "assert(check_alignment(result)) failed: address not aligned: 0x00000008baadbabe"
  • 1da82a3: 8313702: Update IANA Language Subtag Registry to Version 2023-08-02
  • 9c6eb67: 8313167: Update to use jtreg 7.3
  • 380418f: 8295058: test/langtools/tools/javac 116 test classes uses com.sun.tools.classfile library
  • 4726960: 8313779: RISC-V: use andn / orn in the MD5 instrinsic
  • bbbfa21: 8313880: Incorrect copyright header in jdk/java/foreign/TestFree.java after JDK-8310643
  • ... and 46 more: https://git.openjdk.org/jdk/compare/5d1b911c92b933c257c8e9afe1464ec175ca1cc2...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 8, 2023
@openjdk openjdk bot closed this Aug 8, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 8, 2023
@openjdk
Copy link

openjdk bot commented Aug 8, 2023

@tschatzl Pushed as commit 7e20952.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tschatzl tschatzl deleted the submit/8140326-evacuate-retained-regions-at-any-time branch August 18, 2023 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants