Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8267073: Race between Card Redirtying and Freeing Collection Set regions results in missing remembered set entries with G1 #4429

Conversation

tschatzl
Copy link
Contributor

@tschatzl tschatzl commented Jun 9, 2021

Hi all,

can I have reviews for this change that fixes a race between card redirtying and collection set freeing?

So card redirtying tries to be clever about which cards to redirty by not keeping (redirtying) cards in regions that have been fully evacuated as they can't have any references. (We did collect them exactly in the case of evac failure of an old region X: if that region X fails evacuation it is kept and may still contain references to other old regions Y; if it is successfully evacuated, then obviously we keep these cards for nothing):

  bool RedirtyLoggedCardTableEntryClosure::will_become_free(HeapRegion* hr) const {
    // A region will be freed by during the FreeCollectionSet phase if the region is in the
    // collection set and has not had an evacuation failure.
    return _g1h->is_in_cset(hr) && !hr->evacuation_failed();
  }

At the same time, since JDK-8214237, G1 frees the collection set which modifies both the is-in-cset field in the attribute table in is_in_cset() and the region's _evacuation_failed field which yields to returning the wrong value and so dropping cards from the dirty card queue (and in extension from the remembered sets).

The suggested fix consists of two parts:

  1. do not have the free collection set phase change the is-in-cset field in the attribute table
  2. do not have the free collection set phase change the _evacuation_failed field in the HeapRegion

The first item is fairly trivial, there is actually no need to clear that field at that point. Almost right after this phase the entire attribute table is reset again (in start_new_collection_set()).
The second is a bit trickier, as if we did not clear then, we would need to do it at a later point. So while this could be done, adding the clearing somewhere in the pre_evacuate_collection_set_phase(), the easier (and imo better solution) is to move that _evacuation_failed field for regions that is only necessary during young gc, out of HeapRegion and collecting it in a side table.
There are two reasons for this: first I want to in the future split out all transient, young-gc related data and code from G1CollectedHeap into a separate class like G1FullCollector, and we need that information available in such a side table for future improved pinned region support JDK-8254167.

After this change a test where I could reproduce this failure fairly consistently with -XX:+VerifyAfterGC (rate ~1/100), did not fail in thousands of iterations.

Testing: tier1-5

Thanks,
Thomas


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8267073: Race between Card Redirtying and Freeing Collection Set regions results in missing remembered set entries with G1

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4429/head:pull/4429
$ git checkout pull/4429

Update a local copy of the PR:
$ git checkout pull/4429
$ git pull https://git.openjdk.java.net/jdk pull/4429/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4429

View PR using the GUI difftool:
$ git pr show -t 4429

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4429.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 9, 2021

👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Jun 9, 2021

@tschatzl The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc label Jun 9, 2021
@tschatzl tschatzl force-pushed the submit/8267073-fix-crashes-during-evacuation-failure branch from 80fd8e2 to b7c52d0 Compare Jun 9, 2021
@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Jun 9, 2021

An early comment, haven't thought about all implications and if it would be bad for future work or performance. But wouldn't it be quite convenient to bake the evacuation failed status into the attribute table? Similar to have the _need_remset_update member in there, we could also have a _failed_evacuation member. This could always be initialized to false to make clearing as simple as it currently is and then this is the only thing we need to clear after the collection.

@tschatzl
Copy link
Contributor Author

@tschatzl tschatzl commented Jun 10, 2021

There are a few concerns in addition to those that you mentioned that does not make it that proposal clear cut:

  • The difference between the attribute table and this status is the lifetime: we need to attribute table always as we update it as we allocate more regions at the moment - this is something that may be reconsidered. Especially if we want to improve region allocation during mutator time (i.e. making it lock-free or something) where it might be more convenient to not try to make this update atomic too, and make the attribute table only an per-collection data structure.
  • evacuation failure is at this time (not considering region pinning yet) is something exceptional, and may warrant extra side data structures for cachability. In most cases we do not need it anyway.
  • we need to atomically update this flag (compared to the other stuff in the attribute table), and there is no infrastructure for that.
  • the attribute table is supposed to be read-only during actual gc to avoid reloads during gc (supposedly) for performance. For the same reason the _humongous_reclaim_candidate state is also separate. We may certainly want to combine this one with the evacuation failure table. There is a few things in G1CollectedHeap I would like to move to a G1YoungGC(Context) similar to the full gc. That will hopefully make reasoning about the code easier if less is crammed into G1CollectedHeap.
  • for JDK-8254167 we need another side table actually collecting the region ids that are affected (we have this idiom elsewhere too, so it might be useful to actually factor out an appropriate data structure). This might be another candidate to join with above mentioned two tables.
  • I would like to at least attempt to keep space in the attribute table free for some kind of remembered set identifier :), for fast access (as I hope there will not be an 1:1 mapping soon'ish anymore) but that is more an idea floating at the back of my mind and may not be necessary.

Hth,
Thomas

@tschatzl tschatzl marked this pull request as ready for review Jun 10, 2021
@tschatzl
Copy link
Contributor Author

@tschatzl tschatzl commented Jun 10, 2021

Tier1-5 testing completed with no particular issues so I opened it as ready for review.

@openjdk openjdk bot added the rfr label Jun 10, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 10, 2021

Webrevs

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Jun 10, 2021

Thanks Thomas for the reasoning, I think this all makes sense. I agree that the attribute table is not optimal, but at a first glance it looked like a good candidate for this information. I think working towards grouping young-gen things together is good a idea and once that is done we might find an even better abstraction for this.

Copy link

@kimbarrett kimbarrett left a comment

Generally looks good. One suggestion about the representation of the evacuation-failed info.

@@ -867,6 +867,8 @@ class G1CollectedHeap : public CollectedHeap {

// Number of regions evacuation failed in the current collection.
volatile uint _num_regions_failed_evacuation;
// Records for every region on the heap whether evacuation failed for it.
volatile bool* _regions_failed_evacuation;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using BitMap and par_xxx operations instead of an array of bool and explicit atomic operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be fine with the change as is, if that gets this fix for a fairly noisy bug pushed sooner. So also consider a followup RFE to switch to using BitMap.

Copy link
Contributor

@kstefanj kstefanj left a comment

Changes looks good. I like Kim's suggestion to use BitMap, so would not object to that, but I'm fine with the change as is.

@openjdk
Copy link

@openjdk openjdk bot commented Jun 10, 2021

@tschatzl This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267073: Race between Card Redirtying and Freeing Collection Set regions results in missing remembered set entries with G1

Reviewed-by: kbarrett, sjohanss

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 42 new commits pushed to the master branch:

  • f677163: 8266130: convert Thread-SMR stress tests from counter based to time based
  • 6c552a7: 8268544: some runtime/sealedClasses tests should be run in driver mode
  • f4b3170: 8268428: Test java/foreign/TestResourceScope.java fails: expected [M] but found [N]
  • 6b6ff53: 8268543: some runtime/verifier tests should be run in driver mode
  • 0924382: 8266766: Arrays of types that cannot be an annotation member do not yield exceptions
  • d43c8a7: 8268124: Update java.lang to use switch expressions
  • a187fcc: 8238197: JFR: Rework setting and getting EventHandler
  • f770f77: 8268390: G1 concurrent gc upgrade to full gc not working
  • e0c0b13: 8268534: some serviceability/jvmti tests should be run in driver mode
  • 92f0b6d: 8268532: several serviceability/attach tests should be run in driver mode
  • ... and 32 more: https://git.openjdk.java.net/jdk/compare/438895903b1de71b88951a4117c240baf410fd5d...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Jun 10, 2021
Copy link

@kimbarrett kimbarrett left a comment

Looks good, with possible change to _region_failed_evacuation either now or as a future enhancement.

@@ -867,6 +867,8 @@ class G1CollectedHeap : public CollectedHeap {

// Number of regions evacuation failed in the current collection.
volatile uint _num_regions_failed_evacuation;
// Records for every region on the heap whether evacuation failed for it.
volatile bool* _regions_failed_evacuation;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be fine with the change as is, if that gets this fix for a fairly noisy bug pushed sooner. So also consider a followup RFE to switch to using BitMap.

@tschatzl
Copy link
Contributor Author

@tschatzl tschatzl commented Jun 10, 2021

I'll change to bitmaps later - the earlier we get the noise away from CI, the better imho.

Thanks @kstefanj and @kimbarrett for your reviews.

/integrate

@openjdk openjdk bot closed this Jun 10, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Jun 10, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jun 10, 2021

@tschatzl Since your change was applied there have been 44 commits pushed to the master branch:

  • 7cd5a6e: 8268537: (Temporary) Disable ParallelRefProcEnabled for Parallel GC
  • f716711: 8265271: JFR: Allow use of .jfc options when starting JFR
  • f677163: 8266130: convert Thread-SMR stress tests from counter based to time based
  • 6c552a7: 8268544: some runtime/sealedClasses tests should be run in driver mode
  • f4b3170: 8268428: Test java/foreign/TestResourceScope.java fails: expected [M] but found [N]
  • 6b6ff53: 8268543: some runtime/verifier tests should be run in driver mode
  • 0924382: 8266766: Arrays of types that cannot be an annotation member do not yield exceptions
  • d43c8a7: 8268124: Update java.lang to use switch expressions
  • a187fcc: 8238197: JFR: Rework setting and getting EventHandler
  • f770f77: 8268390: G1 concurrent gc upgrade to full gc not working
  • ... and 34 more: https://git.openjdk.java.net/jdk/compare/438895903b1de71b88951a4117c240baf410fd5d...master

Your commit was automatically rebased without conflicts.

Pushed as commit 2b41459.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tschatzl tschatzl deleted the submit/8267073-fix-crashes-during-evacuation-failure branch Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc integrated
3 participants