Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8256265: G1: Improve parallelism in regions that failed evacuation #7047

Closed

Conversation

Hamlin-Li
Copy link

@Hamlin-Li Hamlin-Li commented Jan 12, 2022

Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.

This patch tries to improve parallelism when walking over the regions in chunks

Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.

Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)

The performance benefit trend is:

  • pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
  • pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
    ( Other common Evacuation Failure configurations are:
    -XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )

For more detailed performance data, please check the related bug.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8256265: G1: Improve parallelism in regions that failed evacuation

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7047/head:pull/7047
$ git checkout pull/7047

Update a local copy of the PR:
$ git checkout pull/7047
$ git pull https://git.openjdk.java.net/jdk pull/7047/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7047

View PR using the GUI difftool:
$ git pr show -t 7047

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7047.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 12, 2022

👋 Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 12, 2022

@Hamlin-Li The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Jan 12, 2022
@Hamlin-Li Hamlin-Li marked this pull request as ready for review January 22, 2022 04:15
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 22, 2022
@mlbridge
Copy link

mlbridge bot commented Jan 22, 2022

@tschatzl
Copy link
Contributor

Just a recap of what the change adds:

  • on evacuation failure, also records the number of bytes that failed evacuation in that region in a per-region live-map (using G1RegionMarkStats)
  • (at the start of the Post Evacuation Cleanup 1 we flush that cache - ideally this would be done in Merge PSS, but we can't because we need it in the remove self forwards pointer task potentially running in parallel)
  • remove self forwards in Post Evacuation Cleanup 1 does roughly the following:
  1. let the threads claim and "prepare" the region - mostly setting live bytes from that new per-region live map, "readying the region" (BOT reset, some statistics), finally set to "ready"
  2. wait for the region being "ready"
  3. let the threads claim parts ("chunks") of the region; these chunks are first generated using information from the region (and the bitmap). They contain information to handle the zapping and restoring, which is then immediately used by that thread.

Fwiw, I did some hacking, adding lots of statistics output to it because I was a bit surprised of some of the numbers I saw (available at https://github.com/tschatzl/jdk/tree/pull/7047-evac-failure-chunking).

@Hamlin-Li
Copy link
Author

Thanks Thomas for the summary and logging code.
I attached related log output below.

[3.236s][debug][gc,phases] GC(0)     Post Evacuate Cleanup 1: 5.4ms
[3.236s][debug][gc,phases] GC(0)     Evac Fail Merge Live: 0.0ms
...
[3.236s][debug][gc,phases] GC(0)       Restore Retained Regions (ms): Min:  5.1, Avg:  5.1, Max:  5.3, Diff:  0.2, Sum: 41.0, Workers: 8
[3.236s][debug][gc,phases] GC(0)         Regions:                       Min: 1, Avg:  1.0, Max: 1, Diff: 0, Sum: 4, Workers: 4
[3.237s][debug][gc,phases] GC(0)         Prepared Retained Regions (ms): Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 4
[3.237s][debug][gc,phases] GC(0)         Wait For Ready In Retained Regions (ms): Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 8
[3.237s][debug][gc,phases] GC(0)         Prepare Chunks (ms):           Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.1, Workers: 8
[3.237s][debug][gc,phases] GC(0)         Remove Self Forwards In Chunks (ms): Min:  5.0, Avg:  5.1, Max:  5.2, Diff:  0.2, Sum: 40.6, Workers: 8
[3.237s][debug][gc,phases] GC(0)           Forward Chunks:                Min: 16, Avg: 16.0, Max: 16, Diff: 0, Sum: 128, Workers: 8
[3.237s][debug][gc,phases] GC(0)           Empty Forward Chunks:          Min: 32, Avg: 128.0, Max: 351, Diff: 319, Sum: 896, Workers: 7
[3.237s][debug][gc,phases] GC(0)           Forward Objects:               Min: 432330, Avg: 440400.4, Max: 449475, Diff: 17145, Sum: 3523203, Workers: 8
[3.237s][debug][gc,phases] GC(0)           Forward Bytes:                 Min: 13543912, Avg: 13680514.0, Max: 13779400, Diff: 235488, Sum: 109444112, Workers: 8

@tschatzl
Copy link
Contributor

Regarding the log messages: We might want to fix up them a bit, I did not look at our recent email discussion on what we came up with, and their level.

Some other initial thoughts worth considering:

*) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s).

*) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase.

I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far.

*) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.

*) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.

*) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.

@Hamlin-Li
Copy link
Author

Regarding the log messages: We might want to fix up them a bit, I did not look at our recent email discussion on what we came up with, and their level.

Some other initial thoughts worth considering:

*) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s).

In fact, normally most of time of "Post Evacuate Cleanup 1" is spent on "Restore Retained Regions" in baseline version. In parallel version, the proportion of "Restore Retained Regions" in "Post Evacuate Cleanup 1" is reduced. e.g. following is the "Post Evacuate Cleanup 1"/"Restore Retained Regions" time comparison between baseline and parallel:
baseline:

[3.169s][info ][gc,phases] GC(0)   Post Evacuate Collection Set: 10.0ms
[3.169s][debug][gc,phases] GC(0)     Post Evacuate Cleanup 1: 9.5ms

parallel

[3.105s][info ][gc,phases] GC(0)   Post Evacuate Collection Set: 2.5ms
[3.106s][debug][gc,phases] GC(0)     Post Evacuate Cleanup 1: 2.0ms

the difference between "Post Evacuate Cleanup 1" and "Restore Retained Regions" is the same between baseline and parallel version, which is spent on other subphases in "Post Evacuate Cleanup 1".

*) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase.

I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far.

Yes, currently seems "Wait For Ready" does not cost much time, as "Prepared Retained Regions" is quick, not sure if synchronization will help any more.
But I will investigate if we can omit "Prepared Retained Regions" and "Wait For Ready" subphases totally to simplify the logic. [TODO]

*) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.

I will do this refactor soon.

*) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.

*) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.

I will put it on backlog to see if it can be simplied. [TODO]

@tschatzl
Copy link
Contributor

tschatzl commented Jan 27, 2022

Some other initial thoughts worth considering:

*) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s).

In fact, normally most of time of "Post Evacuate Cleanup 1" is spent on "Restore Retained Regions" in baseline version. In parallel version, the proportion of "Restore Retained Regions" in "Post Evacuate Cleanup 1" is reduced. [...]

I agree. This has only been a general remark about its performance, not meant to belittle the usefulness this change and in general all the changes in this series have, which are quite substantial. 👍 I compared the throughput (bytes/ms) between Object Copy and this phase, and at least without JDK-8280374 the remove self forwards is only like 2x the throughput of Object Copy, which seemed quite bad compared to what they do. With JDK-8280374 the results are much better (~4.5x) afaict on a single benchmark I tried though. It's a bit hard to reproduce the exact situation/heap though...

Another (future) optimization that may be worthwhile here may be to get some occupancy statistics of the chunks and switch between walking the bitmap and walking the objects; another one that might be "simpler" to implement (but fairly messy probably) is to simply check if the object after the current one is also forwarded, and if so, do not switch back to the bitmap walking but immediately process that one as well.
This might help somewhat because given typical avg. object sizes (~40 bytes), the mark word after the current one might be already in the cache anyway, so a read access practically free.

These are only ideas though.

*) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase.

I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far.

Yes, currently seems "Wait For Ready" does not cost much time, as "Prepared Retained Regions" is quick, not sure if synchronization will help any more.
But I will investigate if we can omit "Prepared Retained Regions" and "Wait For Ready" subphases totally to simplify the logic. [TODO]

The point of "proper synchronization" isn't that it's faster, but it does not burn cpu cycles unnecessarily which potentially keeps the one thread that others are waiting on do the work. If we can remove the dependencies between the "Prepare Retained Regions" and the remaining phases, which only seems to be the BOT. One idea is that maybe all that prepare stuff could be placed where G1 adds that region to the list of retained regions. This does not work for the liveness count obviously - but that can be recreated by the actual self forwarding removal as suggested earlier 😸).

Then none of that is required which is even better.

*) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.

I will do this refactor soon.

Thanks!

*) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.

*) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.

I will put it on backlog to see if it can be simplified. [TODO]

Not necessarily simplified: one option is to make that work explicit (we tend to try to not do too much work in constructors - but maybe this just fits here), another is to pre-calculate some of these values during evacuation failure somehow.

We can maybe also postpone the optimization part of that suggestion given that currently that phase takes almost no time if it seems to be too much work.

Thanks for your hard work,
Thomas

@Hamlin-Li
Copy link
Author

I agree. This has only been a general remark about its performance, not meant to belittle the usefulness this change and in general all the changes in this series have, which are quite substantial. 👍 I compared the throughput (bytes/ms) between Object Copy and this phase, and at least without JDK-8280374 the remove self forwards is only like 2x the throughput of Object Copy, which seemed quite bad compared to what they do. With JDK-8280374 the results are much better (~4.5x) afaict on a single benchmark I tried though. It's a bit hard to reproduce the exact situation/heap though...

Another (future) optimization that may be worthwhile here may be to get some occupancy statistics of the chunks and switch between walking the bitmap and walking the objects;

Yes, we have a similar task on our backlog, to fall back to walking the objects if the statistics tell us so.

another one that might be "simpler" to implement (but fairly messy probably) is to simply check if the object after the current one is also forwarded, and if so, do not switch back to the bitmap walking but immediately process that one as well. This might help somewhat because given typical avg. object sizes (~40 bytes), the mark word after the current one might be already in the cache anyway, so a read access practically free.

I'm not sure how much this will help. Currently, the code looks like below, if the next obj is also marked, it will be applied with closure next time in the loop, it should has the same cache hit as the way you suggested above, the difference is the method (apply(current)) invocation overhead. But I will put it on backlog too. [TODO]

  while (next_addr < _limit) {
    Prefetch::write(next_addr, PrefetchScanIntervalInBytes);
    if (_bitmap->is_marked(next_addr)) {
      oop current = cast_to_oop(next_addr);
      next_addr += closure->apply(current);
    } else {
      next_addr = _bitmap->get_next_marked_addr(next_addr, _limit);
    }
  }

These are only ideas though.

[...]
But I will investigate if we can omit "Prepared Retained Regions" and "Wait For Ready" subphases totally to simplify the logic. [TODO]

The point of "proper synchronization" isn't that it's faster, but it does not burn cpu cycles unnecessarily which potentially keeps the one thread that others are waiting on do the work. If we can remove the dependencies between the "Prepare Retained Regions" and the remaining phases, which only seems to be the BOT. One idea is that maybe all that prepare stuff could be placed where G1 adds that region to the list of retained regions. This does not work for the liveness count obviously - but that can be recreated by the actual self forwarding removal as suggested earlier 😸).

Then none of that is required which is even better.

I have just delete the code related to "Prepared Retained Regions" and "Wait For Ready", and put the logic in G1EvacFailureRegions::record(...) and SampleCollectionSetCandidatesTask and VerifyAfterSelfForwardingPtrRemovalTask.

*) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.

I will do this refactor soon.

Thanks!

This one is also done.

*) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.
*) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.

I will put it on backlog to see if it can be simplified. [TODO]

Not necessarily simplified: one option is to make that work explicit (we tend to try to not do too much work in constructors - but maybe this just fits here), another is to pre-calculate some of these values during evacuation failure somehow.

We can maybe also postpone the optimization part of that suggestion given that currently that phase takes almost no time if it seems to be too much work.

OK, let's get back to this when it occupies much time in the phase.

Thanks for your hard work, Thomas

Thanks alot for detailed discussion and valuable suggestion, it helps alot :)

@openjdk
Copy link

openjdk bot commented Feb 11, 2022

@Hamlin-Li this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout parallelize-evac-failure-in-bm
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Feb 11, 2022
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 14, 2022
@Hamlin-Li
Copy link
Author

Hamlin-Li commented Feb 14, 2022

Hi Thomas,

My test (with the latest implementation) shows that when evacuation failure regions number is less than parallel gc thread number, it bring stable benefit in post 1 phase; but when evacuation failure regions number is more than parallel gc thread number, the benefit is not stable, and can bring some regionssion in post 1 phase.
I think the test result is reasonable. When there are more evaucuation failure regions than parallel gc threads, parallelism at region level should already assign some regions to every gc threads, i.e. it's already fully parallized in some degree; whether parallelism at chunk level could bring more benefit depends on the distribution of evacuation failure objects in regions. Otherwise, when there are less evaucuation failure regions, parallelism at region level can not assign every gc threads a evacuation failure region to process, at this situation parallism at chunk level can bring more benefit, and the benefit is stable.

A simple heuristic is to switch to original implemenation, i.e. parallelize only at region level, when detects that evacuation failure regions number is more than parallel gc thread number. The advantage is that it avoids to consume extra CPU to do unnecessary parallelism at chunk level. The drawback of this solution is that it will bring 2 pieces of code: parallelism in regions, and parallelism in chunks.

How do you think about it?

Thanks

@Hamlin-Li
Copy link
Author

Thanks for clarification, I see the point.
Will update the patch.

Copy link
Member

@albertnetymk albertnetymk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments/suggestions.

src/hotspot/share/gc/g1/g1EvacFailure.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailure.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1GCPhaseTimes.cpp Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
@Hamlin-Li
Copy link
Author

Thanks for the detailed reviews. :)

I'm not sure if it's feasible to move prepare_regions into post_evacuate_cleanup_1 as a G1BatchedTask.
My concern is that in prepare_regions()->PrepareEvacFailureRegionTask::prepare_region()->HeapRegion::note_self_forwarding_removal_start(), nTAMS is set, and nTAMS is used cleanup 1 phase at RemoveSelfForwardPtrObjClosure to set next bitmap. If move prepare_regions into post_evacuate_cleanup_1 as a G1BatchedTask, there is no guarantee that preparation will be done before the usage of nTAMS in RemoveSelfForwardPtrObjClosure.
How do you think about it?

@tschatzl
Copy link
Contributor

tschatzl commented Mar 18, 2022

Thanks for the detailed reviews. :)

I'm not sure if it's feasible to move prepare_regions into post_evacuate_cleanup_1 as a G1BatchedTask. My concern is that in prepare_regions()->PrepareEvacFailureRegionTask::prepare_region()->HeapRegion::note_self_forwarding_removal_start(), nTAMS is set, and nTAMS is used cleanup 1 phase at RemoveSelfForwardPtrObjClosure to set next bitmap. If move prepare_regions into post_evacuate_cleanup_1 as a G1BatchedTask, there is no guarantee that preparation will be done before the usage of nTAMS in RemoveSelfForwardPtrObjClosure. How do you think about it?

I believe this is an unnecessary dependency.

PrepareEvacFailureRegionTask::prepare_region()->HeapRegion::note_self_forwarding_removal_start sets nTAMS to top() unconditionally with the intent that the mark always happens.

So instead of calling _cm->mark_in_next_bitmap(_worker_id, obj); which checks nTAMS, just unconditionally mark the next bitmap (not sure if there is already a method for this) to achieve the same effect.

The alternative would be to add a new G1BatchedTask for just this preparation, which seems much more work not only in terms of code, but also much more work spinning up threads.

Of course, the use of this "raw" mark method needs to be documented.

Fwiw, in the protoype we have for JDK-8210708, which looks fairly good at this point, a similar change would be needed anyway.

Thanks,
Thomas

@Hamlin-Li
Copy link
Author

Seems there is another dependency: in RemoveSelfForwardPtrHRChunkClosure, _prev_marked_bytes is accumulated concurrently; _prev_marked_bytes should be reset to zero in prepare_regions.

@tschatzl
Copy link
Contributor

After a quick look through the code, I think we could just call note_self_forwarding_removal_start in G1ParScanThreadState::handle_evacuation_failure_par(), when a region is added to the list of failed regions the first time instead.

@Hamlin-Li
Copy link
Author

Thanks, I've moved the note_self_forwarding_removal_start to G1EvacFailureRegions::record

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will push the change through our testing again since so much time and so many changes happened since last time.

@@ -97,7 +97,7 @@ class RemoveSelfForwardPtrObjClosure {
// explicitly and all objects in the CSet are considered
// (implicitly) live. So, we won't mark them explicitly and
// we'll leave them over NTAMS.
_cm->mark_in_next_bitmap(_worker_id, obj);
_cm->mark_in_next_bitmap_unconditionally(_worker_id, obj);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change, the introduction of this method, is unnecessary after moving the update to the nTAMS into the G1EvacFailureRegions::record method.

@tschatzl
Copy link
Contributor

I will push the change through our testing again since so much time and so many changes happened since last time.

Testing seems good.

@Hamlin-Li
Copy link
Author

I will push the change through our testing again since so much time and so many changes happened since last time.

Testing seems good.

Thanks Thomas for reviewing and testing. :)
I'll update the patch soon.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, with some final cleanup comments. Apologies for taking a bit.

src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailureRegions.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.hpp Show resolved Hide resolved
@Hamlin-Li
Copy link
Author

Thanks, it's fine :). I've just updated the patch as suggested.

src/hotspot/share/gc/g1/g1EvacFailure.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1EvacFailure.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1_globals.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1_globals.hpp Outdated Show resolved Hide resolved
@Hamlin-Li
Copy link
Author

Thanks for the detailed review, nice catch! I will the patch as suggested.

@albertnetymk
Copy link
Member

As a followup to the explicit-loop topic raised before, here's a patch exploring that alternative.

Last commit of https://github.com/openjdk/jdk/compare/master...albertnetymk:explicit-loop?expand=1

It contains mostly two changes, explicit loop and chunking logic encapsulation.

  1. I think the workflow is easier to follow in the explicit-loop approach, compared with iterator + closure.

  2. The chunking logic is not inherent to (evac-failure) regions; instead of it's related to processing evac-failure regions. This way the evac-failure regions class is just a collection of regions, nothing more.

It can probably be further polished, but hopefully it illustrates the gist for now. What do you think?

@Hamlin-Li
Copy link
Author

Not sure, do you mind me to do this refactoring in another PR?

@albertnetymk
Copy link
Member

Since the chunking files/logic are added in this PR, I am leaned towards addressing them in the same PR if you agree the explicit-loop approach is cleaner/better.

@bridgekeeper
Copy link

bridgekeeper bot commented May 10, 2022

@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 7, 2022

@Hamlin-Li This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Jun 7, 2022
@Hamlin-Li Hamlin-Li deleted the parallelize-evac-failure-in-bm branch February 27, 2024 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org rfr Pull request is ready for review
3 participants