-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8256265: G1: Improve parallelism in regions that failed evacuation #7047
8256265: G1: Improve parallelism in regions that failed evacuation #7047
Conversation
👋 Welcome back mli! A progress list of the required criteria for merging this PR into |
@Hamlin-Li The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
Just a recap of what the change adds:
Fwiw, I did some hacking, adding lots of statistics output to it because I was a bit surprised of some of the numbers I saw (available at https://github.com/tschatzl/jdk/tree/pull/7047-evac-failure-chunking). |
Thanks Thomas for the summary and logging code.
|
Regarding the log messages: We might want to fix up them a bit, I did not look at our recent email discussion on what we came up with, and their level. Some other initial thoughts worth considering: *) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s). *) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase. I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far. *) The "Prepared Retained Regions" phase stores the amount of live data into the *) Not too happy that the *) I was wondering whether it would be somewhat more efficient for the |
In fact, normally most of time of "Post Evacuate Cleanup 1" is spent on "Restore Retained Regions" in baseline version. In parallel version, the proportion of "Restore Retained Regions" in "Post Evacuate Cleanup 1" is reduced. e.g. following is the "Post Evacuate Cleanup 1"/"Restore Retained Regions" time comparison between baseline and parallel:
parallel
the difference between "Post Evacuate Cleanup 1" and "Restore Retained Regions" is the same between baseline and parallel version, which is spent on other subphases in "Post Evacuate Cleanup 1".
Yes, currently seems "Wait For Ready" does not cost much time, as "Prepared Retained Regions" is quick, not sure if synchronization will help any more.
I will do this refactor soon.
I will put it on backlog to see if it can be simplied. [TODO] |
I agree. This has only been a general remark about its performance, not meant to belittle the usefulness this change and in general all the changes in this series have, which are quite substantial. 👍 I compared the throughput (bytes/ms) between Another (future) optimization that may be worthwhile here may be to get some occupancy statistics of the chunks and switch between walking the bitmap and walking the objects; another one that might be "simpler" to implement (but fairly messy probably) is to simply check if the object after the current one is also forwarded, and if so, do not switch back to the bitmap walking but immediately process that one as well. These are only ideas though.
The point of "proper synchronization" isn't that it's faster, but it does not burn cpu cycles unnecessarily which potentially keeps the one thread that others are waiting on do the work. If we can remove the dependencies between the "Prepare Retained Regions" and the remaining phases, which only seems to be the BOT. One idea is that maybe all that prepare stuff could be placed where G1 adds that region to the list of retained regions. This does not work for the liveness count obviously - but that can be recreated by the actual self forwarding removal as suggested earlier 😸). Then none of that is required which is even better.
Thanks!
Not necessarily simplified: one option is to make that work explicit (we tend to try to not do too much work in constructors - but maybe this just fits here), another is to pre-calculate some of these values during evacuation failure somehow. We can maybe also postpone the optimization part of that suggestion given that currently that phase takes almost no time if it seems to be too much work. Thanks for your hard work, |
Yes, we have a similar task on our backlog, to fall back to walking the objects if the statistics tell us so.
I'm not sure how much this will help. Currently, the code looks like below, if the next obj is also marked, it will be applied with closure next time in the loop, it should has the same cache hit as the way you suggested above, the difference is the method (apply(current)) invocation overhead. But I will put it on backlog too. [TODO]
I have just delete the code related to "Prepared Retained Regions" and "Wait For Ready", and put the logic in G1EvacFailureRegions::record(...) and SampleCollectionSetCandidatesTask and VerifyAfterSelfForwardingPtrRemovalTask.
This one is also done.
OK, let's get back to this when it occupies much time in the phase.
Thanks alot for detailed discussion and valuable suggestion, it helps alot :) |
…et verification crash
@Hamlin-Li this pull request can not be integrated into git checkout parallelize-evac-failure-in-bm
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Hi Thomas, My test (with the latest implementation) shows that when evacuation failure regions number is less than parallel gc thread number, it bring stable benefit in post 1 phase; but when evacuation failure regions number is more than parallel gc thread number, the benefit is not stable, and can bring some regionssion in post 1 phase. A simple heuristic is to switch to original implemenation, i.e. parallelize only at region level, when detects that evacuation failure regions number is more than parallel gc thread number. The advantage is that it avoids to consume extra CPU to do unnecessary parallelism at chunk level. The drawback of this solution is that it will bring 2 pieces of code: parallelism in regions, and parallelism in chunks. How do you think about it? Thanks |
Thanks for clarification, I see the point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments/suggestions.
Thanks for the detailed reviews. :) I'm not sure if it's feasible to move |
I believe this is an unnecessary dependency.
So instead of calling The alternative would be to add a new Of course, the use of this "raw" mark method needs to be documented. Fwiw, in the protoype we have for JDK-8210708, which looks fairly good at this point, a similar change would be needed anyway. Thanks, |
Seems there is another dependency: in |
After a quick look through the code, I think we could just call |
Thanks, I've moved the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will push the change through our testing again since so much time and so many changes happened since last time.
@@ -97,7 +97,7 @@ class RemoveSelfForwardPtrObjClosure { | |||
// explicitly and all objects in the CSet are considered | |||
// (implicitly) live. So, we won't mark them explicitly and | |||
// we'll leave them over NTAMS. | |||
_cm->mark_in_next_bitmap(_worker_id, obj); | |||
_cm->mark_in_next_bitmap_unconditionally(_worker_id, obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change, the introduction of this method, is unnecessary after moving the update to the nTAMS into the G1EvacFailureRegions::record
method.
Testing seems good. |
Thanks Thomas for reviewing and testing. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, with some final cleanup comments. Apologies for taking a bit.
Thanks, it's fine :). I've just updated the patch as suggested. |
Thanks for the detailed review, nice catch! I will the patch as suggested. |
As a followup to the explicit-loop topic raised before, here's a patch exploring that alternative. Last commit of https://github.com/openjdk/jdk/compare/master...albertnetymk:explicit-loop?expand=1 It contains mostly two changes, explicit loop and chunking logic encapsulation.
It can probably be further polished, but hopefully it illustrates the gist for now. What do you think? |
Not sure, do you mind me to do this refactoring in another PR? |
Since the chunking files/logic are added in this PR, I am leaned towards addressing them in the same PR if you agree the explicit-loop approach is cleaner/better. |
@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@Hamlin-Li This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
This patch tries to improve parallelism when walking over the regions in chunks
Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.
Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)
The performance benefit trend is:
( Other common Evacuation Failure configurations are:
-XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )
For more detailed performance data, please check the related bug.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7047/head:pull/7047
$ git checkout pull/7047
Update a local copy of the PR:
$ git checkout pull/7047
$ git pull https://git.openjdk.java.net/jdk pull/7047/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7047
View PR using the GUI difftool:
$ git pr show -t 7047
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7047.diff