Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8256265 G1: Improve parallelism in regions that failed evacuation #6627

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

@Hamlin-Li
Copy link

@Hamlin-Li Hamlin-Li commented Dec 1, 2021

Summary

Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
Try to improve the parallelsim when walking over the regions by

  • first, split a region into tasks;
  • then, process these task in parallel and load balance among GC threads;
  • last, necessary cleanup

NOTE: load balance part of code is almost same as G1ParScanThreadState, if necessary and feasible, consider to refactor this part into a shared code base.

Performance Test

The perf test based on lastest implementation + JDK-8277736 shows that:

  • when ParallelGCThreads=32, when G1EvacuationFailureALotCSetPercent <= 50, the parallelism bring more benefit than regression;
  • when ParallelGCThreads=128, whatever G1EvacuationFailureALotCSetPercent is, the parallelism bring more benefit than regression;

other related evac failure vm options:

  • G1EvacuationFailureALotInterval=1
  • G1EvacuationFailureALotCount=1

For detailed perf test result, please check:

For the situation like G1EvacuationFailureALotCSetPercent > 50 and ParallelGCThreads=32 , we could fall back to current implmentation, or further optimize the thread sizing at this phase if necessary.

NOTE: I don't include perf data for Remove Self Forwards, because the comparison of pause time in this phase does not well show the improvement of this implementation, I think the reason is that the original implementation is not load balanced, and the new implementation is. But as Remove Self Forwards is part of Post Evacuate Cleanup 1, so only Post Evacuate Cleanup 1 well show the improvement of the new implementation.
It could be a potential improvement to refine the Pause time data in Remove Self Forwards phase.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8256265: G1: Improve parallelism in regions that failed evacuation

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6627/head:pull/6627
$ git checkout pull/6627

Update a local copy of the PR:
$ git checkout pull/6627
$ git pull https://git.openjdk.java.net/jdk pull/6627/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6627

View PR using the GUI difftool:
$ git pr show -t 6627

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6627.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 1, 2021

👋 Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Dec 1, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Dec 1, 2021

@Hamlin-Li The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc label Dec 1, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Dec 1, 2021

Webrevs

@mlbridge
Copy link

@mlbridge mlbridge bot commented Dec 2, 2021

Mailing list message from Kirk Pepperdine on hotspot-gc-dev:

Hi Hamlin,

Am I to understand that these benchmarks were running for 3-5 seconds?

Kind regards,
Kirk Pepperdine

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 2, 2021

Am I to understand that these benchmarks were running for 3-5 seconds?

No, the time is just for Post Evacuate Cleanup 1 pause time.
For the test, it takes about dozens of seconds for each round.

Copy link
Contributor

@tschatzl tschatzl left a comment

I am currently doing a few benchmark runs with it, with so far expected results, and looking over the sources. There are a few places I will comment about later after having a more thorough look, but some initial comments now. They are a bit random, so please bear with me.

General questions/suggestions:

  • we reviewers would really appreciate to give an overview of the design decisions when posting such a significant change or deviation from current parallelization mechanism. I.e. this change uses special (overflow-)task queues for distribution, pre-filling them and using the existing work stealing mechanism.
    No problem with that, but it would have been nice to add this information in addition to the source code dump :)
    As a bonus, add this information to the CR when done.
  • further this change implements limiting the number of threads for a particular subphase of a G1BatchedTask. That's new functionality, and while coincidentally we in the gc team talked about something like this (including the use of task queues) as a possibility for PR#6612 a few days, it would have been nice to discuss such a change - and as usual split it out before. :)
  • it would be interesting to understand whether you tried other approaches on work distribution here and what their results were (if so). Naively I would have initially "just" put the G1EvacFailureObjectsSets into an array of those, and provided an atomic claim counter for those. Not that it is particularly better, but it is interesting that the change immediately goes for the task queue approach. (One problem with that is that the task stealing isn't very scalable to a lot of threads at the moment, particularly on aarch64).
  • the worker_cost could certainly be improved a little: if there are a lot of threads (like 128 in your example), and a given task does around 1000k entries, if there are not (much) more than 128000 objects to process (which seems little, but there are no statistics gathered by this change), there is no point to use all those 128 threads. (and you do not want to start 128 threads for in-total 128 tasks) Currently the heuristics is 5 threads per region). So if it is not too much work (that information might be provided anyway) I would prefer a more exact estimate.
  • you mentioned about a possible optimization for 32 threads: not sure this is the target we should aim for with this change - I would actually count a 32 thread machine already as fairly "large" by server/VM standards but you can convince me otherwise :) So while the change should scale up, smaller sizes are likely more interesting.
  • not sure if there is a problem with always stuffing all possible tasks into the task queues into the beginning: that seems fairly expensive to do, and the overflow mechanism should be avoided as it allows unbounded memory use (consider all regions with all objects on the heap failing - the code is stuffing billions / 1000 tasks into the queues then). Of course this needs evaluation how many tasks there typically are - the TASKQUEUE_STATS mentioned below can give an idea...
    I'm also not sure if it is useful to provide separate task queues for this mechanism, maybe the existing one can be reused. At least not sure if 128k entries for each queue allocated at startup for every thread is a good idea (Did you measure how big they are? These tasks are significantly larger than others). Again, statistics can help.
  • the results posted here are not very useful given that we do not know the environment and the actual application and settings. E.g. something like dacapo h2 with 1gb -Xms and -Xmx would be much better. I can only guess that you were using some ARM servers for testing, but may as well be something like a dual-socket Epyc.
    All that helps categorizing the results quite a bit - see the question from Kirk. E.g. mention that at higher values of G1EvacuationFailureALotCSetPercent, particular with the other settings, the entire young gen will fail, so it depends on luck at least after a few gcs what objects each gc is going to not evacuate, so making the measurements quite useless without a confidence interval; I would largely guess that they are equal in the end.
    I.e. this change needs something like JDK-8140326 for example.
    But there may be other causes for the problem, and without providing statistics you can't tell.

Other requests:

  • traceability: the change needs log messages at different levels, first some summary information (at debug) and second some detailed information (at trace).
    For the debug level message I'd suggest an extension of the gc+heap=debug messages to also show the number of failed regions of that category.
    E.g. for these messages
[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56)
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8)
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5

to add something like

[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56) [0]
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8) [3]
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56 [2]
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2 [0]
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5 [2]

and even also add the distinction between actual evac failed and pinned regions.
I'm not suggesting my idea is perfect, to be discussed with others.

Also maybe some additional gc+evacfail or something category could track more information on detail level per region statistics on which regions failed (region#), how many objects/bytes failed, and the reasons for the failure (pinned and/or "real" evacuation failure). Note that some (or most) of this information should already be provided in an extra CR before doing this change imho.

  • maybe also the "Remove Self Forwards" phase could get some additional information about how many regions/tasks/bytes/objects it worked on (just throwing some ideas here, some may be redundant)
  • also Termination information for any phase that uses a TaskTerminator is a must; also, the TASKQUEUE_STATS mechanism should be supported in a useful way.

Thanks for your patience,
Thomas

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 3, 2021

I am currently doing a few benchmark runs with it, with so far expected results, and looking over the sources. There are a few places I will comment about later after having a more thorough look, but some initial comments now. They are a bit random, so please bear with me.

Thanks a lot. :)

General questions/suggestions:

  • we reviewers would really appreciate to give an overview of the design decisions when posting such a significant change or deviation from current parallelization mechanism. I.e. this change uses special (overflow-)task queues for distribution, pre-filling them and using the existing work stealing mechanism.
    No problem with that, but it would have been nice to add this information in addition to the source code dump :)
    As a bonus, add this information to the CR when done.

Sure, will add this information later. :) (TO add info)

  • further this change implements limiting the number of threads for a particular subphase of a G1BatchedTask. That's new functionality, and while coincidentally we in the gc team talked about something like this (including the use of task queues) as a possibility for PR#6612 a few days, it would have been nice to discuss such a change - and as usual split it out before. :)

Not sure if I get the point, current implementation is calculate the thread sizing to set the terminator threads and task queue size at the end of G1PostEvacuateCollectionSetCleanupTask1 constructor. I think there is some better refactoring solution to get this done.
I will check pr #6612 further to see what we can do first. :) (TO double check)

  • it would be interesting to understand whether you tried other approaches on work distribution here and what their results were (if so). Naively I would have initially "just" put the G1EvacFailureObjectsSets into an array of those, and provided an atomic claim counter for those. Not that it is particularly better, but it is interesting that the change immediately goes for the task queue approach. (One problem with that is that the task stealing isn't very scalable to a lot of threads at the moment, particularly on aarch64).

Thanks, yes, that's another way to do it by providing a atomic claimer, let me measure it later. (TODO & measure)

  • the worker_cost could certainly be improved a little: if there are a lot of threads (like 128 in your example), and a given task does around 1000k entries, if there are not (much) more than 128000 objects to process (which seems little, but there are no statistics gathered by this change), there is no point to use all those 128 threads. (and you do not want to start 128 threads for in-total 128 tasks) Currently the heuristics is 5 threads per region). So if it is not too much work (that information might be provided anyway) I would prefer a more exact estimate.

Yes, this part of thread sizing is very draft, we will definitely need to refine the estimate. (TODO)

  • you mentioned about a possible optimization for 32 threads: not sure this is the target we should aim for with this change

Agree, this is to be done in another pr if necessary. (TODO new pr)

  • I would actually count a 32 thread machine already as fairly "large" by server/VM standards but you can convince me otherwise :) So while the change should scale up, smaller sizes are likely more interesting.

Thanks for the suggestion. Let me meansure this later. (TO measure)

  • not sure if there is a problem with always stuffing all possible tasks into the task queues into the beginning: that seems fairly expensive to do, and the overflow mechanism should be avoided as it allows unbounded memory use (consider all regions with all objects on the heap failing - the code is stuffing billions / 1000 tasks into the queues then). Of course this needs evaluation how many tasks there typically are - the TASKQUEUE_STATS mentioned below can give an idea...

Agree. This is another part needed to be refined. (TODO)

I'm also not sure if it is useful to provide separate task queues for this mechanism, maybe the existing one can be reused.

Do you mean G1EvacFailureParScanTasksQueue* _task_queue in G1EvacFailureParScanState? it's just an OverflowTaskQueue, not a new structure. Would you mind to clarify further? :)

At least not sure if 128k entries for each queue allocated at startup for every thread is a good idea (Did you measure how big they are? These tasks are significantly larger than others). Again, statistics can help.

Not quite get your point, would you mind to clarify futher about "128k entries for each queue allocated at startup for every thread"? :)

  • the results posted here are not very useful given that we do not know the environment and the actual application and settings. E.g. something like dacapo h2 with 1gb -Xms and -Xmx would be much better. I can only guess that you were using some ARM servers for testing, but may as well be something like a dual-socket Epyc.

Sure, I will try to use more stadard benchmark to measure, will update the result and environement information later too. (TO measure)

All that helps categorizing the results quite a bit - see the question from Kirk. E.g. mention that at higher values of G1EvacuationFailureALotCSetPercent, particular with the other settings, the entire young gen will fail, so it depends on luck at least after a few gcs what objects each gc is going to not evacuate, so making the measurements quite useless without a confidence interval; I would largely guess that they are equal in the end.

I think so too, will add more information in the pr. (TO add info)

I.e. this change needs something like JDK-8140326 for example.

Yes, this is to be done.

But there may be other causes for the problem, and without providing statistics you can't tell.

Other requests:

  • traceability: the change needs log messages at different levels, first some summary information (at debug) and second some detailed information (at trace).
    For the debug level message I'd suggest an extension of the gc+heap=debug messages to also show the number of failed regions of that category.
    E.g. for these messages
[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56)
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8)
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5

to add something like

[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56) [0]
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8) [3]
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56 [2]
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2 [0]
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5 [2]

and even also add the distinction between actual evac failed and pinned regions. I'm not suggesting my idea is perfect, to be discussed with others.

Also maybe some additional gc+evacfail or something category could track more information on detail level per region statistics on which regions failed (region#), how many objects/bytes failed, and the reasons for the failure (pinned and/or "real" evacuation failure). Note that some (or most) of this information should already be provided in an extra CR before doing this change imho.

  • maybe also the "Remove Self Forwards" phase could get some additional information about how many regions/tasks/bytes/objects it worked on (just throwing some ideas here, some may be redundant)
  • also Termination information for any phase that uses a TaskTerminator is a must; also, the TASKQUEUE_STATS mechanism should be supported in a useful way.

Can't agree more, I will first collect more statistics by adding logging, will do it in another PR first. (TODO new pr)

Thanks for your patience, Thomas

Thanks a lot for the detailed initial comments.
-Hamlin

@tschatzl
Copy link
Contributor

@tschatzl tschatzl commented Dec 3, 2021

  • further this change implements limiting the number of threads for a particular subphase of a G1BatchedTask. That's new functionality, and while coincidentally we in the gc team talked about something like this (including the use of task queues) as a possibility for PR#6612 a few days, it would have been nice to discuss such a change - and as usual split it out before. :)

Not sure if I get the point, current implementation is calculate the thread sizing to set the terminator threads and task queue size at the end of G1PostEvacuateCollectionSetCleanupTask1 constructor. I think there is some better refactoring solution to get this done. I will check pr #6612 further to see what we can do first. :) (TO double check)

I may have misread the code a bit; pr#6612 does not contain any similar changes - using the task queues has only been a thought.

I'm also not sure if it is useful to provide separate task queues for this mechanism, maybe the existing one can be reused.

Do you mean G1EvacFailureParScanTasksQueue* _task_queue in G1EvacFailureParScanState? it's just an OverflowTaskQueue, not a new structure. Would you mind to clarify further? :)

Yes, I meant reusing G1EvacFailureParScanTasksQueue, but that's not possible afaict - forget it.

At least not sure if 128k entries for each queue allocated at startup for every thread is a good idea (Did you measure how big they are? These tasks are significantly larger than others). Again, statistics can help.

Not quite get your point, would you mind to clarify futher about "128k entries for each queue allocated at startup for every thread"? :)

Just seeing, the task queues are dynamically allocated; still they are 128k entries * 32 bytes (i.e. 4MB / thread) in size. Do we need that big task queues (all the time)? And if not, how do we scale them? (The size is a template parameter right now....) Do we want to have changeable task queue sizes (only if using task queues is the way to go).

I.e. this change needs something like JDK-8140326 for example.

Yes, this is to be done.

Yes, a separate CR.

But there may be other causes for the problem, and without providing statistics you can't tell.
Other requests:

  • traceability: the change needs log messages at different levels,
    [...]

Can't agree more, I will first collect more statistics by adding logging, will do it in another PR first. (TODO new pr)

Things might be quicker if you sent an email about the proposed changes to hotspot-gc-dev first for discussion; otherwise every reviewer needs to check out, set everything up and run the code to see them...

Thanks,
Thomas

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 3, 2021

Not sure if I get the point, current implementation is calculate the thread sizing to set the terminator threads and task queue size at the end of G1PostEvacuateCollectionSetCleanupTask1 constructor. I think there is some better refactoring solution to get this done. I will check pr #6612 further to see what we can do first. :) (TO double check)

I may have misread the code a bit; pr#6612 does not contain any similar changes - using the task queues has only been a thought.

I see.

Do you mean G1EvacFailureParScanTasksQueue* _task_queue in G1EvacFailureParScanState? it's just an OverflowTaskQueue, not a new structure. Would you mind to clarify further? :)

Yes, I meant reusing G1EvacFailureParScanTasksQueue, but that's not possible afaict - forget it.

OK.

Just seeing, the task queues are dynamically allocated; still they are 128k entries * 32 bytes (i.e. 4MB / thread) in size. Do we need that big task queues (all the time)? And if not, how do we scale them? (The size is a template parameter right now....) Do we want to have changeable task queue sizes (only if using task queues is the way to go).

I see your concerns. (TODO, adjust task queue size, or consider using simpler solution e.g. atomic claimer(mentioned above) )

Yes, this is to be done.

Yes, a separate CR.

Sure.

But there may be other causes for the problem, and without providing statistics you can't tell.
Other requests:

  • traceability: the change needs log messages at different levels,
    [...]

Can't agree more, I will first collect more statistics by adding logging, will do it in another PR first. (TODO new pr)

Things might be quicker if you sent an email about the proposed changes to hotspot-gc-dev first for discussion; otherwise every reviewer needs to check out, set everything up and run the code to see them...

I think here you mean the log msg content and format to be collected, right? If positive, I will summarize and send it out to hotspot-gc-dev before the related pr.

Thanks
-Hamlin

@tschatzl
Copy link
Contributor

@tschatzl tschatzl commented Dec 3, 2021

Hi again,

  • maybe also the "Remove Self Forwards" phase could get some additional information about how many regions/tasks/bytes/objects it worked on (just throwing some ideas here, some may be redundant)

  • also Termination information for any phase that uses a TaskTerminator is a must; also, the TASKQUEUE_STATS mechanism should be supported in a useful way.

It would also be very interesting to get information to differentiate how long the sorting and how long the actual self-forward removal of that phase took. Additional interesting information is memory consumption of the segment allocators/sorting. If significant compared to other allocation, maybe we could even separate them out in a separate category (not sure, I do not really think so, just thinking out loud).

If you need help with adding gc pause log statistics, ping me; for timing information it should be sufficient to add enum values to G1GCPhaseTimes::GCParPhases, and initialize them like the others at the appropriate places; counts can be added as "thread work item" like for the ScanHR enum value.

The current evacuation failure handling implementation is already really good compared to 17 due to your hard work, now I guess it starts to become necessary knowing more details for further investigation and in general, for analysis and troubleshooting by users now and later :-).

Thanks a lot for your effort,
Thomas

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 6, 2021

It would also be very interesting to get information to differentiate how long the sorting and how long the actual self-forward removal of that phase took. Additional interesting information is memory consumption of the segment allocators/sorting. If significant compared to other allocation, maybe we could even separate them out in a separate category (not sure, I do not really think so, just thinking out loud).

Agree, these information will be very helpful to the subsequent optimizations we will soon work on.

If you need help with adding gc pause log statistics, ping me;

Thanks a lot for your help all the time. :) I will ask for your help.

for timing information it should be sufficient to add enum values to G1GCPhaseTimes::GCParPhases, and initialize them like the others at the appropriate places; counts can be added as "thread work item" like for the ScanHR enum value.

Thanks for the information. :)

The current evacuation failure handling implementation is already really good compared to 17 due to your hard work, now I guess it starts to become necessary knowing more details for further investigation and in general, for analysis and troubleshooting by users now and later :-).

Agree, we need to supply more detailed information to users.

Thanks a lot for your effort, Thomas

Thanks a lot, my pleasure :)

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Dec 10, 2021

Hi Hamlin,

I looked at this patch a bit and got some other ideas. Similar to some of your initial ideas around using bitmaps. See a PoC here (that I've just done limited testing on):
master...kstefanj:rnd-evac-failure-in-prev-bm

I think the main problem then was the additional memory overhead and my approach is avoiding this by directly using the prev marking bitmap (we could also investigate using the next bitmap). So instead of first recording the failing objects to then mark them in the closure:

class RemoveSelfForwardPtrObjClosure: public ObjectClosure {
...
  void do_object(oop obj) {
    ...
    // We consider all objects that we find self-forwarded to be
    // live. What we'll do is that we'll update the prev marking
    // info so that they are all under PTAMS and explicitly marked.
    if (!_cm->is_marked_in_prev_bitmap(obj)) {
      _cm->mark_in_prev_bitmap(obj);
    }
    ...
  }
};

I directly mark them in the bitmap (that is already in place). One drawback with this approach is that we need all failing regions to have clean bitmaps ready for use. I solve this by clearing them up front. We can probably make this more efficient, but in the PoC I just clear all old regions added to the collection set and all regions compacted in the Full GC. To see the additional cost of the CSet clearing I added a new phase showing this and it takes significant time in some cases, but still it seems to be a lot faster than the current approach. Your logging additions really showed that the sorting part of the current approach is quite slow when many objects are failing the evacuation (and this will be the normal case with region pinning).

I also think this approach will be easier to parallelize, we can just split each region into multiple chunks that are taken care of individually (we just need to make sure the zapping is handled correctly). What do you think about this?

There are more things on the agenda that will affect this, me and Thomas have ideas on how to remove the need for two marking bitmaps. Currently we keep the prev bitmap to be able to parse regions with unloaded classes, but there are other ways to avoid this without keeping the second bitmap. We also need the bitmap for verification, but this can also be managed in other ways.

If we do these things, the approach in my PoC would also need to be revised a bit. So the question is how we should proceed forward. I'm a bit hesitant to this change because of its complexity and knowing that there probably is a simpler way to improve parallelism if we take another initial approach.

Please let me know what you think.

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 13, 2021

Hi Hamlin,

I looked at this patch a bit and got some other ideas. Similar to some of your initial ideas around using bitmaps. See a PoC here (that I've just done limited testing on): master...kstefanj:rnd-evac-failure-in-prev-bm

I think the main problem then was the additional memory overhead and my approach is avoiding this by directly using the prev marking bitmap (we could also investigate using the next bitmap). So instead of first recording the failing objects to then mark them in the closure:
...
I directly mark them in the bitmap (that is already in place). One drawback with this approach is that we need all failing regions to have clean bitmaps ready for use. I solve this by clearing them up front. We can probably make this more efficient, but in the PoC I just clear all old regions added to the collection set and all regions compacted in the Full GC. To see the additional cost of the CSet clearing I added a new phase showing this and it takes significant time in some cases, but still it seems to be a lot faster than the current approach. Your logging additions really showed that the sorting part of the current approach is quite slow when many objects are failing the evacuation (and this will be the normal case with region pinning).

Thanks a lot Stefen, I like the idea very much! :)
I'll see what can be done to further optimize this approach.

I also think this approach will be easier to parallelize, we can just split each region into multiple chunks that are taken care of individually (we just need to make sure the zapping is handled correctly).

I think we can handle this zapping correctly, let me do some investigation and test.

What do you think about this?

Of course, I agree!

There are more things on the agenda that will affect this, me and Thomas have ideas on how to remove the need for two marking bitmaps. Currently we keep the prev bitmap to be able to parse regions with unloaded classes, but there are other ways to avoid this without keeping the second bitmap. We also need the bitmap for verification, but this can also be managed in other ways.

If we do these things, the approach in my PoC would also need to be revised a bit.

Thanks for sharing, it will be helpful if there's more details in this direction. :)

So the question is how we should proceed forward. I'm a bit hesitant to this change because of its complexity and knowing that there probably is a simpler way to improve parallelism if we take another initial approach.

Please let me know what you think.

I like the idea, and fully support the new direction, hope to get more details if it's convenient for now. :)

Thanks a lot
-Hamlin

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Dec 13, 2021

...
I directly mark them in the bitmap (that is already in place). One drawback with this approach is that we need all failing regions to have clean bitmaps ready for use. I solve this by clearing them up front. We can probably make this more efficient, but in the PoC I just clear all old regions added to the collection set and all regions compacted in the Full GC. To see the additional cost of the CSet clearing I added a new phase showing this and it takes significant time in some cases, but still it seems to be a lot faster than the current approach. Your logging additions really showed that the sorting part of the current approach is quite slow when many objects are failing the evacuation (and this will be the normal case with region pinning).

Thanks a lot Stefen, I like the idea very much! :) I'll see what can be done to further optimize this approach.

One thing that we need to look more at is the clearing of the bitmap for old regions. This can probably be moved to a better location and be done in parallel to minimize the impact.

I also think this approach will be easier to parallelize, we can just split each region into multiple chunks that are taken care of individually (we just need to make sure the zapping is handled correctly).

I think we can handle this zapping correctly, let me do some investigation and test.

Yeah, this is just a matter of coming up with a nice clean approach to handle it.

What do you think about this?

Of course, I agree!

There are more things on the agenda that will affect this, me and Thomas have ideas on how to remove the need for two marking bitmaps. Currently we keep the prev bitmap to be able to parse regions with unloaded classes, but there are other ways to avoid this without keeping the second bitmap. We also need the bitmap for verification, but this can also be managed in other ways.
If we do these things, the approach in my PoC would also need to be revised a bit.

Thanks for sharing, it will be helpful if there's more details in this direction. :)

If we manage to remove one of the bitmaps there will no longer be a "prev" and a "next" bitmap, but just "the marking" bitmap. The bitmap is populated during the marking and during this point no old regions will be collected so there will be no need to clear parts of this bitmap. Once we are ready to do mixed collections the marking information in the bitmap is no longer needed and the bitmap will be cleared (this will be done concurrently, as it is today for the "next" bitmap). So what I mean by revised is that we need to adapt the solution to this new state of things.

So the question is how we should proceed forward. I'm a bit hesitant to this change because of its complexity and knowing that there probably is a simpler way to improve parallelism if we take another initial approach.
Please let me know what you think.

I like the idea, and fully support the new direction, hope to get more details if it's convenient for now. :)

Sounds great. The question is how we should move forward to make this most efficient. I could propose my PoC as a PR and then take it from there, allowing you to later work on making it parallelize better. The in the future attack the other optimizations.

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 14, 2021

One thing that we need to look more at is the clearing of the bitmap for old regions. This can probably be moved to a better location and be done in parallel to minimize the impact.

Thanks for reminding.

I think we can handle this zapping correctly, let me do some investigation and test.

Yeah, this is just a matter of coming up with a nice clean approach to handle it.
Agree.

If we manage to remove one of the bitmaps there will no longer be a "prev" and a "next" bitmap, but just "the marking" bitmap. The bitmap is populated during the marking and during this point no old regions will be collected so there will be no need to clear parts of this bitmap. Once we are ready to do mixed collections the marking information in the bitmap is no longer needed and the bitmap will be cleared (this will be done concurrently, as it is today for the "next" bitmap). So what I mean by revised is that we need to adapt the solution to this new state of things.

I see, thanks for clarifying.

Sounds great. The question is how we should move forward to make this most efficient. I could propose my PoC as a PR and then take it from there, allowing you to later work on making it parallelize better. The in the future attack the other optimizations.

Sure, I think your PoC is good.
I just had a draft version of parallelizing evac failure processing in chunks, I use it to evaluate the potential benefit of parallelism based on your PoC, seems my initial version does not bring much benefit, I think it might be because of mark bit map does not works well when I parallelize in chunks, I will invetigate further. here is my draft version:
master...Hamlin-Li:parallelize-evac-failure-v2-in-bm

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Dec 14, 2021

Sure, I think your PoC is good. I just had a draft version of parallelizing evac failure processing in chunks, I use it to evaluate the potential benefit of parallelism based on your PoC, seems my initial version does not bring much benefit, I think it might be because of mark bit map does not works well when I parallelize in chunks, I will invetigate further. here is my draft version: master...Hamlin-Li:parallelize-evac-failure-v2-in-bm

Thanks for taking a stab at this. I don't have time to look at it in detail now, but I think you should be able to re-use some of the ideas used when chunking regions for bebuilding the remembered sets. Look at G1RebuildRemSetTask.

I'll try to find time this week to clean the PoC and get it out for review. But I will soon be out on vacation over the holidays, so not sure I will get it in before that.

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 15, 2021

Sure, I think your PoC is good. I just had a draft version of parallelizing evac failure processing in chunks, I use it to evaluate the potential benefit of parallelism based on your PoC, seems my initial version does not bring much benefit, I think it might be because of mark bit map does not works well when I parallelize in chunks, I will invetigate further. here is my draft version: master...Hamlin-Li:parallelize-evac-failure-v2-in-bm

Thanks for taking a stab at this. I don't have time to look at it in detail now, but I think you should be able to re-use some of the ideas used when chunking regions for bebuilding the remembered sets. Look at G1RebuildRemSetTask.

Thanks for the information.

I'll try to find time this week to clean the PoC and get it out for review. But I will soon be out on vacation over the holidays, so not sure I will get it in before that.
Sure, please take your time. :)

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 15, 2021

I have found the root cause of inefficiency of parallelism in chunks in my PoC, it's not related to bitmap.
I have also fixed this fake parallelism issue in my PoC which is based on yours. Now parallelism in chunks brings much improvement when G1EvacuationFailureALotCSetPercent is low (e.g. below 25), and it does not bring much regresson when G1EvacuationFailureALotCSetPercent is high, and it might be improved further or at least we can fall back to previous version (your PoC) when detect G1EvacuationFailureALotCSetPercent is high at runtime.

So, I will send out my PoC of parallelism in chunks after your PoC is integrated.

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Dec 16, 2021

Open the PoC as a draft: #6867

Still looking at some Full GC regressions that I need to understand.

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Dec 17, 2021

Thanks Stefan.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jan 14, 2022

@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@Hamlin-Li
Copy link
Author

@Hamlin-Li Hamlin-Li commented Jan 14, 2022

I've created a draft PR at #7047, this one will be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants