8278824: Increase chunks per region for G1 vm root scan #6840
Thanks for reporting this issue - nice find!
This is as you correctly noted an issue with work distribution during the Object Copy phase. There are known issues with work stealing that we've been working on specifically in the last few weeks; the graph mentioned below shows the current results (fwiw, it also gives good results without this change).
So this change improves upon by making the initial distribution of work better, which so far seems a good solution for this particular case.
After reproducing this particular issue and internal discussion we think that adding a new flag is something to avoid at all here. The main reason is that improving the defaults for the number of scan chunks seems side-effect free so far - tests so far do not show a regression either, and the additional memory usage (and the effort to manage this scan chunk table memory) seems negligible.
The graph attached to the CR (github does not allow me to attach it for some reason) shows pause times for your reproducer on jdk11, jdk17, and jdk17 with values of
So our suggestion is to, for the 16m regions, set the default number of chunks per region to 128 or 256, depending on further testing results, for JDK 18 (and then use this to backport to 17). When we are ready to post the task queue changes (probably JDK19), we might want to reconsider these defaults again.
Would that be an acceptable plan for you?
Thanks for taking a look at this!
I can understand why you'd be reluctant to add a new flag - perhaps it would be irrelevant with the changes you're describing for task queue stealing? On the other hand, having a flag here is the most flexible option and facilitates further testing. Changing the default for 16M regions to 256 is also sensible and would work in our scenario. Should we also use 256 for regions larger than 16M? or maybe 512 even?
With the task queue changes these changes will be irrelevant as the graph shows and there are quite a few other significant pause time improvements caused by it. However this is for likely the 19 release only, and if you want a quicker fix for older releases (mainly 17, but also 18), we should do it this way.
As for the changes, I would suggest to only change the existing formula from
or similar to get the 256 chunks for 16M regions (if I calculated correctly).
This change by itself will also give some improvements on other applications. We did not notice perf regressions by using a (fixed) suggested value of 256 for this size. I'll also start a perf run with my suggestion.
With configurable card size in 18 we might want to adjust this value based on that as well in the future, I'll file an issue to investigate this specifically - I think this is out of scope for fixing this particular regression though; potentially it makes sense to go for something like a fixed chunk size instead something dependent on region and particularly card size.