8278824: Increase chunks per region for G1 vm root scan #6840

earthling-amzn · 2021-12-14T22:09:49Z

During an upgrade from JDK11 to JDK15 (and JDK17) we experience an increase in garbage collection latency from approximately 10-15ms at p99.9 to approximately 45-50ms. Logs showed the object copy phase consuming most of that time.

Enabling task queue stats on the build and increasing the log level showed that one evacuation thread is doing far more work than the other threads.

148905 [2021-12-06T22:12:38.338+0000][debug][gc,phases ] GC(581) Object Copy (ms): Min: 13.3, Avg: 16.6, Max: 67.2, Diff: 53.8, Sum: 546.7, Workers: 33
148906 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task ] GC(581) 14.8 15.4 15.3 16.0 14.7 14.6 16.6 15.3 16.3 15.2 67.2 13.8 14.1 16.5 14.5 14.3 14.1 14.9 15.3 15.8 13.3 15.6 15.0 15.6 14.7 15.6 14.6 14.3 15.1 15.2 14.4 13.5 14.9

Looking back, we see a pattern in these cycles where the thread doing most of the work scans only a 2 or 3 "blocks" (here, just 2):

148899 [2021-12-06T22:12:38.338+0000][debug][gc,phases ] GC(581) Scanned Blocks: Min: 2, Avg: 1203.9, Max: 1875, Diff: 1873, Sum: 39730, Workers: 33
148900 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task ] GC(581) 949 1838 1397 1548 1450 1875 821 1067 1312 1463 2 1282 1319 38 1177 1199 819 1170 897 1343 1860 1070 1059 1552 1217 1296 1068 1092 1645 1166 1002 1140 1597

And in the task queue stats this thread performs an order of magnitude more operations on the task queue:

149005 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 8 51757 51143 9173 121241 9379 0 0
149006 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 9 37328 36045 6291 113112 8332 0 0
149007 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 10 350079 77644 0 132 0 267666 226166
149008 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 11 38889 37932 7100 110650 7940 0 0
149009 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 12 53762 53174 9375 110255 8637 0 0

We traced the origin of this behavior to: https://bugs.openjdk.java.net/browse/JDK-8213108

We tried changing ParGCArrayScanChunk, but to no avail. Introducing a flag to override the ergonomics that select chunks per region was effective.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Integration blocker

⚠️ Title mismatch between PR and JBS for issue JDK-8278824

Issue

JDK-8278824: Uneven work distribution when scanning heap roots in G1 ⚠️ Title mismatch between PR and JBS.

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6840/head:pull/6840
$ git checkout pull/6840

Update a local copy of the PR:
$ git checkout pull/6840
$ git pull https://git.openjdk.java.net/jdk pull/6840/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6840

View PR using the GUI difftool:
$ git pr show -t 6840

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6840.diff

bridgekeeper · 2021-12-14T22:09:55Z

👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2021-12-14T22:12:05Z

@earthling-amzn The following label will be automatically applied to this pull request:

hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2021-12-14T22:17:23Z

Webrevs

tschatzl · 2021-12-15T16:08:00Z

Thanks for reporting this issue - nice find!

This is as you correctly noted an issue with work distribution during the Object Copy phase. There are known issues with work stealing that we've been working on specifically in the last few weeks; the graph mentioned below shows the current results (fwiw, it also gives good results without this change).

So this change improves upon by making the initial distribution of work better, which so far seems a good solution for this particular case.

After reproducing this particular issue and internal discussion we think that adding a new flag is something to avoid at all here. The main reason is that improving the defaults for the number of scan chunks seems side-effect free so far - tests so far do not show a regression either, and the additional memory usage (and the effort to manage this scan chunk table memory) seems negligible.

The graph attached to the CR (github does not allow me to attach it for some reason) shows pause times for your reproducer on jdk11, jdk17, and jdk17 with values of G1RemSetScanChunksPerRegion from 64 to 1024; the selected 256 seems to work very well for this case :) and that improved-steal-multi that shows a recent prototype for the mentioned task queue changes (which covers more cases than this one).

So our suggestion is to, for the 16m regions, set the default number of chunks per region to 128 or 256, depending on further testing results, for JDK 18 (and then use this to backport to 17). When we are ready to post the task queue changes (probably JDK19), we might want to reconsider these defaults again.

Would that be an acceptable plan for you?

earthling-amzn · 2021-12-15T18:26:35Z

Thanks for taking a look at this!

I can understand why you'd be reluctant to add a new flag - perhaps it would be irrelevant with the changes you're describing for task queue stealing? On the other hand, having a flag here is the most flexible option and facilitates further testing. Changing the default for 16M regions to 256 is also sensible and would work in our scenario. Should we also use 256 for regions larger than 16M? or maybe 512 even?

tschatzl · 2021-12-15T19:20:54Z

With the task queue changes these changes will be irrelevant as the graph shows and there are quite a few other significant pause time improvements caused by it. However this is for likely the 19 release only, and if you want a quicker fix for older releases (mainly 17, but also 18), we should do it this way.

As for the changes, I would suggest to only change the existing formula from

    1u << (HeapRegion::LogOfHRGrainBytes / 2 - 7)

to

   1u << (HeapRegion::LogOfHRGrainBytes / 2 - 4)

or similar to get the 256 chunks for 16M regions (if I calculated correctly).

tschatzl · 2021-12-16T12:27:32Z

The ParGCArrayScanChunk option has a different purpose: when scanning recently copied objArrays, what is the size of a part of it scanned by a single task in the task queue.

This change by itself will also give some improvements on other applications. We did not notice perf regressions by using a (fixed) suggested value of 256 for this size. I'll also start a perf run with my suggestion.

With configurable card size in 18 we might want to adjust this value based on that as well in the future, I'll file an issue to investigate this specifically - I think this is out of scope for fixing this particular regression though; potentially it makes sense to go for something like a fixed chunk size instead something dependent on region and particularly card size.

earthling-amzn · 2021-12-21T00:02:28Z

Sorry for the delayed response. Log4Shell and vacation plans kept me away from this. I've updated the PR as you requested. Thanks again.

tschatzl

This looks good to me - but could you rebase this on the jdk18 tree (and just withdraw this change for tip)? Then we would not need manually backport this change for the 18 release.

That's why I am not approving this here right away but requesting changes.

The 18 changes will then be automatically forward-ported to 19 the next time we do that (every ~Thursday).

Thanks,
Thomas

tschatzl · 2021-12-21T09:30:32Z

src/hotspot/share/gc/g1/g1RemSet.cpp

@@ -114,7 +114,7 @@ class G1RemSetScanState : public CHeapObj<mtGC> {
    // values for region size.
    assert(log_region_size >= 20 && log_region_size <= 29,
           "expected value in [20,29], but got %u", log_region_size);
-    return 1u << (log_region_size / 2 - 7);
+    return 1u << (log_region_size / 2 - 4);


The comment above also needs to be adjusted to this change, octupling the mentioned numbers.

earthling-amzn · 2021-12-22T19:48:37Z

Rebased onto 18 here: openjdk/jdk18#66 (also updated figures in the comment).

openjdk bot added the rfr Pull request is ready for review label Dec 14, 2021

openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Dec 14, 2021

openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Dec 14, 2021

Increase chunks per region to improve work distribution during rset scan

831b916

earthling-amzn force-pushed the configurable-card-block-size branch from 830d028 to 831b916 Compare December 20, 2021 23:58

earthling-amzn changed the title ~~8278824: Add an option for G1 to configure chunks per region~~ 8278824: Increase chunks per region for G1 vm root scan Dec 21, 2021

tschatzl suggested changes Dec 21, 2021

View reviewed changes

tschatzl reviewed Dec 21, 2021

View reviewed changes

earthling-amzn mentioned this pull request Dec 22, 2021

8278824: Uneven work distribution when scanning heap roots in G1 openjdk/jdk18#66

Closed

3 tasks

earthling-amzn closed this Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8278824: Increase chunks per region for G1 vm root scan #6840

8278824: Increase chunks per region for G1 vm root scan #6840

earthling-amzn commented Dec 14, 2021 •

edited by openjdk bot

Loading

bridgekeeper bot commented Dec 14, 2021

openjdk bot commented Dec 14, 2021

mlbridge bot commented Dec 14, 2021 •

edited

Loading

tschatzl commented Dec 15, 2021

earthling-amzn commented Dec 15, 2021

tschatzl commented Dec 15, 2021 •

edited

Loading

tschatzl commented Dec 16, 2021 •

edited

Loading

earthling-amzn commented Dec 21, 2021

tschatzl left a comment

tschatzl Dec 21, 2021

earthling-amzn commented Dec 22, 2021

8278824: Increase chunks per region for G1 vm root scan #6840

8278824: Increase chunks per region for G1 vm root scan #6840

Conversation

earthling-amzn commented Dec 14, 2021 • edited by openjdk bot Loading

Progress

Integration blocker

Issue

Reviewing

bridgekeeper bot commented Dec 14, 2021

openjdk bot commented Dec 14, 2021

mlbridge bot commented Dec 14, 2021 • edited Loading

Webrevs

tschatzl commented Dec 15, 2021

earthling-amzn commented Dec 15, 2021

tschatzl commented Dec 15, 2021 • edited Loading

tschatzl commented Dec 16, 2021 • edited Loading

earthling-amzn commented Dec 21, 2021

tschatzl left a comment

Choose a reason for hiding this comment

tschatzl Dec 21, 2021

Choose a reason for hiding this comment

earthling-amzn commented Dec 22, 2021

earthling-amzn commented Dec 14, 2021 •

edited by openjdk bot

Loading

mlbridge bot commented Dec 14, 2021 •

edited

Loading

tschatzl commented Dec 15, 2021 •

edited

Loading

tschatzl commented Dec 16, 2021 •

edited

Loading