Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8278824: Increase chunks per region for G1 vm root scan #6840

Conversation

earthling-amzn
Copy link
Contributor

@earthling-amzn earthling-amzn commented Dec 14, 2021

During an upgrade from JDK11 to JDK15 (and JDK17) we experience an increase in garbage collection latency from approximately 10-15ms at p99.9 to approximately 45-50ms. Logs showed the object copy phase consuming most of that time.

Enabling task queue stats on the build and increasing the log level showed that one evacuation thread is doing far more work than the other threads.

148905 [2021-12-06T22:12:38.338+0000][debug][gc,phases ] GC(581) Object Copy (ms): Min: 13.3, Avg: 16.6, Max: 67.2, Diff: 53.8, Sum: 546.7, Workers: 33
148906 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task ] GC(581) 14.8 15.4 15.3 16.0 14.7 14.6 16.6 15.3 16.3 15.2 67.2 13.8 14.1 16.5 14.5 14.3 14.1 14.9 15.3 15.8 13.3 15.6 15.0 15.6 14.7 15.6 14.6 14.3 15.1 15.2 14.4 13.5 14.9

Looking back, we see a pattern in these cycles where the thread doing most of the work scans only a 2 or 3 "blocks" (here, just 2):

148899 [2021-12-06T22:12:38.338+0000][debug][gc,phases ] GC(581) Scanned Blocks: Min: 2, Avg: 1203.9, Max: 1875, Diff: 1873, Sum: 39730, Workers: 33
148900 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task ] GC(581) 949 1838 1397 1548 1450 1875 821 1067 1312 1463 2 1282 1319 38 1177 1199 819 1170 897 1343 1860 1070 1059 1552 1217 1296 1068 1092 1645 1166 1002 1140 1597

And in the task queue stats this thread performs an order of magnitude more operations on the task queue:

149005 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 8 51757 51143 9173 121241 9379 0 0
149006 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 9 37328 36045 6291 113112 8332 0 0
149007 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 10 350079 77644 0 132 0 267666 226166
149008 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 11 38889 37932 7100 110650 7940 0 0
149009 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats ] GC(581) 12 53762 53174 9375 110255 8637 0 0

We traced the origin of this behavior to: https://bugs.openjdk.java.net/browse/JDK-8213108

We tried changing ParGCArrayScanChunk, but to no avail. Introducing a flag to override the ergonomics that select chunks per region was effective.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8278824

Issue

  • JDK-8278824: Uneven work distribution when scanning heap roots in G1 ⚠️ Title mismatch between PR and JBS.

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6840/head:pull/6840
$ git checkout pull/6840

Update a local copy of the PR:
$ git checkout pull/6840
$ git pull https://git.openjdk.java.net/jdk pull/6840/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6840

View PR using the GUI difftool:
$ git pr show -t 6840

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6840.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 14, 2021

👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 14, 2021
@openjdk
Copy link

openjdk bot commented Dec 14, 2021

@earthling-amzn The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Dec 14, 2021
@mlbridge
Copy link

mlbridge bot commented Dec 14, 2021

Webrevs

@openjdk openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Dec 14, 2021
@tschatzl
Copy link
Contributor

Thanks for reporting this issue - nice find!

This is as you correctly noted an issue with work distribution during the Object Copy phase. There are known issues with work stealing that we've been working on specifically in the last few weeks; the graph mentioned below shows the current results (fwiw, it also gives good results without this change).

So this change improves upon by making the initial distribution of work better, which so far seems a good solution for this particular case.

After reproducing this particular issue and internal discussion we think that adding a new flag is something to avoid at all here. The main reason is that improving the defaults for the number of scan chunks seems side-effect free so far - tests so far do not show a regression either, and the additional memory usage (and the effort to manage this scan chunk table memory) seems negligible.

The graph attached to the CR (github does not allow me to attach it for some reason) shows pause times for your reproducer on jdk11, jdk17, and jdk17 with values of G1RemSetScanChunksPerRegion from 64 to 1024; the selected 256 seems to work very well for this case :) and that improved-steal-multi that shows a recent prototype for the mentioned task queue changes (which covers more cases than this one).

So our suggestion is to, for the 16m regions, set the default number of chunks per region to 128 or 256, depending on further testing results, for JDK 18 (and then use this to backport to 17). When we are ready to post the task queue changes (probably JDK19), we might want to reconsider these defaults again.

Would that be an acceptable plan for you?

@earthling-amzn
Copy link
Contributor Author

Thanks for taking a look at this!

I can understand why you'd be reluctant to add a new flag - perhaps it would be irrelevant with the changes you're describing for task queue stealing? On the other hand, having a flag here is the most flexible option and facilitates further testing. Changing the default for 16M regions to 256 is also sensible and would work in our scenario. Should we also use 256 for regions larger than 16M? or maybe 512 even?

@tschatzl
Copy link
Contributor

tschatzl commented Dec 15, 2021

With the task queue changes these changes will be irrelevant as the graph shows and there are quite a few other significant pause time improvements caused by it. However this is for likely the 19 release only, and if you want a quicker fix for older releases (mainly 17, but also 18), we should do it this way.

As for the changes, I would suggest to only change the existing formula from

    1u << (HeapRegion::LogOfHRGrainBytes / 2 - 7)

to

   1u << (HeapRegion::LogOfHRGrainBytes / 2 - 4)

or similar to get the 256 chunks for 16M regions (if I calculated correctly).

@tschatzl
Copy link
Contributor

tschatzl commented Dec 16, 2021

The ParGCArrayScanChunk option has a different purpose: when scanning recently copied objArrays, what is the size of a part of it scanned by a single task in the task queue.

This change by itself will also give some improvements on other applications. We did not notice perf regressions by using a (fixed) suggested value of 256 for this size. I'll also start a perf run with my suggestion.

With configurable card size in 18 we might want to adjust this value based on that as well in the future, I'll file an issue to investigate this specifically - I think this is out of scope for fixing this particular regression though; potentially it makes sense to go for something like a fixed chunk size instead something dependent on region and particularly card size.

@earthling-amzn earthling-amzn changed the title 8278824: Add an option for G1 to configure chunks per region 8278824: Increase chunks per region for G1 vm root scan Dec 21, 2021
@earthling-amzn
Copy link
Contributor Author

Sorry for the delayed response. Log4Shell and vacation plans kept me away from this. I've updated the PR as you requested. Thanks again.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - but could you rebase this on the jdk18 tree (and just withdraw this change for tip)? Then we would not need manually backport this change for the 18 release.

That's why I am not approving this here right away but requesting changes.

The 18 changes will then be automatically forward-ported to 19 the next time we do that (every ~Thursday).

Thanks,
Thomas

@@ -114,7 +114,7 @@ class G1RemSetScanState : public CHeapObj<mtGC> {
// values for region size.
assert(log_region_size >= 20 && log_region_size <= 29,
"expected value in [20,29], but got %u", log_region_size);
return 1u << (log_region_size / 2 - 7);
return 1u << (log_region_size / 2 - 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above also needs to be adjusted to this change, octupling the mentioned numbers.

@earthling-amzn
Copy link
Contributor Author

Rebased onto 18 here: openjdk/jdk18#66 (also updated figures in the comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org rfr Pull request is ready for review
2 participants