Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal #10256

Closed
wants to merge 25 commits into from

Conversation

kimbarrett
Copy link

@kimbarrett kimbarrett commented Sep 14, 2022

8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal
8155996: Improve concurrent refinement green zone control
8134303: Introduce -XX:-G1UseConcRefinement

Please review this change to the control of concurrent refinement.

This new controller takes a different approach to the problem, addressing a
number of issues.

The old controller used a multiple of the target number of cards to determine
the range over which increasing numbers of refinement threads should be
activated, and finally activating mutator refinement. This has a variety of
problems. It doesn't account for the processing rate, the rate of new dirty
cards, or the time available to perform the processing. This often leads to
unnecessary spikes in the number of running refinement threads. It also tends
to drive the pending number to the target quickly and keep it there, removing
the benefit from having pending dirty cards filter out new cards for nearby
writes. It can't delay and leave excess cards in the queue because it could
be a long time before another buffer is enqueued.

The old controller was triggered by mutator threads enqueing card buffers,
when the number of cards in the queue exceeded a threshold near the target.
This required a complex activation protocol between the mutators and the
refinement threads.

With the new controller there is a primary refinement thread that periodically
estimates how many refinement threads need to be running to reach the target
in time for the next GC, along with whether to also activate mutator
refinement. If the primary thread stops running because it isn't currently
needed, it sleeps for a period and reevaluates on wakeup. This eliminates any
involvement in the activation of refinement threads by mutator threads.

The estimate of how many refinement threads are needed uses a prediction of
time until the next GC, the number of buffered cards, the predicted rate of
new dirty cards, and the predicted refinement rate. The number of running
threads is adjusted based on these periodically performed estimates.

This new approach allows more dirty cards to be left in the queue until late
in the mutator phase, typically reducing the rate of new dirty cards, which
reduces the amount of concurrent refinement work needed.

It also smooths out the number of running refinement threads, eliminating the
unnecessarily large spikes that are common with the old method. One benefit
is that the number of refinement threads (lazily) allocated is often much
lower now. (This plus UseDynamicNumberOfGCThreads mitigates the problem
described in JDK-8153225.)

This change also provides a new method for calculating for the number of dirty
cards that should be pending at the start of a GC. While this calculation is
conceptually distinct from the thread control, the two were significanly
intertwined in the old controller. Changing this calculation separately and
first would have changed the behavior of the old controller in ways that might
have introduced regressions. Changing it after the thread control was changed
would have made it more difficult to test and measure the thread control in a
desirable configuration.

The old calculation had various problems that are described in JDK-8155996.
In particular, it can get more or less stuck at low values, and is slow to
respond to changes.

The old controller provided a number of product options, none of which were
very useful for real applications, and none of which are very applicable to
the new controller. All of these are being obsoleted.

-XX:-G1UseAdaptiveConcRefinement
-XX:G1ConcRefinementGreenZone=
-XX:G1ConcRefinementYellowZone=
-XX:G1ConcRefinementRedZone=
-XX:G1ConcRefinementThresholdStep=

The new controller could use G1ConcRefinementGreenZone to provide a fixed
value for the target number of cards, though it is poorly named for that.

A configuration that was useful for some kinds of debugging and testing was to
disable G1UseAdaptiveConcRefinement and set g1ConcRefinementGreenZone to a
very large value, effectively disabling concurrent refinement. To support
this use case with the new controller, the -XX:-G1UseConcRefinement diagnostic
option has been added (see JDK-8155996).

The other options are meaningless for the new controller.

Because of these option changes, a CSR and a release note need to accompany
this change.

Testing:
mach5 tier1-6
various performance tests.
local (linux-x64) tier1 with -XX:-G1UseConcRefinement

Performance testing found no regressions, but also little or no improvement
with default options, which was expected. With default options most of our
performance tests do very little concurrent refinement. And even for those
that do, while the old controller had a number of problems, the impact of
those problems is small and hard to measure for most applications.

When reducing G1RSetUpdatingPauseTimePercent the new controller seems to fare
better, particularly when also reducing MaxGCPauseMillis. specjbb2015 with
MaxGCPauseMillis=75 and G1RSetUpdatingPauseTimePercent=3 (and other options
held constant) showed a statistically significant improvement of about 4.5%
for critical-jOPS. Using the changed controller, the difference between this
configuration and the default is fairly small, while the baseline shows
significant degradation with the more restrictive options.

For all tests and configurations the new controller often creates many fewer
refinement threads.

/issue add 8155996
/issue add 8134303

/csr


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires a CSR request to be approved

Issues

  • JDK-8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal
  • JDK-8155996: Improve concurrent refinement green zone control
  • JDK-8134303: Introduce -XX:-G1UseConcRefinement
  • JDK-8294206: Concurrent refinement thread adjustment and (de-)activation suboptimal (CSR)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10256/head:pull/10256
$ git checkout pull/10256

Update a local copy of the PR:
$ git checkout pull/10256
$ git pull https://git.openjdk.org/jdk pull/10256/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10256

View PR using the GUI difftool:
$ git pr show -t 10256

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10256.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 14, 2022

👋 Welcome back kbarrett! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 14, 2022
@openjdk
Copy link

openjdk bot commented Sep 14, 2022

@kimbarrett
Adding additional issue to issue list: 8155996: Improve concurrent refinement green zone control.

@openjdk
Copy link

openjdk bot commented Sep 14, 2022

@kimbarrett
Adding additional issue to issue list: 8134303: Introduce -XX:-G1UseConcRefinement.

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Sep 14, 2022
@openjdk
Copy link

openjdk bot commented Sep 14, 2022

@kimbarrett has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@kimbarrett please create a CSR request for issue JDK-8137022 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

@openjdk
Copy link

openjdk bot commented Sep 14, 2022

@kimbarrett The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 14, 2022
@mlbridge
Copy link

mlbridge bot commented Sep 14, 2022

Webrevs

Copy link
Contributor

@kstefanj kstefanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass over the code and I really like the new approach. Feels much simpler to reason about and understand.

I have a few small suggestions/questions.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1Policy.cpp Outdated Show resolved Hide resolved
@kimbarrett
Copy link
Author

There is now an associated draft CSR for folks to review: https://bugs.openjdk.org/browse/JDK-8294206

double alloc_region_rate = analytics->predict_alloc_rate_ms();
double alloc_bytes_rate = alloc_region_rate * HeapRegion::GrainBytes;
if (alloc_bytes_rate == 0.0) {
_predicted_time_until_next_gc_ms = 0.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why time-till-next-gc is zero when the alloc rate is at its minimal -- by continuity, I'd expect this to be the max of time-till-next-gc, i.e. one_hour_ms.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero value for the prediction indicates that we don't have a valid
prediction, because we're starting up the VM and don't have data yet. That
puts us in a place where we can't really do much with the predictive model.
Later clauses (no refinement needed, or time is short) also "handle" this
case, though maybe there should be some commentary about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero value for the prediction indicates that we don't have a valid
prediction

Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.

On a related note, there's special treatment for too-close upcoming GC pause later on, if (_predicted_time_until_next_gc_ms > _update_period_ms) {. Shouldn't there be sth similar for too-far upcoming GC pause? IOW, incoming_rate * _predicted_time_until_next_gc_ms; would be unreliable for farther prediction, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.

Allocation rate is the rate of allocation between GCs, to have a GC, you (almost) need non-zero allocation rate (not with periodic gcs).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments regarding the alloc_bytes_rate == 0 case.

I don't think we need to do anything special for predicted GC times being far
away, because we'll be periodically re-evaluating.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp Outdated Show resolved Hide resolved
Comment on lines 249 to 251
double G1Analytics::predict_dirtied_cards_in_thread_buffers() const {
return predict_zero_bounded(_dirtied_cards_in_thread_buffers_seq);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this sequence captures #dirty cards in thread buffers at GC pause start. However, I don't think it follows a normal distribution because of the buffer-size clamping. In comparison, the dirty-cards-generation-rate (predict_dirtied_cards_rate_ms) more likely follows a normal distribution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of dirty cards in thread buffers at the start of GC pause is
exactly what this is supposed to capture. We discount the number of cards that
can be processed in the budgeted time by this prediction to get the target
number of cards in the queue. It's not a very accurate prediction, but it's
still worth doing. For some applications and configurations I've tested (with
low G1RSetUpdatingPauseTimePercent) it might be 5-10% of the target. A model
based on the number of threads tends to do very poorly for some applications.

This is entirely different from predict_dirtied_cards_rate_ms, which is a
different value and has different uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that #cards in the bounded thread-buffers doesn't necessarily follow a normal distribution, so one can't predict the future valuse using avg + std. Taking an extreme example of a single thread-buffer, if the population avg is ~buffer_capacity, #cards in the thread-buffer can exhibit large jumps btw 0 and ~buffer_capacity due to the implicit modulo operation.

It's not a very accurate prediction, but it's
still worth doing.

Which benchmark shows its effect? I hard-coded size_t predicted_thread_buffer_cards = 0; in G1Policy::record_young_collection_end but can't see much difference. Normally, #cards in global dirty queue should be >> #cards in thread-local buffers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example log line from my development machine:
[241.020s][debug][gc,ergo,refine ] GC(27) GC refinement: goal: 86449 + 8201 / 2.00ms, actual: 100607 / 2.32ms, HCC: 1024 / 0.00ms (exceeded goal)
Note the cards in thread buffers prediction (8149) is approaching 10% of the goal.
This is from specjbb2015 with
-Xmx40g -XX:MaxGCPauseMillis=100 -XX:G1RSetUpdatingPauseTimePercent=2
on a machine with 32 cores.

specjbb2015 with default pause time and refinement budget probably won't see
much impact from the cards still in buffers because the goal will be so much
larger. OTOH, such a configuration also probably does very little concurrent
refinement.

Lest one thinks that configuration is unreasonable or unlikely, part of the
point of this change is to improve the behavior with a smaller percentage of a
pause budgeted for refinement. That allows more time in a pause for other
things, like evacuation. (Even with that more restrictive condiguration
specjbb2015 still doesn't do much concurrent refinement. For example, during
the mutator phase before that GC there was never more than one refinement
thread running, and it was only running for about the last 5% of the phase.)

I'm using the prediction infrastructure to get a moving average over several
recent samples, to get a number that has some basis. The stdev implicit in
that infrastructure makes the result a bit higher than the average. I think
probably doesn't matter much, as none of the inputs nor the calculations that
use them are very precise. But the behavior does seem to be worse (in the
sense of more frequently blowing the associated budget and by larger amounts)
if this isn't accounted for to some extent.

But maybe your point is more about the stddev, and that should not be
included. I can see that, and could just use the moving average.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with using the average rather than the prediction. The
difference between the two is not large (estimating just by eye, average
difference is in the neighborhood of 10%). Using the average seems more
appropriate though, so I don't mind changing to it.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typos. Going to do some testing.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp Outdated Show resolved Hide resolved
double alloc_region_rate = analytics->predict_alloc_rate_ms();
double alloc_bytes_rate = alloc_region_rate * HeapRegion::GrainBytes;
if (alloc_bytes_rate == 0.0) {
_predicted_time_until_next_gc_ms = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.

Allocation rate is the rate of allocation between GCs, to have a GC, you (almost) need non-zero allocation rate (not with periodic gcs).

Copy link
Contributor

@kstefanj kstefanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some testing with the patch to get a feel for the log changes and found a couple of things that we might want to improved.

src/hotspot/share/gc/g1/g1Policy.cpp Show resolved Hide resolved
src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp Outdated Show resolved Hide resolved
@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Oct 13, 2022
@kimbarrett
Copy link
Author

kimbarrett commented Oct 16, 2022

@tschatzl did some additional performance testing and found a regression. The
periodic remset sampling with associated young gen target length adjustment
can interact poorly with this change. This change tries to be lazy about
performing the concurrent refinement, delaying to late in the mutator phase.
Because the target length adjustment happens relatively infrequently (300ms
period by default), it may make a large change. But if the length is reduced
by enough to make a GC needed soon, it may be too late to do much of the
needed refinement. This can lead to significantly increased pause time. The
solution is to move the target length adjustment into the primary refinement
thread. Currently doing more perf testing.

@openjdk openjdk bot added rfr Pull request is ready for review and removed merge-conflict Pull request has merge conflict with target branch labels Oct 16, 2022
Copy link
Member

@walulyai walulyai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}
// Flush dirty card queues to qset, so later phases don't need to account
// for partially filled per-thread queues and such.
flush_dirty_card_queues();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also fix the existing issue to call flush_dirty_card_queues(); a bit earlier if you hadn't already filed it as a separate bug.

Copy link
Author

@kimbarrett kimbarrett Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already a bug for this - https://bugs.openjdk.org/browse/JDK-8295319. I'm still thinking about / looking at that one. I'm not sure moving it earlier is correct. Might be that record_concurrent_refinement_stats should be combined with concatenate_logs, maybe where concatenate_logs / flush_dirty_card_queues is now. I think it can be addressed separately from the new concurrent refinement controller.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, see minor comments.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp Outdated Show resolved Hide resolved
src/hotspot/share/gc/g1/g1_globals.hpp Show resolved Hide resolved
Copy link
Member

@albertnetymk albertnetymk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think predict_dirtied_cards_in_thread_buffers should not use avg + stddev as prediction since it doesn't follow normal distribution, but if the measured results are good enough, this is fine as well.

@openjdk openjdk bot removed the csr Pull request needs approved CSR before integration label Oct 20, 2022
@openjdk
Copy link

openjdk bot commented Oct 20, 2022

@kimbarrett This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal
8155996: Improve concurrent refinement green zone control
8134303: Introduce -XX:-G1UseConcRefinement

Reviewed-by: sjohanss, tschatzl, iwalulya, ayang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 4 new commits pushed to the master branch:

  • b37421e: 8295564: Norwegian Nynorsk Locale is missing formatting
  • 6707bfb: 8029633: Raw inner class constructor ref should not perform diamond inference
  • 7bc9692: 8294670: Enhanced switch statements have an implicit default which does not complete normally
  • 95dd376: 8291914: generated constructors are considered compact when they shouldn't

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 20, 2022
@kimbarrett
Copy link
Author

Thanks @kstefanj , @tschatzl , @albertnetymk , @walulyai for reviews and additional performance testing.

@kimbarrett
Copy link
Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 20, 2022

Going to push as commit 028e8b3.
Since your change was applied there have been 7 commits pushed to the master branch:

  • faa6b66: 8295715: Minimize disabled warnings in serviceability libs
  • de1e0c5: 8295719: Remove unneeded disabled warnings in jdk.sctp
  • 9612cf9: 8295529: Add link to JBS to README.md
  • b37421e: 8295564: Norwegian Nynorsk Locale is missing formatting
  • 6707bfb: 8029633: Raw inner class constructor ref should not perform diamond inference
  • 7bc9692: 8294670: Enhanced switch statements have an implicit default which does not complete normally
  • 95dd376: 8291914: generated constructors are considered compact when they shouldn't

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 20, 2022
@openjdk openjdk bot closed this Oct 20, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 20, 2022
@openjdk
Copy link

openjdk bot commented Oct 20, 2022

@kimbarrett Pushed as commit 028e8b3.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@kimbarrett kimbarrett deleted the crt2 branch October 20, 2022 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants