-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal #10256
Conversation
👋 Welcome back kbarrett! A progress list of the required criteria for merging this PR into |
@kimbarrett |
@kimbarrett |
@kimbarrett has indicated that a compatibility and specification (CSR) request is needed for this pull request. @kimbarrett please create a CSR request for issue JDK-8137022 with the correct fix version. This pull request cannot be integrated until the CSR request is approved. |
@kimbarrett The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass over the code and I really like the new approach. Feels much simpler to reason about and understand.
I have a few small suggestions/questions.
There is now an associated draft CSR for folks to review: https://bugs.openjdk.org/browse/JDK-8294206 |
double alloc_region_rate = analytics->predict_alloc_rate_ms(); | ||
double alloc_bytes_rate = alloc_region_rate * HeapRegion::GrainBytes; | ||
if (alloc_bytes_rate == 0.0) { | ||
_predicted_time_until_next_gc_ms = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why time-till-next-gc is zero when the alloc rate is at its minimal -- by continuity, I'd expect this to be the max of time-till-next-gc, i.e. one_hour_ms
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A zero value for the prediction indicates that we don't have a valid
prediction, because we're starting up the VM and don't have data yet. That
puts us in a place where we can't really do much with the predictive model.
Later clauses (no refinement needed, or time is short) also "handle" this
case, though maybe there should be some commentary about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A zero value for the prediction indicates that we don't have a valid
prediction
Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.
On a related note, there's special treatment for too-close upcoming GC pause later on, if (_predicted_time_until_next_gc_ms > _update_period_ms) {
. Shouldn't there be sth similar for too-far upcoming GC pause? IOW, incoming_rate * _predicted_time_until_next_gc_ms;
would be unreliable for farther prediction, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.
Allocation rate is the rate of allocation between GCs, to have a GC, you (almost) need non-zero allocation rate (not with periodic gcs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments regarding the alloc_bytes_rate == 0 case.
I don't think we need to do anything special for predicted GC times being far
away, because we'll be periodically re-evaluating.
double G1Analytics::predict_dirtied_cards_in_thread_buffers() const { | ||
return predict_zero_bounded(_dirtied_cards_in_thread_buffers_seq); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this sequence captures #dirty cards in thread buffers at GC pause start. However, I don't think it follows a normal distribution because of the buffer-size clamping. In comparison, the dirty-cards-generation-rate (predict_dirtied_cards_rate_ms
) more likely follows a normal distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of dirty cards in thread buffers at the start of GC pause is
exactly what this is supposed to capture. We discount the number of cards that
can be processed in the budgeted time by this prediction to get the target
number of cards in the queue. It's not a very accurate prediction, but it's
still worth doing. For some applications and configurations I've tested (with
low G1RSetUpdatingPauseTimePercent) it might be 5-10% of the target. A model
based on the number of threads tends to do very poorly for some applications.
This is entirely different from predict_dirtied_cards_rate_ms, which is a
different value and has different uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is that #cards in the bounded thread-buffers doesn't necessarily follow a normal distribution, so one can't predict the future valuse using avg + std. Taking an extreme example of a single thread-buffer, if the population avg is ~buffer_capacity, #cards in the thread-buffer can exhibit large jumps btw 0 and ~buffer_capacity due to the implicit modulo operation.
It's not a very accurate prediction, but it's
still worth doing.
Which benchmark shows its effect? I hard-coded size_t predicted_thread_buffer_cards = 0;
in G1Policy::record_young_collection_end
but can't see much difference. Normally, #cards in global dirty queue should be >> #cards in thread-local buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example log line from my development machine:
[241.020s][debug][gc,ergo,refine ] GC(27) GC refinement: goal: 86449 + 8201 / 2.00ms, actual: 100607 / 2.32ms, HCC: 1024 / 0.00ms (exceeded goal)
Note the cards in thread buffers prediction (8149) is approaching 10% of the goal.
This is from specjbb2015 with
-Xmx40g -XX:MaxGCPauseMillis=100 -XX:G1RSetUpdatingPauseTimePercent=2
on a machine with 32 cores.
specjbb2015 with default pause time and refinement budget probably won't see
much impact from the cards still in buffers because the goal will be so much
larger. OTOH, such a configuration also probably does very little concurrent
refinement.
Lest one thinks that configuration is unreasonable or unlikely, part of the
point of this change is to improve the behavior with a smaller percentage of a
pause budgeted for refinement. That allows more time in a pause for other
things, like evacuation. (Even with that more restrictive condiguration
specjbb2015 still doesn't do much concurrent refinement. For example, during
the mutator phase before that GC there was never more than one refinement
thread running, and it was only running for about the last 5% of the phase.)
I'm using the prediction infrastructure to get a moving average over several
recent samples, to get a number that has some basis. The stdev implicit in
that infrastructure makes the result a bit higher than the average. I think
probably doesn't matter much, as none of the inputs nor the calculations that
use them are very precise. But the behavior does seem to be worse (in the
sense of more frequently blowing the associated budget and by larger amounts)
if this isn't accounted for to some extent.
But maybe your point is more about the stddev, and that should not be
included. I can see that, and could just use the moving average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with using the average rather than the prediction. The
difference between the two is not large (estimating just by eye, average
difference is in the neighborhood of 10%). Using the average seems more
appropriate though, so I don't mind changing to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typos. Going to do some testing.
double alloc_region_rate = analytics->predict_alloc_rate_ms(); | ||
double alloc_bytes_rate = alloc_region_rate * HeapRegion::GrainBytes; | ||
if (alloc_bytes_rate == 0.0) { | ||
_predicted_time_until_next_gc_ms = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? It's still possible that the alloc-rate is zero after start-up; I mean alloc-rate is up to applications.
Allocation rate is the rate of allocation between GCs, to have a GC, you (almost) need non-zero allocation rate (not with periodic gcs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing with the patch to get a feel for the log changes and found a couple of things that we might want to improved.
@tschatzl did some additional performance testing and found a regression. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
// Flush dirty card queues to qset, so later phases don't need to account | ||
// for partially filled per-thread queues and such. | ||
flush_dirty_card_queues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also fix the existing issue to call flush_dirty_card_queues();
a bit earlier if you hadn't already filed it as a separate bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already a bug for this - https://bugs.openjdk.org/browse/JDK-8295319. I'm still thinking about / looking at that one. I'm not sure moving it earlier is correct. Might be that record_concurrent_refinement_stats should be combined with concatenate_logs, maybe where concatenate_logs / flush_dirty_card_queues is now. I think it can be addressed separately from the new concurrent refinement controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, see minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think predict_dirtied_cards_in_thread_buffers
should not use avg + stddev
as prediction since it doesn't follow normal distribution, but if the measured results are good enough, this is fine as well.
@kimbarrett This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 4 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
Thanks @kstefanj , @tschatzl , @albertnetymk , @walulyai for reviews and additional performance testing. |
/integrate |
Going to push as commit 028e8b3.
Your commit was automatically rebased without conflicts. |
@kimbarrett Pushed as commit 028e8b3. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal
8155996: Improve concurrent refinement green zone control
8134303: Introduce -XX:-G1UseConcRefinement
Please review this change to the control of concurrent refinement.
This new controller takes a different approach to the problem, addressing a
number of issues.
The old controller used a multiple of the target number of cards to determine
the range over which increasing numbers of refinement threads should be
activated, and finally activating mutator refinement. This has a variety of
problems. It doesn't account for the processing rate, the rate of new dirty
cards, or the time available to perform the processing. This often leads to
unnecessary spikes in the number of running refinement threads. It also tends
to drive the pending number to the target quickly and keep it there, removing
the benefit from having pending dirty cards filter out new cards for nearby
writes. It can't delay and leave excess cards in the queue because it could
be a long time before another buffer is enqueued.
The old controller was triggered by mutator threads enqueing card buffers,
when the number of cards in the queue exceeded a threshold near the target.
This required a complex activation protocol between the mutators and the
refinement threads.
With the new controller there is a primary refinement thread that periodically
estimates how many refinement threads need to be running to reach the target
in time for the next GC, along with whether to also activate mutator
refinement. If the primary thread stops running because it isn't currently
needed, it sleeps for a period and reevaluates on wakeup. This eliminates any
involvement in the activation of refinement threads by mutator threads.
The estimate of how many refinement threads are needed uses a prediction of
time until the next GC, the number of buffered cards, the predicted rate of
new dirty cards, and the predicted refinement rate. The number of running
threads is adjusted based on these periodically performed estimates.
This new approach allows more dirty cards to be left in the queue until late
in the mutator phase, typically reducing the rate of new dirty cards, which
reduces the amount of concurrent refinement work needed.
It also smooths out the number of running refinement threads, eliminating the
unnecessarily large spikes that are common with the old method. One benefit
is that the number of refinement threads (lazily) allocated is often much
lower now. (This plus UseDynamicNumberOfGCThreads mitigates the problem
described in JDK-8153225.)
This change also provides a new method for calculating for the number of dirty
cards that should be pending at the start of a GC. While this calculation is
conceptually distinct from the thread control, the two were significanly
intertwined in the old controller. Changing this calculation separately and
first would have changed the behavior of the old controller in ways that might
have introduced regressions. Changing it after the thread control was changed
would have made it more difficult to test and measure the thread control in a
desirable configuration.
The old calculation had various problems that are described in JDK-8155996.
In particular, it can get more or less stuck at low values, and is slow to
respond to changes.
The old controller provided a number of product options, none of which were
very useful for real applications, and none of which are very applicable to
the new controller. All of these are being obsoleted.
-XX:-G1UseAdaptiveConcRefinement
-XX:G1ConcRefinementGreenZone=
-XX:G1ConcRefinementYellowZone=
-XX:G1ConcRefinementRedZone=
-XX:G1ConcRefinementThresholdStep=
The new controller could use G1ConcRefinementGreenZone to provide a fixed
value for the target number of cards, though it is poorly named for that.
A configuration that was useful for some kinds of debugging and testing was to
disable G1UseAdaptiveConcRefinement and set g1ConcRefinementGreenZone to a
very large value, effectively disabling concurrent refinement. To support
this use case with the new controller, the -XX:-G1UseConcRefinement diagnostic
option has been added (see JDK-8155996).
The other options are meaningless for the new controller.
Because of these option changes, a CSR and a release note need to accompany
this change.
Testing:
mach5 tier1-6
various performance tests.
local (linux-x64) tier1 with -XX:-G1UseConcRefinement
Performance testing found no regressions, but also little or no improvement
with default options, which was expected. With default options most of our
performance tests do very little concurrent refinement. And even for those
that do, while the old controller had a number of problems, the impact of
those problems is small and hard to measure for most applications.
When reducing G1RSetUpdatingPauseTimePercent the new controller seems to fare
better, particularly when also reducing MaxGCPauseMillis. specjbb2015 with
MaxGCPauseMillis=75 and G1RSetUpdatingPauseTimePercent=3 (and other options
held constant) showed a statistically significant improvement of about 4.5%
for critical-jOPS. Using the changed controller, the difference between this
configuration and the default is fairly small, while the baseline shows
significant degradation with the more restrictive options.
For all tests and configurations the new controller often creates many fewer
refinement threads.
/issue add 8155996
/issue add 8134303
/csr
Progress
Issues
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10256/head:pull/10256
$ git checkout pull/10256
Update a local copy of the PR:
$ git checkout pull/10256
$ git pull https://git.openjdk.org/jdk pull/10256/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10256
View PR using the GUI difftool:
$ git pr show -t 10256
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10256.diff