Change JVM GC duration metric from milliseconds to seconds #3414

trask · 2023-04-19T23:36:38Z

What are you trying to achieve?

Follow the decision that was made in #2977 and change JVM GC duration unit to seconds.

EDIT and decide on bucket boundaries

jack-berg · 2023-04-21T13:27:30Z

One idea for the advice bucket boundaries would be to recommend no buckets at all, downgrading the histogram to a summary by default with min, max, count, sum (and implicitly average).

Otherwise, we would need to find some way to base the defaults off of real world data.

pirgeo · 2023-04-25T09:23:09Z

Otherwise, we would need to find some way to base the defaults off of real world data.

I guess that is the whole point of the advice API, but is this going to lead to us having different bucket boundaries for every semantic convention?

kittylyst · 2023-04-25T17:27:32Z

Otherwise, we would need to find some way to base the defaults off of real world data.

That's going to be very difficult to do in a general way. For smallish heaps (<2G) then sub-1ms for young collections and sub-200ms for Parallel (STW) old collections is fine.

But for G1 those numbers - especially the old collections - are different. And for a super big heap they'll be different again.

breedx-splk · 2023-04-25T20:51:49Z

I agree with @kittylyst , it will be challenging to find a one-size fits all scheme here.

To help inform my own thoughts on this, I thought it might be useful to look at GC logs from a real-world example. This is from GC logs from one of our production services that is using G1. This instance has been running more than 7 days. X axis is GC id, Y axis is milliseconds:

and then tweaked with a logarithmic y axis:

Not sure if other folks want to or can contribute other gc data distributions, or if it's too academic to even look at these.

kittylyst · 2023-04-25T21:13:46Z

I don't think it's too academic to look at these - but we should be clear about what we're looking at.

Is this both young and old collections? Or just old? And what's the time - is it total STW pause time or GC duration (which for G1 Old should be mostly concurrent)?

breedx-splk · 2023-04-25T21:37:52Z

Is this both young and old collections? Or just old? And what's the time - is it total STW pause time or GC duration (which for G1 Old should be mostly concurrent)?

Right. I did the simplest thing and just looked at every gc in the gc.log file. Specifically, I just took the last occurrence of milliseconds like 123.456ms in each gc "block". So I think the above would include both STW and concurrent GC, as well as old and young.

@kittylyst I can separate the data, but I'm not confident about which boundary/ies would be relevant/important. The current spec is a little fuzzy here as well.

jack-berg · 2023-04-27T20:37:56Z

I guess that is the whole point of the advice API, but is this going to lead to us having different bucket boundaries for every semantic convention?

It could, but that doing so would lead to a bunch of conversations that repeat the same line of reasoning:

Do we have real world data to inform what the boundaries should be?
How do we consider scenarios that fall outside the norms?
Given that histogram size is proportional to the number of buckets, what's the right number and distribution of buckets?

Of course users will always be able to get what they want by using view and metric reader configuration, but we'll still worry / debate over the defaults as if we're forcing them on users.

Here are a couple options I see:

Downgrade to summary by specifying advice bucket boundaries of an empty list. This side steps the issues of deciding the right number of buckets and their boundaries, and makes it the user's responsibility to opt into an actual histogram by specifying explicit bucket boundaries or an exponential histogram.
Extend advice to include exponential histogram preference and use it. This capability doesn't exist yet but was discussed here. Exponential histograms automatically adjust to the range of measurements recorded, so the only decision for users is how many buckets to allocate.
Specify explicit bucket boundaries informed by real world data, acknowledging that we will never make everyone happy.

In the case of gc duration, my preference would be to downgrade to summary. This is probably not the right decision for all histograms, but for gc duration the most important information is contained in min, max, sum, count. The distribution is a luxury.

There's a related question about whether a change to histogram bucket boundary advice is a breaking change. We've gone back and forth on whether changing the default bucket boundaries is breaking, but seem to have landed on it indeed being breaking since duration units have been changed from ms to s and we haven't changed the defaults. Maybe advice bucket boundaries are different, but given that advice can influence the default behavior of the SDK, maybe the same applies? cc @jsuereth

trask · 2023-04-28T15:29:45Z

Downgrade to summary

I like this option. I'll check out existing histogram metrics in the spec to see if we can make any general recommendation around this.

trask added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory labels Apr 19, 2023

github-actions bot assigned jmacd Apr 19, 2023

trask mentioned this issue Apr 19, 2023

Converts JVM metrics to yaml #3413

Merged

trask assigned trask and unassigned jmacd Apr 20, 2023

jack-berg mentioned this issue Apr 28, 2023

JVM gc duration unit and bucket boundaries #3458

Merged

reyang closed this as completed in #3458 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change JVM GC duration metric from milliseconds to seconds #3414

Change JVM GC duration metric from milliseconds to seconds #3414

trask commented Apr 19, 2023 •

edited

jack-berg commented Apr 21, 2023

pirgeo commented Apr 25, 2023

kittylyst commented Apr 25, 2023

breedx-splk commented Apr 25, 2023

kittylyst commented Apr 25, 2023

breedx-splk commented Apr 25, 2023

jack-berg commented Apr 27, 2023

trask commented Apr 28, 2023

Change JVM GC duration metric from milliseconds to seconds #3414

Change JVM GC duration metric from milliseconds to seconds #3414

Comments

trask commented Apr 19, 2023 • edited

jack-berg commented Apr 21, 2023

pirgeo commented Apr 25, 2023

kittylyst commented Apr 25, 2023

breedx-splk commented Apr 25, 2023

kittylyst commented Apr 25, 2023

breedx-splk commented Apr 25, 2023

jack-berg commented Apr 27, 2023

trask commented Apr 28, 2023

trask commented Apr 19, 2023 •

edited