Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change JVM GC duration metric from milliseconds to seconds #3414

Closed
trask opened this issue Apr 19, 2023 · 8 comments · Fixed by #3458
Closed

Change JVM GC duration metric from milliseconds to seconds #3414

trask opened this issue Apr 19, 2023 · 8 comments · Fixed by #3458
Assignees
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory

Comments

@trask
Copy link
Member

trask commented Apr 19, 2023

What are you trying to achieve?

Follow the decision that was made in #2977 and change JVM GC duration unit to seconds.

EDIT and decide on bucket boundaries

@trask trask added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory labels Apr 19, 2023
@trask trask assigned trask and unassigned jmacd Apr 20, 2023
@jack-berg
Copy link
Member

One idea for the advice bucket boundaries would be to recommend no buckets at all, downgrading the histogram to a summary by default with min, max, count, sum (and implicitly average).

Otherwise, we would need to find some way to base the defaults off of real world data.

@pirgeo
Copy link
Member

pirgeo commented Apr 25, 2023

Otherwise, we would need to find some way to base the defaults off of real world data.

I guess that is the whole point of the advice API, but is this going to lead to us having different bucket boundaries for every semantic convention?

@kittylyst
Copy link

Otherwise, we would need to find some way to base the defaults off of real world data.

That's going to be very difficult to do in a general way. For smallish heaps (<2G) then sub-1ms for young collections and sub-200ms for Parallel (STW) old collections is fine.

But for G1 those numbers - especially the old collections - are different. And for a super big heap they'll be different again.

@breedx-splk
Copy link
Contributor

I agree with @kittylyst , it will be challenging to find a one-size fits all scheme here.

To help inform my own thoughts on this, I thought it might be useful to look at GC logs from a real-world example. This is from GC logs from one of our production services that is using G1. This instance has been running more than 7 days. X axis is GC id, Y axis is milliseconds:

image

and then tweaked with a logarithmic y axis:

image

Not sure if other folks want to or can contribute other gc data distributions, or if it's too academic to even look at these.

@kittylyst
Copy link

I don't think it's too academic to look at these - but we should be clear about what we're looking at.

Is this both young and old collections? Or just old? And what's the time - is it total STW pause time or GC duration (which for G1 Old should be mostly concurrent)?

@breedx-splk
Copy link
Contributor

Is this both young and old collections? Or just old? And what's the time - is it total STW pause time or GC duration (which for G1 Old should be mostly concurrent)?

Right. I did the simplest thing and just looked at every gc in the gc.log file. Specifically, I just took the last occurrence of milliseconds like 123.456ms in each gc "block". So I think the above would include both STW and concurrent GC, as well as old and young.

@kittylyst I can separate the data, but I'm not confident about which boundary/ies would be relevant/important. The current spec is a little fuzzy here as well.

@jack-berg
Copy link
Member

I guess that is the whole point of the advice API, but is this going to lead to us having different bucket boundaries for every semantic convention?

It could, but that doing so would lead to a bunch of conversations that repeat the same line of reasoning:

  • Do we have real world data to inform what the boundaries should be?
  • How do we consider scenarios that fall outside the norms?
  • Given that histogram size is proportional to the number of buckets, what's the right number and distribution of buckets?

Of course users will always be able to get what they want by using view and metric reader configuration, but we'll still worry / debate over the defaults as if we're forcing them on users.

Here are a couple options I see:

  • Downgrade to summary by specifying advice bucket boundaries of an empty list. This side steps the issues of deciding the right number of buckets and their boundaries, and makes it the user's responsibility to opt into an actual histogram by specifying explicit bucket boundaries or an exponential histogram.
  • Extend advice to include exponential histogram preference and use it. This capability doesn't exist yet but was discussed here. Exponential histograms automatically adjust to the range of measurements recorded, so the only decision for users is how many buckets to allocate.
  • Specify explicit bucket boundaries informed by real world data, acknowledging that we will never make everyone happy.

In the case of gc duration, my preference would be to downgrade to summary. This is probably not the right decision for all histograms, but for gc duration the most important information is contained in min, max, sum, count. The distribution is a luxury.

There's a related question about whether a change to histogram bucket boundary advice is a breaking change. We've gone back and forth on whether changing the default bucket boundaries is breaking, but seem to have landed on it indeed being breaking since duration units have been changed from ms to s and we haven't changed the defaults. Maybe advice bucket boundaries are different, but given that advice can influence the default behavior of the SDK, maybe the same applies? cc @jsuereth

@trask
Copy link
Member Author

trask commented Apr 28, 2023

Downgrade to summary

I like this option. I'll check out existing histogram metrics in the spec to see if we can make any general recommendation around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory
Projects
Development

Successfully merging a pull request may close this issue.

6 participants