Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix admission metrics in true units #72343

Merged
merged 2 commits into from Jan 29, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions staging/src/k8s.io/apiserver/pkg/admission/metrics/metrics.go
Expand Up @@ -206,9 +206,9 @@ func (m *metricSet) reset() {

// Observe records an observed admission event to all metrics in the metricSet.
func (m *metricSet) observe(elapsed time.Duration, labels ...string) {
elapsedMicroseconds := float64(elapsed / time.Microsecond)
m.latencies.WithLabelValues(labels...).Observe(elapsedMicroseconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will changing the values here cause problems for monitoring already set up to track/alarm on these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly, I can think of two ways this metric is (and can be) broken for people who have monitoring set up against these.

The first group would be people who are using this metric and are aware that the metric is emitting in microseconds even though the label is in seconds. If they have set their alerts accordingly (this would be weird but not impossible), then we would break their monitoring with this fix.

The second group are people who are using this metric as if this metric was working correctly, i.e. emitting latency in seconds. In that case, thresholds which are currently set for alerting would be off by orders of magnitude and this fix would actually make those alerts start working as intended.

Personally, I think we should just fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also for changing as this is an actual bug, but can you add an item to the changelog that this is a change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the metrics overhaul planned for 1.14 where a number of metrics are changing and we're documenting every single case including what to change. As a middle ground, let's add an already deprecated metric that is called admission_latencies_milliseconds_summary so people who are affected by the break would only have to change the metric name and not do a unit conversion. I think this would work well, as 1.14 is the "metric migration" release, where we have deprecated metrics as well as the (new) best practice following metrics and the deprecated ones will be removed in 1.15.

This one is an interesting case as it's not just not following the best practice, but also incorrectly labels its unit. It will either way need a separate, additional changelog notice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @brancz proposal.
Added admission_latencies_milliseconds and admission_latencies_milliseconds_summary for Backward compatible. PTAL

elapsedSeconds := elapsed.Seconds()
m.latencies.WithLabelValues(labels...).Observe(elapsedSeconds)
if m.latenciesSummary != nil {
m.latenciesSummary.WithLabelValues(labels...).Observe(elapsedMicroseconds)
m.latenciesSummary.WithLabelValues(labels...).Observe(elapsedSeconds)
}
}