New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix admission metrics in true units #72343

Merged
merged 2 commits into from Jan 29, 2019

Conversation

@danielqsj
Copy link
Member

danielqsj commented Dec 26, 2018

What type of PR is this?

/kind bug

What this PR does / why we need it:

Admission metrics name is *_admission_latencies_seconds and *_admission_latencies_seconds_summary, the units from metrics name are seconds, but actually the return metrics are in microseconds, this PR aims to fix these metrics in seconds.

Which issue(s) this PR fixes:

Fixes #72342

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix admission metrics in seconds.
Add metrics `*_admission_latencies_milliseconds` and `*_admission_latencies_milliseconds_summary` for backward compatible, but will be removed in a future release.
@logicalhan
Copy link
Contributor

logicalhan left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 26, 2018

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 27, 2018

/unassign
/assign @jpbetz @sttts

@@ -206,9 +206,9 @@ func (m *metricSet) reset() {

// Observe records an observed admission event to all metrics in the metricSet.
func (m *metricSet) observe(elapsed time.Duration, labels ...string) {
elapsedMicroseconds := float64(elapsed / time.Microsecond)
m.latencies.WithLabelValues(labels...).Observe(elapsedMicroseconds)

This comment has been minimized.

@liggitt

liggitt Dec 29, 2018

Member

will changing the values here cause problems for monitoring already set up to track/alarm on these?

This comment has been minimized.

@logicalhan

logicalhan Jan 3, 2019

Contributor

Mainly, I can think of two ways this metric is (and can be) broken for people who have monitoring set up against these.

The first group would be people who are using this metric and are aware that the metric is emitting in microseconds even though the label is in seconds. If they have set their alerts accordingly (this would be weird but not impossible), then we would break their monitoring with this fix.

The second group are people who are using this metric as if this metric was working correctly, i.e. emitting latency in seconds. In that case, thresholds which are currently set for alerting would be off by orders of magnitude and this fix would actually make those alerts start working as intended.

Personally, I think we should just fix it.

This comment has been minimized.

@brancz

brancz Jan 8, 2019

Member

I'm also for changing as this is an actual bug, but can you add an item to the changelog that this is a change?

This comment has been minimized.

@brancz

brancz Jan 17, 2019

Member

This is part of the metrics overhaul planned for 1.14 where a number of metrics are changing and we're documenting every single case including what to change. As a middle ground, let's add an already deprecated metric that is called admission_latencies_milliseconds_summary so people who are affected by the break would only have to change the metric name and not do a unit conversion. I think this would work well, as 1.14 is the "metric migration" release, where we have deprecated metrics as well as the (new) best practice following metrics and the deprecated ones will be removed in 1.15.

This one is an interesting case as it's not just not following the best practice, but also incorrectly labels its unit. It will either way need a separate, additional changelog notice.

This comment has been minimized.

@danielqsj

danielqsj Jan 18, 2019

Author Member

agree with @brancz proposal.
Added admission_latencies_milliseconds and admission_latencies_milliseconds_summary for Backward compatible. PTAL

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 8, 2019

/cc @brancz

@k8s-ci-robot k8s-ci-robot requested a review from brancz Jan 8, 2019

@danielqsj danielqsj referenced this pull request Jan 8, 2019

Open

Change existing metrics to conform metrics guidelines #72333

7 of 9 tasks complete

@k8s-ci-robot k8s-ci-robot added size/M and removed lgtm size/XS labels Jan 18, 2019

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 18, 2019

/retest

@brancz
Copy link
Member

brancz left a comment

Just a small thing on consistency. Otherwise looks good.

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds", name),
Help: fmt.Sprintf(helpTemplate, "latency histogram in milliseconds (deprecated)"),

This comment has been minimized.

@brancz

brancz Jan 18, 2019

Member

Same here

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name),
Help: fmt.Sprintf(helpTemplate, "latency summary in milliseconds (deprecated)"),

This comment has been minimized.

@brancz

brancz Jan 18, 2019

Member

We should be consistent with the deprecation warning. Let’s make sure the help text is preceded with (Deprecated) like the other metrics we have already deprecated.

This comment has been minimized.

@danielqsj

danielqsj Jan 18, 2019

Author Member

@brancz fixed.

@danielqsj danielqsj force-pushed the danielqsj:adm branch from 7050745 to d9c57e7 Jan 18, 2019

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 18, 2019

/lgtm

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 21, 2019

@sttts @deads2k if you have time, can you help review this? Thanks

@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Jan 22, 2019

Apologies for mis-labeling this metric. That's clearly my fault.

Note that we'll be doubling the memory utilization of the metrics for admission which might matter for those that use a lot of admission controllers. But given that we reduced the carnality of these metrics in the last release, I don't think that will be a show stopper. Let's just document the deprecation plan so we know when we can remove the old metrics for good.

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name),
Help: fmt.Sprintf("(Deprecated) "+helpTemplate, "latency summary in milliseconds"),

This comment has been minimized.

@jpbetz

jpbetz Jan 22, 2019

Contributor

Add which k8s version these will be removed in here in the help string?

This comment has been minimized.

@danielqsj

danielqsj Jan 23, 2019

Author Member

This is consistent with the deprecation warning in other metrics we have deprecated.
But surely, we will announce the metrics migration/deprecation plan in release notes or in other ways.
cc @brancz

This comment has been minimized.

@brancz

brancz Jan 23, 2019

Member

1.14 the "old" metrics are deprecated and 1.15 is targeted for removal. Let's explicitly document this in the KEP.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 23, 2019

@danielqsj let's make sure the deprecation plan is more thoroughly documented in the KEP. Do you want to take care of that?

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 23, 2019

@brancz sure. I will update KEP about the deprecation plan and the latest PRs which not covered.

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 28, 2019

@sttts @deads2k @smarterclayton if you have time, can you help review this? Thanks

@sttts

This comment has been minimized.

Copy link
Contributor

sttts commented Jan 28, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 28, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielqsj, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 28, 2019

/retest

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 29, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 29, 2019

@danielqsj: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws d9c57e7 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 035332f into kubernetes:master Jan 29, 2019

16 of 18 checks passed

pull-kubernetes-e2e-kops-aws Job failed.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation danielqsj authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment