Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert latency/latencies in metrics name to duration #74418

Merged
merged 3 commits into from Mar 2, 2019

Conversation

@danielqsj
Copy link
Member

danielqsj commented Feb 22, 2019

What type of PR is this?

/kind feature
/sig instrumentation

What this PR does / why we need it:

As discussion here , we need to convert latency/latencies in metrics name to duration to follow the best practice for Prometheus.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Convert `latency`/`latencies` in metrics name to `duration`.
The following metrics are changed and mark previous metrics as deprecated:
* `rest_client_request_latency_seconds` -> `rest_client_request_duration_seconds`
* `apiserver_proxy_tunnel_sync_latency_secs` -> `apiserver_proxy_tunnel_sync_duration_seconds`
* `scheduler_scheduling_latency_seconds` -> `scheduler_scheduling_duration_seconds `
@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Feb 22, 2019

/cc @brancz

@k8s-ci-robot k8s-ci-robot requested a review from brancz Feb 22, 2019

@brancz
Copy link
Member

brancz left a comment

One question, otherwise I double checked that we have already deprecated the old metrics for those changed here, so we're good to go once we can confirm this one other one.

@@ -31,7 +31,7 @@ const (
WorkQueueSubsystem = "workqueue"
DepthKey = "depth"
AddsKey = "adds_total"
QueueLatencyKey = "queue_latency_seconds"
QueueLatencyKey = "queue_duration_seconds"

This comment has been minimized.

@brancz

brancz Feb 22, 2019

Member

I don't see deprecation for this?

This comment has been minimized.

@danielqsj

danielqsj Feb 22, 2019

Author Member

This metrics

func (prometheusMetricsProvider) NewLatencyMetric(name string) workqueue.HistogramMetric {
is new introduced in 1.14 by #71300, to deprecate this metricsqueue_latency
func (prometheusMetricsProvider) NewDeprecatedLatencyMetric(name string) workqueue.SummaryMetric {

This comment has been minimized.

@danielqsj

danielqsj Feb 22, 2019

Author Member

What I think is the newly introduced metrics (via KEP, but name contains latency/latencies) will not need to be replaced by another newer metrics (replace the name with duration), by this change, right ?

This comment has been minimized.

@brancz

brancz Feb 23, 2019

Member

Yes you’re totally right I just didn’t see it :)

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Feb 23, 2019

/lgtm
/retest

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Feb 23, 2019

@liggitt
Copy link
Member

liggitt left a comment

I'll defer to @kubernetes/sig-instrumentation-api-reviews on the cost of doubling these up during the deprecation period

deprecatedRequestLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "rest_client_request_latency_seconds",
Help: "(Deprecated) Request latency in seconds. Broken down by verb and URL.",

This comment has been minimized.

@liggitt

liggitt Feb 24, 2019

Member

Specify a time for removal

Help: "The time since the last successful synchronization of the SSH tunnels for proxy requests.",
}, func() float64 { return float64(nodeTunneler.SecondsSinceSync()) })
prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Name: "apiserver_proxy_tunnel_sync_latency_secs",
Help: "(Deprecated) The time since the last successful synchronization of the SSH tunnels for proxy requests.",

This comment has been minimized.

@liggitt

liggitt Feb 24, 2019

Member

Specify a time for removal (usually 3 releases)

This comment has been minimized.

@danielqsj

danielqsj Feb 25, 2019

Author Member

This change is part of the KEP Kubernetes Metrics Overhaul . We plan to mark all deprecated metrics in 1.14, and remove them in 1.15.
The pattern "(Deprecated)" in front of the metric help text is consistent with other changes.
Does it sound good ?
/cc @brancz

This comment has been minimized.

@brancz

brancz Feb 25, 2019

Member

There are no guarantees around metrics at all right now, so technically we can break as much as we want, but we decided we want at least 1 release as transition period.

This comment has been minimized.

@liggitt

liggitt Mar 1, 2019

Member

for visibility, asked for an ack on that position from sig-instrumentation and API reviewers. can proceed once there's agreement there

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 24, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot requested a review from brancz Feb 25, 2019

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Feb 25, 2019

I think for high cardinality metrics we can decide on a case by case basis which metrics we may not be ok with duplicating. For what it's worth, Prometheus has a built-in mechanism to drop metrics at ingestion, so apart from network bytes transferred, there is no efficiency loss, as dropping metrics at ingestion is a very cheap operation. As described in the KEP, we are providing transition mechanisms and lists of metrics before/after so that people can convert their setups in the 1.14 timeframe while maintaining efficiency of their Prometheus setups.

@roycaihw

This comment has been minimized.

Copy link
Member

roycaihw commented Feb 25, 2019

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Feb 27, 2019

ping @liggitt

@liggitt liggitt removed the kind/api-change label Mar 1, 2019

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 1, 2019

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 1, 2019

As described in the KEP, we are providing transition mechanisms and lists of metrics before/after so that people can convert their setups in the 1.14 timeframe while maintaining efficiency of their Prometheus setups.

Should the changed names be described in detail in the PR release note?

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Mar 1, 2019

@liggitt Updated release note with all metrics changed here.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 1, 2019

Absolutely, thanks for the catch @liggitt! And thanks for adding so quickly @danielqsj!

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 1, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 1, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielqsj, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9b8c586 into kubernetes:master Mar 2, 2019

16 checks passed

cla/linuxfoundation danielqsj authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 5, 2019

The following metrics are new introduced in 1.14, so they are directly changed:
...

  • *_admission_latencies_seconds_summary -> *_admission_duration_seconds_summary

these two existed in 1.13:

https://github.com/kubernetes/kubernetes/blob/release-1.13/staging/src/k8s.io/apiserver/pkg/admission/metrics/metrics.go#L160-L189

prometheus.HistogramOpts{
Subsystem: kubeProxySubsystem,
Name: "network_programming_latency_seconds",
Help: "(Deprecated) In Cluster Network Programming Latency in seconds",

This comment has been minimized.

@liggitt

liggitt Mar 5, 2019

Member

network_programming_latency_seconds is new in 1.14, can drop this

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 5, 2019

I swept the metrics from 1.13 modified in this PR, with these results:

in 1.13:
rest_client_request_latency_seconds (deprecated)
apiserver_proxy_tunnel_sync_latency_secs (deprecated)
scheduling_latency_seconds (deprecated)
*_admission_latencies_seconds_summary (removed!)
*_admission_latencies_seconds (removed!)

new in 1.14, free to change:
docker_operations_latency_seconds
network_plugin_operations_latency_seconds
sync_proxy_rules_latency_seconds
network_programming_latency_seconds
sync_proxy_rules_latency_seconds
e2e_scheduling_latency_seconds
scheduling_algorithm_latency_seconds
binding_latency_seconds
queue_latency_seconds
apiserver_request_latency_seconds
etcd_request_cache_get_latency_seconds
etcd_request_cache_add_latency_seconds
etcd_request_latency_seconds
transformation_latencies_seconds
data_key_generation_latencies_seconds

we should restore/deprecate these two:
*_admission_latencies_seconds_summary (removed!)
*_admission_latencies_seconds (removed!)

and we can drop the new-in-1.14 metrics from the release note... the intra-release name change is just noise in the release note

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 5, 2019

@liggitt the admission latency metrics were intentionally removed because they had too high cardinality (not uncommon to cause 30k+ active time-series per apiserver instance).

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 5, 2019

@liggitt the admission latency metrics were intentionally removed because they had too high cardinality (not uncommon to cause 30k+ active time-series per apiserver instance).

ah, I see. can we call those out explicitly in the release note, then, along with the reason for removal?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 5, 2019

Yes absolutely.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 5, 2019

also, it looks like those were renamed, not removed... are you saying the cardinality was too high to maintain the same metric under the deprecated name?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 5, 2019

I recall them being reported to have high cardinality (and am seeing that in my environments currently), so my assumption here was that they were removed for that reason, but need to double check.

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Mar 6, 2019

@liggitt
Yes, in 1.13, these two metrics _admission_latencies_seconds_summary
and _admission_latencies_seconds are actually wrong in unit. And been fixed in 1.14 #72343.
For backward compatible, We renamed these metrics which has wrong unit to _admission_latencies_milliseconds_summary and _admission_latencies_milliseconds.

@brancz Should we remove these admission metrics or change the design of admission metrics?

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Mar 6, 2019

@liggitt @brancz
Thanks for the finding about network_programming_latency_seconds.
I dropped 1.14 change in release notes.
And opened a PR to clean the deprecated metrics network_programming_latency_seconds. PTAL #75023

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 6, 2019

Aha yes thanks for digging that up @danielqsj. For what it's worth we do drop those metrics in our setup due to being too high cardinality.

@liggitt we can do

  1. leave as is as the old metric was inconsistent in its labeling (it said "seconds" but reported "milliseconds"), so this was deprecation and bugfix in one go
  2. immediately remove the deprecated metric as it was both incorrectly saying what unit it is, and otherwise violated the naming scheme
  3. revert the deprecated metric to the old inconsistent state, and leave deprecated and remove with other deprecated metrics
  4. one of the above + look into reducing cardinality
  5. remove both metrics due to high cardinality

3 might be the least breaking thing for people making use of this metric today, 2 is what we typically do in such a case.

Looking through the most popular dashboard and alert definitions[1], none of them seem to make use of this metric and most of them even drop it. Given the nature of the metric I think it could be useful in a general sense, so I feel we should do drop the old metric entirely (2) and look into reducing cardinality with the new metric (4).

[1] https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus/manifests
https://github.com/kubernetes-monitoring/kubernetes-mixin
https://github.com/grafana/jsonnet-libs/tree/master/prometheus-ksonnet/lib https://github.com/helm/charts/tree/master/stable/prometheus-operator

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 6, 2019

For backward compatible, We renamed these metrics which has wrong unit to _admission_latencies_milliseconds_summary and _admission_latencies_milliseconds.

renaming a metric doesn't really help us, compatibility-wise

I feel we should do drop the old metric entirely (2) and look into reducing cardinality with the new metric (4).

if these were problematic for performance reasons and were misnamed, I'd tend to agree

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 6, 2019

Looks like we already solved the cardinality explosion which is why we didn't remove the metric (RE #69540; I just checked it's present in 1.12 and not 1.13). In that case, we should just remove the old metric and explicitly call this out in the release notes for 1.14. Sounds good?

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Mar 12, 2019

@brancz thanks for clarification.
I agree to directly remove the old metrics as @liggitt mentions they don't seem to help much.
I submit a PR to remove them in #75279 . PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.