New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change apiserver metrics to conform metrics guidelines #72336

Merged
merged 3 commits into from Jan 18, 2019

Conversation

@danielqsj
Copy link
Member

danielqsj commented Dec 26, 2018

What type of PR is this?

/kind feature
/sig instrumentation

What this PR does / why we need it:

  1. As part of kubernetes metrics overhaul, change apiserver metrics to conform Kubernetes metrics instrumentation guidelines.
  2. Change etcd latency metrics to histogram for better aggregation.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This patch does not remove the existing metrics but mark them as deprecated.
We need 2 releases for users to convert monitoring configuration.

Does this PR introduce a user-facing change?:

Change apiserver metrics to conform metrics guidelines.
The following metrics are deprecated, and will be removed in a future release:
* `apiserver_request_count`
* `apiserver_request_latencies`
* `apiserver_request_latencies_summary`
* `apiserver_dropped_requests`
* `etcd_helper_cache_hit_count`
* `etcd_helper_cache_miss_count`
* `etcd_helper_cache_entry_count`
* `etcd_request_cache_get_latencies_summary`
* `etcd_request_cache_add_latencies_summary`
* `etcd_request_latencies_summary`
* `transformation_latencies_microseconds `
* `data_key_generation_latencies_microseconds`
Please convert to the following metrics:
* `apiserver_request_total`
* `apiserver_request_latency_seconds`
* `apiserver_dropped_requests_total`
* `etcd_helper_cache_hit_total`
* `etcd_helper_cache_miss_total`
* `etcd_helper_cache_entry_total`
* `etcd_request_cache_get_latency_seconds`
* `etcd_request_cache_add_latency_seconds`
* `etcd_request_latency_seconds`
* `transformation_latencies_seconds`
* `data_key_generation_latencies_seconds`
@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Dec 26, 2018

/assign @brancz

@danielqsj danielqsj force-pushed the danielqsj:apimetrics branch from 87a1536 to e4c9b52 Dec 26, 2018

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Dec 26, 2018

@danielqsj danielqsj force-pushed the danielqsj:apimetrics branch from 26e31d0 to 88c4b64 Dec 26, 2018

},
[]string{"verb", "group", "version", "resource", "subresource", "scope", "component"},
)
requestLatenciesSummary = prometheus.NewSummaryVec(

This comment has been minimized.

@logicalhan

logicalhan Dec 26, 2018

Contributor

Do we want to export summary metrics in the future? I thought we were generally trying to move off of them.

This comment has been minimized.

@danielqsj

danielqsj Dec 27, 2018

Author Member

I'm not sure whether to remove all these existing summary metrics or convert to histogram.

@brancz @loburm : can you give some suggestion?

This comment has been minimized.

@brancz

brancz Jan 7, 2019

Member

I believe the summaries are currently in use by performance tests, so we'll need to keep them around for now, but eventually I agree we should just delete them or at least have them be opt in.

As for this PR I think it's ok as is.

This comment has been minimized.

@logicalhan

logicalhan Jan 8, 2019

Contributor

@brancz, but this PR is adding a new summary metric. How can it already be in use by perf tests?

This comment has been minimized.

@brancz

brancz Jan 8, 2019

Member

oops I missed that, I meant the existing summary metric (the one that is deprecated) .. my thinking is to just skip the new summary metric as summaries are too expensive here

This comment has been minimized.

@logicalhan

logicalhan Jan 8, 2019

Contributor

Great, that's basically in line with what I was thinking as well.

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 8, 2019

@brancz @logicalhan
Added Deprecated in metrics description. PTAL

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 10, 2019

Let's remove the newly added summary and only keep the old, deprecated one. Then this lgtm.

@danielqsj danielqsj force-pushed the danielqsj:apimetrics branch from 66de638 to 8b41863 Jan 11, 2019

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 11, 2019

@brancz @logicalhan
Removed the new added summary. PTAL, thanks.

@logicalhan
Copy link
Contributor

logicalhan left a comment

/lgtm

Thanks for the iterations!

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 14, 2019

/assign @deads2k

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Jan 17, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 17, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielqsj, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 18, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@danielqsj

This comment has been minimized.

Copy link
Member Author

danielqsj commented Jan 18, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 24643fd into kubernetes:master Jan 18, 2019

18 checks passed

cla/linuxfoundation danielqsj authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
// Use buckets ranging from 125 ms to 8 seconds.
Buckets: prometheus.ExponentialBuckets(125000, 2.0, 7),
},
[]string{"verb", "group", "version", "resource", "subresource", "scope", "component"},
)
requestLatenciesSummary = prometheus.NewSummaryVec(
deprecatedRequestLatenciesSummary = prometheus.NewSummaryVec(

This comment has been minimized.

@wojtek-t

wojtek-t Jan 30, 2019

Member

@danielqsj - this metric was deprecated without adding anything new instead.
Was that intentional or just overlooked?

This comment has been minimized.

@brancz

brancz Jan 30, 2019

Member

The intention was that we felt the latency metrics as summaries are too expensive and not aggregatable across instances so just having the histograms should be sufficient. If there are other voices I'm happy to reconsider. As I mentioned above, I'm aware that scalability tests currently use this metric (or at least were), I recall us having had a discussion a year or so ago that the scalability tests should be moving away from using these (my apologies if I'm mixing up conversations, and this wasn't with you).

This comment has been minimized.

@wojtek-t

wojtek-t Jan 31, 2019

Member

The context is that this PR broke the tests, so we temporarily reverted to using the old metric: #73524

I think I roughly understand the concerns, but I would like to discuss what exactly we are trying to achieve and how we can do that without having a summary.
So the requirements are:

  1. We need to be able to say whether 99th percentile was higher or lower than X (X being 1s, 5s or 30s depending on the calls - the exact SLO is described here: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
  2. As there is defined in SLI, we would like to have the value from last 5 minutes.
  3. I would like to be able to spot regressions (at the very least by looking post factum into graphs)
  4. Aggregating across few masters is probably also an important thing.

@brancz - are those reasonable expectations?

IIUC, the first one we can achieve by appropriately choosing buckets (if we would have buckets that finish at 1s, 5s and 30s, we would be done).
The third one is probably also doable - we will just show on the graph % of requests handled for a given buckets - if that is growing, then it's probably a regression.
The question is: what about number (2). Do you have any recommendations?

Any other thoughts?

This comment has been minimized.

@brancz

brancz Jan 31, 2019

Member

As these are Prometheus metrics, assuming you're collecting these metrics with Prometheus, I see no problem at all for all of the points above:

  1. exactly
  2. we can see historic 99th latency with:
histogram_quantile(0.99, sum without(instance, pod) (rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])))

This assumes the prometheus config you are using has the job="apiserver", pod, and instance labels. If you choose to have a different labelling, then those may vary, but the idea is the same.

  1. the above calculates this over time
  2. the above already does the latency aggregated over all instances (with today's summaries that's fundamentally not possible which is why we preferred the histogram :) )

The only thing that's different right now is that the buckets do not go up to 30s, they go up to 8s. If you have data to back up relevance up to 30s, that would be wonderful and I'd be more than happy to change the bucketing according to real world experience from the scalability tests! :)

This comment has been minimized.

@wojtek-t

wojtek-t Jan 31, 2019

Member

@brancz - thanks

I have more question about buckets. Can we customize buckets more?
In particular, there is huge difference for us between say 500ms and 900ms (while this is the same bucket).
What I would ideally like to see is probably sth like:

  • every 100ms (or even a bit more) up to 1s
  • then every 500ms up to 5s
  • then maybe every 5s or so up to 30s

Regaring 30s - yes in large clusters having 150k pods, listing all pods may easily take more than 10s.

This comment has been minimized.

@brancz

brancz Jan 31, 2019

Member

Buckets are completely free form. Linear, log, or exponential buckets are just standards that are available to be used more easily. If we have a better idea of buckets we want to have, that's absolutely possible, and we should do it.

The godoc comment is literally:

	// Buckets defines the buckets into which observations are counted. Each
	// element in the slice is the upper inclusive bound of a bucket. The
	// values must be sorted in strictly increasing order. There is no need
	// to add a highest bucket with +Inf bound, it will be added
	// implicitly. The default value is DefBuckets.
	Buckets []float64

This comment has been minimized.

@wojtek-t

wojtek-t Jan 31, 2019

Member

@brancz - thanks
Let me think a bit more and I will send you a PR tomorrow or early next week that will change buckets in that metric (given that we added it in this release, we can still do that) so that they will be more useful for us.

Probably last question in this area - what is the reasonable limit for number of buckets? For example, would 50 buckets or so still be ok?

This comment has been minimized.

@wojtek-t

wojtek-t Feb 1, 2019

Member

Anyway - I opened #73638 for that.

This comment has been minimized.

@brancz

brancz Feb 1, 2019

Member

Looking at the cardinality we’re looking at creating roughly 300 histograms and with 50 buckets that’s 15k metrics. That’s a lot, but I feel this is important enough of a metric for it to be worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment