Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/k8s.io/apiserver: Increase cert expiration histogram resolution #74806

Merged
merged 1 commit into from Mar 3, 2019

Conversation

@mxinden
Copy link
Member

mxinden commented Mar 1, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

The certificate_expiration_seconds histogram measures the remaining
time of client certificates used to authenticate to the API server. It
records the lifetime of received client request certificates in buckets
of 6h, 12h, ..., 1y.

In environments with automated certificate rotation it is not uncommen
to have issued certificates expire in less than the above mentioned
minimum bucket of 6h. In such environments the above histogram is
useless given that every request will be recorded in the first bucket.

This patch increases the histogram resolution by adding a 30m, 1h and 2h
bucket. Prometheus histogram buckets are cummulative, e.g. the 12h
bucket is counting all records with an expiration date lower or equal
to 12h including all requests of the 6h bucket. Thereby this patch
does not break existing monitoring setups. This histogram is exposed
once per API server, thereby the 3 additional time series do not cause a
cardinality issue.

Which issue(s) this PR fixes:

Relates to openshift/cluster-monitoring-operator#275

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Increase api server client certificate expiration histogram resolution to accommodate short-lived (< 6h) client certificates.
src/k8s.io/apiserver: Increase cert expiration histogram resolution
The `certificate_expiration_seconds` histogram measures the remaining
time of client certificates used to authenticate to the API server. It
records the lifetime of received client request certificates in buckets
of 6h, 12h, ..., 1y.

In environments with automated certificate rotation it is not uncommen
to have issued certificates expire in less than the above mentioned
minimum bucket of 6h. In such environments the above histogram is
useless given that every request will be recorded in the first bucket.

This patch increases the histogram resolution by adding a 30m, 1h and 2h
bucket. Prometheus histogram buckets are cummulative, e.g. the 12h
bucket is counting _all_ records with an expiration date lower or equal
to 12h including _all_ requests of the 6h bucket. Thereby this patch
does not break existing monitoring setups.  This histogram is exposed
once per API server, thereby the 3 additional time series do not cause a
cardinality issue.
@sttts

This comment has been minimized.

Copy link
Contributor

sttts commented Mar 1, 2019

What's the memory cost of this?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 1, 2019

@sttts as there are no labels on this, there is no multiplied cardinality, so the memory usage is having 3 more float64 in memory 🙂 .

@sttts

This comment has been minimized.

Copy link
Contributor

sttts commented Mar 1, 2019

/lgtm

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 1, 2019

looks good from sig-instrumentation as well

/lgtm

@mxinden

This comment has been minimized.

Copy link
Member Author

mxinden commented Mar 1, 2019

What's the memory cost of this?

@sttts adding some details to @brancz answer.

First of all we need to save the bucket bound itself (e.g. 6h, 12h, ...):

	upperBounds []float64

https://github.com/prometheus/client_golang/blob/2daed26f633ce23fab5efc02fb3f2e803b3526c3/prometheus/histogram.go#L255

In addition we need to record the amount of samples for a particular bucket. Previously this was done with a simple uint64 per bucket:

	counts []uint64

https://github.com/prometheus/client_golang/blob/v0.8.0/prometheus/histogram.go#L231

Nowadays Prometheus client_golang leverages atomic increments for lock free recording. For that it needs two instances of the count vector, one for hot writes and one for cold reads. Hence we now have two uint64 per bucket.

	// Two counts, one is "hot" for lock-free observations, the other is
	// "cold" for writing out a dto.Metric. It has to be an array of
	// pointers to guarantee 64bit alignment of the histogramCounts, see
	// http://golang.org/pkg/sync/atomic/#pkg-note-BUG.
	counts [2]*histogramCounts

https://github.com/prometheus/client_golang/blob/2daed26f633ce23fab5efc02fb3f2e803b3526c3/prometheus/histogram.go#L249-L253


Summarizing adding 3 new buckets adds 3 bucket bound float64 and 6 bucket count uint64 allocations.

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Mar 2, 2019

/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 2, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikedanese, mxinden

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Mar 2, 2019

Aside: Ideally we would record this metric after we've validated chain signatures and before we've validate chain expiration. I've noticed that this metric can pickup data from scanners in deployments where the apiserver is exposed publicly.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Mar 2, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Mar 3, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit c54978a into kubernetes:master Mar 3, 2019

17 checks passed

cla/linuxfoundation mxinden authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 4, 2019

@mikedanese absolutely! I think we should have that second metric.

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Mar 4, 2019

/assign @logicalhan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.