Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: add eviction counter to kubelet metrics #81377

Merged
merged 1 commit into from Aug 15, 2019

Conversation

@sjenning
Copy link
Contributor

commented Aug 13, 2019

Evictions in a cluster are typically undesired and indicate that there is contention for resources on the node. If the overcommit is unintentional, it is typically caused by pods that significantly under estimate their resource usage in their requests.

Currently we only emit events when pod is evicted which are subject to TTL and most monitoring solutions can't do anything with them.

This PR adds a counter metric to track eviction by the kubelet so that standard monitoring/alerting stacks like prometheus/alertmanager can alert the cluster admin if so desired.

@dashpole @mrunalp @derekwaynecarr
/sig node
@kubernetes/sig-instrumentation-pr-reviews
/priority important-soon
/kind feature

kubelet now exports an "kubelet_evictions" metric that counts the number of pod evictions carried out by the kubelet to reclaim resources
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sjenning

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

/retest

@brancz

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

Can you elaborate on the possible values that thresholdToReclaim.Signal may have?

@mattjmcnaughton
Copy link
Contributor

left a comment

/lgtm

can definitely see the benefit of this metric. my only concern would be if the cardinality of eviction key is exceptionally high, but I imagine there's very little chance of that being the case.

@@ -205,6 +206,16 @@ var (
},
[]string{"operation_type"},
)
// Evictions is a Counter that tracks the cumulative number of pod evictions initiated by the kubelet.
// Broken down by eviction signal.

This comment has been minimized.

Copy link
@mattjmcnaughton

mattjmcnaughton Aug 14, 2019

Contributor

+1 to being interested in the cardinality of evictions_signal here.

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce

Test failures look like flake.

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 14, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@dashpole

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

Signals can be found here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/api/types.go#L30. There are currently 7 of them.

/lgtm

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

/hold

this makes sense to me. @brancz please ack if you think order of 10 possible signals is ok.

@brancz

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

We have defined set of values which is most important and 7 is perfectly fine I think.

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 3645041 into kubernetes:master Aug 15, 2019

23 checks passed

cla/linuxfoundation sjenning authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.