Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet: add a metric to observe time since PLEG last seen #86251

Merged
merged 1 commit into from Jan 7, 2020

Conversation

@bboreham
Copy link
Contributor

bboreham commented Dec 13, 2019

/sig node
/kind feature

What this PR does / why we need it:

Expose the measurement that kubelet uses to judge that "PLEG is unhealthy".
If we can observe the measurement growing then we can alert before the node goes unhealthy.

Note that the existing metrics PLEGRelistInterval and PLEGRelistDuration are no good for this, because when relist() gets stuck they are never updated.

Which issue(s) this PR fixes:

Not a fix, but relates to #45419, #72533, #77728, etc

Does this PR introduce a user-facing change?:

New metric kubelet_pleg_last_seen_seconds to aid diagnosis of PLEG not healthy issues.
Copy link
Contributor

mattjmcnaughton left a comment

Adding this metric certainly seems like it would provide monitoring benefit w.r.t. to the issue you mentioned. Additionally, it seems like a low cost metric (i.e. no risks of high cardinality, etc...). So LGTM!

Pretty sure the test failures are flakes - will trigger retries.

/lgtm

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

mattjmcnaughton commented Dec 14, 2019

/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-integration

/assign @yujuhong

@@ -187,6 +188,16 @@ var (
StabilityLevel: metrics.ALPHA,
},
)
// PLEGLastSeen is a Gauge that tracks the interval (in seconds) since the Kubelet's
// Pod Lifecycle Event Generator (PLEG) was last seen active.
PLEGLastSeen = metrics.NewGauge(

This comment has been minimized.

Copy link
@dashpole

dashpole Dec 16, 2019

Contributor

It is generally preferable to report the time at which an event occurred, rather than the time since the event occurred. This makes it easier to deal with latency in monitoring pipelines, since the time at which an event occurs is always correct. The time since an event occurs is always outdated.

It also deals much better with bad kubelet health. If the kubelet stops responding to metric scrapes, prometheus will continue to use the previously reported value for some time (usually a few minutes). Sending the time at which an event occurred allows prometheus to alert on such cases as well.

This comment has been minimized.

Copy link
@dashpole

dashpole Dec 16, 2019

Contributor

See google/cadvisor#2305 for an example of where cAdvisor does this.

This comment has been minimized.

Copy link
@mattjmcnaughton

mattjmcnaughton Dec 17, 2019

Contributor

Oooo interesting! Thank you - today I learned :)

This comment has been minimized.

Copy link
@bboreham

bboreham Jan 3, 2020

Author Contributor

I have updated the code as you suggested.
I considered for a while whether the metric update should move to where relistTime is set but decided against it as my objective is to provide a metric that will give the same answer as Kubelet uses when deciding to go unhealthy.
The name remains last_seen rather than, say, last_relist, to emphasise this.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Dec 16, 2019

/assign

Expose the measurement that kubelet uses to judge that "PLEG is
unhealthy". If we can observe the measurement growing then we can
alert before the node goes unhealthy.

Note that the existing metrics PLEGRelistInterval and
PLEGRelistDuration are poor for this, because when relist() gets
stuck they are never updated.

Signed-off-by: Bryan Boreham <bryan@weave.works>
@bboreham bboreham force-pushed the bboreham:pleg-last-seen-metric branch from ea0063b to cc0b3e8 Jan 3, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 3, 2020
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 6, 2020

/retest
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 6, 2020
@bboreham

This comment has been minimized.

Copy link
Contributor Author

bboreham commented Jan 7, 2020

/assign @yujuhong

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 7, 2020

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 7, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bboreham, dashpole, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 49bc696 into kubernetes:master Jan 7, 2020
14 of 15 checks passed
14 of 15 checks passed
tide Not mergeable. Retesting: pull-kubernetes-kubemark-e2e-gce-big
Details
cla/linuxfoundation bboreham authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.