Add kubelet metrics for ephemeral containers #99000

verb · 2021-02-11T13:47:56Z

What type of PR is this?

/kind feature
/sig node
/priority important-soon

What this PR does / why we need it:

This adds alpha metrics to the kubelet that track pods and containers under management and counters for container creation. This is part of the Ephemeral Containers (kubernetes/enhancements#277) PRR.

Which issue(s) this PR fixes:

Fixes #97974

Special notes for your reviewer:

As part of code cleanup this slightly changes the format of log messages of particular container types. This isn't necessary and can be reverted, but using the same label in log messages in metrics seems cleaner and more supportable.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://git.k8s.io/enhancements/keps/sig-node/277-ephemeral-containers

verb · 2021-02-11T15:44:01Z

/retest

verb · 2021-02-11T17:06:37Z

I was going to ask @ehashman for advice on these metrics, but I just found sig-instrumentation docs that make me think that I've got it approximately right.

I'll go ahead finish up my TODO to add tests, remove the WIP hold and then submit it for review.

ehashman · 2021-02-11T23:46:08Z

Let me know when this is ready for review and I'll TAL.

verb · 2021-02-12T15:25:47Z

/retest

verb · 2021-02-12T15:58:12Z

Looks like timeouts that are unrelated to this change

/retest

kikisdeliveryservice · 2021-02-12T17:42:04Z

/triage accepted

verb · 2021-05-25T12:47:43Z

@dashpole Thanks for the guidance here. No, I don't need to compare these two. For ephemeral containers I really only want to answer two questions:

How many ephemeral containers are running on this node?
How many ephemeral containers has this node started? were there errors?

The kubelet discards container type information, so the pod manager was the most convenient place to track the first. It's still unclear on what we want to measure. I plan on bringing it up at tonight's sig-node meetup.

Random-Liu · 2021-05-25T18:18:57Z

I always treated "pod manager" a local cache of the pods in apiserver, thus it contains the "desired state", and no much different with counting pods from the apiserver.

However, I do agree with the point brought up in the meeting that, having the pods in the apiserver doesn't necessarily mean that kubelet has seen it:

There might be a bug that kubelet doesn't see a pod;
There might be a delay before kubelet sees the pod.

So in theory, the "pod manager" does contain some fact about the actual state - "whether kubelet has seen the pods".

However, it is not quite clear to me how useful it is to introduce metrics counting pods in the pod manager.
For case 1), if it happens it is a bug we should troubleshoot, I think the pod events/status and kubelet log is a more useful indication of the actual issue, than comparing the pod number between apiserver and managed pods.
For case 2), if the delay is huge, it is also a bug we should troubleshoot, and the pod events/status and kubelet log can be used to debug the issue, just counting the number may not be that useful; if the delay is minimal, the metrics is almost the same with the counting on the apiserver side.

Given so, I'm not sure adding metrics for managed_pods actually brings more value than simply counting objects on the apiserver side.

The started containers/pods make more sense to me, because it exposes more "actual state" from kubelet, and I guess you are going to use it to measure "container creation rate" and "error rate"?

Please note that once we add these metrics, it becomes part of kubelet api and hard to deprecate. people may start depending on it, and may complain about the metrics don't always match the actual pods in the apiserver, like what happened before for running_pods. #99624

The initial problem we are trying to solve does make sense, can we limit the scope of this change to #97974.

ehashman · 2021-06-03T21:35:14Z

Waiting on changes per the above from @Random-Liu, thanks @verb for all the work you've done here :)

verb · 2021-06-04T12:32:28Z

Ah, thanks for pinging this. If I understand correctly, both @Random-Liu and @dchen1107 are in favor of changing the "managed {containers,pods}" gauge to be narrowly scoped for ephemeral containers, and @derekwaynecarr thought the additional information for all types would be useful.

I would want metrics at every stage of an automation system. I agree that logs are better for debugging kubelet problems, but these metrics are for observing cluster state rather than debugging problems with the kubelet. Surfacing information on the kubelet's internal state, even if it is theoretically available from the API server, is useful. The API server may be using different feature flags or perhaps it's unreachable.

I think @dchen1107's point was that solving this problem deserves a full design rather than a hasty solution to meet the PRR requirements for a single feature, so we should add a metric that's specific to that PRR. On the other hand, it seems strange to me to provide this information for ephemeral containers and withhold it for other types of containers.

I can think of 3 different ways forward:

Add container_type to the existing running_containers metrics. This is more monitoring driving design, so let's avoid it.
All of these metrics are marked alpha. We could continue to add alpha metrics and that way, when someone has time to design the kubelet metrics API, there's a ready list of metrics that people actually want with no commitment to support them indefinitely.
Add a managed_ephemeral_containers metric, as suggested by @Random-Liu.

I like the idea of there being some mechanism of iterating on unstable metrics (as in the second), but I don't have time to work on that right now, so I'll just plan on the third option.

This replaces the generic ManagedPod and ManagedContainer kubelet metrics with a gauge to track only ephemeral container usage.

verb · 2021-06-16T09:00:52Z

/retest

verb · 2021-06-16T09:52:30Z

@ehashman PTAL! Thanks 😄

ehashman

/lgtm

ehashman · 2021-06-23T21:32:07Z

pkg/kubelet/metrics/metrics.go

@@ -431,6 +445,54 @@ var (
 		},
 		[]string{"container_state"},
 	)
+	// StartedPodsTotal is a counter that tracks pod sandbox creation operations
+	StartedPodsTotal = metrics.NewCounter(


Note to self: this one is NewCounter but the rest of these new metrics are NewCounterVec because this one doesn't have any additional labels (e.g. container type, error).

verb · 2021-06-25T15:01:46Z

/assign @klueska

verb · 2021-06-29T09:53:13Z

Hi @ehashman, I see you added this to the "Needs Approver" board. Is there a process for assigning the Approver or should I seek one out? Thanks!

verb · 2021-07-05T14:42:07Z

/assign @dchen1107

dchen1107 · 2021-07-08T18:30:02Z

/lgtm
/approve

Thanks for narrow down the scope to the ephemeral containers, and renamed some of metrics per reviewers' and SIG Node's suggests. Also two reviewers from SIG instruments are ok with the small overlap even before the narrow down.

However, it is not quite clear to me how useful it is to introduce metrics counting pods in the pod manager.
For case 1), if it happens it is a bug we should troubleshoot, I think the pod events/status and kubelet log is a more useful indication of the actual issue, than comparing the pod number between apiserver and managed pods.
For case 2), if the delay is huge, it is also a bug we should troubleshoot, and the pod events/status and kubelet log can be used to debug the issue, just counting the number may not be that useful; if the delay is minimal, the metrics is almost the same with the counting on the apiserver side.

Discussed this at SIG Node a while back, and here is my take on this very topic: Logging and Metrics are serving the different purpose for observability and debuggability. Metrics are usually measured a time series, and can be optimized for the storage and longer retention of data, and easily build dashboards to reflect historical trends and be aggregated into daily or weekly frequency. This is why normally admins / SREs prefer to use metrics to detect the abnormal states, and the developer looking into the logs for the troubleshooting.

Back to this particular PR, I have to admit that the current metrics being introduced here might be the best effort. In a real environment, kubelet might be restarted due to different reasons. Once the kubelet is restarted, the count of the container starts with error would be lost, even the current running containers & pods can be re-generated through the current state. Just want to point it out this.

k8s-ci-robot · 2021-07-08T18:30:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, verb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ehashman · 2021-07-08T19:27:47Z

/milestone v1.22

k8s-ci-robot requested review from Random-Liu and yifan-gu February 11, 2021 13:49

verb force-pushed the 1.21-kubelet-metrics branch from 4da2b1b to f81009d Compare February 11, 2021 16:46

verb force-pushed the 1.21-kubelet-metrics branch from f81009d to f71f45c Compare February 11, 2021 17:09

ehashman added this to Triage in SIG Node PR Triage Feb 11, 2021

ehashman moved this from Triage to Waiting on Author in SIG Node PR Triage Feb 11, 2021

verb force-pushed the 1.21-kubelet-metrics branch 2 times, most recently from 314b071 to 437ec78 Compare February 12, 2021 13:13

verb changed the title ~~WIP: Add kubelet managed pod metrics~~ Add kubelet managed pod metrics Feb 12, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 12, 2021

kikisdeliveryservice added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 12, 2021

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 12, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 12, 2021

ehashman moved this from Needs Reviewer to Waiting on Author in SIG Node PR Triage Jun 3, 2021

Remove ManagedPod,ManagedContainer metrics

30d2ad5

This replaces the generic ManagedPod and ManagedContainer kubelet metrics with a gauge to track only ephemeral container usage.

verb changed the title ~~Add kubelet managed pod metrics~~ WIP: Add kubelet metrics for ephemeral containers Jun 15, 2021

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2021

verb changed the title ~~WIP: Add kubelet metrics for ephemeral containers~~ Add kubelet metrics for ephemeral containers Jun 16, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2021

ehashman moved this from Waiting on Author to Needs Reviewer in SIG Node PR Triage Jun 23, 2021

ehashman reviewed Jun 23, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2021

ehashman moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Jun 23, 2021

k8s-ci-robot assigned klueska Jun 25, 2021

k8s-ci-robot assigned dchen1107 Jul 5, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2021

ehashman moved this from Needs Approver to Done in SIG Node PR Triage Jul 8, 2021

k8s-ci-robot added this to the v1.22 milestone Jul 8, 2021

k8s-ci-robot merged commit 7c84064 into kubernetes:master Jul 8, 2021

This was referenced Oct 5, 2021

Promote Ephemeral Containers to beta and enable by default #98808

Closed

Promote EphemeralContainers to beta #105405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kubelet metrics for ephemeral containers #99000

Add kubelet metrics for ephemeral containers #99000

verb commented Feb 11, 2021 •

edited

verb commented Feb 11, 2021

verb commented Feb 11, 2021

ehashman commented Feb 11, 2021

verb commented Feb 12, 2021

verb commented Feb 12, 2021

kikisdeliveryservice commented Feb 12, 2021

verb commented May 25, 2021

Random-Liu commented May 25, 2021 •

edited

ehashman commented Jun 3, 2021

verb commented Jun 4, 2021

verb commented Jun 16, 2021

verb commented Jun 16, 2021

ehashman left a comment

ehashman Jun 23, 2021

verb commented Jun 25, 2021

verb commented Jun 29, 2021

verb commented Jul 5, 2021

dchen1107 commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

ehashman commented Jul 8, 2021

Add kubelet metrics for ephemeral containers #99000

Add kubelet metrics for ephemeral containers #99000

Conversation

verb commented Feb 11, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

verb commented Feb 11, 2021

verb commented Feb 11, 2021

ehashman commented Feb 11, 2021

verb commented Feb 12, 2021

verb commented Feb 12, 2021

kikisdeliveryservice commented Feb 12, 2021

verb commented May 25, 2021

Random-Liu commented May 25, 2021 • edited

ehashman commented Jun 3, 2021

verb commented Jun 4, 2021

verb commented Jun 16, 2021

verb commented Jun 16, 2021

ehashman left a comment

Choose a reason for hiding this comment

ehashman Jun 23, 2021

Choose a reason for hiding this comment

verb commented Jun 25, 2021

verb commented Jun 29, 2021

verb commented Jul 5, 2021

dchen1107 commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

ehashman commented Jul 8, 2021

verb commented Feb 11, 2021 •

edited

Random-Liu commented May 25, 2021 •

edited