New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kubelet metrics for ephemeral containers #99000
Conversation
/retest |
4da2b1b
to
f81009d
Compare
I was going to ask @ehashman for advice on these metrics, but I just found sig-instrumentation docs that make me think that I've got it approximately right. I'll go ahead finish up my TODO to add tests, remove the WIP hold and then submit it for review. |
f81009d
to
f71f45c
Compare
Let me know when this is ready for review and I'll TAL. |
314b071
to
437ec78
Compare
/retest |
Looks like timeouts that are unrelated to this change /retest |
/triage accepted |
@dashpole Thanks for the guidance here. No, I don't need to compare these two. For ephemeral containers I really only want to answer two questions:
The kubelet discards container type information, so the pod manager was the most convenient place to track the first. It's still unclear on what we want to measure. I plan on bringing it up at tonight's sig-node meetup. |
I always treated "pod manager" a local cache of the pods in apiserver, thus it contains the "desired state", and no much different with counting pods from the apiserver. However, I do agree with the point brought up in the meeting that, having the pods in the apiserver doesn't necessarily mean that kubelet has seen it:
So in theory, the "pod manager" does contain some fact about the actual state - "whether kubelet has seen the pods". However, it is not quite clear to me how useful it is to introduce metrics counting pods in the pod manager. Given so, I'm not sure adding metrics for managed_pods actually brings more value than simply counting objects on the apiserver side. The started containers/pods make more sense to me, because it exposes more "actual state" from kubelet, and I guess you are going to use it to measure "container creation rate" and "error rate"? Please note that once we add these metrics, it becomes part of kubelet api and hard to deprecate. people may start depending on it, and may complain about the metrics don't always match the actual pods in the apiserver, like what happened before for running_pods. #99624 The initial problem we are trying to solve does make sense, can we limit the scope of this change to #97974. |
Waiting on changes per the above from @Random-Liu, thanks @verb for all the work you've done here :) |
Ah, thanks for pinging this. If I understand correctly, both @Random-Liu and @dchen1107 are in favor of changing the "managed {containers,pods}" gauge to be narrowly scoped for ephemeral containers, and @derekwaynecarr thought the additional information for all types would be useful. I would want metrics at every stage of an automation system. I agree that logs are better for debugging kubelet problems, but these metrics are for observing cluster state rather than debugging problems with the kubelet. Surfacing information on the kubelet's internal state, even if it is theoretically available from the API server, is useful. The API server may be using different feature flags or perhaps it's unreachable. I think @dchen1107's point was that solving this problem deserves a full design rather than a hasty solution to meet the PRR requirements for a single feature, so we should add a metric that's specific to that PRR. On the other hand, it seems strange to me to provide this information for ephemeral containers and withhold it for other types of containers. I can think of 3 different ways forward:
I like the idea of there being some mechanism of iterating on unstable metrics (as in the second), but I don't have time to work on that right now, so I'll just plan on the third option. |
This replaces the generic ManagedPod and ManagedContainer kubelet metrics with a gauge to track only ephemeral container usage.
/retest |
@ehashman PTAL! Thanks 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -431,6 +445,54 @@ var ( | |||
}, | |||
[]string{"container_state"}, | |||
) | |||
// StartedPodsTotal is a counter that tracks pod sandbox creation operations | |||
StartedPodsTotal = metrics.NewCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this one is NewCounter
but the rest of these new metrics are NewCounterVec
because this one doesn't have any additional labels (e.g. container type, error).
/assign @klueska |
Hi @ehashman, I see you added this to the "Needs Approver" board. Is there a process for assigning the Approver or should I seek one out? Thanks! |
/assign @dchen1107 |
/lgtm Thanks for narrow down the scope to the ephemeral containers, and renamed some of metrics per reviewers' and SIG Node's suggests. Also two reviewers from SIG instruments are ok with the small overlap even before the narrow down.
Discussed this at SIG Node a while back, and here is my take on this very topic: Logging and Metrics are serving the different purpose for observability and debuggability. Metrics are usually measured a time series, and can be optimized for the storage and longer retention of data, and easily build dashboards to reflect historical trends and be aggregated into daily or weekly frequency. This is why normally admins / SREs prefer to use metrics to detect the abnormal states, and the developer looking into the logs for the troubleshooting. Back to this particular PR, I have to admit that the current metrics being introduced here might be the best effort. In a real environment, kubelet might be restarted due to different reasons. Once the kubelet is restarted, the count of the container starts with error would be lost, even the current running containers & pods can be re-generated through the current state. Just want to point it out this. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, verb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/milestone v1.22 |
What type of PR is this?
/kind feature
/sig node
/priority important-soon
What this PR does / why we need it:
This adds alpha metrics to the kubelet that track pods and containers under management and counters for container creation. This is part of the Ephemeral Containers (kubernetes/enhancements#277) PRR.
Which issue(s) this PR fixes:
Fixes #97974
Special notes for your reviewer:
As part of code cleanup this slightly changes the format of log messages of particular container types. This isn't necessary and can be reverted, but using the same label in log messages in metrics seems cleaner and more supportable.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: