-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP] Kubelet Resource Metrics Endpoint #726
[KEP] Kubelet Resource Metrics Endpoint #726
Conversation
/cc |
xref: #727 |
/cc |
For the purposes of this document, I will use the following definitions: | ||
|
||
* Resource Metrics: Metrics for the consumption of first-class resources (CPU, Memory, Ephemeral Storage) which are aggregated by the [Metrics Server](https://github.com/kubernetes-incubator/metrics-server#kubernetes-metrics-server), and served by the [Resource Metrics API](https://github.com/kubernetes/metrics#resource-metrics-api) | ||
* Monitoring Metrics: Metrics for overvability and introspection of the cluster, which are used by end-users, operators, devs, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overvability->observability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit inline, otherwise 👍
/lgtm from sig-instrumentation side. Thanks for the detailed write up! |
cc @tallclair |
looks good to me as well. /approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very well written KEP, thanks!
|
||
Name: node_memory_working_set_bytes | ||
Labels: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the endpoint is versioned, and I like this simple expression of the API (though it could use a detailed description along with each). This is much easier to read than the codified representation here: kubernetes/kubernetes@master...dashpole:prometheus_core_metrics#diff-f5670671376f2de630ba50525fc86d77R27
Is there a way we can format the code so there is a canonical versioned file that resembles this, the equivalent of types.go or proto definition? In other words, I want a (source of truth) file that I can look at that easily expresses what is included in the API at a given version. This might be as simple as coming up with a nice way of formatting the metric registration, and separating that from the metric usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea. Ill incorporate that into the example PR at some point.
Alpha: | ||
|
||
- [] Implement and test the kubelet resource metrics endpoint as described above | ||
- [] Modify the metrics server to consume the kubelet resource metrics endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the metrics server alpha/beta or GA? How will the endpoint be configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find whether it is alpha/beta/GA, but I believe it is beta since it is enabled by default in kubernetes clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "how will the endpoint be configured"? Are you asking about something about the metrics server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I imagine the metrics server will need to support both summary metric and resource metrics ingestion for a while. I'm assuming this will be some sort of flag or configuration parameter. For heterogeneous clusters with older nodes, I suppose the metrics server will just continue to use the summary endpoint until all nodes support the required version of resource metrics. Just thinking out loud, not sure you need to add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plan is just to switch over once all nodes are expected to have the new endpoint. I know we only support a 2 version skew between nodes and master version, but i'm not sure how lenient we want to be on enabling a larger skew by providing config to revert to the older API. It would just be easier for them to run an older metrics server than to add extra configuration...
|
||
``` | ||
Name: container_cpu_usage_seconds_total | ||
Labels: container, pod, namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container name? pod name or UID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all names. It is following the instrumentation guidelines by labeling using pod
, rather than pod_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. See the last point in the document re: UUIDs. Do we need to be able to separate metrics across pod recreation? If so, a UUID should be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are suggesting adding an info metric kube_pod_info
which includes lots of details, such as UID, and then performing a join on that metric, and the container metrics exposed here.
/lgtm |
a0b1818
to
7dbc3b2
Compare
/test pull-enhancements-verify |
@dashpole: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
73a0600
to
d6d87a8
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, dashpole, derekwaynecarr, tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/close |
/reopen |
@krzyzacy: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@krzyzacy: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The |
Bumping to update the stale Tide status. |
/hold cancel |
As described in kubernetes/kubernetes#68522, a long-term goal is to reduce the set of metrics provided by the kubelet to make monitoring.
Resource Metrics == Core Metrics. The term "Core" is overloaded, so I am trying to switch to using this terminology. This is about metrics for first-class resources, such as cpu, and memory.
Feature Issue: #727. This is ideally targeted for 1.14, but obviously pending approval.
cc @kubernetes/sig-node-proposals
@kubernetes/sig-instrumentation-api-reviews