[KEP] Kubelet Resource Metrics Endpoint #726

dashpole · 2019-01-24T23:35:47Z

As described in kubernetes/kubernetes#68522, a long-term goal is to reduce the set of metrics provided by the kubelet to make monitoring.

Resource Metrics == Core Metrics. The term "Core" is overloaded, so I am trying to switch to using this terminology. This is about metrics for first-class resources, such as cpu, and memory.

Feature Issue: #727. This is ideally targeted for 1.14, but obviously pending approval.

cc @kubernetes/sig-node-proposals
@kubernetes/sig-instrumentation-api-reviews

danielqsj · 2019-01-25T03:41:45Z

/cc

vikaschoudhary16 · 2019-01-25T17:37:27Z

@vikaschoudhary16

dashpole · 2019-01-25T17:39:06Z

xref: #727

WanLinghao · 2019-01-28T02:41:57Z

/cc

WanLinghao · 2019-01-28T02:54:13Z

keps/sig-node/kubelet-resource-metrics-endpoint.md

+For the purposes of this document, I will use the following definitions:
+
+* Resource Metrics: Metrics for the consumption of first-class resources (CPU, Memory, Ephemeral Storage) which are aggregated by the [Metrics Server](https://github.com/kubernetes-incubator/metrics-server#kubernetes-metrics-server), and served by the [Resource Metrics API](https://github.com/kubernetes/metrics#resource-metrics-api)
+* Monitoring Metrics: Metrics for overvability and introspection of the cluster, which are used by end-users, operators, devs, etc. 


overvability->observability?

DirectXMan12

minor nit inline, otherwise 👍

keps/sig-node/kubelet-resource-metrics-endpoint.md

brancz · 2019-01-30T21:34:35Z

/lgtm
/approve

from sig-instrumentation side. Thanks for the detailed write up!

dashpole · 2019-02-04T21:31:34Z

cc @tallclair

derekwaynecarr · 2019-02-04T22:43:48Z

looks good to me as well.

/approve

tallclair

This is a very well written KEP, thanks!

keps/sig-node/kubelet-resource-metrics-endpoint.md

tallclair · 2019-02-05T19:14:25Z

keps/sig-node/kubelet-resource-metrics-endpoint.md

+
+Name: node_memory_working_set_bytes
+Labels:
+```


I like that the endpoint is versioned, and I like this simple expression of the API (though it could use a detailed description along with each). This is much easier to read than the codified representation here: kubernetes/kubernetes@master...dashpole:prometheus_core_metrics#diff-f5670671376f2de630ba50525fc86d77R27

Is there a way we can format the code so there is a canonical versioned file that resembles this, the equivalent of types.go or proto definition? In other words, I want a (source of truth) file that I can look at that easily expresses what is included in the API at a given version. This might be as simple as coming up with a nice way of formatting the metric registration, and separating that from the metric usage.

That sounds like a good idea. Ill incorporate that into the example PR at some point.

keps/sig-node/kubelet-resource-metrics-endpoint.md

tallclair · 2019-02-05T19:25:25Z

keps/sig-node/kubelet-resource-metrics-endpoint.md

+Alpha:
+
+- [] Implement and test the kubelet resource metrics endpoint as described above
+- [] Modify the metrics server to consume the kubelet resource metrics endpoint


Is the metrics server alpha/beta or GA? How will the endpoint be configured?

I can't find whether it is alpha/beta/GA, but I believe it is beta since it is enabled by default in kubernetes clusters.

What do you mean by "how will the endpoint be configured"? Are you asking about something about the metrics server?

Yeah. I imagine the metrics server will need to support both summary metric and resource metrics ingestion for a while. I'm assuming this will be some sort of flag or configuration parameter. For heterogeneous clusters with older nodes, I suppose the metrics server will just continue to use the summary endpoint until all nodes support the required version of resource metrics. Just thinking out loud, not sure you need to add this.

The current plan is just to switch over once all nodes are expected to have the new endpoint. I know we only support a 2 version skew between nodes and master version, but i'm not sure how lenient we want to be on enabling a larger skew by providing config to revert to the older API. It would just be easier for them to run an older metrics server than to add extra configuration...

keps/sig-node/kubelet-resource-metrics-endpoint.md

tallclair · 2019-02-05T19:28:45Z

keps/sig-node/kubelet-resource-metrics-endpoint.md

+
+```
+Name: container_cpu_usage_seconds_total
+Labels: container, pod, namespace


container name? pod name or UID?

These are all names. It is following the instrumentation guidelines by labeling using pod, rather than pod_name

Got it. See the last point in the document re: UUIDs. Do we need to be able to separate metrics across pod recreation? If so, a UUID should be included.

The docs are suggesting adding an info metric kube_pod_info which includes lots of details, such as UID, and then performing a join on that metric, and the container metrics exposed here.

tallclair · 2019-02-05T22:40:32Z

/lgtm

nikhita · 2019-02-06T15:36:11Z

/test pull-enhancements-verify

k8s-ci-robot · 2019-02-06T15:37:06Z

@dashpole: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-enhancements-verify	`7dbc3b2`	link	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tallclair · 2019-02-07T20:13:44Z

/lgtm

k8s-ci-robot · 2019-02-07T20:13:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, dashpole, derekwaynecarr, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

krzyzacy · 2019-02-09T05:39:33Z

/close

krzyzacy · 2019-02-09T05:39:38Z

/reopen

k8s-ci-robot · 2019-02-09T05:39:39Z

@krzyzacy: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krzyzacy · 2019-02-09T05:39:59Z

/reopen

k8s-ci-robot · 2019-02-09T05:40:00Z

@krzyzacy: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cjwagner · 2019-02-10T21:44:50Z

The tide/squash label is preventing this PR from merging because squash merging is forbidden in this GitHub repo. Please remove the label to unblock tide on this repo or enable squash merging for the repo. I'll add a hold to this PR so that other PR in this repo can merge in the meantime.
/hold

cjwagner · 2019-02-11T00:55:53Z

Bumping to update the stale Tide status.
Based on the Tide logs it appears that GitHub search index took so long to update that this PR fell outside the Tide status controller's search window (even with a hefty window overlap). This is very likely related to the search indexing failures/corruption that we have been seeing with increasing frequency. It looks like GitHub's background indexing jobs are taking a long time to complete or timing out altogether resulting in invalid search results 😞

dashpole · 2019-02-11T18:26:54Z

/hold cancel

k8s-ci-robot requested review from dchen1107 and derekwaynecarr January 24, 2019 23:35

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/pm labels Jan 24, 2019

dashpole mentioned this pull request Jan 24, 2019

Kubelet Resource Metrics Endpoint #727

Closed

10 tasks

k8s-ci-robot requested a review from danielqsj January 25, 2019 03:41

k8s-ci-robot requested a review from WanLinghao January 28, 2019 02:41

WanLinghao reviewed Jan 28, 2019

View reviewed changes

DirectXMan12 suggested changes Jan 29, 2019

View reviewed changes

keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned brancz Jan 30, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2019

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2019

tallclair reviewed Feb 5, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2019

tallclair added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 5, 2019

dashpole force-pushed the kubelet_resource_metrics branch from a0b1818 to 7dbc3b2 Compare February 5, 2019 22:51

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2019

dashpole mentioned this pull request Feb 5, 2019

Add spelling verification to the enhancements repo #745

Closed

Add Kubelet Resource Metrics Endpoint KEP

d6d87a8

dashpole force-pushed the kubelet_resource_metrics branch from 73a0600 to d6d87a8 Compare February 7, 2019 18:21

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2019

k8s-ci-robot closed this Feb 9, 2019

k8s-ci-robot reopened this Feb 9, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2019

tallclair removed the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 11, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2019

k8s-ci-robot merged commit b683294 into kubernetes:master Feb 11, 2019

dashpole mentioned this pull request Feb 12, 2019

Add kubelet resource metrics v1alpha1 endpoint kubernetes/kubernetes#73946

Merged

dashpole deleted the kubelet_resource_metrics branch February 23, 2019 00:47

justaugustus removed the sig/pm label Apr 19, 2020

jan--f mentioned this pull request Sep 5, 2022

OCPBUGS-1364: Dedicated kubelet ServiceMonitor for prometheus-adapter openshift/cluster-monitoring-operator#1752

Merged

2 tasks

[KEP] Kubelet Resource Metrics Endpoint #726

[KEP] Kubelet Resource Metrics Endpoint #726

Conversation

dashpole commented Jan 24, 2019 • edited Loading

danielqsj commented Jan 25, 2019

vikaschoudhary16 commented Jan 25, 2019

dashpole commented Jan 25, 2019

WanLinghao commented Jan 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DirectXMan12 left a comment

Choose a reason for hiding this comment

brancz commented Jan 30, 2019

dashpole commented Feb 4, 2019

derekwaynecarr commented Feb 4, 2019

tallclair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair commented Feb 5, 2019

nikhita commented Feb 6, 2019

k8s-ci-robot commented Feb 6, 2019

tallclair commented Feb 7, 2019

k8s-ci-robot commented Feb 7, 2019

krzyzacy commented Feb 9, 2019

krzyzacy commented Feb 9, 2019

k8s-ci-robot commented Feb 9, 2019

krzyzacy commented Feb 9, 2019

k8s-ci-robot commented Feb 9, 2019

cjwagner commented Feb 10, 2019

cjwagner commented Feb 11, 2019

dashpole commented Feb 11, 2019

dashpole commented Jan 24, 2019 •

edited

Loading