Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP] Kubelet Resource Metrics Endpoint #726

Merged
merged 1 commit into from
Feb 11, 2019

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Jan 24, 2019

As described in kubernetes/kubernetes#68522, a long-term goal is to reduce the set of metrics provided by the kubelet to make monitoring.

Resource Metrics == Core Metrics. The term "Core" is overloaded, so I am trying to switch to using this terminology. This is about metrics for first-class resources, such as cpu, and memory.

Feature Issue: #727. This is ideally targeted for 1.14, but obviously pending approval.

cc @kubernetes/sig-node-proposals
@kubernetes/sig-instrumentation-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jan 24, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/pm labels Jan 24, 2019
@dashpole dashpole mentioned this pull request Jan 24, 2019
10 tasks
@danielqsj
Copy link
Contributor

/cc

@vikaschoudhary16
Copy link

@vikaschoudhary16

@dashpole
Copy link
Contributor Author

xref: #727

@WanLinghao
Copy link

/cc

For the purposes of this document, I will use the following definitions:

* Resource Metrics: Metrics for the consumption of first-class resources (CPU, Memory, Ephemeral Storage) which are aggregated by the [Metrics Server](https://github.com/kubernetes-incubator/metrics-server#kubernetes-metrics-server), and served by the [Resource Metrics API](https://github.com/kubernetes/metrics#resource-metrics-api)
* Monitoring Metrics: Metrics for overvability and introspection of the cluster, which are used by end-users, operators, devs, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overvability->observability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit inline, otherwise 👍

keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved
@brancz
Copy link
Member

brancz commented Jan 30, 2019

/lgtm
/approve

from sig-instrumentation side. Thanks for the detailed write up!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2019
@dashpole
Copy link
Contributor Author

dashpole commented Feb 4, 2019

cc @tallclair

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2019
@derekwaynecarr
Copy link
Member

looks good to me as well.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2019
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very well written KEP, thanks!


Name: node_memory_working_set_bytes
Labels:
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the endpoint is versioned, and I like this simple expression of the API (though it could use a detailed description along with each). This is much easier to read than the codified representation here: kubernetes/kubernetes@master...dashpole:prometheus_core_metrics#diff-f5670671376f2de630ba50525fc86d77R27

Is there a way we can format the code so there is a canonical versioned file that resembles this, the equivalent of types.go or proto definition? In other words, I want a (source of truth) file that I can look at that easily expresses what is included in the API at a given version. This might be as simple as coming up with a nice way of formatting the metric registration, and separating that from the metric usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. Ill incorporate that into the example PR at some point.

keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved
keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved
keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved
keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved
Alpha:

- [] Implement and test the kubelet resource metrics endpoint as described above
- [] Modify the metrics server to consume the kubelet resource metrics endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the metrics server alpha/beta or GA? How will the endpoint be configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find whether it is alpha/beta/GA, but I believe it is beta since it is enabled by default in kubernetes clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "how will the endpoint be configured"? Are you asking about something about the metrics server?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I imagine the metrics server will need to support both summary metric and resource metrics ingestion for a while. I'm assuming this will be some sort of flag or configuration parameter. For heterogeneous clusters with older nodes, I suppose the metrics server will just continue to use the summary endpoint until all nodes support the required version of resource metrics. Just thinking out loud, not sure you need to add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current plan is just to switch over once all nodes are expected to have the new endpoint. I know we only support a 2 version skew between nodes and master version, but i'm not sure how lenient we want to be on enabling a larger skew by providing config to revert to the older API. It would just be easier for them to run an older metrics server than to add extra configuration...

keps/sig-node/kubelet-resource-metrics-endpoint.md Outdated Show resolved Hide resolved

```
Name: container_cpu_usage_seconds_total
Labels: container, pod, namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container name? pod name or UID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all names. It is following the instrumentation guidelines by labeling using pod, rather than pod_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. See the last point in the document re: UUIDs. Do we need to be able to separate metrics across pod recreation? If so, a UUID should be included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are suggesting adding an info metric kube_pod_info which includes lots of details, such as UID, and then performing a join on that metric, and the container metrics exposed here.

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2019
@tallclair tallclair added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 5, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2019
@nikhita
Copy link
Member

nikhita commented Feb 6, 2019

/test pull-enhancements-verify

@k8s-ci-robot
Copy link
Contributor

@dashpole: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-enhancements-verify 7dbc3b2 link /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, dashpole, derekwaynecarr, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@krzyzacy
Copy link
Member

krzyzacy commented Feb 9, 2019

/close

@krzyzacy
Copy link
Member

krzyzacy commented Feb 9, 2019

/reopen

@k8s-ci-robot
Copy link
Contributor

@krzyzacy: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krzyzacy
Copy link
Member

krzyzacy commented Feb 9, 2019

/reopen

@k8s-ci-robot
Copy link
Contributor

@krzyzacy: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 9, 2019
@cjwagner
Copy link
Member

The tide/squash label is preventing this PR from merging because squash merging is forbidden in this GitHub repo. Please remove the label to unblock tide on this repo or enable squash merging for the repo. I'll add a hold to this PR so that other PR in this repo can merge in the meantime.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2019
@cjwagner
Copy link
Member

Bumping to update the stale Tide status.
Based on the Tide logs it appears that GitHub search index took so long to update that this PR fell outside the Tide status controller's search window (even with a hefty window overlap). This is very likely related to the search indexing failures/corruption that we have been seeing with increasing frequency. It looks like GitHub's background indexing jobs are taking a long time to complete or timing out altogether resulting in invalid search results 😞

@tallclair tallclair removed the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 11, 2019
@dashpole
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2019
@k8s-ci-robot k8s-ci-robot merged commit b683294 into kubernetes:master Feb 11, 2019
@dashpole dashpole deleted the kubelet_resource_metrics branch February 23, 2019 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.