-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource Metrics API proposal #24253
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Resource Metrics API | ||
|
||
*This proposal is based on and supersedes [compute-resource-metrics-api.md](compute-resource-metrics-api.md).* | ||
|
||
This document describes API part of MVP version of Resource Metrics API effort in Kubernetes. | ||
Once the agreement will be made the document will be extended to also cover implementation details. | ||
The shape of the effort may be also a subject of changes once we will have more well-defined use cases. | ||
|
||
## Goal | ||
|
||
The goal for the effort is to provide resource usage metrics for pods and nodes through the API server. | ||
This will be a stable, versioned API which core Kubernetes components can rely on. | ||
In the first version only the well-defined use cases will be handled, | ||
although the API should be easily extensible for potential future use cases. | ||
|
||
## Main use cases | ||
|
||
This section describes well-defined use cases which should be handled in the first version. | ||
Use cases which are not listed below are out of the scope of MVP version of Resource Metrics API. | ||
|
||
#### Horizontal Pod Autoscaler | ||
|
||
HPA uses the latest value of cpu usage as an average aggregated across 1 minute | ||
(the window may change in the future). The data for a given set of pods | ||
(defined either by pod list or label selector) should be accesible in one request | ||
due to performance issues. | ||
|
||
#### Scheduler | ||
|
||
Scheduler in order to schedule best-effort pods requires node level resource usage metrics | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add some comment that this is something the scheduler will do in the future, but not yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
as an average aggreated across 1 minute (the window may change in the future). | ||
The metrics should be available for all resources supported in the scheduler. | ||
Currently the scheduler does not need this information, because it schedules best-effort pods | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I plan to add a new node condition as part of #18724 that could be used as a signal to the scheduler for it to not place additional best effort pods on a node. If a machine's available memory falls below the specified threshold, the node will report a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation. |
||
without considering node usage. But having the metrics available in the API server is a blocker | ||
for adding the ability to take node usage into account when scheduling best-effort pods. | ||
|
||
## Other considered use cases | ||
|
||
This section describes the other considered use cases and explains why they are out | ||
of the scope of the MVP version. | ||
|
||
#### Custom metrics in HPA | ||
|
||
HPA requires the latest value of application level metrics. | ||
|
||
The design of the pipeline for collecting application level metrics should | ||
be revisited and it's not clear whether application level metrics should be | ||
available in API server so the use case initially won't be supported. | ||
|
||
#### Ubernetes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would rewrite this whole section as "Ubernetes might want to consider cluster-level usage (in addition to cluster-level request) of running pods when choosing where to schedule new pods. Although Ubernetes is still in design, we expect the metrics API described here to be sufficient. Cluster-level usage can be obtained by summing over usage of all nodes in the cluster." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
Ubernetes might want to consider cluster-level usage (in addition to cluster-level request) | ||
of running pods when choosing where to schedule new pods. Although Ubernetes is still in design, | ||
we expect the metrics API described here to be sufficient. Cluster-level usage can be | ||
obtained by summing over usage of all nodes in the cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpicking: for cluster-wide rank statistics such as 95th and 99th percentile, and people will want them, you can't compute those from pre-aggregated data. Think of the case where you look at the max over a window of one hour and a large RC/RS moves from one set of machines to another: it will be counted twice. (You also need to decide whether it's cluster-wide, but computed at the container, pod or node level... most likely people will want the last two.) One way to approximate this would be computing something like Q-digests or t-digests from data streamed from PodMetrics (not node data). Unless you only want to support instantaneous usage, the last sentence is misleading. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. In the future we may want to introduce also aggregations on a cluster level though as for now there is no use case for it. I added it to Further improvements section. |
||
|
||
#### kubectl top | ||
|
||
This feature is not yet specified/implemented although it seems reasonable to provide users information | ||
about resource usage on pod/node level. | ||
|
||
Since this feature has not been fully specified yet it will be not supported initally in the API although | ||
it will be probably possible to provide a reasonable implementation of the feature anyway. | ||
|
||
#### Kubernetes dashboard | ||
|
||
[Kubernetes dashboard](https://github.com/kubernetes/dashboard) in order to draw graphs requires resource usage | ||
in timeseries format from relatively long period of time. The aggreations should be also possible on various levels | ||
including replication controllers, deployments, services, etc. | ||
|
||
Since the use case is complicated it will not be supported initally in the API and they will query Heapster | ||
directly using some custom API there. | ||
|
||
## Proposed API | ||
|
||
Initially the metrics API will be in a separate [API group](api-group.md) called ```metrics```. | ||
Later if we decided to have Node and Pod in different API groups also | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would you have them in different API groups instead of using the paths you listed at the end of the doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is explained by @erictune in #18824 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Yeah, I don't think it's necessary to put them in separate groups now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what I meant here. |
||
NodeMetrics and PodMetrics should be in different API groups. | ||
|
||
#### Schema | ||
|
||
The proposed schema is as follow. Each top-level object has `TypeMeta` and `ObjectMeta` fields | ||
to be compatible with Kubernetes API standards. | ||
|
||
```go | ||
type NodeMetrics struct { | ||
unversioned.TypeMeta | ||
ObjectMeta | ||
|
||
// The following fields define time interval from which metrics were | ||
// collected in the following format [Timestamp-Window, Timestamp]. | ||
Timestamp unversioned.Time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not wrap these fields in a sub-structure that makes it easy to add more windows in the future? type Sample struct {
Timestamp unversioned.Time
Window unversioned.Duration
Usage map[ResourceType]Resource.Quantity
}
type NodeMetrics struct {
unversioned.TypeMeta
ObjectMeta
Samples []Sample
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current proposal is to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahh makes sense.. In that case, what is the purpose of including this field? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To make response self-descriptive. I won't fight for having it though I found it usable. See also #24253 (comment) |
||
Window unversioned.Duration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is window free to vary from response to response for the same node or the same pod, over time? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By default the windows will be set to 1minute. In the future once we will add support for window parameter this will have various values depending on the request. This field is somehow redundant and I'm ok with removing it though I think it's better user experience when the response is self-descriptive especially in the default case. See also #24253 (comment) |
||
|
||
// The memory usage is the memory working set. | ||
Usage v1.ResourceList | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where will you define exactly how Usage is measured for each resource? For example, is memory usage working-set or something else? It seems like this definition should be part of the API (at least as comments). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As for now it's total memory in use. I agree it should be documented. Also I think it should be consistent with what Kubelet uses right now (or at least is planning to use in the future) to account resource usage. Especially user experience should be as follow:
@dchen1107, @timstclair could you please explain which kind of memory usage Kubelet uses? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For eviction, the proposal is using It's defined as We will block best effort pods in admission if the eviction threshold for memory.available has fallen below a operator specified value, we will update a node condition like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, then Usage should be workingSetBytes? If we all agree on that, let's document it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The stats objects are more complete. Why not just use them instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @timstclair thanks for the info. What I meant here is exactly what is not defined in the proposal, but as @derekwaynecarr wrote it's working set. @erictune it seems we all agree here, so I'll add an appropriate comment. @Q-Lee I agree that we can provide more info. The question is whether we want to do it. As for now it's out of the scope of the MVP and all more specific metrics are available in Heapster/InfluxDB |
||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If pod metrics is an aggregate of all of its container metrics, Is it possible to describe similarly what are the different constituents of a node metrics? Or how is computed just as a whole or by taking aggregation of various different metrics, if the later what are those? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It definitely has to to be clarified. The question is what does scheduler expect to get. @davidopp for clarifying the scheduler needs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually it might make sense to have both, though we don't have a use case for both yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC the question is: is there any difference between the sum over all ContainerMetrics on a node, and the value reported by NodeMetrics? And if so, should we expose the constituents of that difference explicitly? (I actually don't know the answer to the question, if that is the question. But I assume docker daemon and kubelet usage count in NodeMetrics but aren't represented in ContainerMetrics.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davidopp yes, you understand the problem correctly. What we are exposing currently as a node usage is containers usage + system overhead + docker + kubelet + kube-proxy. The question is whether we should change it to just containers usage, which might be in line with what scheduler expects, especially in context we want to support Allocatable feature. @dchen1107 any thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in a separate email, @piosz asked:
If the scheduler is just using a "least loaded" heuristic for spreading work, then whole node cpu usage is sufficient. It is just a rough heuristic, so we should do the simplest thing, I think. Regarding unaccounted cpu usage: If some node has a bunch of CPU usage that is not accounted to any container, (e.g. kernel threads or processes in root cpu cgroup), we still want to avoid that node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to what @erictune said. NodeMetrics should report overall for node, including stuff that isn't in ContainerMetrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM, thanks for the clarification. |
||
type PodMetrics struct { | ||
unversioned.TypeMeta | ||
ObjectMeta | ||
|
||
// The following fields define time interval from which metrics were | ||
// collected in the following format [Timestamp-Window, Timestamp]. | ||
Timestamp unversioned.Time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think putting
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly disagree with having
IMO this is a bug not a feature. Also the goal for this effort is to have some basic overview of resource usage. Not every metric has to included here, so hopefully we are able to align timestamps for the metrics supported here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HPA does not aggregate across metrics right? As long as metrics are aligned for every resource, why does it matter to HPA? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we decide to have a top-level timestamp, we should at least make sure it's well-defined. Is it:
It's to avoid usage spikes at each housekeeping interval, the same reason Heapster polls each node at a different time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Tim, at least you should add a comment explaining what timestamp and window mean. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comment.
I know about it and I'm fine with it, although it's rather implementation imperfection and metrics inaccuracy that the feature itself. |
||
Window unversioned.Duration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this field indicate? It sounds like it's specified as a request parameter, so why have it on every metric object rather than with the top-level response (or not at all)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the first version there won't be a support for the parameter. It will be added if we decide we need to support various window sizes. PodMetrics is actually a top level resource (this will be returned for request to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be implying that all containers of the same Pod are measured at the same Timestamp and Window? Is that correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. If we want to make any reasonable aggregation on pod level (which for example HPA does) it's hard requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Document in the comments of the type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
// Metrics for all containers are collected within the same time window. | ||
Containers []ContainerMetrics | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Containers sounds confusing, how about containerUsageList or containerMetricsList or may be something else. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @erictune WDYT? |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to have a collective usage for pod too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this would make users' life easier. The question is how compatible with pod definition we want to be. Especially you don't request resource per pod but rather per container. @erictune WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would you do usage per pod? Usage per pod is always sum of usage per container. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What @aveshagarwal means is to have it precomputed for easier usage, though I'm rather against here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we later add volume resources, then it pod metrics would be sum of container and volume metrics for the pod. So, if we plan to add volume later, it might be better to add Pod metrics here, so that when we add volume accounting, users who use Pod metrics immediately see this new usage. OTOH, it might be breaking for users to change the Pod accounting. So maybe better to not add it until we are sure what it will contain. Also, if we later add 95th as an aggregation, the user cannot compute this herself. The kubelet has to do that. So, if 95th, or other non-user-computable aggregations are needed, then I agree we will need a PodMetrics. However, I am fine with leaving those out of the MVP until we have a stronger use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by adding "95th as an aggregation"? I'm unsure why the users can't compute that. And if we do want to start offering aggregated stats, then we should serve a distribution (e.g., f(x)) rather than a specific value (e.g., f(0.95)). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is one of possibilities to to provide a distribution, however this is out of the scope of MVP so let's do not discuss it right now. |
||
|
||
type ContainerMetrics struct { | ||
// Container name corresponding to the one from v1.Pod.Spec.Containers. | ||
Name string | ||
// The memory usage is the memory working set. | ||
Usage v1.ResourceList | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What keys can I expect to find in this map, or are there no guarantees? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either 'everything we have' (if we collected only memory, we will return only memory) or 'all or nothing' (if we collected only memory, we will respond with 404). I don't have strong opinion which one is better. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure. |
||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar is already part of pkg/.../types.go:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know and I also like this type, however someone (@davidopp, @fgrzadkowski, @mwielgus?) has objections that maybe some day we would like to have a different set of keys here. Is this still current? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was just pointing to that similar constants already exist, and you could reuse them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reusing just constants doesn't make sense here. If we decide to reuse something we should reuse ResourceList. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We generally do not reuse constants - we type them and keep them distinct On Apr 14, 2016, at 2:35 PM, Avesh Agarwal notifications@github.com wrote: In docs/proposals/resource-metrics-api.md
I was just pointing to that similar constants already exist, and you could — |
||
|
||
By default `Usage` is the mean from samples collected within the returned time window. | ||
The default time window is 1 minute. | ||
|
||
#### Endpoints | ||
|
||
All endpoints are GET endpoints, rooted at `/apis/metrics/v1alpha1/`. | ||
There won't be support for the other REST methods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will watch be ever supported? Polling requires clients to poll with a period lesser than the minimum resolution in-order to not miss metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Support for watch is definitely a good idea and I think eventually we will add it, although it doesn't add much value (metrics will change very frequently) so there won't be a support for the watch in the first version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. Would it make sense to mention the future plans in the proposal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely yes. I'll add a paragraph. |
||
|
||
The list of supported endpoints: | ||
- `/nodes` - all node metrics; type `[]NodeMetrics` | ||
- `/nodes/{node}` - metrics for a specified node; type `NodeMetrics` | ||
- `/namespaces/{namespace}/pods` - all pod metrics within namespace with support for `all-namespaces`; type `[]PodMetrics` | ||
- `/namespaces/{namespace}/pods/{pod}` - metrics for a specified pod; type `PodMetrics` | ||
|
||
The following query parameters are supported: | ||
- `labelSelector` - restrict the list of returned objects by labels (list endpoints only) | ||
|
||
In the future we may want to introduce the following params: | ||
`aggreator` (`max`, `min`, `95th`, etc.) and `window` (`1h`, `1d`, `1w`, etc.) | ||
which will allow to get the other aggregates over the custom time window. | ||
|
||
## Further improvements | ||
|
||
Depending on the further requirements the following features may be added: | ||
- support for more metrics | ||
- support for application level metrics | ||
- watch for metrics | ||
- possibility to query for window sizes and aggreation functions (though single window size/aggregation function per request) | ||
- cluster level metrics | ||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-metrics-api.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, remove the old proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it once we will merge this one.