-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: ResourceQuota scope support #19761
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,362 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Resource Quota - Scoping resources | ||
|
||
## Problem Description | ||
|
||
### Ability to limit compute requests and limits | ||
|
||
The existing `ResourceQuota` API object constrains the total amount of compute | ||
resource requests. This is useful when a cluster-admin is interested in | ||
controlling explicit resource guarantees such that there would be a relatively | ||
strong guarantee that pods created by users who stay within their quota will find | ||
enough free resources in the cluster to be able to schedule. The end-user creating | ||
the pod is expected to have intimate knowledge on their minimum required resource | ||
as well as their potential limits. | ||
|
||
There are many environments where a cluster-admin does not extend this level | ||
of trust to their end-user because user's often request too much resource, and | ||
they have trouble reasoning about what they hope to have available for their | ||
application versus what their application actually needs. In these environments, | ||
the cluster-admin will often just expose a single value (the limit) to the end-user. | ||
Internally, they may choose a variety of other strategies for setting the request. | ||
For example, some cluster operators are focused on satisfying a particular over-commit | ||
ratio and may choose to set the request as a factor of the limit to control for | ||
over-commit. Other cluster operators may defer to a resource estimation tool that | ||
sets the request based on known historical trends. In this environment, the | ||
cluster-admin is interested in exposing a quota to their end-users that maps | ||
to their desired limit instead of their request since that is the value the user | ||
manages. | ||
|
||
### Ability to limit impact to node and promote fair-use | ||
|
||
The current `ResourceQuota` API object does not allow the ability | ||
to quota best-effort pods separately from pods with resource guarantees. | ||
For example, if a cluster-admin applies a quota that caps requested | ||
cpu at 10 cores and memory at 10Gi, all pods in the namespace must | ||
make an explicit resource request for cpu and memory to satisfy | ||
quota. This prevents a namespace with a quota from supporting best-effort | ||
pods. | ||
|
||
In practice, the cluster-admin wants to control the impact of best-effort | ||
pods to the cluster, but not restrict the ability to run best-effort pods | ||
altogether. | ||
|
||
As a result, the cluster-admin requires the ability to control the | ||
max number of active best-effort pods. In addition, the cluster-admin | ||
requires the ability to scope a quota that limits compute resources to | ||
exclude best-effort pods. | ||
|
||
### Ability to quota long-running vs bounded-duration compute resources | ||
|
||
The cluster-admin may want to quota end-users separately | ||
based on long-running vs bounded-duration compute resources. | ||
|
||
For example, a cluster-admin may offer more compute resources | ||
for long running pods that are expected to have a more permanent residence | ||
on the node than bounded-duration pods. Many batch style workloads | ||
tend to consume as much resource as they can until something else applies | ||
the brakes. As a result, these workloads tend to operate at their limit, | ||
while many traditional web applications may often consume closer to their | ||
request if there is no active traffic. An operator that wants to control | ||
density will offer lower quota limits for batch workloads than web applications. | ||
|
||
A classic example is a PaaS deployment where the cluster-admin may | ||
allow a separate budget for pods that run their web application vs pods that | ||
build web applications. | ||
|
||
Another example is providing more quota to a database pod than a | ||
pod that performs a database migration. | ||
|
||
## Use Cases | ||
|
||
* As a cluster-admin, I want the ability to quota | ||
* compute resource requests | ||
* compute resource limits | ||
* compute resources for terminating vs non-terminating workloads | ||
* compute resources for best-effort vs non-best-effort pods | ||
|
||
## Proposed Change | ||
|
||
### New quota tracked resources | ||
|
||
Support the following resources that can be tracked by quota. | ||
|
||
| Resource Name | Description | | ||
| ------------- | ----------- | | ||
| cpu | total cpu requests (backwards compatibility) | | ||
| cpu.request | total cpu requests | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another alternative would be something closer to the name of the field, such as requests.cpu. I agree cpu.request reads better, but it is ad hoc and possibly more difficult to handle extended resources offered by nodes, such as gpus. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine to update to requests.cpu in the implementation. Will take a note to do that. |
||
| cpu.limit | total cpu limits | | ||
| memory | total memory requests (backwards compatibility) | | ||
| memory.request | total memory requests | | ||
| memory.limit | total memory limits | | ||
|
||
### Resource Quota Scopes | ||
|
||
Add the ability to associate a set of `scopes` to a quota. | ||
|
||
A quota will only measure usage for a `resource` if it matches | ||
the intersection of enumerated `scopes`. | ||
|
||
Adding a `scope` to a quota limits the number of resources | ||
it supports to those that pertain to the `scope`. Specifying | ||
a resource on the quota object outside of the allowed set | ||
would result in a validation error. | ||
|
||
| Scope | Description | | ||
| ----- | ----------- | | ||
| Terminating | Match `kind=Pod` where `spec.activeDeadlineSeconds >= 0` | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't take restartPolicy into account? A Pod might not have a deadline, but still could terminate in 30 seconds if it had restartPolicy Never or OnFailure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Saw that there was some commentary about this. I commented on #20199. I'm ok with a separate mechanism to require reasonable deadlines on terminating pods, but that's worth mentioning here as the solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is nothing stopping another scope that matches on RestartPolicy being added in the future as well... |
||
| NotTerminating | Match `kind=Pod` where `spec.activeDeadlineSeconds = nil` | | ||
| BestEffort | Match `kind=Pod` where `status.qualityOfService in (BestEffort)` | | ||
| NotBestEffort | Match `kind=Pod` where `status.qualityOfService not in (BestEffort)` | | ||
|
||
A `BestEffort` scope restricts a quota to tracking the following resources: | ||
|
||
* pod | ||
|
||
A `Terminating`, `NotTerminating`, `NotBestEffort` scope restricts a quota to | ||
tracking the following resources: | ||
|
||
* pod | ||
* memory, memory.request, memory.limit | ||
* cpu, cpu.request, cpu.limit | ||
|
||
## Data Model Impact | ||
|
||
``` | ||
// The following identify resource constants for Kubernetes object types | ||
const ( | ||
// CPU Request, in cores | ||
ResourceCPURequest ResourceName = "cpu.request" | ||
// CPU Limit, in bytes | ||
ResourceCPULimit ResourceName = "cpu.limit" | ||
// Memory Request, in bytes | ||
ResourceMemoryRequest ResourceName = "memory.request" | ||
// Memory Limit, in bytes | ||
ResourceMemoryLimit ResourceName = "memory.limit" | ||
) | ||
|
||
// A scope is a filter that matches an object | ||
type ResourceQuotaScope string | ||
const ( | ||
ResourceQuotaScopeTerminating ResourceQuotaScope = "Terminating" | ||
ResourceQuotaScopeNotTerminating ResourceQuotaScope = "NotTerminating" | ||
ResourceQuotaScopeBestEffort ResourceQuotaScope = "BestEffort" | ||
ResourceQuotaScopeNotBestEffort ResourceQuotaScope = "NotBestEffort" | ||
) | ||
|
||
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota | ||
// The quota matches by default on all objects in its namespace. | ||
// The quota can optionally match objects that satisfy a set of scopes. | ||
type ResourceQuotaSpec struct { | ||
// Hard is the set of desired hard limits for each named resource | ||
Hard ResourceList `json:"hard,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aside: Could you please rename ResourceList to ResourceMap? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Opened and assigned issue: #21584 |
||
// Scopes is the set of filters that must match an object for it to be | ||
// tracked by the quota | ||
Scopes []ResourceQuotaScope `json:"scopes,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: This needs patch merge key tags. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also note that API field comments should be in terms of json field names, or plain English (without mentioning names/types) |
||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also assumes the addition of |
||
|
||
## Rest API Impact | ||
|
||
None. | ||
|
||
## Security Impact | ||
|
||
None. | ||
|
||
## End User Impact | ||
|
||
The `kubectl` commands that render quota should display its scopes. | ||
|
||
## Performance Impact | ||
|
||
This feature will make having more quota objects in a namespace | ||
more common in certain clusters. This impacts the number of quota | ||
objects that need to be incremented during creation of an object | ||
in admission control. It impacts the number of quota objects | ||
that need to be updated during controller loops. | ||
|
||
## Developer Impact | ||
|
||
None. | ||
|
||
## Alternatives | ||
|
||
This proposal initially enumerated a solution that leveraged a | ||
`FieldSelector` on a `ResourceQuota` object. A `FieldSelector` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FieldSelector came up for MetadataPolicy, also. I'd be in favor of that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That said, I'm also ok with the approach proposed above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Long term, I am in favor of FieldSelector to handle edge cases, and scopes to handle common cases. |
||
grouped an `APIVersion` and `Kind` with a selector over its | ||
fields that supported set-based requirements. It would have allowed | ||
a quota to track objects based on cluster defined attributes. | ||
|
||
For example, a quota could do the following: | ||
|
||
* match `Kind=Pod` where `spec.restartPolicy in (Always)` | ||
* match `Kind=Pod` where `spec.restartPolicy in (Never, OnFailure)` | ||
* match `Kind=Pod` where `status.qualityOfService in (BestEffort)` | ||
* match `Kind=Service` where `spec.type in (LoadBalancer)` | ||
* see [#17484](https://github.com/kubernetes/kubernetes/issues/17484) | ||
|
||
Theoretically, it would enable support for fine-grained tracking | ||
on a variety of resource types. While extremely flexible, there | ||
are cons to to this approach that make it premature to pursue | ||
at this time. | ||
|
||
* Generic field selectors are not yet settled art | ||
* see [#1362](https://github.com/kubernetes/kubernetes/issues/1362) | ||
* see [#19084](https://github.com/kubernetes/kubernetes/pull/19804) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/19084/19804/ |
||
* Discovery API Limitations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Swagger |
||
* Not possible to discover the set of field selectors supported by kind. | ||
* Not possible to discover if a field is readonly, readwrite, or immutable | ||
post-creation. | ||
|
||
The quota system would want to validate that a field selector is valid, | ||
and it would only want to select on those fields that are readonly/immutable | ||
post creation to make resource tracking work during update operations. | ||
|
||
The current proposal could grow to support a `FieldSelector` on a | ||
`ResourceQuotaSpec` and support a simple migration path to convert | ||
`scopes` to the matching `FieldSelector` once the project has identified | ||
how it wants to handle `fieldSelector` requirements longer term. | ||
|
||
This proposal previously discussed a solution that leveraged a | ||
`LabelSelector` as a mechanism to partition quota. This is potentially | ||
interesting to explore in the future to allow `namespace-admins` to | ||
quota workloads based on local knowledge. For example, a quota | ||
could match all kinds that match the selector | ||
`tier=cache, environment in (dev, qa)` separately from quota that | ||
matched `tier=cache, environment in (prod)`. This is interesting to | ||
explore in the future, but labels are insufficient selection targets | ||
for `cluster-administrators` to control footprint. In those instances, | ||
you need fields that are cluster controlled and not user-defined. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This general issue seems to keep coming up... It seems like we need label-namespace-level ACLs on labels, so we can then create a label namespace to house non-user-modifiable system-generated labels. I wonder if we could even get rid of the concept of field selector entirely if we do that, if we publish every field as a label (?!?)... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure I want the ability to publish every field as a label, but I do think it may make some sense to let you put system style labels in a different space with different ACLs. As part of defining your API object, you can write the projection code that promotes specific fields on object to that special label-space. That said, even with this capability, I do think scopes are easier to understand for common use cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't publish every field as a label. ACLs on labels and annotations would be useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the downside of publishing every field as a label (in a separate namespace)? It would allow us to unify field selectors and label selectors (both in use and in implementation/code). If it's in a special namespace we could hide it in when displaying kubectl unless you ask to see that namespace. It would eliminate other kinds of redundancies too, for example having to publish QoS both as a field and as a label (latter so you can use it in MetadataPolicy, former for reasons @derekwaynecarr will remember but not me). |
||
|
||
## Example | ||
|
||
### Scenario 1 | ||
|
||
The cluster-admin wants to restrict the following: | ||
|
||
* limit 2 best-effort pods | ||
* limit 2 terminating pods that can not use more than 1Gi of memory, and 2 cpu cores | ||
* limit 4 long-running pods that can not use more than 4Gi of memory, and 4 cpu cores | ||
* limit 6 pods in total, 10 replication controllers | ||
|
||
This would require the following quotas to be added to the namespace: | ||
|
||
``` | ||
$ cat quota-best-effort | ||
apiVersion: v1 | ||
kind: ResourceQuota | ||
metadata: | ||
name: quota-best-effort | ||
spec: | ||
hard: | ||
pods: "2" | ||
scopes: | ||
- BestEffort | ||
|
||
$ cat quota-terminating | ||
apiVersion: v1 | ||
kind: ResourceQuota | ||
metadata: | ||
name: quota-terminating | ||
spec: | ||
hard: | ||
pods: "2" | ||
memory.limit: 1Gi | ||
cpu.limit: 2 | ||
scopes: | ||
- Terminating | ||
- NotBestEffort | ||
|
||
$ cat quota-longrunning | ||
apiVersion: v1 | ||
kind: ResourceQuota | ||
metadata: | ||
name: quota-longrunning | ||
spec: | ||
hard: | ||
pods: "2" | ||
memory.limit: 4Gi | ||
cpu.limit: 4 | ||
scopes: | ||
- NotTerminating | ||
- NotBestEffort | ||
|
||
$ cat quota | ||
apiVersion: v1 | ||
kind: ResourceQuota | ||
metadata: | ||
name: quota | ||
spec: | ||
hard: | ||
pods: "6" | ||
replicationcontrollers: "10" | ||
``` | ||
|
||
In the above scenario, every pod creation will result in its usage being | ||
tracked by `quota` since it has no additional scoping. The pod will then | ||
be tracked by at 1 additional quota object based on the scope it | ||
matches. In order for the pod creation to succeed, it must not violate | ||
the constraint of any matching quota. So for example, a best-effort pod | ||
would only be created if there was available quota in `quota-best-effort` | ||
and `quota`. | ||
|
||
## Implementation | ||
|
||
### Assignee | ||
|
||
@derekwaynecarr | ||
|
||
### Work Items | ||
|
||
* Add support for requests and limits | ||
* Add support for scopes in quota-related admission and controller code | ||
|
||
## Dependencies | ||
|
||
None. | ||
|
||
Longer term, we should evaluate what we want to do with `fieldSelector` as | ||
the requests around different quota semantics will continue to grow. | ||
|
||
## Testing | ||
|
||
Appropriate unit and e2e testing will be authored. | ||
|
||
## Documentation Impact | ||
|
||
Existing resource quota documentation and examples will be updated. | ||
|
||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-quota-scoping.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this whole section, is the key reason admins treat these pods differently in policies due to the duration or is duration merely correlated with importance? That is, admins set aside fewer guaranteed resources for these short-duration jobs because they can generally afford brief downtime, not merely because they'll finish quickly. It just so happens that most such jobs can finish quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, batch style workloads consume as much resource as they can until something applies the brakes. Transactional web applications typically consume resource in response to user requests. Transactional web applications tend to stay put once scheduled until a deployment occurs. Batch style workloads cause more bursts in scheduling.
Cluster administrators often plan their clusters with a view to long running workloads when coming from traditional PaaS background. Admins in this case want to prioritize their cluster planning around transactional web applications that tend to not move once scheduled and scale based on incoming traffic to the cluster that they can measure. They are often willing to provide higher quota limits for those workloads because most applications never actually consume up to their limit. Batch workloads on the other hand are greedy and often consume up to their limit until done and therefore may actually have a greater impact on what is consumed on the node to which its scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch workloads may use many more resources than requested, but at a lower QoS level, with throughput more important than latency.
They also often have weaker availability requirements, and may be deferred in time.
Additionally, because they are bounded in duration, they are launched more often, and a rate limit on creation is needed.