Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for adding priority to Kubernetes API #604

Merged
merged 2 commits into from
Jul 13, 2017

Conversation

bsalamat
Copy link
Member

@bsalamat bsalamat commented May 5, 2017

@kubernetes/sig-scheduling-api-reviews
@kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 5, 2017
@davidopp
Copy link
Member

davidopp commented May 5, 2017

ref/ kubernetes/kubernetes#22212

@bsalamat bsalamat changed the title Add a design proposal for adding priority to Kubernetes API Design proposal for adding priority to Kubernetes API May 5, 2017
@bsalamat
Copy link
Member Author

bsalamat commented May 5, 2017

ref/ kubernetes/enhancements#268

A Kubernetes cluster has four predefined (built-in) priority names. The top
priority is reserved for the system pods and its value is the largest 32-bit
integer. The following list is an example of how the priorities and their
values may look like. I am open to suggestions for the names and values of the
Copy link
Member

@davidopp davidopp May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to give an example of what kinds of workloads might map to each priority level. For example in decreasing order of priority

  • Kubernetes system daemons (per-node like fluentd, and cluster-level like Heapster)
  • critical user infrastructure (e.g. storage servers, monitoring system like Prometheus, etc.)
  • components that are in the user-facing request serving path and must be able to scale up arbitrarily in response to load spikes (web servers, middleware, etc.)
  • important interruptible workloads that need strong-ish guarantee of schedulability and of not being interrupted
  • less important interruptible workloads that need a less strong guarantee of schedulability and of not being interrupted
  • best effort / opportunistic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the list.


This design doc introduces the concept of priorities for pods in Kubernetes and
how the priority impacts scheduling and preemption of pods when the cluster
runs out of resources. A pod can specify a priority at the creation time. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have admission to check it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, described a little farther below.

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled sooner than lower priority pods. The details of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused :(; my understand is that: priority does not guarantee scheduler oder, e.g. lower priority maybe scheduled firstly because of race condition; but scheduler may preempt lower priority if scheduler decide to schedule pod to that host with higher pod, the decision was made according to current scheduler algorithm.

Copy link
Member Author

@bsalamat bsalamat May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more or less what I was trying to say as well that a higher priority pod is not guaranteed to be scheduled sooner than lower priority ones, as there are many other parameters that affect scheduling order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about earlier instead of sooner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure which one is grammatically more suitable. I will change sooner than to before.

string to an integer. The PriorityName specified in a pod spec must be found in
this map or the pod creation request will be rejected. If PriorityName is
empty, "tier3" (see below) is assumed. I am open to other suggestions about
what the default priority should be. Once the PriorityName is resolved to an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to set priority to lowest one by AdmissionController.

specs. A new ConfigMap may be fed to the admission controller to define new
priority names or change the integer value of existing names (other than
"system"). These new changes affects only the pods created after the changes
and existing pods will keep their existing Priority.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change priority may make cluster & admin confused; for example, if we changed following config tier_1:1000, tier_2:2000 to tier_1:2000, tier_2:1000, it's hard for admin/dev to debug issue (with new/old Pods). Similar issue with deleting.

Draft option in my mind is to mark old priority as deprecated: can not create new Pod with that priority but priority for old Pod is still here, and let admin to delete it until all Objects exits (for now, did not provide a way to delete all Objects with the same priority).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of changing priorities at runtime, but looks like there are customers and engineers who want this capability.

I am not sure if I understand the second paragraph.

Copy link
Member

@k82cn k82cn May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re second paragraph: if rename "tier1" to "tier2", we just create a new name "tier2" and let admin delete "tier1" when all Objects are deleted; but "tier1" is deprecated (new pods can not use it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should add all the logic to obsolete one name and restrict the use of names. Pretty much all the components of the system will use the integer value of Priority anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we can just document it: it's better to delete/change name after delete all related objects. Add such logic many make it complex :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can only add or delete objects.

I don't think deleting should be possible. Once you created a priority, it shouldn't be possible to delete it too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you created a priority, it shouldn't be possible to delete it too.

What is the concern with deleting? If one deletes an existing PriorityClass, no more pods can be created with that name. I think we should give the flexibility to cluster admins to delete PriorityClasses if needed. They may later create one with the same name as the delete one and different value. I don't see any problem with that either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t , I wrote "document it", I'd like to keep is flexible enough to user :). But the admin can delete it: it need admin to make sure all related objects are deleted.

With PriorityClass, k8s can help to delete them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat - my main problem with deleting is basically that "delete + create = update". That said, relying on the admin checking if it is fine to delete a priority and deleting it is not a good solution in my opinion.
If we really want to allow for deleting a priority, then the system itself has to somehow check if it is safe to remove a given priority and disallow it if there are any pods with it in the system.

But to be honest, I'm not sure how to do it in non-racy way, that's why I'm very very against allowing this.
Another problem with allowing deleting a priority is performance - if we want to allow it, I would like to see a similar discussion of how we would like to solve the races that we have in #611

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t We can add a restriction on allowing deletion only when there is no pod with that priority class in the system, but completely disallowing deletion of priority classes wouldn't be practical IMO. Over time these classes might get accumulated and there would be no way to clean them up.
Race condition will exist when performing the check till the class is actually deleted, but I guess it will be OK given that deletion is probably infrequent and the fact that if a pod manages to sneak in due a race, it won't break the cluster.

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled sooner than lower priority pods. The details of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about earlier instead of sooner?

system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the detailed list below, why define just 4 default priorities?
If defining the default set is an issue why not define just two - default and system? Any pod without a priority field maps to the default priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the above and below lists are examples. That said, in the list below some items can have equal priority. Moreover, not all clusters have all types of workloads given below and some clusters may have many more types of workloads with different priorities. So, I think we should have a small number of default priorities, like the two that you mentioned, and let customers add more when they need more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh also the doc specifies that it is configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default I meant an empty PriorityName field. There is a clear need for defining priority for system pods. Can we avoid defining standard priorities for all other use cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will ship with only system as a built-in value and the rest will be user configurable. If PriorityName is empty, I'd propose resolving the priority to zero.


### Resolving priority names

User requests sent to Kubernetes may have "PriorityName" in their PodSpec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of "PriorityName"? Is it meant to define portable names? If so, we need a set of default Priority names that can satisfy the requirements for most users.
If it's only a matter of defaulting, can pod presets be used instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PriorityNames limit pods to have a small set of priorities, as opposed to any valid integer. A small set makes it easier to predict schedulability of pods. It also makes it easier to create policies based on known priorities. For example, we may add quota for various priorities in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand an explicit name brings in portability constraints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand an explicit name brings in portability constraints.

That's why I think we should ship Kubernetes with a few predefined priority names. These predefined priority names can be changed (except for the system), but they bring config portability to default Kubernetes deployment.

system pods will have "system" priority and no pod should be able to preempt
them.

Alternatively, we may ship Kubernetes with only the highest priority level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the system priority level be hard-coded? In the admission controller or across the codebase?
Admission controllers are moderately easy to plugin and so how will this be enforced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought much about placement of code, but I guess admission controller is the only place that should know about priority names. When one replaces an admission controller that resolves priority names, they should add proper priority name resolution logic to their controller. Whether it is the system or any other priority name, if it is not resolved to the right priority, the rest of the system may not behave correctly.

specs. A new ConfigMap may be fed to the admission controller to define new
priority names or change the integer value of existing names (other than
"system"). These new changes affects only the pods created after the changes
and existing pods will keep their existing Priority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priority is essentially an API. updating or deleting existing priority names is a breaking change. Why should we optimize for handling such a situation?
+1 for documentation.

classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we may give
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been discussed in multiple forums. In the case of a tie, I'd prefer kubelet computing relative priority based on absolute priority and usage above request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me.

### Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher
priority pods when cluster is out of resources. In such a case, scheduler may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is out of resources/has reached a threshold.


type PodStatus struct {
...
Priority int // Populated by Admission Controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Priority/EffectivePriority ;-) b/c there may be scaling applied at different layers, e.g. admission.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still trying to avoid adding adjectives. ;-) Even with scaling applied, can't we use just "Priority"? It looks cleaner to me, but it is just personal preference.

Copy link
Member

@timothysc timothysc May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to eliminate any ambiguity an operator might have in the case where they might have a (PriorityName, Value) association ... and the expectation that Priority will always equal the Value.

EffectivePriority to me indicates that it could be scaled by some other value (e.g. PriorityFactor - which could be based on user or group).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents, I think "effective" will raise more questions in the user's mind than it answers. For example, if they see "effective" they might assume the system has some scaling factor like what you described, but it may not have one. I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.

Copy link
Member

@timothysc timothysc May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not loosing sleep on the name, but we should have someone other then us comment on this too.

/cc @smarterclayton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the part that is unclear is that the PriorityName coyul end up with a different value over time, but that is latched on this pod, and not updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer value of PriorityName could be changed (although some people believe we should not allow it), but pods created before the change will keep the old value. I believe it is fine and clearly say that a change applied only to the pods created after the change.

@timothysc
Copy link
Member

/cc @derekwaynecarr you will want to review this one.

@davidopp
Copy link
Member

davidopp commented May 9, 2017

BTW here is the info on how to add an alpha (or beta) field to a GA API object (which is what you're doing in this feature)

https://docs.google.com/document/d/1wuoSqHkeT51mQQ7dIFhUKrdi3-1wbKrNWeIL4cKb9zU/edit#heading=h.3sjjx2gvn9hz

@bsalamat
Copy link
Member Author

bsalamat commented May 9, 2017

cc/ @dchen1107 @dashpole

@derekwaynecarr
Copy link
Member

I plan to review this this week.

@derekwaynecarr derekwaynecarr self-assigned this May 10, 2017
@davidopp
Copy link
Member

cc/ @wojtek-t

@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-misc this is going to affect the workload controllers

@k8s-ci-robot
Copy link
Contributor

@Kargakis: These labels do not exist in this repository: sig/apps.

In response to this:

@kubernetes/sig-apps-misc this is going to affect the workload controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@davidopp
Copy link
Member

@kubernetes/sig-apps-misc this is going to affect the workload controllers

Can you say why? Priority (or PriorityName, to be more precise) would just go in the pod template and shouldn't really need to be visible to the workload controller.

@k8s-ci-robot
Copy link
Contributor

@davidopp: These labels do not exist in this repository: sig/apps.

In response to this:

@kubernetes/sig-apps-misc this is going to affect the workload controllers

Can you say why? Priority (or PriorityName, to be more precise) would just go in the pod template and shouldn't really need to be visible to the workload controller.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@0xmichalis
Copy link
Contributor

Can you say why? Priority (or PriorityName, to be more precise) would just go in the pod template and shouldn't really need to be visible to the workload controller.

IIUC this feature can be used for implementing a pruning policy for pods, ie. by killing pods with lower priority first. The latter feature will be part of the controllers.

@0xmichalis
Copy link
Contributor

IIUC this feature can be used for implementing a pruning policy for pods, ie. by killing pods with lower priority first. The latter feature will be part of the controllers.

See kubernetes/kubernetes#4301 kubernetes/kubernetes#45339 kubernetes/kubernetes#45509

@davidopp
Copy link
Member

No that's different. For that we are proposing to use something called evictionCost which is localized per-controller and is only used by the controller. We are not planning to implement that feature soon.

The priority @bsalamat is introducing here is global across all controllers/sets and namespaces, and is used by the scheduler to decide which pods to evict to make room for a pending pod (think of it as a fancy version of the critical pod rescheduler).


type PodStatus struct {
...
Priority int // Populated by Admission Controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.

+1

of the valid priority names are defined by a system wide mapping that maps each
string to an integer. The PriorityName specified in a pod spec must be found in
this map or the pod creation request will be rejected. If PriorityName is
empty, it will resolve to priority zero. Once the PriorityName is resolved to an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be resolving PriorityName and when? I imagine that it will be some admission plugin, but no matter what I think this information should be added here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - it seems it's described below.

specs. A new ConfigMap may be fed to the admission controller to define new
priority names or change the integer value of existing names (other than
"system"). These new changes affects only the pods created after the changes
and existing pods will keep their existing Priority.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping late to discussion.
I completely agree with all the concerns mentioned above. I don't see any strong argument for the need of changing the "integer value of a given priority name".
What is more, if someone does this it will be super extremely confusing for the users. As an example imagine the scenario:

  • we have priorityX: 1000
  • then we create a number of pods with this priority - it is resolved to 1000 integer priority
  • then we change the priority value to: priorityX: 2000
  • then we create a number of pods with this priority - it is resolved to 2000 integer priority.

And what we end up with? We have a number of pods with priorityX PriorityName, but some of them are resolved to priority 1000 and the other to priority 2000.
This in my opinion would be extremely confusing to users and I'm really against introducing such operation.

That said, I thing that users want to have an ability to add new priority names during runtime and that we should allow.
@davidopp - for opinions

@timothysc
Copy link
Member

/cc @liggitt

@k8s-ci-robot k8s-ci-robot requested a review from liggitt May 11, 2017 16:35
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we give
Copy link
Member

@derekwaynecarr derekwaynecarr May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i continue to believe it is an anti-pattern to allow a best effort pod to have higher priority than pods in other qos classes.

from a scheduling perspective, if a node is reporting memory pressure, and no longer accepts BestEffort pods, would the scheduler really preempt pods in the other QoS classes to make room? there is no guarantee room will ever be made available, so what is the point? best effort pods may not have access to any node resources once scheduled given how qos is implemented, and will effectively just be starved anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the pod priorities will be zero, when pods don't specify any. Once they do, why should we make it harder for users to understand the meaning of priority by adding QoS to the mix? IMO, if a user wants to have their BestEffort pods to have very high priority, they should be able to do so.
The concern that BestEffort pods may be scheduled and starved exists even today. A best effort may land on a machine with no free resources and may be starved. This is acceptable behavior of the system when dealing with BestEffort pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that if a guaranteed pod can be evicted because of a high priority besteffort pod, the guaranteed qos doesn't mean much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be unclear how many guaranteed pods should be preempted in order to free sufficient resources for a best effort pod that does not declare a resource request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, Guarantee, Burstable, BestEffort is also a kind of priority; the question is how we handle them together. One option is to make priority only effect within same QoS class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to limit the range of usable priorities based on QoS class. So, if a user chooses to use a higher priority for their besteffort pods, they can do so, but you are right that the besteffort class may be scheduled on a node with no resources (as it could happen today). And, yes priority will be considered by kubelet for out of resource evictions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is: scheduler and kubelet will take different action against priority & QoS class; I agree to discuss this in preemption doc in future :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Guaranteed" is guaranteed resource/performance QoS while scheduled on the node, not with respect to durability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a mechanism to say that this priority can only be used with an allowed set of QoS class? i want a way to prevent what i find is an abuse vector (a BE pod using high priorities), as I think it will be a future support case when it causes an unforeseen consequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr What unforeseen consequences are you thinking of?
If we have a mechanism to control the set of priorities a namespace can use and if we can restrict the QoS classes a namespace can use (using LimitRanges) why would we need an additional priority -> QoS link?


type PodStatus struct {
...
Priority int // Populated by Admission Controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't this be an int64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm, i was informed by @liggitt that int is fine for this use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be int32 (no unsized ints)

@derekwaynecarr
Copy link
Member

I am trying to understand how I will support quota by priority in the future relative to the current proposal.

Today, I am able to support quota by storage-class via the following:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage
spec:
  hard:
    gold.storageclass.storage.k8s.io/requests.storage: 500Gi
    requests.storage: 5000Gi

In the above example, I have 5000Gi of storage quota, but only 500Gi of it can have gold class. Now that StorageClass is promoted, we could look to model this differently (more akin to option 2 below).

The scope of this proposal appears to be resources that surround a pod (i.e. cpu, memory, etc.) and not storage.

Option 1: priority based resource names

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-priority
spec:
  hard:
    high.priorityclass.k8s.io/requests.cpu: 10
    low.priorityclass.k8s.io/requests.cpu: 5
    requests.cpu: 20

This would make quota controller PriorityClass name aware.

Option 2: priority constraint

apiVersion: v1
kind: ResourceQuota
metadata:
 name: compute-priority
spec:
 priorityClassName: foo
 priorityValue: 1234
 hard:
   requests.cpu: 10    

The priorityValue could be set on admission similar to your Pod example. The quota would be restricted to the set of resources that are allowed on pods. If we went this route, I would want to augment quota by storage class to use the same pattern, but just restrict quota to values allowed on a PVC.

@bsalamat
Copy link
Member Author

@derekwaynecarr What am writing is not finalized, but what I currently have in mind about quota is like your option 2. Basically, we add priority to the current ResourceQuota. When a resource quota has priority, its restrictions are applied to the pods with the same priority. With this approach whatever resources are supported by the current quota mechanism will be supported after the addition of priority.

@bsalamat bsalamat force-pushed the priority_api branch 3 times, most recently from 08e3a31 to 25afe86 Compare June 15, 2017 00:17
@bsalamat
Copy link
Member Author

@davidopp @timothysc @derekwaynecarr
Given that it seems we have reached consensus on what this design doc is trying to achieve, can one of you approve this PR?

@derekwaynecarr
Copy link
Member

@bsalamat -- see my question re: ability to say this priority can only be used with this QoS.

i also want to discuss another use case that came up with storage-classes. i want to restrict the set of allowed priorities that can be used in a namespace w/o having to create an explicit quota for each enumerated priority in my cluster.

for example, just because the cluster operator added a "super privileged system daemon priority", the operator should not have to update every existing namespace/project to say they have 0 super privileged system daemon priority priority resources. do you guys have similar concerns? can i have whitelist of allowed priorities associated with a namespace?

@bsalamat
Copy link
Member Author

@derekwaynecarr We have LimitRanges and in the near future, ResourceQuota to limit what users what users can create in a namespace and/or at a particular priority. Why should we limit priorities that can be used with QoS classes. Linking priority to QoS might make the management of priorities hard.

Regarding your second question, once we add priority to ResourceQuota, an admin can create a zero ResourceQuota with no particular priority. Such quota will prevent users from creating pods at all priorities. The admin can then assign quota to users at the particular priorities. So, when new priorities are added, admins don't need to explicitly give zero quota at the new priority.

@davidopp
Copy link
Member

davidopp commented Jun 27, 2017

@derekwaynecarr It would be great if we could resolve your comments so we can merge the proposal.

My two cents:

All of the quotas that ResourceQuota controlls today should be additionally indexed by priority (I haven't thought about whether it should be PriorityClass or absolute priority). Controlling which priorities can use which QoSes should be done as an extension to LimitRange, as @bsalamat suggested
.
Regarding how to have a cluster-level policy override per-namespace policies, I think this comes up in a lot of contexts and we should think about how to handle it systematically. In this case maybe a non-namespaced ResourceQuota, or maybe something like OpenShift projectRequestTemplate to automatically populate each namespace with a ResourceQuota, etc.

But this doc explicitly lists quota as outside the scope of the doc so maybe its OK to merge as-is and continue discussion of quota elsewhere?

@davidopp
Copy link
Member

sorry, previous comment was for @derekwaynecarr, fixed now

@derekwaynecarr
Copy link
Member

@davidopp - i think i am fine deferring the quota problem to another proposal.

The cluster may have many user defined priority classes for
various use cases. The following list is an example of how the priorities and
their values may look like.
We decided to ship Kubernetes with no predefined priority classes, but any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section later called "system PriorityClassName" seems to imply that this sentence is not correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it. Thanks!

Priority admission controller ensures that new priority classes will be not
created with those names. They are used for critical system pods that must not
be preempted. We set default policies that deny creation of pods with these
`system` priorities. Cluster admins can authorize users or service accounts to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be using system both to refer to a specific PriorityClassName as well as to refer to a set of different PriorityClassNames (system being one of them) that have the property of requiring pre-approval to use them. You should use different words for these two concepts. And earlier you say about everything above one billion being reserved for these kinds of priority classes, but then you don't mention it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initially wanted to have only one system priority class name, but we later decided to have more than one. I fixed the text.

Priority admission controller ensures that new priority classes will be not
created with those names. They are used for critical system pods that must not
be preempted. We set default policies that deny creation of pods with these
system priorities. Cluster admins can authorize users or service accounts to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say "We set default policies that deny creation of pods with PriorityClassNames corresponding to these priorities"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

@davidopp
Copy link
Member

Just one comment, then LGTM

@davidopp
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2017
@bsalamat bsalamat merged commit 9b7c8fa into kubernetes:master Jul 13, 2017
@bsalamat bsalamat deleted the priority_api branch July 13, 2017 18:04
erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017
* Add a design proposal for adding priority to Kubernetes API
erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017
* Add a design proposal for adding priority to Kubernetes API
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
* Add a design proposal for adding priority to Kubernetes API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.