-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic resource allocation: remove ResourceClaimStatus.Scheduling #13
dynamic resource allocation: remove ResourceClaimStatus.Scheduling #13
Conversation
57fb18a
to
36c6d23
Compare
process, but it would work. But this does not work for user-created ResourceClaims. | ||
Do we prevent delayed allocation for those? How? | ||
See also https://github.com/kubernetes/enhancements/pull/3064#discussion_r902133513 | ||
<<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preferred approach for solving this race is to keep DeallocationRequested
in the ResourceClaimStatus
. It is used during scheduling, but could also serve other purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I spent a few minutes thinking about how I would describe this loop with emphasis on NOT doing allocations until the end.
* ResourceClass "gpu" is created:
ResourceClass "gpu" {
spec: {
driver: foobar.example.com
params: ...
}
}
* ResourceClaim "cfoo" is created
ResourceClaim "cfoo" {
spec: {
class: gpu
params: ...
mode: Delayed
}
status: {
impl: nil
nodeSelector: nil
}
}
* Driver "foobar.example.com" observes Claim creation for itself
(cfoo.spec.driver)
* Pod "pfoo" is created using Claim "cfoo"
* Scheduler sees cfoo.status.impl == nil && cfoo.spec.mode == Delayed
* Scheduler evaluates pfoo for pod constraints and builds a list of
candidate nodes:
- nodeName
- nodeSelector
- requests
- tolerationsd
- affinity
* Scheduler evaluates pfoo for volume constraints and filters the list of
candidate nodes
* Scheduler does delayed allocation for volumes and filters the list of
candidate nodes
* Scheduler evaluates pfoo for bound ResourceClaims and filters the list of
candidate nodes
* Scheduler evaluates pfoo for in-progress ResourceClaims and if any, come back
to this pod later.
* Scheduler now has a list of nodes that SHOULD accommodate pfoo except for
unbound ResourceClaims
* Scheduler creates PodScheduling "pfoo"
PodScheduling "pfoo" {
spec: {
candidateNodes: [ up to 32 nodes in priority order ]
parties: [{
resourceClaim: {
name: pfoo
class: gpu
}}
]
status: {
viable: map[party] -> list of nodes {}
alternatives: map[party] -> list of nodes or node selector
}
}
* In case of multiple schedulers, one wins this race and the other aborts.
* Scheduler waits for either:
a) all parties to publish a list of nodes (even empty)
b) timeout O(30s)
* Driver observes PodScheduling with reference to a Claim it owns
* Driver considers pfoo.spec.candidateNodes for which are viable
* Driver sets pfoo status, one of:
- viable -> [ all candidate nodes which are viable ]
- alternatives -> [ up to 32 other nodes or a node selector ]
* Scheduler observes PodScheduling mutation
* If there is a node which is viable for all parties, scheduler does something
to tell the driver it is OK to allocate on that node (what exactly??)
* If there is no node which is viable for all parties, scheduler sets
candidateNodes to a new set of nodes, considering the alternatives proposed
* Repeat until a node is selected
* Driver sets cfoo.status.Conditions[Allocating] = true
* If 2 pods were racing this far, the driver can fail one of them
* Scheduler sets pfoo.spec.nodeName
* Kubelet waits for Claim cfoo.status.impl to point to a valid object
* If Kubelet can not run pfoo for any reason, kill it and start over
- downside: no opportunity to reallocate the resource?
Should this also consider a node selector in the ResourceClass? In one comment you said it would be good as an initial filter, but then you questioned whether it is needed when I added it to the API. FWIW, I still think it's useful because it can trim candidateNodes, which becomes even more important when that is limited to 32 nodes. The chance of hitting the right node when only a single node in the cluster has one particular kind of resource is low when the cluster is big.
"does delayed allocation for volumes" = "trigger volume provisioning"? That has to come after a node has been selected, which isn't the case yet.
Which claims get listed here? Only those which are unallocated or all that are referenced by the Pod (regardless of immediate vs. late allocation and allocation state)? I think it has to be all of them. Everything else is volatile. Even immediate vs. late allocation can change because the ResourceClaim might get deleted and recreated while the pod is pending.
API question: what is "party" here? "parties" above is a slice with no unique string. The index number? Would that be a literal map or a list with an index field as key?
Does that refer to multiple schedulers trying to schedule the same pod? I don't think that this can happen. I don't know exactly how sharding a cluster works, but I was under the impression that each pod gets assigned to exactly one scheduler. Either way, only one scheduler can create the PodScheduling object, so this statement isn't wrong.
The current KEP doesn't necessarily wait for feedback from drivers. For a single party that makes sense because triggering an allocation attempt can be done in parallel and the outcome is either positive (allocation succeeds, pod can get scheduled) or a cheap failure (continue with information provided by driver). I'm fine with dropping this optimization. I was trying to make scheduling as fast as possible because that seemed to be one of the main objections, but IMHO simplicity is more important. We can still tweak it later. @alculquicondor, do you agree?
It could set PodScheduling.Status.SelectedNode.
Is this necessary?
With "fail one of them" you mean "proceed with allocation for one pod and ignore the other", right? The other pod might still have a chance to run (shared resource) so there is no particular action for the "loosing" pod.
... while resource allocations are still going on
Which might not happen anytime soon if the driver is out of resources. In the meantime the pod is pending on the node and it's requested resources (RAM, CPU) are set aside for it. This is basically how Akri works today, using the device plugin API.
I don't think kubelet has any means of killing a pod that it currently cannot run and no means to distinguish between temporary and permanent failures. We would have to merge https://github.com/kubernetes-sigs/descheduler into kublet for this to work or do something similar in kubelet.
Yep. This is my main concern. Everything else is above seems doable, but IMHO we have to let the scheduler wait for completion of all resource allocations and deal with the situation that despite all of the careful checks beforehand, resources might have to be reallocated. |
Sure - anything we can do to pre-filter and avoid negotiation loops is a win.
Note 32 is picked from thin air. It could be 64 or even 256 - it just can't be defined as "all". If we set it to 256 we'll effectively have "all" for 90% of clusters.
Yeah, I was handwaving a bit because I didn't remember how it worked
I was thinking of it as only unallocated, delayed allocation ones. We need to wait for immediate allocation first, because it might be 90% done by the time the pod starts. There are some edge-cases that are just unreasonable to expect to work perfectly. If an immediate-allocation resource is deleted while a pod is scheduling against it, and that pod fails to run, I don't really have too much sympathy. In my mind this adds support to the idea that ephemeral claims MUST be delayed allocation. Ephemeral + Immediate is nonsense |
woops, hit the button too soon, more coming |
Don't take what I wrote as a literal API proposal. What I meant by "party" is some way of identifying an actor and a reason, sufficient to capture all of the individual decisions that need to be made. If you have 2 claims, I imagine it is 2 parties - even if it is the same driver it is 2 decisions and 2 sets of candidates. We could totally explore the idea that the driver could do its own intersection and save steps, though.
Yeah, it's a weird case and should not happen, but it's clear that it's not a problem.
I was operating on the idea that allocations are expensive, and the complexity of allocating and deallocating in a loop seemed (to me) to be the thing to avoid. This really is at the heart of my concerns.
I'm skeptical that we know what is an optimization yet. :)
I left this open because it's not clear to me what the semantic wants to be. If I create a ResourceClaim, I don't know which node(s) that claim will be reachable from. I don't know if it will be reachable from a single node or from all nodes. For a delayed-allocation claim, we need an unambiguous signal which says "it;s OK to allocate now" and that has to include "...on at least this specific node". My instinct is that this "feels like" part of the claim API, but it also "feels" dirty - "on this node" is meaningless for a resource available on all nodes. So one option would be to set Another would be that drivers need to watch all pods, filter for pods that have claims agains classes which use that driver, and then wait for Pod.spec.NodeName to be set. Watching pods is always a scary proposition.
The goal here was to interlock. We need some way to signal that a Delayed claim is bering worked on by "someone else" so that all-but-one pods can switch from "actively try to allocate this resource" mode to "wait for this resource to be allocated" mode.
I wrote it this way on the assumption that parallelizing was valuable. We could also wait for all allocations and the re-confirm(?) availability.
That's a fair point.
Sure it does. It can just mark the pod as Failed.
We have to accept that there will be unsolvable conditions. A resource might be pinned to a node that can't run a pod that wants to use that resource. This is little comfort for someone who asked for a resource bound to a specific pod, saw a resource allocated FOR that pod, and then has the pod fail to schedule. That said, it feels like a relatively low-risk scenario to me, which requires a much more complicated API and lifecycle to categorically avoid. I can see extending the ResourceClaim API include an intermediate state ("Allocated but not initialized") or something like that, during which heroic actions like "deallocate this and try again" might be taken, but not without some evidence that it is necessary. |
ea13213
to
08c0828
Compare
It was mentioned a few times that we must not have arrays of unlimited size. But doesn't that mean that we also need a size limit for a NodeSelector, which contains arrays? Is that already done anywhere? Or is the difference that "normally", when used as intended, NodeSelectors do not grow beyond a certain size (although the API server and the type would allow it)?
That also works. We then just have to remember that the list might have to be changed over time and some of the listed claims might become allocated. It's basically a snapshot of "what the scheduler wants to know" and then drivers will need to check how they respond.
Agreed. I fully support that once a pod has more than one pending claim or volume, the scheduler should wait for information. It's not just that allocation may be expensive, recovering from a bad choice also is hard. For a single pending claim, it may trigger allocation in parallel to asking for information. But that's an optimization to cut down on pod scheduling latency. I wonder whether it is even worth having it in the KEP and whether it then should be implemented right away or should be delayed until later.
I don't see a problem with it. For a resource driver, it means "allocate all pending claims so that they (probably) will be usable on this node". This cannot be a full guarantee that scheduling will work out and perhaps there are constraints that only get discovered later which prevent using a claim even though it was allocated for a node, but it should be a good first bet. Instead of a single node, we could have multiple, sorted by priority. The semantic would still be "it's okay to allocate now for any of these". Then drivers have some more flexibility if one selected node turns out to be unusable at allocation time because conditions changed.
There are network-attached devices that will only be available on some nodes, it's not a strict "one node" vs. "all nodes". The advantage of putting this into PodScheduling is that when a driver sees this trigger, it can consider all pending claims at once while working on a PodScheduling object. Instead of allocating one claim and failing the next (think "two GPUs requested, but node only has one"), it can fail both allocations. Doing the same when SelectedNode in ResourceClaimStatus is harder.
Agreed. Waiting for NodeName also would be too late. At that point the pod cannot be placed anymore on those nodes which have the necessary resources available.
I don't think we need to place other pods on hold through the API. The driver knows which allocations are in progress and can ignore attempts to allocate for another pod. The result of the allocation then determines how to proceed with the other pod.
IMHO it is crucial that we wait for allocation to finish before setting NodeName. It's always possible that something doesn't go as planned.
I was thinking of actually deleting the pod, but you are right, setting it to failed comes close enough. But my point about "distinguish between temporary and permanent failures" still stands. We we would need some policy for how long it waits for a claim before giving up.
I'm okay with keeping this out of the implementation for now until we know more how often it goes wrong. I was less sure whether it should still be in the KEP and just not get implemented, but I came to the conclusion that it is better to make the API and scheduler description match what we intend to implement. "Deallocate" can get moved to "alternatives" in the meantime. |
A separate PodScheduling object has several advantages: - users don't need to see the information exchanged between scheduler and drivers - a selected node and potential nodes automatically apply to all pending claims - drivers can make holistic decisions about resource availability, for example when a pod requests two distinct GPUs but only some nodes have more than one or when there are interdependencies with other drivers Deallocate gets renamed to DeallocationRequested to make it describe the state of the claim, not an imperative. The reason why it needs to remain in ResourceClaimStatus is explained better. Because the scheduler extender API has no support for Reserve and Unreserve, the previous proposal for replacing usage of PodScheduling with webhook calls is no longer applicable and would have to be extended. This may be feasible, but is more complicated and is left out for now.
I believe we don't need this field. By not having it in the first place we avoid all questions around what should be listed here. |
I have pushed an update to this branch with one commit that:
We can still decide to not implement the recovery. |
36c6d23
to
4a61e12
Compare
@@ -937,9 +952,6 @@ The entire scheduling section is tentative. Key opens: | |||
- Support arbitrary combinations of user- vs. Kubernetes-managed ResourceClaims | |||
and immediate vs. late allocation? | |||
https://github.com/kubernetes/enhancements/pull/3064#discussion_r901948474 | |||
- Can and should `SelectedNode`, `SelectedUser`, `Deallocate` be moved to | |||
`PodScheduling` or be handled differently? | |||
https://github.com/pohly/enhancements/pull/13/files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SelectedNode can be moved. SelectedUser becomes obsolete (because we are doing pod scheduling, the pod is the intended user by definition). Deallocate must stay where it is to ensure atomic status updates.
Naming question: do we need some other name? The plural form is "cute", but I am not sure whether this is what we are going for here 😅 $ kubectl get podschedulings
No resources found in default namespace. |
Let's continue in kubernetes#3502. |
Add section about discovery to compatibility version KEP
SelectedNode and Deallocate (now called DeallocationRequested to make it
describe the state of the claim, not an imperative) get moved into
PodScheduling. The advantage is a cleaner API.
Because the scheduler extender API has no support for Reserve and Unreserve,
the previous proposal for replacing usage of PodScheduling with webhook calls
is no longer applicable and would have to be extended. This may be feasible,
but is more complicated and is left out for now.
The main disadvantage is that "deallocation in progress, do not use this claim"
can no longer be detected by looking at the ResourceClaimStatus. This leads to
a race condition with no known solution.