Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic resource allocation: remove ResourceClaimStatus.Scheduling #13

Conversation

pohly
Copy link
Owner

@pohly pohly commented Jun 21, 2022

SelectedNode and Deallocate (now called DeallocationRequested to make it
describe the state of the claim, not an imperative) get moved into
PodScheduling. The advantage is a cleaner API.

Because the scheduler extender API has no support for Reserve and Unreserve,
the previous proposal for replacing usage of PodScheduling with webhook calls
is no longer applicable and would have to be extended. This may be feasible,
but is more complicated and is left out for now.

The main disadvantage is that "deallocation in progress, do not use this claim"
can no longer be detected by looking at the ResourceClaimStatus. This leads to
a race condition with no known solution.

process, but it would work. But this does not work for user-created ResourceClaims.
Do we prevent delayed allocation for those? How?
See also https://github.com/kubernetes/enhancements/pull/3064#discussion_r902133513
<<[/UNRESOLVED]>>
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preferred approach for solving this race is to keep DeallocationRequested in the ResourceClaimStatus. It is used during scheduling, but could also serve other purposes.

Copy link

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I spent a few minutes thinking about how I would describe this loop with emphasis on NOT doing allocations until the end.

* ResourceClass "gpu" is created:
	ResourceClass "gpu" {
		spec: {
			driver: foobar.example.com
			params: ...
		}
	}

* ResourceClaim "cfoo" is created
	ResourceClaim "cfoo" {
		spec: {
			class: gpu
			params: ...
			mode: Delayed
		}
		status: {
			impl: nil
			nodeSelector: nil
		}
	}

* Driver "foobar.example.com" observes Claim creation for itself
  (cfoo.spec.driver)

* Pod "pfoo" is created using Claim "cfoo"

* Scheduler sees cfoo.status.impl == nil && cfoo.spec.mode == Delayed

* Scheduler evaluates pfoo for pod constraints and builds a list of
  candidate nodes:
  - nodeName
  - nodeSelector
  - requests
  - tolerationsd
  - affinity

* Scheduler evaluates pfoo for volume constraints and filters the list of
  candidate nodes

* Scheduler does delayed allocation for volumes and filters the list of
  candidate nodes

* Scheduler evaluates pfoo for bound ResourceClaims and filters the list of
  candidate nodes

* Scheduler evaluates pfoo for in-progress ResourceClaims and if any, come back
  to this pod later.

* Scheduler now has a list of nodes that SHOULD accommodate pfoo except for
  unbound ResourceClaims

* Scheduler creates PodScheduling "pfoo"
	PodScheduling "pfoo" {
		spec: {
			candidateNodes: [ up to 32 nodes in priority order ]
			parties: [{
				resourceClaim: {
					name: pfoo
					class: gpu
				}}
			]
		status: {
			viable: map[party] -> list of nodes {}
			alternatives: map[party] -> list of nodes or node selector
		}
	}

* In case of multiple schedulers, one wins this race and the other aborts.

* Scheduler waits for either:
  a) all parties to publish a list of nodes (even empty)
  b) timeout O(30s)

* Driver observes PodScheduling with reference to a Claim it owns

* Driver considers pfoo.spec.candidateNodes for which are viable

* Driver sets pfoo status, one of:
  - viable -> [ all candidate nodes which are viable ]
  - alternatives -> [ up to 32 other nodes or a node selector ]

* Scheduler observes PodScheduling mutation

* If there is a node which is viable for all parties, scheduler does something
  to tell the driver it is OK to allocate on that node (what exactly??)

* If there is no node which is viable for all parties, scheduler sets
  candidateNodes to a new set of nodes, considering the alternatives proposed

* Repeat until a node is selected

* Driver sets cfoo.status.Conditions[Allocating] = true

* If 2 pods were racing this far, the driver can fail one of them

* Scheduler sets pfoo.spec.nodeName

* Kubelet waits for Claim cfoo.status.impl to point to a valid object

* If Kubelet can not run pfoo for any reason, kill it and start over
  - downside: no opportunity to reallocate the resource?

@pohly
Copy link
Owner Author

pohly commented Jun 22, 2022

  • Scheduler evaluates pfoo for pod constraints and builds a list of
    candidate nodes:

Should this also consider a node selector in the ResourceClass? In one comment you said it would be good as an initial filter, but then you questioned whether it is needed when I added it to the API. FWIW, I still think it's useful because it can trim candidateNodes, which becomes even more important when that is limited to 32 nodes. The chance of hitting the right node when only a single node in the cluster has one particular kind of resource is low when the cluster is big.

  • Scheduler does delayed allocation for volumes and filters the list of
    candidate nodes

"does delayed allocation for volumes" = "trigger volume provisioning"? That has to come after a node has been selected, which isn't the case yet.

  	parties: [{
  		resourceClaim: {

Which claims get listed here? Only those which are unallocated or all that are referenced by the Pod (regardless of immediate vs. late allocation and allocation state)?

I think it has to be all of them. Everything else is volatile. Even immediate vs. late allocation can change because the ResourceClaim might get deleted and recreated while the pod is pending.

  	viable: map[party] -> list of nodes {}
  	alternatives: map[party] -> list of nodes or node selector

API question: what is "party" here? "parties" above is a slice with no unique string. The index number?

Would that be a literal map or a list with an index field as key?

  • In case of multiple schedulers, one wins this race and the other aborts.

Does that refer to multiple schedulers trying to schedule the same pod? I don't think that this can happen. I don't know exactly how sharding a cluster works, but I was under the impression that each pod gets assigned to exactly one scheduler. Either way, only one scheduler can create the PodScheduling object, so this statement isn't wrong.

  • Scheduler waits for either:
    a) all parties to publish a list of nodes (even empty)
    b) timeout O(30s)

The current KEP doesn't necessarily wait for feedback from drivers. For a single party that makes sense because triggering an allocation attempt can be done in parallel and the outcome is either positive (allocation succeeds, pod can get scheduled) or a cheap failure (continue with information provided by driver).

I'm fine with dropping this optimization. I was trying to make scheduling as fast as possible because that seemed to be one of the main objections, but IMHO simplicity is more important. We can still tweak it later. @alculquicondor, do you agree?

  • If there is a node which is viable for all parties, scheduler does something
    to tell the driver it is OK to allocate on that node (what exactly??)

It could set PodScheduling.Status.SelectedNode.

  • Driver sets cfoo.status.Conditions[Allocating] = true

Is this necessary?

  • If 2 pods were racing this far, the driver can fail one of them

With "fail one of them" you mean "proceed with allocation for one pod and ignore the other", right? The other pod might still have a chance to run (shared resource) so there is no particular action for the "loosing" pod.

  • Scheduler sets pfoo.spec.nodeName

... while resource allocations are still going on

  • Kubelet waits for Claim cfoo.status.impl to point to a valid object

Which might not happen anytime soon if the driver is out of resources. In the meantime the pod is pending on the node and it's requested resources (RAM, CPU) are set aside for it. This is basically how Akri works today, using the device plugin API.

  • If Kubelet can not run pfoo for any reason, kill it and start over

I don't think kubelet has any means of killing a pod that it currently cannot run and no means to distinguish between temporary and permanent failures. We would have to merge https://github.com/kubernetes-sigs/descheduler into kublet for this to work or do something similar in kubelet.

  • downside: no opportunity to reallocate the resource?

Yep. This is my main concern. Everything else is above seems doable, but IMHO we have to let the scheduler wait for completion of all resource allocations and deal with the situation that despite all of the careful checks beforehand, resources might have to be reallocated.

@thockin
Copy link

thockin commented Jun 22, 2022

Should this also consider a node selector in the ResourceClass?

Sure - anything we can do to pre-filter and avoid negotiation loops is a win.

I still think it's useful because it can trim candidateNodes, which becomes even more important when that is limited to 32 nodes.

Note 32 is picked from thin air. It could be 64 or even 256 - it just can't be defined as "all". If we set it to 256 we'll effectively have "all" for 90% of clusters.

"does delayed allocation for volumes" = "trigger volume provisioning"? That has to come after a node has been selected, which isn't the case yet.

Yeah, I was handwaving a bit because I didn't remember how it worked

Which claims get listed here? Only those which are unallocated or all that are referenced by the Pod (regardless of immediate vs. late allocation and allocation state)?

I think it has to be all of them. Everything else is volatile. Even immediate vs. late allocation can change because the ResourceClaim might get deleted and recreated while the pod is pending.

I was thinking of it as only unallocated, delayed allocation ones. We need to wait for immediate allocation first, because it might be 90% done by the time the pod starts. There are some edge-cases that are just unreasonable to expect to work perfectly. If an immediate-allocation resource is deleted while a pod is scheduling against it, and that pod fails to run, I don't really have too much sympathy.

In my mind this adds support to the idea that ephemeral claims MUST be delayed allocation. Ephemeral + Immediate is nonsense

@thockin
Copy link

thockin commented Jun 22, 2022

woops, hit the button too soon, more coming

@thockin
Copy link

thockin commented Jun 22, 2022

API question: what is "party" here? "parties" above is a slice with no unique string. The index number?

Would that be a literal map or a list with an index field as key?

Don't take what I wrote as a literal API proposal. What I meant by "party" is some way of identifying an actor and a reason, sufficient to capture all of the individual decisions that need to be made. If you have 2 claims, I imagine it is 2 parties - even if it is the same driver it is 2 decisions and 2 sets of candidates. We could totally explore the idea that the driver could do its own intersection and save steps, though.

Does that refer to multiple schedulers trying to schedule the same pod?

Yeah, it's a weird case and should not happen, but it's clear that it's not a problem.

The current KEP doesn't necessarily wait for feedback from drivers. For a single party that makes sense because triggering an allocation attempt can be done in parallel and the outcome is either positive (allocation succeeds, pod can get scheduled) or a cheap failure (continue with information provided by driver).

I was operating on the idea that allocations are expensive, and the complexity of allocating and deallocating in a loop seemed (to me) to be the thing to avoid. This really is at the heart of my concerns.

I'm fine with dropping this optimization. I was trying to make scheduling as fast as possible

I'm skeptical that we know what is an optimization yet. :)

If there is a node which is viable for all parties, scheduler does something
to tell the driver it is OK to allocate on that node (what exactly??)

It could set PodScheduling.Status.SelectedNode.

I left this open because it's not clear to me what the semantic wants to be. If I create a ResourceClaim, I don't know which node(s) that claim will be reachable from. I don't know if it will be reachable from a single node or from all nodes. For a delayed-allocation claim, we need an unambiguous signal which says "it;s OK to allocate now" and that has to include "...on at least this specific node".

My instinct is that this "feels like" part of the claim API, but it also "feels" dirty - "on this node" is meaningless for a resource available on all nodes.

So one option would be to set Claim.status.allocationMustIncludeNode (bad name, making a point).

Another would be that drivers need to watch all pods, filter for pods that have claims agains classes which use that driver, and then wait for Pod.spec.NodeName to be set. Watching pods is always a scary proposition.

Driver sets cfoo.status.Conditions[Allocating] = true

Is this necessary?

If 2 pods were racing this far, the driver can fail one of them
With "fail one of them" you mean "proceed with allocation for one pod and ignore the other", right? The other pod might still have a chance to run (shared resource) so there is no particular action for the "loosing" pod.

The goal here was to interlock. We need some way to signal that a Delayed claim is bering worked on by "someone else" so that all-but-one pods can switch from "actively try to allocate this resource" mode to "wait for this resource to be allocated" mode.

Scheduler sets pfoo.spec.nodeName

... while resource allocations are still going on

I wrote it this way on the assumption that parallelizing was valuable. We could also wait for all allocations and the re-confirm(?) availability.

Kubelet waits for Claim cfoo.status.impl to point to a valid object

Which might not happen anytime soon if the driver is out of resources. In the meantime the pod is pending on the node and it's requested resources (RAM, CPU) are set aside for it. This is basically how Akri works today, using the device plugin API.

That's a fair point.

If Kubelet can not run pfoo for any reason, kill it and start over

I don't think kubelet has any means of killing a pod that it currently cannot run and no means to distinguish between temporary and permanent failures.

Sure it does. It can just mark the pod as Failed.

downside: no opportunity to reallocate the resource?

Yep. This is my main concern. Everything else is above seems doable, but IMHO we have to let the scheduler wait for completion of all resource allocations and deal with the situation that despite all of the careful checks beforehand, resources might have to be reallocated.

We have to accept that there will be unsolvable conditions. A resource might be pinned to a node that can't run a pod that wants to use that resource. This is little comfort for someone who asked for a resource bound to a specific pod, saw a resource allocated FOR that pod, and then has the pod fail to schedule. That said, it feels like a relatively low-risk scenario to me, which requires a much more complicated API and lifecycle to categorically avoid.

I can see extending the ResourceClaim API include an intermediate state ("Allocated but not initialized") or something like that, during which heroic actions like "deallocate this and try again" might be taken, but not without some evidence that it is necessary.

@pohly pohly force-pushed the dynamic-resource-allocation-upstream branch 4 times, most recently from ea13213 to 08c0828 Compare June 22, 2022 20:54
@pohly
Copy link
Owner Author

pohly commented Jun 27, 2022

I still think it's useful because it can trim candidateNodes, which becomes even more important when that is limited to 32 nodes.

Note 32 is picked from thin air. It could be 64 or even 256 - it just can't be defined as "all". If we set it to 256 we'll effectively have "all" for 90% of clusters.

It was mentioned a few times that we must not have arrays of unlimited size. But doesn't that mean that we also need a size limit for a NodeSelector, which contains arrays? Is that already done anywhere?

Or is the difference that "normally", when used as intended, NodeSelectors do not grow beyond a certain size (although the API server and the type would allow it)?

Which claims get listed here? Only those which are unallocated or all that are referenced by the Pod (regardless of immediate vs. late allocation and allocation state)?
I think it has to be all of them. Everything else is volatile. Even immediate vs. late allocation can change because the ResourceClaim might get deleted and recreated while the pod is pending.

I was thinking of it as only unallocated, delayed allocation ones.

That also works. We then just have to remember that the list might have to be changed over time and some of the listed claims might become allocated. It's basically a snapshot of "what the scheduler wants to know" and then drivers will need to check how they respond.

The current KEP doesn't necessarily wait for feedback from drivers. For a single party that makes sense because triggering an allocation attempt can be done in parallel and the outcome is either positive (allocation succeeds, pod can get scheduled) or a cheap failure (continue with information provided by driver).

I was operating on the idea that allocations are expensive, and the complexity of allocating and deallocating in a loop seemed (to me) to be the thing to avoid. This really is at the heart of my concerns.

Agreed. I fully support that once a pod has more than one pending claim or volume, the scheduler should wait for information. It's not just that allocation may be expensive, recovering from a bad choice also is hard.

For a single pending claim, it may trigger allocation in parallel to asking for information. But that's an optimization to cut down on pod scheduling latency. I wonder whether it is even worth having it in the KEP and whether it then should be implemented right away or should be delayed until later.

If there is a node which is viable for all parties, scheduler does something
to tell the driver it is OK to allocate on that node (what exactly??)

It could set PodScheduling.Status.SelectedNode.

I left this open because it's not clear to me what the semantic wants to be.

I don't see a problem with it. For a resource driver, it means "allocate all pending claims so that they (probably) will be usable on this node". This cannot be a full guarantee that scheduling will work out and perhaps there are constraints that only get discovered later which prevent using a claim even though it was allocated for a node, but it should be a good first bet.

Instead of a single node, we could have multiple, sorted by priority. The semantic would still be "it's okay to allocate now for any of these". Then drivers have some more flexibility if one selected node turns out to be unusable at allocation time because conditions changed.

My instinct is that this "feels like" part of the claim API, but it also "feels" dirty - "on this node" is meaningless for a resource available on all nodes.

There are network-attached devices that will only be available on some nodes, it's not a strict "one node" vs. "all nodes".

The advantage of putting this into PodScheduling is that when a driver sees this trigger, it can consider all pending claims at once while working on a PodScheduling object. Instead of allocating one claim and failing the next (think "two GPUs requested, but node only has one"), it can fail both allocations. Doing the same when SelectedNode in ResourceClaimStatus is harder.

So one option would be to set Claim.status.allocationMustIncludeNode (bad name, making a point).

Another would be that drivers need to watch all pods, filter for pods that have claims agains classes which use that driver, and then wait for Pod.spec.NodeName to be set. Watching pods is always a scary proposition.

Agreed. Waiting for NodeName also would be too late. At that point the pod cannot be placed anymore on those nodes which have the necessary resources available.

Driver sets cfoo.status.Conditions[Allocating] = true

Is this necessary?

The goal here was to interlock. We need some way to signal that a Delayed claim is bering worked on by "someone else" so that all-but-one pods can switch from "actively try to allocate this resource" mode to "wait for this resource to be allocated" mode.

I don't think we need to place other pods on hold through the API. The driver knows which allocations are in progress and can ignore attempts to allocate for another pod. The result of the allocation then determines how to proceed with the other pod.

Scheduler sets pfoo.spec.nodeName

... while resource allocations are still going on

I wrote it this way on the assumption that parallelizing was valuable. We could also wait for all allocations and the re-confirm(?) availability.

IMHO it is crucial that we wait for allocation to finish before setting NodeName. It's always possible that something doesn't go as planned.

If Kubelet can not run pfoo for any reason, kill it and start over
I don't think kubelet has any means of killing a pod that it currently cannot run and no means to distinguish between temporary and permanent failures.

Sure it does. It can just mark the pod as Failed.

I was thinking of actually deleting the pod, but you are right, setting it to failed comes close enough. But my point about "distinguish between temporary and permanent failures" still stands. We we would need some policy for how long it waits for a claim before giving up.

I can see extending the ResourceClaim API include an intermediate state ("Allocated but not initialized") or something like that, during which heroic actions like "deallocate this and try again" might be taken, but not without some evidence that it is necessary.

I'm okay with keeping this out of the implementation for now until we know more how often it goes wrong.

I was less sure whether it should still be in the KEP and just not get implemented, but I came to the conclusion that it is better to make the API and scheduler description match what we intend to implement. "Deallocate" can get moved to "alternatives" in the meantime.

A separate PodScheduling object has several advantages:
- users don't need to see the information exchanged
  between scheduler and drivers
- a selected node and potential nodes automatically apply to
  all pending claims
- drivers can make holistic decisions about
  resource availability, for example when a pod
  requests two distinct GPUs but only some nodes have more
  than one or when there are interdependencies with
  other drivers

Deallocate gets renamed to DeallocationRequested to make it describe the state
of the claim, not an imperative. The reason why it needs to remain in
ResourceClaimStatus is explained better.

Because the scheduler extender API has no support for Reserve and Unreserve,
the previous proposal for replacing usage of PodScheduling with webhook calls
is no longer applicable and would have to be extended. This may be feasible,
but is more complicated and is left out for now.
@pohly
Copy link
Owner Author

pohly commented Jun 30, 2022

parties: [{
resourceClaim: {
name: pfoo
class: gpu
}}
]

I believe we don't need this field. By not having it in the first place we avoid all questions around what should be listed here.
Instead, resource drivers themselves must check the pod and its referenced claims to decide what they should work on.

@pohly
Copy link
Owner Author

pohly commented Jun 30, 2022

I have pushed an update to this branch with one commit that:

  • adds SelectedNode to PodScheduling (see rationale in the commit message)
  • keeps the deallocation recovery mechanism as originally described, just with a slightly different field name (because we already discussed how it would work in the scheduler, I have tried it in my prototype, and it still seems relevant)
  • for a single claim, the scheduler can go ahead with selecting a node in parallel to asking for information
  • for more claims, it needs to wait for information

We can still decide to not implement the recovery.

@pohly pohly force-pushed the dynamic-resource-allocation-pod-scheduling branch from 36c6d23 to 4a61e12 Compare June 30, 2022 13:54
@@ -937,9 +952,6 @@ The entire scheduling section is tentative. Key opens:
- Support arbitrary combinations of user- vs. Kubernetes-managed ResourceClaims
and immediate vs. late allocation?
https://github.com/kubernetes/enhancements/pull/3064#discussion_r901948474
- Can and should `SelectedNode`, `SelectedUser`, `Deallocate` be moved to
`PodScheduling` or be handled differently?
https://github.com/pohly/enhancements/pull/13/files
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectedNode can be moved. SelectedUser becomes obsolete (because we are doing pod scheduling, the pod is the intended user by definition). Deallocate must stay where it is to ensure atomic status updates.

@pohly
Copy link
Owner Author

pohly commented Jul 7, 2022

Scheduler creates PodScheduling "pfoo"

Naming question: do we need some other name?

The plural form is "cute", but I am not sure whether this is what we are going for here 😅

$ kubectl get podschedulings
No resources found in default namespace.

@pohly
Copy link
Owner Author

pohly commented Sep 13, 2022

Let's continue in kubernetes#3502.

@pohly pohly closed this Sep 13, 2022
pohly pushed a commit that referenced this pull request Mar 27, 2024
Add section about discovery to compatibility version KEP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants