Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3063: dynamic resource allocation updates for 1.26 #3502

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 12, 2022

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 12, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 12, 2022
@alculquicondor
Copy link
Member

/cc

Having the scheduler and drivers exchange availability information through the
PodScheduling object has several advantages:
- users don't need to see the information
- a selected node and potential nodes automatically apply to
  all pending claims
- drivers can make holistic decisions about
  resource availability, for example when a pod
  requests two distinct GPUs but only some nodes have more
  than one or when there are interdependencies with
  other drivers

Deallocate gets renamed to DeallocationRequested to make it describe the state
of the claim, not an imperative. The reason why it needs to remain in
ResourceClaimStatus is explained better.

Because the scheduler extender API has no support for Reserve and Unreserve,
the previous proposal for replacing usage of PodScheduling with webhook calls
is no longer applicable and would have to be extended. This may be feasible,
but is more complicated and is left out for now.
The API shouldn't dictate how drivers receive ResourceClass parameters.  It
might make sense to use namespaced objects, perhaps because then cleaning up a
deployment is easier, or an existing object can be reused (like a ConfigMap for
the test driver).
@pohly pohly force-pushed the dynamic-resource-allocation-upstream branch from a60476e to e69ccae Compare September 22, 2022 16:55
…er references

Previously, empty string and nil both meant the same thing (core API group). We
don't need two different ways of expressing that.
Using APIVersion in the reference leads to ambiguities when the versions
change. Using just APIGroup is better. The examples accidentally used
"apiVersion".

The naming convention (see
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#naming-of-the-reference-field)
is to have Ref as field suffix. In our case that also provides a path
towards (perhaps) at some point inlining parameters because that field then can
be called just "Parameters".
A separate staging repo will be cleaner than reusing component-helpers.
@@ -985,7 +997,8 @@ a certain resource class, a node selector can be specified in that class. That
selector is static and typically will use labels that determine which nodes may
have resources available.

To gather information about the current state of resource availability, the
To gather information about the current state of resource availability and to
trigger allocation of a claim, the
scheduler creates a PodScheduling object. That object is owned by the pod and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify if it's one per pod or one per resource claim that the pod has?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"one PodScheduling object for the pod" - will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* **resource driver** sets `podScheduling.claims[name=name of claim in pod].unsuitableNodes`
* **resource driver** clears `claim.status.selectedNode` -> next attempt by scheduler has more information and is more likely to succeed
* **scheduler** creates or updates a `PodScheduling` object with `podScheduling.spec.potentialNodes=<nodes that fit the pod>`
* if *exactly one claim is pending* or *all drivers have provided information*:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly one claim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in that special case it is safe to trigger the allocation: if the node is suitable, the allocation will succeed and the pod can get scheduled without further delays. If the node is not suitable, allocation fail and the next attempt can do better because it has more information.

The same should not be done when there are multiple claims because allocation might succeed for some, but not all of them, which would force the scheduler to recover by asking for deallocation. It's better to wait for information in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the KEP.

keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
@@ -1635,7 +1641,7 @@ conditions apply:
Filter

One of the ResourceClaims satisfying these criteria is picked randomly and deallocation
is requested by setting the Deallocate field. The scheduler then needs to wait
is requested by setting the ResourceClaimStatus.DeallocationRequested field. The scheduler then needs to wait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's better if the driver decides for itself when to deallocate.

For example, if it's a local resource, it might never deallocate from a node. Otherwise, it could start deallocating from a node when reservedFor is empty, perhaps even with a grace period.

Now, if we add a grace period, kube-scheduler might want to give a signal that it "thinks" that a deallocation could help. This signal could work this way:

  • the scheduler clears the selectedNode field.
  • the driver observes:
    • there is a Pod with an empty PodScheduling.spec.selectedNode that requires this allocated claim. Note that this is different from a Pod that was never attempted for scheduling because the PodScheduling object wouldn't exist at this time.
    • the claim is not reserved for any other pod

Then the driver could choose to trigger deallocation.

Two advantages:

  • The scheduler only updates one object.
  • The driver could be more pro-active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's better if the driver decides for itself when to deallocate.

For example, if it's a local resource, it might never deallocate from a node. Otherwise, it could start deallocating from a node when reservedFor is empty, perhaps even with a grace period.

A driver should never deallocate a resource that has a non-empty reservedFor. It cannot be sure that the resource is not in use or about to be used.

To make this work, the scheduler would first have to clear the reservedFor to indicate that deallocating the claim is okay. That's doable.

But that alone puts the claim into "allocated, available" state. That's not a state where the driver pro-actively frees the claim because it doesn't know whether the allocation is still needed. For example, claims with immediate allocation enter that state before a pod gets created which uses them.

So we need some additional indicator.

Now, if we add a grace period, kube-scheduler might want to give a signal that it "thinks" that a deallocation could help.
This signal could work this way:

* the scheduler clears the `selectedNode` field.

* the driver observes:
  
  * there is a Pod with an empty PodScheduling.spec.selectedNode that requires this allocated claim. Note that this is different from a Pod that was never attempted for scheduling because the PodScheduling object wouldn't exist at this time.
  * the claim is not reserved for any other pod

Then the driver could choose to trigger deallocation.

That would work, but there's a race condition:

  • The driver observes the above state and starts deallocation.
  • Another pod gets created which can use the same claim.
  • The pod gets added to reservedFor and gets scheduled to a node.
  • The driver finishes the deallocation and the resource is not available for the pod anymore.

This race can be fixed by adding a ClaimStatus.Deallocating boolean that the driver would have to set before starting the deallocation.

I think that would work. Setting reservedFor together with allocation becomes more important, but that's already done. Does it sound better to you?

Two advantages:

* The scheduler only updates one object.

Agreed.

* The driver could be more pro-active.

No, I think it still has to be reactive. But even that is an improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if there was some compromise found here? How do we end up in this situation? Is it something like:

  • pod with 2 claims
  • driver-1 allocates and only available on nodes (a, b, c)
  • driver-2 allocates and only available on nodes (x, y, z)
    ? That shouldn't happen, right?

I guess that could play out as:

  • pod with 2 claims
  • scheduler suggests (a, b, c)
  • both drivers ACK
  • scheduler chooses a
  • driver-1 allocates and only available on nodes (a, b, c)
  • driver-2 allocates and only available on nodes (a, b, c)
  • Pod tries to run and resource 2 is no longer availble on a ?

Help me understand how we end up here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion has been superseded by #3502 (comment) below. After thinking about this some more, I realized that @alculquicondor's proposal depended on not releasing a reservation (because doing so would trigger deallocation, according to the definition above) but not releasing a reservation after a failed pod scheduling attempt leads to deadlocks or poor resource utilization. I gave examples for both below.

To answer your specific example: you are right, both drivers should allocate so that the claims are usable on the same node. The problem is not with where claims are usable, but rather which pod is allowed to use a claim exclusively when claims cannot be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @alculquicondor accepted my line of argumentation because he added his LGTM after we discussed this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not with where claims are usable, but rather which pod is allowed to use a claim exclusively when claims cannot be shared

Not needed for this review, but can you illuminate an example time-sequence that leads to that?

@@ -1635,7 +1641,7 @@ conditions apply:
Filter

One of the ResourceClaims satisfying these criteria is picked randomly and deallocation
is requested by setting the Deallocate field. The scheduler then needs to wait
is requested by setting the ResourceClaimStatus.DeallocationRequested field. The scheduler then needs to wait
for the resource driver to react to that change and deallocate the resource.

This may make it possible to run the Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the order of this plugin compared to preemption?

Copy link
Contributor Author

@pohly pohly Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation here is triggered by observing that it was some allocated claim which prevented pod scheduling. I think this deallocate+wait then should happen before preemption is attempted.

Would preemption be triggered at all in this case? As far as the scheduler knows, all nodes where rejected by the dynamic resource allocation plugin. Why should it evict pods from any of them when those pods where not the reason that the new pod cannot run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it evict pods from any of them when those pods were not the reason that the new pod cannot run?

What if one of these pods is using the ResourceClaim that the new pod needs?

Copy link
Contributor Author

@pohly pohly Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment earlier. The answer is that taking away a claim from a pod is not supported yet, even if a more important pod is blocked by a less important pod. Support for this will have to be planned and added later.

// A change of the PotentialNodes field in the PodScheduling object
// triggers a check in the driver
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
// allocation attempt trigger a check in the driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that UnsuitableNodes should be built from the union of PotentialNodes and the old list of UnsuitableNodes.

If we only take into account PotentialNodes, there could be some back-and-forth between the scheduler and the driver where the scheduler removes UnsuitableNodes from the next PotentialNodes, then in the next attempt the scheduler adds those nodes back because it thinks they are no longer unsuitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I had the same thought while working on the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I also included an explicit constant for the maximum size of these lists.

#### Reserve

A node has been chosen for the Pod.

If using delayed allocation and the resource has not been allocated yet,
the SelectedNode field of the ResourceClaim
the PodSchedulingSpec.SelectedNode field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do a single API call with the data for PotentialNodes and SelectedNode.

You could store the necessary information in memory in PreScore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation already does that. Let me double-check and update the KEP update accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#### Unreserve

The scheduler removes the Pod from the ReservedFor field because it cannot be scheduled after
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the SelectedNode instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing only that prevents other pods from using the claim. If one pod is potentially stuck while waiting for some other resource, it shouldn't hold onto claims that might be used by some other pod in the meantime.

This leads to a bigger question of how long those claims should be kept in the allocated state. The underlying resources might be needed for some other claim. Let's be optimistic and assume that most of the time, allocated claims will be put to good use quickly, okay?

We can still tune this once people have some real-world experience with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after re-reading what we said above about triggering deallocating by clearing the reservedFor field it becomes clear that the scheduler must keep the claim reserved here. Otherwise we'll have a lot of thrashing for a pod with several claims:

  • several drivers start allocating for the selectedNode
  • most allocations succeed, one doesn't
  • reservedFor gets cleared, selectedNode unset
  • all drivers start deallocating
  • once that is done, allocation is attempted anew for all claims

I think the same argument about being optimistic also applies to reservedFor: let's keep it set as you suggested and hope that the situation gets resolved quickly or (even better) doesn't occur often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sleeping over this I realized that this approach with keeping claims reserved has a drawback.

Suppose there are two separate claims and two pods both referencing them. Both pods get scheduled concurrently. One pod manages to allocate one claim and reserves it, the other pod does the same for the other claim. Now both pods are stuck because they only have one out of two claims.

This can happen both for claims with delayed allocation and with immediate allocation.

To solve this, we would need additional logic somewhere that looks at multiple different pods to detect such a deadlock and then resolve it. I don't know where to put that.

Perhaps the original proposal with giving up reservations and deallocation triggered explicitly is better after all? It relies on chance and repeated attempts to get one of the two pods scheduled, i.e. there's no coordination either, but it's not needed because eventually one pod should succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to the KEP as explanation why Unreserve must remove the pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just the risk of deadlocks. Consider claims X and Y and pods A, referencing X and Y, and pod B, referencing X, and pod C, referencing Y. If A reserves X and C reserves Y, then only pod C can run. A waits for C to finish and release Y, B waits for A.

If A releases X right after the failed scheduling attempt, then both B and C can run.

pohly and others added 2 commits September 30, 2022 19:39
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing a step where the pod says "I want to reserve this resource for me" (probably somewhere in a spec). Then the driver responds by setting .status.reservedFor (instead of the scheduler doing it).

If there is such mechanism, I think the race condition you describe wouldn't exist: the driver wouldn't deallocate and reserve for a pod at the same time.

because it cannot be scheduled after all.

This is necessary to prevent a deadlock: suppose there are two stand-alone
claims that only can be used by one pod at a time and two pods which both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this resource can be shared by two pods (but not three)? What is validating that only two are reserving it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have a usage count in the API (it was there in an earlier draft), now a claim is either shared (= unlimited number of pods) or single-use (= one pod). But it doesn't matter, in all of these cases the apiserver checks this when validating a ResourceClaimStatus update.

Copy link
Contributor Author

@pohly pohly Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's worth calling out a key design aspect of this KEP: all relevant information that is needed to ensure that a claim doesn't enter some invalid state is stored in the claim object itself. That is why the apiserver can validate it and why state updates are atomic.

Additional objects like PodScheduling can trigger operations like allocating a claim, but even then it is the claim object that records "allocation in progress" (through the driver's finalizer) and not that other object or some internal state of the driver.

"deallocation pending" is another such state that needs to be visible in the claim object. The obvious one is "claim has DeletionTimestamp", but for claims that need to go from "allocated" to "not allocated" instead of "removed" we need something else.

We cannot just reduce it to one case (= delete claim) because the scheduler does not control the lifecycle of the claim.

@pohly
Copy link
Contributor Author

pohly commented Oct 3, 2022

I think we are missing a step where the pod says "I want to reserve this resource for me" (probably somewhere in a spec). Then the driver responds by setting .status.reservedFor (instead of the scheduler doing it).

The driver only gets to see that a pod is interested in a claim when doing delayed allocation. In that case it sees the PodScheduling object. We already decided that it then can and should set the reservedFor together with doing the allocation.

But for an already allocated claim the driver is not involved anymore. Serializing all operations on a claim through the driver controller seems more complicated (we'll need additional APIs for it) and risky from a performance point of view to me.

@pohly
Copy link
Contributor Author

pohly commented Oct 3, 2022

I'd also like to point out that conceptually also other entities can be users of a claim. The main intention is for pods, but changing the design so that it only works for pods (for example, by relying on PodScheduling as the API for getting a pod added to reservedBy) might turn out to be a missed opportunity later on.

The implementation already moved to a separate type. OwnerReference had
several fields that were not needed. The maximum size needs to be a constant
because various consumers of the API will need to check it.
@pohly pohly force-pushed the dynamic-resource-allocation-upstream branch from 6a0937a to 2bfcb9e Compare October 4, 2022 09:23
It makes sense to start with scalability testing early, so for beta we should
already have a test that seems realistic. For GA we can then extend that based
on user feedback.
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully convinced of the API fields used for scheduling. In particular, I'm not keen on the idea of kube-scheduler modifying the ResourceClaim status, competing with the driver. I believe that kube-scheduler should only be writing to the PodScheduling object.

However, since this KEP is relying on new alpha APIs, rather than modifying the stable PodSpec, we can still backtrack in a follow up release.

So, lgtm from sig-scheduling for alpha.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Needs Approver in SIG Node PR Triage Oct 5, 2022
@derekwaynecarr
Copy link
Member

@thockin @dchen1107 do you want to take a pass at this too? at a high level its fine. the original draft asserts changes may be needed for resource quota, but resource quota can do object count quota on generic crds so i dont see that we need a specific 'resourceclaims` token for use in quota.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another read through this, flagged a few questions. Mostly, it remains the most anxiety inducing KEP of all time, and I can't help but throw myself at the idea that maybe there's some fundamental invariant we can assert which will make this dramatically simpler. I have not found it yet.

@@ -1030,19 +1041,17 @@ else changes in the system, like for example deleting objects.
* **scheduler** filters nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...based on built-in resources" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and other plugins (like the volume binder for storage capacity). I'll add that.

gets set here and the scheduling attempt gets stopped for now. It will be
retried when the ResourceClaim status changes.
If using delayed allocation and one or more claims have not been allocated yet,
the plugin now needs to decide whether it wants to trigger allocation by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only scheduler sets SelectedNode, but this says Plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler plugin. Elsewhere I have only said "the scheduler", but that meant the same thing because the core scheduler doesn't know anything about claims - everything related to those happens inside the scheduler plugin.

I'll make this more consistent and stick with "the scheduler".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at some other sections I found that "the plugin does xyz" was already used, so I changed my mind: in this chapter, let's use "the scheduler" for the core scheduler logic and "the claim plugin" for anything related to claims.

// needed.
// PodScheduling objects get created by a scheduler when it handles
// a pod which uses one or more unallocated ResourceClaims with delayed
// allocation.
type PodScheduling struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name this more specifically, e.g. from the name I expected it to be an explanation of the scheduler's decision or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions? I know that the current usage is just for claims, but calling it PodClaimScheduling would become inappropriate as soon as we find other usages for this object, which is a possibility.

On the other hand this is only alpha. We can name it PodClaimScheduling now and still rename it later.

@thockin : you suggested that name. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with the name and the current api type is that it looks hard to expand this into other plausible use cases for a pod scheduling object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this as-is for now and make a final decision during the API review of the implementation?

I don't want to miss the KEP deadline because of a naming discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apis are never final until then.

Is there an explanation in this KEP why the scheduler has to do this reach-out to drivers? I am reading from the top and haven't figured it out yet.

Copy link
Contributor Author

@pohly pohly Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the scheduler and the generic claim plugin inside it have no idea what the claim parameters mean and where a claim might be allocated. That entire logic is provided by the drivers.

The "Coordinating resource allocation through the scheduler" section is meant to explain this, but I probably lack this particular detail where it says "a node is selected tentatively by the scheduler".

I've updated this section to:

For delayed allocation, a node is selected tentatively by the scheduler
in an iterative process where the scheduler suggests some potential nodes
that fit the other resource requirements of a Pod and resource drivers
respond with information about whether they can allocate claims for those
nodes. This exchange of information happens through the `PodScheduling`
object for a Pod. The scheduler has to involve the drivers because it
doesn't know what claim parameters mean and where suitable resources are
currently available.

Once the scheduler is confident that it has enough information to select
a node that will probably work for all claims, it asks the driver(s) to
allocate their resources for that node. If that
succeeds, the Pod can get scheduled. If it fails, the scheduler must
determine whether some other node fits the requirements and if so,
request allocation again. If no node fits because some resources were
already allocated for a node and are only usable there, then those
resources must be released and then get allocated elsewhere.

"The scheduler" is now the core scheduler logic and "the claim plugin" is the
new scheduler plugin which handles ResourceClaims.
The "Coordinating resource allocation through the scheduler" needs to
explain some of the basic design decisions better.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022
@pohly
Copy link
Contributor Author

pohly commented Oct 6, 2022

@lavalamp, @thockin: I have pushed an update which may address some of the points you raised. Please take another look.

@thockin
Copy link
Member

thockin commented Oct 6, 2022

I am LGTMing based on @alculquicondor previous LGTM

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022
@pohly
Copy link
Contributor Author

pohly commented Oct 6, 2022

/hold cancel

Because Tim had a look.

@derekwaynecarr: I think you can approve now. My understanding is that @lavalamp is okay with not settling all API-related questions before the KEP merge.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2022
@lavalamp
Copy link
Member

lavalamp commented Oct 6, 2022

My understanding is that lavalamp is ok with not settling all API-related questions before the KEP merge.

Yeah, that's what API reviews are for.

(As for the KEP, I don't understand it well enough to either approve or block it)

@thockin
Copy link
Member

thockin commented Oct 6, 2022

Also, this PR is a refinement on an already merged KEP. Not that we can't go back over it with the finest of fine-toothed combs, but I don't think that should block this refinement.

@thockin
Copy link
Member

thockin commented Oct 6, 2022

oh, it need a sig node owner.

@dchen1107
Copy link
Member

/lgtm
/approve

Even there are still some open questions, but we can continue those discussion for alpha implementation. I approved it for moving forward. Thanks.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022
@k8s-ci-robot k8s-ci-robot merged commit af902dc into kubernetes:master Oct 7, 2022
SIG Node PR Triage automation moved this from Needs Approver to Done Oct 7, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 7, 2022
ahmedtd pushed a commit to ahmedtd/enhancements that referenced this pull request Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants