Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: move all preCheck to QueueingHint #110175

Open
Tracked by #122597
sanposhiho opened this issue May 23, 2022 · 23 comments
Open
Tracked by #122597

scheduler: move all preCheck to QueueingHint #110175

sanposhiho opened this issue May 23, 2022 · 23 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sanposhiho
Copy link
Member

sanposhiho commented May 23, 2022

What happened?

In our custom plugin, we want to move pods that is unscheduled by the plugin to activeQ/backoffQ when a new Pod is created.
So, we define EventsToRegister like this:

func (p *Plugin) EventsToRegister() []framework.ClusterEvent {
	return []framework.ClusterEvent{{
		Resource:   framework.Node,
		ActionType: framework.All,
	}, {
		Resource:   framework.Pod,
		ActionType: framework.All,
	}}
}

But, even when a new Pod is created, the Pod that was unscheduled by our custom plugin isn't moved to backoffQ/activeQ.

What did you expect to happen?

when a new Pod is created, the Pod unscheduled by our custom plugin is moved to backoffQ/activeQ.

How can we reproduce it (as minimally and precisely as possible)?

  1. create a custom plugin that has pod events in EventsToRegister.
  2. create a new Pod that the custom plugin will reject.
  3. create another new Pod that makes the Pod created in (1) schedulable.
  4. you can see the Pod created in (1) isn't scheduled immediately (it should've been scheduled immediately after (3) because (3) should move the Pod created in (1) to activeQ/backoffQ.)

Anything else we need to know?

If I understand correctly and if it's not expected behavior, this bug has a non-small impact, especially in v1.24 because we change PodMaxUnschedulableQDuration to 5 min.
#108761

Kubernetes version

v1.23.4 but I think latest master has the same issue.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sanposhiho sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2022
@sanposhiho
Copy link
Member Author

/assign
/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2022
@sanposhiho
Copy link
Member Author

Maybe it's only on a newly created unscheduled Pod.
We've moved correctly for events related to assigned Pod.
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/internal/queue/scheduling_queue.go#L597

@sanposhiho
Copy link
Member Author

sanposhiho commented May 23, 2022

And deleting Node events has the same issue...? I guess we need to run sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue when some Nodes are deleted.

func (sched *Scheduler) deleteNodeFromCache(obj interface{}) {
var node *v1.Node
switch t := obj.(type) {
case *v1.Node:
node = t
case cache.DeletedFinalStateUnknown:
var ok bool
node, ok = t.Obj.(*v1.Node)
if !ok {
klog.ErrorS(nil, "Cannot convert to *v1.Node", "obj", t.Obj)
return
}
default:
klog.ErrorS(nil, "Cannot convert to *v1.Node", "obj", t)
return
}
klog.V(3).InfoS("Delete event for node", "node", klog.KObj(node))
if err := sched.Cache.RemoveNode(node); err != nil {
klog.ErrorS(err, "Scheduler cache RemoveNode failed")
}
}

In in-tree plugins, only pod topology spread uses the deleting Node event and I guess this bug affects its performance. (Even when some Nodes are deleted, the Pods, that were unscheduled by Pod topology spread, aren't moved to activeQ/backoffQ by that deleting events.)

{Resource: framework.Node, ActionType: framework.Add | framework.Delete | framework.UpdateNodeLabel},

@sanposhiho
Copy link
Member Author

Hm, it may be a problem that events related to Node and Pod are not always called when an event occurs. (Even if it's intended behavior, at least we need to write this on the comment of EventsToRegister.)

/retitle scheduler: events in EventsToRegister related to Pod and Node don't always work

@k8s-ci-robot k8s-ci-robot changed the title scheduler: Any events in EventsToRegister related to Pod doesn't work scheduler: events in EventsToRegister related to Pod and Node don't always work May 24, 2022
@sanposhiho
Copy link
Member Author

sanposhiho commented May 29, 2022

To summarize: The problem is "not all events related to Pod or Node don't always make move from unschedulableQ to active/backoffQ and it may not be intended for custom-plugin developers because this behavior isn't even documented."

I think we should move all unschedulable Pods with all Pod and Node events. If concerned about performance degradation, we can extend EventsToRegister or EnqueueExtensions , for example, to have the function like PreEnqueueCheck?

This problem has a non-small impact, especially in v1.24+ because we change PodMaxUnschedulableQDuration to 5 min.
#108761

Let's see the details... ↓

Assigned Pod's add/update events

We move the unscheduled Pods (rejected by the plugin that has Pod add/update events in EventsToRegister) in AssignedPodAdded or AssignedPodUpdated

// AssignedPodAdded is called when a bound pod is added. Creation of this pod
// may make pending pods with matching affinity terms schedulable.
func (p *PriorityQueue) AssignedPodAdded(pod *v1.Pod) {
p.lock.Lock()
p.movePodsToActiveOrBackoffQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod), AssignedPodAdd)
p.lock.Unlock()
}
// AssignedPodUpdated is called when a bound pod is updated. Change of labels
// may make pending pods with matching affinity terms schedulable.
func (p *PriorityQueue) AssignedPodUpdated(pod *v1.Pod) {
p.lock.Lock()
p.movePodsToActiveOrBackoffQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod), AssignedPodUpdate)
p.lock.Unlock()
}

But, the target Pods against this moving are only unschedulable pods which have any affinity term that matches "pod".

// getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have
// any affinity term that matches "pod".
// NOTE: this function assumes lock has been acquired in caller.
func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo {
var nsLabels labels.Set
nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister)
var podsToMove []*framework.QueuedPodInfo
for _, pInfo := range p.unschedulablePods.podInfoMap {
for _, term := range pInfo.RequiredAffinityTerms {
if term.Matches(pod, nsLabels) {
podsToMove = append(podsToMove, pInfo)
break
}
}
}
return podsToMove
}

So,,,, users cannot move from unschedulableQ to activeQ/backoffQ with all assignedPod events?
For example, for Pod Topology Spread (it uses Pod All event in EventsToRegister), the new assigned Pod creation may make some Pods, that are rejected by Pod Topology Spread, schedulable. Regardless of whether that unschedulable Pods have any affinity term that matches a newly created assigned pod or not.
Of course this is not just a problem for Pod Topology Spread, it could be a problem for user's custom-plugins.

I guess we should pass all unscheduled Pods to movePodsToActiveOrBackoffQueue. Or at least we should document this behavior.

Assigned Pod's delete events

No issue on this.

sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(queue.AssignedPodDelete, nil)

Non-scheduled Pod's add/update/delete events

We don't move the unschedulable Pods with those events at all. That's a problem... Even when a plugin uses Pod add/update/delete event in EventsToRegister and wants to move the rejected Pods with non-scheduled Pod's add/update/delete events, scheduling queue doesn't move the unschedulable Pods with those events.

(In our company, we have the custom plugin for Gang Scheduling and the plugin wants the scheduler queye to move the Pods rejected by the plugin from unschedulableQ to active/backoffQ when a new non-scheduled Pod is created.)

Node's add/update events

We move only unscheduled Pods that pass preCheckForNode.

sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(queue.NodeAdd, preCheckForNode(nodeInfo))

sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(*event, preCheckForNode(nodeInfo))

return func(pod *v1.Pod) bool {
admissionResults := AdmissionCheck(pod, nodeInfo, false)
if len(admissionResults) != 0 {
return false
}
_, isUntolerated := corev1helpers.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, func(t *v1.Taint) bool {
return t.Effect == v1.TaintEffectNoSchedule
})
return !isUntolerated
}

preCheckForNode checks the Node can run that unscheduled Pod from the points of view of some in-tree filter plugins like noderesources/nodeport/nodeAffinity/nodename.

The problem that arises here is similar to "Assigned Pod's add/update events".
For example, for Pod Topology Spread (it uses Node Add event in EventsToRegister), the new Node creation may make some Pods, that are rejected by Pod Topology Spread, schedulable. Regardless of the results of preCheckForNode against that Node. (In #107009, we change the skew calculation of Pod Topology Spread to exclude Nodes that don't match the node affinity. But, after NodeInclusionPolicy is implemented, the calculation can be configured to include those Nodes.)

Node's delete events

We don't move the unschedulable Pods with Node delete events... That's a problem. Even when the plugin use Node delete event in EventsToRegister, scheduling queue doesn't move the unschedulable Pods with any Node delete events.)

@sanposhiho
Copy link
Member Author

/cc @Huang-Wei @alculquicondor @ahg-g

Could you please take a look

@alculquicondor
Copy link
Member

These are all intended optimizations. These basic checks prevent the scheduler from retrying unschedulable pods over and and over.

So far we haven't had the need to make these checks configurable as they check well established scheduling features.

+1 on documenting the behavior, but I'm not sure about changing it or making it configurable (that adds complexity that might not be justified).

What is it that you are trying to do?

@sanposhiho
Copy link
Member Author

sanposhiho commented May 30, 2022

+1 on documenting the behavior, but I'm not sure about changing it or making it configurable (that adds complexity that might not be justified).
What is it that you are trying to do?

We are trying to remove flushing in queue in #87850, right?

Without improving like I suggest, some Pods in unschedulableQ cannot move back to activeQ forever in some cases if we remove flushing.
(Actually, our company's custom plugin needs to wait flush because non-assigned Pod's events don't let scheduling queue move unschedulable Pods rejected by that plugin.)
So, I think we should simply inform all events to queue and let queue move unschedulable Pods with that events.

This can happen in some custom plugins like our company's one and also even in in-tree plugins.
For example, Pod Topology Spread (after NodeInclusionPolicy is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

@alculquicondor
Copy link
Member

cc @kerthcet, who implemented NodeInclusionPolicy.

For example, Pod Topology Spread (after kubernetes/enhancements#3094 is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

Can you provide a concrete example? The policy just refers to the topology spread calculations. It doesn't mean that the node doesn't have to match the nodeaffinity to be able to accept the pod.

What is it that you are trying to do?

I'm asking what is your custom plugin trying to do. Depending on whether it makes sense, we need to evaluate if it's worth the complexity.

@kerthcet
Copy link
Member

kerthcet commented May 31, 2022

So,,,, users cannot move from unschedulableQ to activeQ/backoffQ with all assignedPod events?
For example, for Pod Topology Spread (it uses Pod All event in EventsToRegister), the new assigned Pod creation may make some Pods, that are rejected by Pod Topology Spread, schedulable

This is actually a problem here, 5 mins would be too long for background flush. But it doesn't mean that we should simply shorten the interval time.

This can happen in some custom plugins like our company's one and also even in in-tree plugins.
For example, Pod Topology Spread (after kubernetes/enhancements#3094 is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

The skew of a new created node is always 0, so it's still unfit I think.

@sanposhiho
Copy link
Member Author

sanposhiho commented May 31, 2022

@alculquicondor

I'm asking what is your custom plugin trying to do. Depending on whether it makes sense, we need to evaluate if it's worth the complexity.

Our company's custom plugin is like coscheduling plugin.

And like coscheduling's prefilter, our plugin rejects Pods in PreFilter when the number of Pods is less than the defined number.
https://github.com/kubernetes-sigs/scheduler-plugins/blob/57499a6332c124314020a368c862b8b97470cc11/pkg/coscheduling/coscheduling.go#L128

Yes, so, maybe coscheduling plugin has the same issue as our plugin. The coscheduling plugin also uses Pod Add event and I guess it's for non-scheduled Pod's creation. (I don't know much about the coscheduling plugin though.)
https://github.com/kubernetes-sigs/scheduler-plugins/blob/57499a6332c124314020a368c862b8b97470cc11/pkg/coscheduling/coscheduling.go#L98

@kerthcet

The skew of a new created node is always 0, so it's still unfit I think.

Oh, that's true 😓 Please forget about that example.
But, how about like this scenario:

  1. a user uses NodeTaintsPolicy: Respect and WhenUnsatisfiable: DoNotSchedule
  2. a Pod is rejected by Pod Topology Spread because of the skew.
    e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.
  3. user adds un-tolerated taint to NodeC
  4. The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

@kerthcet
Copy link
Member

Maybe we can find some pointcuts by the UnschedulablePlugins in QueuedPodInfo, e.g. if pod was rejected by PodTopologySpread, we'll movePodsToActiveOrBackoffQueue when Assigned Pod's add/update events arrived.

For customize plugins, we may have to do some refactoring to register the filter func together with the event.

@alculquicondor
Copy link
Member

So now NodeC is excluded from calculations. Hence the pod fits in either NodeA or NodeB, which were unchanged.

Tricky... I'll have to think about it. We really need those quick checks to pass for performance in most scenarios.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2022
@Huang-Wei
Copy link
Member

Huang-Wei commented May 31, 2022

Thanks @sanposhiho for bringing this up. Some comments below:

Re: Assigned Pod's add/update events

You're right. Pods failed by PodTopologySpread may not get re-queued immediately.
However, an assigned Pod's add/update event would only impact unschedulable Pods with cross-node
scheduling directives. As of today, it's InterPodAffinity and PodTopologySpread.

So instead of "pass all unscheduled Pods to movePodsToActiveOrBackoffQueue", I'd like
to come up with a solution like:

p.movePodsToActiveOrBackoffQueue(
    p.getUnschedulablePodsWithMatchingAffinityTerm(pod),
    p.getUnschedulablePodsWithPodTopologySpreadConstraints(pod), /* Not Implemented Yet */
    AssignedPodAdd,
) 

Re: Non-scheduled Pod's add/update/delete events

It's pretty rare to move pods depending on Unschedulable Pods' events.

In your particular case, maybe a proactive Activate
call would behave more efficiently.

Re: Node's add/update events

It doesn't sound correlated, is it? As the preCheckForNode check will just pre-verify the basic
checks which are orthogonal with PodTopologySpread. In other words, it's impossible that a Pod failed
preCheckForNode can become schedulable eventually.

Updated: The example #110175 (comment) is a valid point...

Re: Node's delete events

Can you share some use-cases to move Pods upon node's deletion? In the before, I don't think we have any,
until the last release the introduction of minDomains feature (as in that case deleting a node would impact
the number of domains).

@sanposhiho
Copy link
Member Author

sanposhiho commented Jun 1, 2022

So instead of "pass all unscheduled Pods to movePodsToActiveOrBackoffQueue", I'd like
to come up with a solution like:

If we do so, also, we need to provide the way to pass some functions to decide which Pods should pass to movePodsToActiveOrBackoffQueue like getUnschedulablePodsWithPodTopologySpreadConstraints for out-of-tree plugins?

In your particular case, maybe a proactive kubernetes-sigs/scheduler-plugins#299
call would behave more efficiently.

Ah~, that's a good idea :)

Can you share some use-cases to move Pods upon node's deletion?

Maybe like this (similar to #110175 (comment))

  1. a user uses WhenUnsatisfiable: DoNotSchedule.
  2. a Pod is rejected by Pod Topology Spread because of the skew.
    e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.
  3. user deletes NodeC
  4. The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

I think we should move all unschedulable Pods with all Pod and Node events. If concerned about performance degradation, we can extend EventsToRegister or EnqueueExtensions , for example, to have the function like PreEnqueueCheck?

So,,, ignoring some events for performance in scheduler side (without providing a way to change this) could be inconvenient for some plugins. What I thought in the above previous comment (#110175 (comment)) is to define "which kind of resource's events should be ignored" in plugin side.
(similar to @kerthcet 's idea? #110175 (comment))

We can add pre filter functions in ClusterEvent like:

// obj: the created, updated, or deleted object
// rejectedPod: one of the unschedulable Pod in unschedulableQ
// return value: whether this unschedulable Pod should be moved to activeQ/backoffQ or not.
type PreFilterFn func(obj interface{}, rejectedPod *v1.Pod) bool

type ClusterEvent struct { 
	Resource   GVK
	ActionType ActionType
	Label      string
	// PreFilterFn is the function to check which Pod should be moved from unschedulableQ to activeQ/backoffQ by this event happens.
	PreFilterFn    PreFilterFn
}

type ClusterEvent struct {
Resource GVK
ActionType ActionType
Label string
}

Then we can set any pre-filter functions for each plugin.
And, we can define some default behavior for PreFilterFn and use them when ClusterEvent.PreFilterFn is empty.
So, for example, if most of the plugins are considered to prefer to ignore the event for Node that doesn't match nodeAffinity/nodeSelector, we can define this behavior as the default behavior.

What do you think about this idea?

@Huang-Wei
Copy link
Member

Huang-Wei commented Jun 2, 2022

If we do so, also, we need to provide the way to pass some functions to decide which Pods should pass to movePodsToActiveOrBackoffQueue like getUnschedulablePodsWithPodTopologySpreadConstraints for out-of-tree plugins?

Nope, that is something the scheduler core needs to implement.

Maybe like this (similar to #110175 (comment))

  1. a user uses WhenUnsatisfiable: DoNotSchedule.
  2. a Pod is rejected by Pod Topology Spread because of the skew.
    e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.
  3. user deletes NodeC
  4. The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

Thanks. It makes sense to me.

What I thought in the above previous comment (#110175 (comment)) is to define "which kind of resource's events should be ignored" in plugin side.

A customizable PreCheck approach for sure can satisfy more dynamic cases, but it's inappropriate to associate it with ClusterEvent as (1) ClusterEvent is basically an immutable constant, and (2) we use it to map to plugins. If we go this way, we need to wrap ClusterEvent into another data structure which is kind of complicated.


To summarize, the following cases doesn't properly re-queue relevant pods:

  1. Assigned Pod's add/update events
  2. Non-scheduled Pod's add/update/delete events
  3. Node's add/update/delete events

For 1 & 2, I've commented above with solutions. (1 needs core changes similar like podAffinity, 2 may leverage the Activate hook)

For 3, with @sanposhiho 's examples, I'm convinced this is some area we need to improve. One simple and stratightforward solution is to plumb a hook to let the plugin claim "alright, upon Node events, move the pod failed by me anyways". Then once this knob is registered, we move pods failed by that plugin back unconditionally; instead of proceeding with preCheckForNode(). This hook should be available to in-tree and out-of-tree plugins. In terms of in-tree plugins, it's just PodTopologySpread (we discussed this in today's sig meeting, even InterPodAffinity doesn't fall into this case)

So, as a first step, let's ensure we're on the same page and agree on the ^^ summary. Next, we can better evaluate trade-offs and come up with an optimal solution.

@sanposhiho
Copy link
Member Author

sanposhiho commented Jun 5, 2022

Thanks for discussing this topic in the sig meeting. 🙏
But, I am still believing that all logic related to a plugin should be controllable from the plugin side and we should not create something that can be achieved in in-tree plugins but not in out-of-tree plugins.

So... the benefit of the "customizable PreCheck approach" (#110175 (comment)) isn't only to solve this issue. We can use this feature for other optimization: like improving the whole performance of scheduler by moving only Pods that are more likely to be able to be scheduled by defining more detailed events on EventToRegister with customizable PreCheck.
Many plugins have a particular scenario that makes rejected Pods schedulable and the current EventsToRegister doesn't have an abundant enough ability of expression for this. (So, today many plugins need to say "register this hoge event on foo resource. This event may make rejected Pods schedulable" on comments in EventsToRegister. )

If we go this way, we need to wrap ClusterEvent into another data structure which is kind of complicated.

Yes, I watched the sig-meeting recording as well and the opinion that the "customizable PreCheck approach" increases some implementation complexity surely makes sense. But,,, isn't this complexity increasing acceptable even if the above is taken into account?


EDIT: It is a bad habit of mine to always write long sentences even though I am not very good at English. 😓
Summarizing the above:

  • I think it's not a good idea to create something that can be achieved in in-tree plugins but not in out-of-tree plugins.
  • If we need a solid advantage to extend like this, then it has the advantage of optimizing the queue per plugin like described above.

@sanposhiho
Copy link
Member Author

I didn't jump in the thread since it's actively on the discussion.
But, the issue https://kubernetes.slack.com/archives/C09TP78DV/p1659383737933799 is the similar one.
(This time, we can solve that issue by changing scheduling queue logic since It's the k/k feature though.)

@Huang-Wei
Copy link
Member

This time, we can solve that issue by changing scheduling queue logic since It's the k/k feature though.

@sanposhiho I'm not sure I fully understand your intention. The slack thread was discussing the AssignedPod update case, so it fallls into case 1 in #110175 (comment), which we already agreed to improve in upstream.

@kerthcet
Copy link
Member

Hi @sanposhiho Are you working on this? I think this is quite important and if you're not working on this, I can follow.

@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 17, 2022

Sorry, leave here for a while. I wanted to wait for KEP-3521: Pod Scheduling Readiness (and then forgot to come back here. 😓 )
That proposal is quite similar to what I want to see in the conclusion of this issue; every plugin can judge if each Pod in unschedQ can be moved to schedQ/backoffQ.

@Huang-Wei Do you think it makes sense to add "By which event the pod will be moved back to schedQ/backoffQ" on PreEnqueue's argument?

type EnqueuePlugin interface {
    Plugin
    // event: By which event the pod will be moved back to schedQ/backoffQ.
    // If the Pod will be backed to schedQ/backoffQ by flushing, it will be nil. (Flushing plans to be removed in https://github.com/kubernetes/kubernetes/issues/87850 though.)
    // involvedObj: the object involved in that event.
    // For example, the given event is "Node deleted", the involvedObj will be that deleted Node.
    PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod, event framework.ClusterEvent, involvedObj runtime.Object) *Status
}

I know event and involvedObj isn't needed in DefaultEnqueue plugin for schedulingGates API, but that's the last missing puzzle piece 🧩 : If PreEnqueue has event, enqueue plugin can judge if the event makes it possible to schedule the Pod.

For example, I believe we can move ↓ this logic to Pod Affinity plugin by implementing PreEnqueue in Pod Affinity plugin.

// getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have
// any affinity term that matches "pod".
// NOTE: this function assumes lock has been acquired in caller.
func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo {
var nsLabels labels.Set
nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister)
var podsToMove []*framework.QueuedPodInfo
for _, pInfo := range p.unschedulablePods.podInfoMap {
for _, term := range pInfo.RequiredAffinityTerms {
if term.Matches(pod, nsLabels) {
podsToMove = append(podsToMove, pInfo)
break
}
}
}
return podsToMove
}

Pod Affinity plugin will remember Pods rejected by itself in Filter. And in PreEnqueue, if the given Pod was the one rejected by Pod Affinity, event is AssignedPod/Created|Updated and involvedObj is Pod affinity with the rejected Pod, then move the Pod back to activeQ/backoffQ. Otherwise reject the Pod moving.

We can resolve (3) by implementing PreEnqueue as well. (by exactly the way you said.↓)

"alright, upon Node events, move the pod failed by me anyways"
#110175 (comment)

My original concern in this issue is that having logic for specific plugins in the scheduler core may prevent out-of-tree plugins' extendability. The scheduler itself should be pure, and shouldn't do anything special for any plugins implementation ideally. If we can move all such logic to each plugin, that'd be perfect.

@sanposhiho
Copy link
Member Author

^ is more like a new feature than just a solution to this problem. I created another issue and describe my proposal in detail. Let's have a discussion there.
#114297

@sanposhiho
Copy link
Member Author

It can be resolved by moving all preCheck into each plugin's QueueingHint. Let me retitle this issue so that the current status is easy to understand.

/retitle scheduler: move all preCheck to QueueingHint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Backlog
Development

No branches or pull requests

5 participants