scheduler: move all preCheck to QueueingHint #110175

sanposhiho · 2022-05-23T18:02:23Z

What happened?

In our custom plugin, we want to move pods that is unscheduled by the plugin to activeQ/backoffQ when a new Pod is created.
So, we define EventsToRegister like this:

func (p *Plugin) EventsToRegister() []framework.ClusterEvent {
	return []framework.ClusterEvent{{
		Resource:   framework.Node,
		ActionType: framework.All,
	}, {
		Resource:   framework.Pod,
		ActionType: framework.All,
	}}
}

But, even when a new Pod is created, the Pod that was unscheduled by our custom plugin isn't moved to backoffQ/activeQ.

What did you expect to happen?

when a new Pod is created, the Pod unscheduled by our custom plugin is moved to backoffQ/activeQ.

How can we reproduce it (as minimally and precisely as possible)?

create a custom plugin that has pod events in EventsToRegister.
create a new Pod that the custom plugin will reject.
create another new Pod that makes the Pod created in (1) schedulable.
you can see the Pod created in (1) isn't scheduled immediately (it should've been scheduled immediately after (3) because (3) should move the Pod created in (1) to activeQ/backoffQ.)

Anything else we need to know?

If I understand correctly and if it's not expected behavior, this bug has a non-small impact, especially in v1.24 because we change PodMaxUnschedulableQDuration to 5 min.
#108761

Kubernetes version

v1.23.4 but I think latest master has the same issue.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

sanposhiho · 2022-05-23T18:02:35Z

/assign
/sig scheduling

sanposhiho · 2022-05-23T18:25:46Z

Maybe it's only on a newly created unscheduled Pod.
We've moved correctly for events related to assigned Pod.
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/internal/queue/scheduling_queue.go#L597

sanposhiho · 2022-05-23T18:28:36Z

And deleting Node events has the same issue...? I guess we need to run sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue when some Nodes are deleted.

kubernetes/pkg/scheduler/eventhandlers.go

Lines 92 to 112 in 31a1024

    
           func (sched *Scheduler) deleteNodeFromCache(obj interface{}) { 
        
           	var node *v1.Node 
        
           	switch t := obj.(type) { 
        
           	case *v1.Node: 
        
           		node = t 
        
           	case cache.DeletedFinalStateUnknown: 
        
           		var ok bool 
        
           		node, ok = t.Obj.(*v1.Node) 
        
           		if !ok { 
        
           			klog.ErrorS(nil, "Cannot convert to *v1.Node", "obj", t.Obj) 
        
           			return 
        
           		} 
        
           	default: 
        
           		klog.ErrorS(nil, "Cannot convert to *v1.Node", "obj", t) 
        
           		return 
        
           	} 
        
           	klog.V(3).InfoS("Delete event for node", "node", klog.KObj(node)) 
        
           	if err := sched.Cache.RemoveNode(node); err != nil { 
        
           		klog.ErrorS(err, "Scheduler cache RemoveNode failed") 
        
           	} 
        
           }

In in-tree plugins, only pod topology spread uses the deleting Node event and I guess this bug affects its performance. (Even when some Nodes are deleted, the Pods, that were unscheduled by Pod topology spread, aren't moved to activeQ/backoffQ by that deleting events.)

kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go

Line 146 in e0dbea2

    
           {Resource: framework.Node, ActionType: framework.Add | framework.Delete | framework.UpdateNodeLabel},

sanposhiho · 2022-05-24T14:18:42Z

Hm, it may be a problem that events related to Node and Pod are not always called when an event occurs. (Even if it's intended behavior, at least we need to write this on the comment of EventsToRegister.)

/retitle scheduler: events in EventsToRegister related to Pod and Node don't always work

sanposhiho · 2022-05-29T07:51:30Z

To summarize: The problem is "not all events related to Pod or Node don't always make move from unschedulableQ to active/backoffQ and it may not be intended for custom-plugin developers because this behavior isn't even documented."

I think we should move all unschedulable Pods with all Pod and Node events. If concerned about performance degradation, we can extend EventsToRegister or EnqueueExtensions , for example, to have the function like PreEnqueueCheck?

This problem has a non-small impact, especially in v1.24+ because we change PodMaxUnschedulableQDuration to 5 min.
#108761

Let's see the details... ↓

Assigned Pod's add/update events

We move the unscheduled Pods (rejected by the plugin that has Pod add/update events in EventsToRegister) in AssignedPodAdded or AssignedPodUpdated

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 593 to 607 in 31a1024

    
           // AssignedPodAdded is called when a bound pod is added. Creation of this pod 
        
           // may make pending pods with matching affinity terms schedulable. 
        
           func (p *PriorityQueue) AssignedPodAdded(pod *v1.Pod) { 
        
           	p.lock.Lock() 
        
           	p.movePodsToActiveOrBackoffQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod), AssignedPodAdd) 
        
           	p.lock.Unlock() 
        
           } 
        
           // AssignedPodUpdated is called when a bound pod is updated. Change of labels 
        
           // may make pending pods with matching affinity terms schedulable. 
        
           func (p *PriorityQueue) AssignedPodUpdated(pod *v1.Pod) { 
        
           	p.lock.Lock() 
        
           	p.movePodsToActiveOrBackoffQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod), AssignedPodUpdate) 
        
           	p.lock.Unlock() 
        
           }

But, the target Pods against this moving are only unschedulable pods which have any affinity term that matches "pod".

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 660 to 678 in 31a1024

    
           // getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have 
        
           // any affinity term that matches "pod". 
        
           // NOTE: this function assumes lock has been acquired in caller. 
        
           func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo { 
        
           	var nsLabels labels.Set 
        
           	nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister) 
        
           	var podsToMove []*framework.QueuedPodInfo 
        
           	for _, pInfo := range p.unschedulablePods.podInfoMap { 
        
           		for _, term := range pInfo.RequiredAffinityTerms { 
        
           			if term.Matches(pod, nsLabels) { 
        
           				podsToMove = append(podsToMove, pInfo) 
        
           				break 
        
           			} 
        
           		} 
        
           	} 
        
           	return podsToMove 
        
           }

So,,,, users cannot move from unschedulableQ to activeQ/backoffQ with all assignedPod events?
For example, for Pod Topology Spread (it uses Pod All event in EventsToRegister), the new assigned Pod creation may make some Pods, that are rejected by Pod Topology Spread, schedulable. Regardless of whether that unschedulable Pods have any affinity term that matches a newly created assigned pod or not.
Of course this is not just a problem for Pod Topology Spread, it could be a problem for user's custom-plugins.

I guess we should pass all unscheduled Pods to movePodsToActiveOrBackoffQueue. Or at least we should document this behavior.

Assigned Pod's delete events

No issue on this.

kubernetes/pkg/scheduler/eventhandlers.go

Line 234 in 31a1024

    
           sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(queue.AssignedPodDelete, nil)

Non-scheduled Pod's add/update/delete events

We don't move the unschedulable Pods with those events at all. That's a problem... Even when a plugin uses Pod add/update/delete event in EventsToRegister and wants to move the rejected Pods with non-scheduled Pod's add/update/delete events, scheduling queue doesn't move the unschedulable Pods with those events.

(In our company, we have the custom plugin for Gang Scheduling and the plugin wants the scheduler queye to move the Pods rejected by the plugin from unschedulableQ to active/backoffQ when a new non-scheduled Pod is created.)

Node's add/update events

We move only unscheduled Pods that pass preCheckForNode.

kubernetes/pkg/scheduler/eventhandlers.go

Line 70 in 31a1024

    
           sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(queue.NodeAdd, preCheckForNode(nodeInfo))

kubernetes/pkg/scheduler/eventhandlers.go

Line 88 in 31a1024

    
           sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(*event, preCheckForNode(nodeInfo))

kubernetes/pkg/scheduler/eventhandlers.go

Lines 474 to 483 in 31a1024

    
           return func(pod *v1.Pod) bool { 
        
           	admissionResults := AdmissionCheck(pod, nodeInfo, false) 
        
           	if len(admissionResults) != 0 { 
        
           		return false 
        
           	} 
        
           	_, isUntolerated := corev1helpers.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, func(t *v1.Taint) bool { 
        
           		return t.Effect == v1.TaintEffectNoSchedule 
        
           	}) 
        
           	return !isUntolerated 
        
           }

preCheckForNode checks the Node can run that unscheduled Pod from the points of view of some in-tree filter plugins like noderesources/nodeport/nodeAffinity/nodename.

The problem that arises here is similar to "Assigned Pod's add/update events".
For example, for Pod Topology Spread (it uses Node Add event in EventsToRegister), the new Node creation may make some Pods, that are rejected by Pod Topology Spread, schedulable. Regardless of the results of preCheckForNode against that Node. (In #107009, we change the skew calculation of Pod Topology Spread to exclude Nodes that don't match the node affinity. But, after NodeInclusionPolicy is implemented, the calculation can be configured to include those Nodes.)

Node's delete events

We don't move the unschedulable Pods with Node delete events... That's a problem. Even when the plugin use Node delete event in EventsToRegister, scheduling queue doesn't move the unschedulable Pods with any Node delete events.)

sanposhiho · 2022-05-29T07:52:43Z

/cc @Huang-Wei @alculquicondor @ahg-g

Could you please take a look

alculquicondor · 2022-05-30T15:27:53Z

These are all intended optimizations. These basic checks prevent the scheduler from retrying unschedulable pods over and and over.

So far we haven't had the need to make these checks configurable as they check well established scheduling features.

+1 on documenting the behavior, but I'm not sure about changing it or making it configurable (that adds complexity that might not be justified).

What is it that you are trying to do?

sanposhiho · 2022-05-30T15:59:52Z

+1 on documenting the behavior, but I'm not sure about changing it or making it configurable (that adds complexity that might not be justified).
What is it that you are trying to do?

We are trying to remove flushing in queue in #87850, right?

Without improving like I suggest, some Pods in unschedulableQ cannot move back to activeQ forever in some cases if we remove flushing.
(Actually, our company's custom plugin needs to wait flush because non-assigned Pod's events don't let scheduling queue move unschedulable Pods rejected by that plugin.)
So, I think we should simply inform all events to queue and let queue move unschedulable Pods with that events.

This can happen in some custom plugins like our company's one and also even in in-tree plugins.
For example, Pod Topology Spread (after NodeInclusionPolicy is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

alculquicondor · 2022-05-30T18:30:31Z

cc @kerthcet, who implemented NodeInclusionPolicy.

For example, Pod Topology Spread (after kubernetes/enhancements#3094 is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

Can you provide a concrete example? The policy just refers to the topology spread calculations. It doesn't mean that the node doesn't have to match the nodeaffinity to be able to accept the pod.

What is it that you are trying to do?

I'm asking what is your custom plugin trying to do. Depending on whether it makes sense, we need to evaluate if it's worth the complexity.

kerthcet · 2022-05-31T09:32:27Z

So,,,, users cannot move from unschedulableQ to activeQ/backoffQ with all assignedPod events?
For example, for Pod Topology Spread (it uses Pod All event in EventsToRegister), the new assigned Pod creation may make some Pods, that are rejected by Pod Topology Spread, schedulable

This is actually a problem here, 5 mins would be too long for background flush. But it doesn't mean that we should simply shorten the interval time.

This can happen in some custom plugins like our company's one and also even in in-tree plugins.
For example, Pod Topology Spread (after kubernetes/enhancements#3094 is implemented). When a user uses NodeAffinityPolicy: Ignore and WhenUnsatisfiable: DoNotSchedule, the creation of Node that doesn't match nodeAffinity/nodeSelector may change the scheduling result.

The skew of a new created node is always 0, so it's still unfit I think.

sanposhiho · 2022-05-31T10:24:05Z

@alculquicondor

I'm asking what is your custom plugin trying to do. Depending on whether it makes sense, we need to evaluate if it's worth the complexity.

Our company's custom plugin is like coscheduling plugin.

And like coscheduling's prefilter, our plugin rejects Pods in PreFilter when the number of Pods is less than the defined number.
https://github.com/kubernetes-sigs/scheduler-plugins/blob/57499a6332c124314020a368c862b8b97470cc11/pkg/coscheduling/coscheduling.go#L128

Yes, so, maybe coscheduling plugin has the same issue as our plugin. The coscheduling plugin also uses Pod Add event and I guess it's for non-scheduled Pod's creation. (I don't know much about the coscheduling plugin though.)
https://github.com/kubernetes-sigs/scheduler-plugins/blob/57499a6332c124314020a368c862b8b97470cc11/pkg/coscheduling/coscheduling.go#L98

@kerthcet

The skew of a new created node is always 0, so it's still unfit I think.

Oh, that's true 😓 Please forget about that example.
But, how about like this scenario:

a user uses NodeTaintsPolicy: Respect and WhenUnsatisfiable: DoNotSchedule
a Pod is rejected by Pod Topology Spread because of the skew.
e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.
user adds un-tolerated taint to NodeC
The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

kerthcet · 2022-05-31T14:21:31Z

Maybe we can find some pointcuts by the UnschedulablePlugins in QueuedPodInfo, e.g. if pod was rejected by PodTopologySpread, we'll movePodsToActiveOrBackoffQueue when Assigned Pod's add/update events arrived.

For customize plugins, we may have to do some refactoring to register the filter func together with the event.

alculquicondor · 2022-05-31T16:13:40Z

So now NodeC is excluded from calculations. Hence the pod fits in either NodeA or NodeB, which were unchanged.

Tricky... I'll have to think about it. We really need those quick checks to pass for performance in most scenarios.

/triage accepted

Huang-Wei · 2022-05-31T18:28:53Z

Thanks @sanposhiho for bringing this up. Some comments below:

Re: Assigned Pod's add/update events

You're right. Pods failed by PodTopologySpread may not get re-queued immediately.
However, an assigned Pod's add/update event would only impact unschedulable Pods with cross-node
scheduling directives. As of today, it's InterPodAffinity and PodTopologySpread.

So instead of "pass all unscheduled Pods to movePodsToActiveOrBackoffQueue", I'd like
to come up with a solution like:

p.movePodsToActiveOrBackoffQueue(
    p.getUnschedulablePodsWithMatchingAffinityTerm(pod),
    p.getUnschedulablePodsWithPodTopologySpreadConstraints(pod), /* Not Implemented Yet */
    AssignedPodAdd,
)

Re: Non-scheduled Pod's add/update/delete events

It's pretty rare to move pods depending on Unschedulable Pods' events.

In your particular case, maybe a proactive Activate
call would behave more efficiently.

Re: Node's add/update events

It doesn't sound correlated, is it? As the preCheckForNode check will just pre-verify the basic
checks which are orthogonal with PodTopologySpread. In other words, it's impossible that a Pod failed
preCheckForNode can become schedulable eventually.

Updated: The example #110175 (comment) is a valid point...

Re: Node's delete events

Can you share some use-cases to move Pods upon node's deletion? In the before, I don't think we have any,
until the last release the introduction of minDomains feature (as in that case deleting a node would impact
the number of domains).

sanposhiho · 2022-06-01T00:18:18Z

So instead of "pass all unscheduled Pods to movePodsToActiveOrBackoffQueue", I'd like
to come up with a solution like:

If we do so, also, we need to provide the way to pass some functions to decide which Pods should pass to movePodsToActiveOrBackoffQueue like getUnschedulablePodsWithPodTopologySpreadConstraints for out-of-tree plugins?

In your particular case, maybe a proactive kubernetes-sigs/scheduler-plugins#299
call would behave more efficiently.

Ah~, that's a good idea :)

Can you share some use-cases to move Pods upon node's deletion?

Maybe like this (similar to #110175 (comment))

a user uses WhenUnsatisfiable: DoNotSchedule.
a Pod is rejected by Pod Topology Spread because of the skew.
e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.
user deletes NodeC
The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

I think we should move all unschedulable Pods with all Pod and Node events. If concerned about performance degradation, we can extend EventsToRegister or EnqueueExtensions , for example, to have the function like PreEnqueueCheck?

So,,, ignoring some events for performance in scheduler side (without providing a way to change this) could be inconvenient for some plugins. What I thought in the above previous comment (#110175 (comment)) is to define "which kind of resource's events should be ignored" in plugin side.
(similar to @kerthcet 's idea? #110175 (comment))

We can add pre filter functions in ClusterEvent like:

// obj: the created, updated, or deleted object
// rejectedPod: one of the unschedulable Pod in unschedulableQ
// return value: whether this unschedulable Pod should be moved to activeQ/backoffQ or not.
type PreFilterFn func(obj interface{}, rejectedPod *v1.Pod) bool

type ClusterEvent struct { 
	Resource   GVK
	ActionType ActionType
	Label      string
	// PreFilterFn is the function to check which Pod should be moved from unschedulableQ to activeQ/backoffQ by this event happens.
	PreFilterFn    PreFilterFn
}

kubernetes/pkg/scheduler/framework/types.go

Lines 80 to 84 in da7f184

    
           type ClusterEvent struct { 
        
           	Resource   GVK 
        
           	ActionType ActionType 
        
           	Label      string 
        
           }

Then we can set any pre-filter functions for each plugin.
And, we can define some default behavior for PreFilterFn and use them when ClusterEvent.PreFilterFn is empty.
So, for example, if most of the plugins are considered to prefer to ignore the event for Node that doesn't match nodeAffinity/nodeSelector, we can define this behavior as the default behavior.

What do you think about this idea?

Huang-Wei · 2022-06-02T21:19:12Z

If we do so, also, we need to provide the way to pass some functions to decide which Pods should pass to movePodsToActiveOrBackoffQueue like getUnschedulablePodsWithPodTopologySpreadConstraints for out-of-tree plugins?

Nope, that is something the scheduler core needs to implement.

Maybe like this (similar to #110175 (comment))

a user uses WhenUnsatisfiable: DoNotSchedule.

a Pod is rejected by Pod Topology Spread because of the skew.
e.g. a user uses maxSkew: 1 and TopologyKey is hostname, NodeA/NodeB has 10 Pods each, and NodeC has 8 Pods.

user deletes NodeC

The skew calculation result should be changed and the Pod rejected in (2) will be allowed by Pod Topology Spread. But the Pod isn't moved from unschedulableQ without flushing because of this issue.

Thanks. It makes sense to me.

What I thought in the above previous comment (#110175 (comment)) is to define "which kind of resource's events should be ignored" in plugin side.

A customizable PreCheck approach for sure can satisfy more dynamic cases, but it's inappropriate to associate it with ClusterEvent as (1) ClusterEvent is basically an immutable constant, and (2) we use it to map to plugins. If we go this way, we need to wrap ClusterEvent into another data structure which is kind of complicated.

To summarize, the following cases doesn't properly re-queue relevant pods:

Assigned Pod's add/update events
Non-scheduled Pod's add/update/delete events
Node's add/update/delete events

For 1 & 2, I've commented above with solutions. (1 needs core changes similar like podAffinity, 2 may leverage the Activate hook)

For 3, with @sanposhiho 's examples, I'm convinced this is some area we need to improve. One simple and stratightforward solution is to plumb a hook to let the plugin claim "alright, upon Node events, move the pod failed by me anyways". Then once this knob is registered, we move pods failed by that plugin back unconditionally; instead of proceeding with preCheckForNode(). This hook should be available to in-tree and out-of-tree plugins. In terms of in-tree plugins, it's just PodTopologySpread (we discussed this in today's sig meeting, even InterPodAffinity doesn't fall into this case)

So, as a first step, let's ensure we're on the same page and agree on the ^^ summary. Next, we can better evaluate trade-offs and come up with an optimal solution.

sanposhiho · 2022-06-05T09:00:14Z

Thanks for discussing this topic in the sig meeting. 🙏
But, I am still believing that all logic related to a plugin should be controllable from the plugin side and we should not create something that can be achieved in in-tree plugins but not in out-of-tree plugins.

So... the benefit of the "customizable PreCheck approach" (#110175 (comment)) isn't only to solve this issue. We can use this feature for other optimization: like improving the whole performance of scheduler by moving only Pods that are more likely to be able to be scheduled by defining more detailed events on EventToRegister with customizable PreCheck.
Many plugins have a particular scenario that makes rejected Pods schedulable and the current EventsToRegister doesn't have an abundant enough ability of expression for this. (So, today many plugins need to say "register this hoge event on foo resource. This event may make rejected Pods schedulable" on comments in EventsToRegister. )

If we go this way, we need to wrap ClusterEvent into another data structure which is kind of complicated.

Yes, I watched the sig-meeting recording as well and the opinion that the "customizable PreCheck approach" increases some implementation complexity surely makes sense. But,,, isn't this complexity increasing acceptable even if the above is taken into account?

EDIT: It is a bad habit of mine to always write long sentences even though I am not very good at English. 😓
Summarizing the above:

I think it's not a good idea to create something that can be achieved in in-tree plugins but not in out-of-tree plugins.
If we need a solid advantage to extend like this, then it has the advantage of optimizing the queue per plugin like described above.

sanposhiho · 2022-08-01T22:43:06Z

I didn't jump in the thread since it's actively on the discussion.
But, the issue https://kubernetes.slack.com/archives/C09TP78DV/p1659383737933799 is the similar one.
(This time, we can solve that issue by changing scheduling queue logic since It's the k/k feature though.)

Huang-Wei · 2022-08-03T00:25:47Z

This time, we can solve that issue by changing scheduling queue logic since It's the k/k feature though.

@sanposhiho I'm not sure I fully understand your intention. The slack thread was discussing the AssignedPod update case, so it fallls into case 1 in #110175 (comment), which we already agreed to improve in upstream.

kerthcet · 2022-10-14T03:56:54Z

Hi @sanposhiho Are you working on this? I think this is quite important and if you're not working on this, I can follow.

sanposhiho · 2022-10-17T13:42:57Z

Sorry, leave here for a while. I wanted to wait for KEP-3521: Pod Scheduling Readiness (and then forgot to come back here. 😓 )
That proposal is quite similar to what I want to see in the conclusion of this issue; every plugin can judge if each Pod in unschedQ can be moved to schedQ/backoffQ.

@Huang-Wei Do you think it makes sense to add "By which event the pod will be moved back to schedQ/backoffQ" on PreEnqueue's argument?

type EnqueuePlugin interface {
    Plugin
    // event: By which event the pod will be moved back to schedQ/backoffQ.
    // If the Pod will be backed to schedQ/backoffQ by flushing, it will be nil. (Flushing plans to be removed in https://github.com/kubernetes/kubernetes/issues/87850 though.)
    // involvedObj: the object involved in that event.
    // For example, the given event is "Node deleted", the involvedObj will be that deleted Node.
    PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod, event framework.ClusterEvent, involvedObj runtime.Object) *Status
}

I know event and involvedObj isn't needed in DefaultEnqueue plugin for schedulingGates API, but that's the last missing puzzle piece 🧩 : If PreEnqueue has event, enqueue plugin can judge if the event makes it possible to schedule the Pod.

For example, I believe we can move ↓ this logic to Pod Affinity plugin by implementing PreEnqueue in Pod Affinity plugin.

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 660 to 678 in 31a1024

    
           // getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have 
        
           // any affinity term that matches "pod". 
        
           // NOTE: this function assumes lock has been acquired in caller. 
        
           func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo { 
        
           	var nsLabels labels.Set 
        
           	nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister) 
        
           	var podsToMove []*framework.QueuedPodInfo 
        
           	for _, pInfo := range p.unschedulablePods.podInfoMap { 
        
           		for _, term := range pInfo.RequiredAffinityTerms { 
        
           			if term.Matches(pod, nsLabels) { 
        
           				podsToMove = append(podsToMove, pInfo) 
        
           				break 
        
           			} 
        
           		} 
        
           	} 
        
           	return podsToMove 
        
           }

Pod Affinity plugin will remember Pods rejected by itself in Filter. And in PreEnqueue, if the given Pod was the one rejected by Pod Affinity, event is AssignedPod/Created|Updated and involvedObj is Pod affinity with the rejected Pod, then move the Pod back to activeQ/backoffQ. Otherwise reject the Pod moving.

We can resolve (3) by implementing PreEnqueue as well. (by exactly the way you said.↓)

"alright, upon Node events, move the pod failed by me anyways"
#110175 (comment)

My original concern in this issue is that having logic for specific plugins in the scheduler core may prevent out-of-tree plugins' extendability. The scheduler itself should be pure, and shouldn't do anything special for any plugins implementation ideally. If we can move all such logic to each plugin, that'd be perfect.

sanposhiho · 2022-12-06T02:13:53Z

^ is more like a new feature than just a solution to this problem. I created another issue and describe my proposal in detail. Let's have a discussion there.
#114297

sanposhiho · 2023-07-13T12:45:36Z

It can be resolved by moving all preCheck into each plugin's QueueingHint. Let me retitle this issue so that the current status is easy to understand.

/retitle scheduler: move all preCheck to QueueingHint

sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2022

k8s-ci-robot assigned sanposhiho May 23, 2022

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2022

k8s-ci-robot changed the title ~~scheduler: Any events in EventsToRegister related to Pod doesn't work~~ scheduler: events in EventsToRegister related to Pod and Node don't always work May 24, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2022

sanposhiho mentioned this issue Dec 6, 2022

introduce QueueingHint for wise-enqueueing #114297

Closed

sanposhiho mentioned this issue Apr 29, 2023

scheduler + DRA: per-event filter callbacks #117561

Closed

sanposhiho mentioned this issue Jun 7, 2023

scheduler: add nodeAnnotationsChanged event to trigger rescheduling #118369

Closed

sanposhiho mentioned this issue Jul 13, 2023

scheduler: migrate --pod-max-in-unschedulable-pods-duration flag into KubeSchedulerConfiguration #118906

Closed

k8s-ci-robot changed the title ~~scheduler: events in EventsToRegister related to Pod and Node don't always work~~ scheduler: move all preCheck to QueueingHint Jul 13, 2023

AxeZhan mentioned this issue Aug 30, 2023

Add the skip return status support for PreEnqueue, and won't run PreFilter if PreEnqueue returned a Skip status #120237

Closed

sanposhiho mentioned this issue Sep 25, 2023

nodeports: scheduler queueing hints #119176

Merged

sanposhiho mentioned this issue Nov 23, 2023

Fix: modify a flag doc of pod-max-in-unschedulable-pods-duration #122013

Merged

sanposhiho mentioned this issue Dec 5, 2023

noderesourcefit: scheduler queueing hints #119177

Merged

sanposhiho mentioned this issue Dec 13, 2023

NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable #122284

Closed

rafalw82 mentioned this issue Jan 12, 2024

Enable pod-max-in-unschedulable-pods-duration for Kubernetes greater than 1.26 dell/csi-baremetal#1065

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: move all preCheck to QueueingHint #110175

scheduler: move all preCheck to QueueingHint #110175

sanposhiho commented May 23, 2022 •

edited

Loading

sanposhiho commented May 23, 2022

sanposhiho commented May 23, 2022

sanposhiho commented May 23, 2022 •

edited

Loading

sanposhiho commented May 24, 2022

sanposhiho commented May 29, 2022 •

edited

Loading

sanposhiho commented May 29, 2022

alculquicondor commented May 30, 2022

sanposhiho commented May 30, 2022 •

edited

Loading

alculquicondor commented May 30, 2022

kerthcet commented May 31, 2022 •

edited

Loading

sanposhiho commented May 31, 2022 •

edited

Loading

kerthcet commented May 31, 2022

alculquicondor commented May 31, 2022

Huang-Wei commented May 31, 2022 •

edited

Loading

sanposhiho commented Jun 1, 2022 •

edited

Loading

Huang-Wei commented Jun 2, 2022 •

edited

Loading

sanposhiho commented Jun 5, 2022 •

edited

Loading

sanposhiho commented Aug 1, 2022

Huang-Wei commented Aug 3, 2022

kerthcet commented Oct 14, 2022

sanposhiho commented Oct 17, 2022 •

edited

Loading

sanposhiho commented Dec 6, 2022

sanposhiho commented Jul 13, 2023

scheduler: move all preCheck to QueueingHint #110175

scheduler: move all preCheck to QueueingHint #110175

Comments

sanposhiho commented May 23, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sanposhiho commented May 23, 2022

sanposhiho commented May 23, 2022

sanposhiho commented May 23, 2022 • edited Loading

sanposhiho commented May 24, 2022

sanposhiho commented May 29, 2022 • edited Loading

Assigned Pod's add/update events

Assigned Pod's delete events

Non-scheduled Pod's add/update/delete events

Node's add/update events

Node's delete events

sanposhiho commented May 29, 2022

alculquicondor commented May 30, 2022

sanposhiho commented May 30, 2022 • edited Loading

alculquicondor commented May 30, 2022

kerthcet commented May 31, 2022 • edited Loading

sanposhiho commented May 31, 2022 • edited Loading

kerthcet commented May 31, 2022

alculquicondor commented May 31, 2022

Huang-Wei commented May 31, 2022 • edited Loading

sanposhiho commented Jun 1, 2022 • edited Loading

Huang-Wei commented Jun 2, 2022 • edited Loading

sanposhiho commented Jun 5, 2022 • edited Loading

sanposhiho commented Aug 1, 2022

Huang-Wei commented Aug 3, 2022

kerthcet commented Oct 14, 2022

sanposhiho commented Oct 17, 2022 • edited Loading

sanposhiho commented Dec 6, 2022

sanposhiho commented Jul 13, 2023

sanposhiho commented May 23, 2022 •

edited

Loading

sanposhiho commented May 23, 2022 •

edited

Loading

sanposhiho commented May 29, 2022 •

edited

Loading

sanposhiho commented May 30, 2022 •

edited

Loading

kerthcet commented May 31, 2022 •

edited

Loading

sanposhiho commented May 31, 2022 •

edited

Loading

Huang-Wei commented May 31, 2022 •

edited

Loading

sanposhiho commented Jun 1, 2022 •

edited

Loading

Huang-Wei commented Jun 2, 2022 •

edited

Loading

sanposhiho commented Jun 5, 2022 •

edited

Loading

sanposhiho commented Oct 17, 2022 •

edited

Loading