Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take PodTopologySpread into consideration when requeueing Pods based on Pod related events #122627

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Jan 7, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

When assigned Pods are created/updated, cluster events (Pod/Add or Pod/Update) should be delivered for all plugins that subscribe those events. But, actually getUnschedulablePodsWithMatchingAffinityTerm pre-filters Pods to receive those events based on Pods' affinity and pod.Status.Resize. - meaning only PodAffinity/NodeResourceFit receive them actually.

The problem here is similar to #110175 - in in-tree plugins, PodAffinity, PodTopologySpread and NodeResourceFit subscribe Pod add and/or update events. And, Pods rejected by PodTopologySpread will never be requeued to activeQ by Pod-related events. The same happens for custom plugins registering pod add and/or update events.


This PR addresses this problem by -

  • take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
    • This fix is mostly for people who don't use QHint enabled.
  • use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
    • It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Which issue(s) this PR fixes:

Fixes #122626
Fixes #123480

Special notes for your reviewer:

@kubernetes/sig-scheduling-leads
It might be a candidate for cherry-pick. I'm not sure and don't have a strong opinion.

Does this PR introduce a user-facing change?

Fix a bug that when PodTopologySpread rejects Pods, they may be stuck in Pending state for 5 min in a worst case scenario.
The same problem could happen with custom plugins which have Pod/Add or Pod/Update in EventsToRegister,
which is also solved with this PR, but only when the feature flag SchedulerQueueingHints is enabled.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 7, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 7, 2024
@sanposhiho
Copy link
Member Author

/cc @alculquicondor

@sanposhiho sanposhiho marked this pull request as ready for review January 7, 2024 03:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2024
@sanposhiho
Copy link
Member Author

/hold to go through an approver's review.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2024
@sanposhiho sanposhiho marked this pull request as draft January 7, 2024 03:44
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2024
@sanposhiho sanposhiho changed the title remove(scheduling queue): remove AssignedPodAdded and AssignedPodUpdated take PodTopologySpread into consideration when requeueing Pods based on Pod related events Jan 7, 2024
@sanposhiho sanposhiho marked this pull request as ready for review January 7, 2024 04:06
@sanposhiho
Copy link
Member Author

/retest

Comment on lines 1160 to 1163
// Note that this function leaks goroutines in the case of timeout; even after this function returns after timeout,
// the goroutine made by this function keep waiting to pop a pod from the queue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix this in timeout()?

Copy link
Member Author

@sanposhiho sanposhiho Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to fix it, I guess we have to modify schedulingQueue.Pop() to somehow abort waiting (e.g., add ctx in Pop() and abort waiting if ctx is canceled.).
I don't have any other idea to solve this leak.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pop() would be able to quit in another way I guess - we can use defer schedulingQueue.Close() in each test to signal the queue to quit gracefully, right?

Copy link
Member Author

@sanposhiho sanposhiho Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we use a different scheduler in every test case, so this leaked goroutine won't be carried over to the next test case.

What I meant here is that this NextPod is called multiple times in one test. What confused me when writing tests was:

  1. The first call of NextPod is timeout-ed.
  2. Do something and Pod-A is moved to activeQ.
  3. I want to confirm (2) moved Pod-A, and use NextPod again to make sure Pod-A is popped. But the goroutine made in (1) is still alive and pops Pod-A out. The second call of NextPod returns timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho
Copy link
Member Author

@Huang-Wei addressed your suggestion

@sanposhiho
Copy link
Member Author

@kubernetes/sig-scheduling-approvers Can anyone take a look?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2024
@sanposhiho sanposhiho force-pushed the remove-AssignedPodUpdated branch 2 times, most recently from 01fc67c to d7c4b64 Compare April 7, 2024 14:39
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments.

@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) {
logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
}

sched.SchedulingQueue.AssignedPodAdded(logger, pod)
// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the logic centralized at AssignedPodAdded? Make the function more pure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current implementation because after SchedulerQueueingHints feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?

if utilfeature.DefaultFeatureGate.Enabled(features.SchedulerQueueingHints) {
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, queue.AssignedPodUpdate, oldPod, newPod, nil)
} else {
sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -1085,9 +1089,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool {
func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) {
p.lock.Lock()
if isPodResourcesResizedDown(newPod) {
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossNodeTerm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz remove this, let's comment on special case only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is special case actually, people don't have a clue why we don't use getUnschedulablePodsWithCrossNodeTerm if isPodResourcesResizedDown is true.

pkg/scheduler/internal/queue/scheduling_queue.go Outdated Show resolved Hide resolved
pkg/scheduler/internal/queue/scheduling_queue.go Outdated Show resolved Hide resolved
nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister)

var podsToMove []*framework.QueuedPodInfo
for _, pInfo := range p.unschedulablePods.podInfoMap {
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right? I'm ok with the current implementation, since no feedback/issue ticketed about this. Don't want to expand the scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think we can still check the podAffinity here, if notMatch, don't append, agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?

Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.

Copy link
Member Author

@sanposhiho sanposhiho Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think we can still check the podAffinity here, if notMatch, don't append, agree?

I took time thinking deeply about this, and found that we should go with the current implementation rather than the suggestion from @kerthcet .

In short: Supposing a Pod has both topology spread and pod affinity, and a Pod is rejected only by topology spread. In this case, even pod here doesn't match pod-affinity, it could make the Pod schedulable.

More in-depth explanation: Supposing,

  • a Pod has both topology spread and pod affinity. Topology spread's maxSkew is 1 and topology key is nodename.
  • node-a has 1 Pod matching topology spread, while node-b has 3 Pod.
  • both nodes have a scheduled Pod matching pod affinity.

In this case, a assigned Pod, which is on node-a and matching topology spread, can make the Pod schedulable, regardless of whether this Pod matches pod affinity or not. (because node-a already has another Pod matching pod-affinity)

@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) {
logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod))
}

sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)
// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
// Consequently, if custom plugins that subscribes Pod/Update events reject Pods,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only want to handle podTopologySpread problem here, let's not mention the custom plugins, or will lead to confusing like we solved the problems for all plugins, unless you're intended to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the comment more concise? It's a little too long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this PR solves two things:

  • take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
    • This fix is mostly for people who don't use QHint enabled.
  • use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
    • It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Maybe you want me to create separate PR for the changes in eventhandlers.go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no plans for cherry-pick, I'm ok fixing both problems in this PR.
Just update the title and description accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description.

@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued
}
return podInfo
}

// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this? Unused?

Copy link
Member Author

@sanposhiho sanposhiho Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this cleanup to #125433

Copy link
Member Author

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerthcet Sorry for late, I just replied to comments. (will have actual impl changes later)

@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) {
logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
}

sched.SchedulingQueue.AssignedPodAdded(logger, pod)
// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current implementation because after SchedulerQueueingHints feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?

@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued
}
return podInfo
}

// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout.
Copy link
Member Author

@sanposhiho sanposhiho Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)

@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) {
logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod))
}

sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)
// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
// Consequently, if custom plugins that subscribes Pod/Update events reject Pods,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this PR solves two things:

  • take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
    • This fix is mostly for people who don't use QHint enabled.
  • use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
    • It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Maybe you want me to create separate PR for the changes in eventhandlers.go?

nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister)

var podsToMove []*framework.QueuedPodInfo
for _, pInfo := range p.unschedulablePods.podInfoMap {
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?

Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.

@sanposhiho
Copy link
Member Author

@kerthcet Fixed / replied to your comments🙏

@alculquicondor
Copy link
Member

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024
@sanposhiho
Copy link
Member Author

@alculquicondor @kerthcet Sorry for a long wait on it. Addressed the comments.

@sanposhiho
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2024
@sanposhiho
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2024
@sanposhiho
Copy link
Member Author

rebased.

Comment on lines +1111 to +1112
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm
// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm
// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable.
// In this case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm
// because Pod related events may make Pods that were rejected by NodeResourceFit schedulable.

@@ -1104,9 +1108,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool {
func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include changes to Pod status? or only Pod spec?

Comment on lines +1191 to +1272
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {
// This Pod may be schedulable now by this Pod event.
podsToMove = append(podsToMove, pInfo)
continue
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider checking for the namespace?

@@ -1853,6 +1854,7 @@ func TestPriorityQueue_AssignedPodAdded(t *testing.T) {
defer cancel()

affinityPod := st.MakePod().Name("afp").Namespace("ns1").UID("afp").Annotation("annot2", "val2").Priority(mediumPriority).NominatedNodeName("node1").PodAffinityExists("service", "region", st.PodAffinityWithRequiredReq).Obj()
spreadPod := st.MakePod().Name("tsp").Namespace("ns1").UID("tsp").SpreadConstraint(1, "node", v1.DoNotSchedule, nil, nil, nil, nil, nil).Obj()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please instead create a new test and rename the existing one to be specific about pod affinity (the comment already talks about pod affinity specifically).

Or maybe it can be converted into a test table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

util.NextPod leaks goroutine Assigned Pod add/update events don't requeue Pods without PodAffinity
5 participants