Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change sort function of scheduling queue to avoid starvation when a lot of unscheduleable pods are in the queue #72619

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 4 additions & 1 deletion pkg/scheduler/internal/queue/scheduling_queue.go
Expand Up @@ -230,7 +230,10 @@ func podTimestamp(pod *v1.Pod) *metav1.Time {
if condition == nil {
return &pod.CreationTimestamp
}
return &condition.LastTransitionTime
if condition.LastProbeTime.IsZero() {
return &condition.LastTransitionTime
}
return &condition.LastProbeTime
}

// activeQComp is the function used by the activeQ heap algorithm to sort pods.
Expand Down
101 changes: 97 additions & 4 deletions pkg/scheduler/internal/queue/scheduling_queue_test.go
Expand Up @@ -572,10 +572,11 @@ func TestRecentlyTriedPodsGoBack(t *testing.T) {
}
// Update pod condition to unschedulable.
podutil.UpdatePodCondition(&p1.Status, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: v1.PodReasonUnschedulable,
Message: "fake scheduling failure",
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: v1.PodReasonUnschedulable,
Message: "fake scheduling failure",
LastProbeTime: metav1.Now(),
})
// Put in the unschedulable queue.
q.AddUnschedulableIfNotPresent(p1)
Expand All @@ -594,6 +595,98 @@ func TestRecentlyTriedPodsGoBack(t *testing.T) {
}
}

// TestPodFailedSchedulingMultipleTimesDoesNotBlockNewerPod tests
// that a pod determined as unschedulable multiple times doesn't block any newer pod.
// This behavior ensures that an unschedulable pod does not block head of the queue when there
// are frequent events that move pods to the active queue.
func TestPodFailedSchedulingMultipleTimesDoesNotBlockNewerPod(t *testing.T) {
q := NewPriorityQueue(nil)

// Add an unschedulable pod to a priority queue.
// This makes a situation that the pod was tried to schedule
// and had been determined unschedulable so far.
unschedulablePod := v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "test-pod-unscheduled",
Namespace: "ns1",
UID: "tp001",
},
Spec: v1.PodSpec{
Priority: &highPriority,
},
Status: v1.PodStatus{
NominatedNodeName: "node1",
},
}
// Update pod condition to unschedulable.
podutil.UpdatePodCondition(&unschedulablePod.Status, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: v1.PodReasonUnschedulable,
Message: "fake scheduling failure",
LastProbeTime: metav1.Now(),
})
// Put in the unschedulable queue
q.AddUnschedulableIfNotPresent(&unschedulablePod)
// Clear its backoff to simulate backoff its expiration
q.clearPodBackoff(&unschedulablePod)
// Move all unschedulable pods to the active queue.
q.MoveAllToActiveQueue()

// Simulate a pod being popped by the scheduler,
// At this time, unschedulable pod should be popped.
p1, err := q.Pop()
if err != nil {
t.Errorf("Error while popping the head of the queue: %v", err)
}
if p1 != &unschedulablePod {
t.Errorf("Expected that test-pod-unscheduled was popped, got %v", p1.Name)
}

// Assume newer pod was added just after unschedulable pod
// being popped and before being pushed back to the queue.
newerPod := v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "test-newer-pod",
Namespace: "ns1",
UID: "tp002",
CreationTimestamp: metav1.Now(),
},
Spec: v1.PodSpec{
Priority: &highPriority,
},
Status: v1.PodStatus{
NominatedNodeName: "node1",
},
}
q.Add(&newerPod)

// And then unschedulablePod was determined as unschedulable AGAIN.
podutil.UpdatePodCondition(&unschedulablePod.Status, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: v1.PodReasonUnschedulable,
Message: "fake scheduling failure",
LastProbeTime: metav1.Now(),
})
// And then, put unschedulable pod to the unschedulable queue
q.AddUnschedulableIfNotPresent(&unschedulablePod)
// Clear its backoff to simulate its backoff expiration
q.clearPodBackoff(&unschedulablePod)
// Move all unschedulable pods to the active queue.
q.MoveAllToActiveQueue()

// At this time, newerPod should be popped
// because it is the oldest tried pod.
p2, err2 := q.Pop()
if err2 != nil {
t.Errorf("Error while popping the head of the queue: %v", err2)
}
if p2 != &newerPod {
t.Errorf("Expected that test-newer-pod was popped, got %v", p2.Name)
}
}

// TestHighPriorityBackoff tests that a high priority pod does not block
// other pods if it is unschedulable
func TestHighProirotyBackoff(t *testing.T) {
Expand Down
9 changes: 5 additions & 4 deletions pkg/scheduler/scheduler.go
Expand Up @@ -282,10 +282,11 @@ func (sched *Scheduler) recordSchedulingFailure(pod *v1.Pod, err error, reason s
sched.config.Error(pod, err)
sched.config.Recorder.Event(pod, v1.EventTypeWarning, "FailedScheduling", message)
sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: reason,
Message: err.Error(),
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
LastProbeTime: metav1.Now(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the active queue takes into account lastProbeTime, and if a pod keeps getting unschedulable , which means its lastProbeTime will always be latest, it will be be farthest behind in the queue and hence least likely to be considered for scheduling ? Is that the general idea ?

After being marked as unschedulable, how often is it tried again ?

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition , hence this will still be least in prioirty to get schedulable compared to other pods. I am wondering if that will cause starvation ?

Copy link
Contributor Author

@everpeace everpeace Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the general idea ?

Thank you for your clear explanation 🙇 Yes, it is. That's the idea.

After being marked as unschedulable, how often is it tried again ?

I think it depends on cluster status. MoveAllToActiveQueue() moves unschedulable pods from unschedulable queue to active queue with backoff per pod. The method are called generally when scheduler detected pods/nodes status changed.

If an unschedulable pod eventually becomes schedulable, the LastTransitionTime will still update for this condition

LastTransitionTime should be updated only when the condition status is changed by definition. This means, once the pod is marked as unschedulable, which creates PodScheduled condition with Status=False on the pod status, LastTransitionTime shoudn't be updated until condition status will become True. It is because I added code updating LastProbeTime when schedule() failed.

I thinkkubelet is responsible for updating with PodScheduled.Status = True of PodScheduled condition in under the current implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to say is that this change optimizes for the cases where there are lot of unschedulable pods and favors other pods scheduling in that case. Does it make the recovery of a pod which has been unschedulable for a while and just became schedulable slower compare to previously @everpeace @bsalamat because its been keep pushing to end because of constant updating of lastprobetime ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krmayankk I don't think so. If an unschedulable pod has higher priority, it will still get to the head of the queue even after this change. When it has the same priority as other pods, it is fair to put it behind other pods with the same priority after the scheduler has tried it and determined that it is unschedulable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat i was talking about the case when there is no priority involved. All pods are same priroity or default priority. In that case its trying to avoid starvation for regular pods when lot of unschedulable pods are present. How does it affect the recovery of the unschedulable pods which finally become shcedulable. Does this behavior change when compared to without this change ?
Note: Just trying to understand, the answer may be no change. It depends on how the active queue is implemented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change (and somewhat similarly after #71488), a pod that is determined unschedulable goes behind other similar priority pods in the scheduling queue. Once pods become schedulable they are processed by their order in the scheduling queue. So, depending on their location in the queue, they may get scheduled before or after same priority pods.
In short, we don't expect further delays in scheduling unschedulable pods after this change.

Reason: reason,
Message: err.Error(),
})
}

Expand Down