Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP for co-scheduling plugin based on scheduler framework #1463

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8e2ffff
kep for coscheduling-plugin
denkensk Jan 16, 2020
29bfcc7
fix express
denkensk Jan 19, 2020
4f5b50b
fix express
denkensk Jan 19, 2020
7644eeb
fix express
denkensk Jan 19, 2020
380b7ba
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
5873d2c
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
be06e52
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
622b108
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
9b48b36
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
63f27a4
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
b4928e1
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Jan 21, 2020
3e91600
update kep for coscheduling-plugin
denkensk Jan 21, 2020
922eff4
update kep for coscheduling-plugin
denkensk Jan 21, 2020
24b7fce
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Feb 7, 2020
bb7d968
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Feb 10, 2020
696d49a
coscheduling-plugin
denkensk Feb 12, 2020
d887152
coscheduling-plugin
denkensk Feb 18, 2020
2e9b7c4
coscheduling-plugin
denkensk Feb 26, 2020
0c91efa
coscheduling-plugin
denkensk Feb 26, 2020
fa9b1ff
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Mar 11, 2020
f3321d8
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Mar 11, 2020
95835c9
Update keps/sig-scheduling/20200116-coscheduling-plugin-based-on-sche…
denkensk Mar 12, 2020
56126ca
coscheduling-plugin
denkensk Mar 13, 2020
3380577
coscheduling-plugin
denkensk Mar 15, 2020
d0f33a9
coscheduling-plugin
denkensk Mar 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
title: Coscheduling plugin based on scheduler framework
authors:
- "@denkensk"
owning-sig: sig-scheduling
reviewers:
- "@Huang-Wei"
- "@ahg-g"
- "@alculquicondor"
- "k82cn"
- "@resouer"
- "@hex108"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you kindly add me to reviewers??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks ~~

- "@everpeace"
approvers:
- "@Huang-Wei"
creation-date: 2020-01-16
last-updated: 2020-01-16
status: provisional
---

# Coscheduling plugin based on scheduler framework

## Table of Contents

<!-- toc -->
- [Coscheduling plugin based on scheduler framework](#coscheduling-plugin-based-on-scheduler-framework)
- [Table of Contents](#table-of-contents)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
- [Terms](#terms)
- [Proposal](#proposal)
- [Design Details](#design-details)
- [PodGroup](#podgroup)
- [Coscheduling](#coscheduling)
- [Extension points](#extension-points)
- [QueueSort](#queuesort)
- [Pre-Filter](#pre-filter)
- [Permit](#permit)
- [UnReserve](#unreserve)
- [Alternatives considered](#alternatives-considered)
- [Graduation Criteria](#graduation-criteria)
- [Testing Plan](#testing-plan)
- [Implementation History](#implementation-history)
- [References](#references)
<!-- /toc -->

## Motivation
Kubernetes has become a popular solution for orchestrating containerized workloads. Due to limitation of Kubernetes scheduler, some offline workloads (ML/DL) are managed in a different controller. To improve cluster utilization and operation efficiency, we'd like to treat Kubernetes as a unified management platform. But ML jobs are All-or-Nothing: they require all tasks of a job to be scheduled at the same time. If the job only start part of tasks, it will wait for other tasks to be ready to begin to work. In the worst case, all jobs are pending leading to a deadlock. To solve this problem, co-scheduling is needed for the scheduler. The new scheduler framework makes the goal possible.

## Goals
1. Use scheduler plugin, which is the most Kubernetes native way, to implement coscheduling.
2. Lightweight implementation of coscheduling without the CRD of `PodGroup`

## Non-Goals
Use CRD `PodGroup` - this design proposes a lightweight approach that doesn't need to impose CRD management.

## Use Cases
When running a Tensorflow/MPI job, all tasks must start before they can do any work. This becomes a bigger problem when several jobs are competing to get all their tasks started. In worst case, all jobs are pending because of a deadlock: every job only start part of tasks, and waits for the other tasks to start. In worst case, all jobs are pending leading to a deadlock.

## Terms

- **pgPod**: pod belongs to some `PodGroup`.
- **regularPod**: a regular `Pod` (which doesn't have `PodGropuName` set).

## Proposal

In order to implement coscheduling, we developed plugins in different extension points. In `QueueSort` we ensure that the Pods belonging to the same PodGroup are queued back-to-back. For example, suppose PodGroup A owns Pod-A1, Pod-A2, Pod-A3, while PodGroup B owns Pod-B1, Pod-B2. The pods of the two PodGroups should not interleave - it should be always <PodGroup-A, PodGroup-B> or the other way around; but never <Pod-A1, Pod-B1, Pod-A2, ...>. In `Permit` phase we put the pod that doesn't meet min-available into the WaitingMap and reserve resources until min-available are met or timeout is triggered. In `Unreserve` phase,clean up the pods that timed-out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Too much detail for QueueSort in this summary.

Also, focus in the general idea. IIRC, it is only one pod group is processed at a time?


![image](./20200116-coscheduling-plugin-based-on-scheduler-framework-extensions.png)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



## Design Details

### PodGroup

We use a special label named ```pod-group.scheduling.sigs.k8s.io/name``` to define a PodGroup. Pods that set this label and use the same value belong to the same PodGroup. This is a short term solution, in the future if the definition of `PodGroup` is accepted by the community, we will define it directly through the CRD of `PodGroup`. This is not the focus of this proposal.

```yaml
labels:
pod-group.scheduling.sigs.k8s.io/name: nginx
pod-group.scheduling.sigs.k8s.io/min-available: "2"
```
`Pods` in the same `PodGroup` with different priorities might lead to unintended behavior, so need to ensure `Pods` in the same `PodGroup` with the same priority.

### Coscheduling
```go
// Coscheduling is a plugin that implements the mechanism of gang scheduling.
type Coscheduling struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not use Gang as the plugin name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coscheduling seems appropriate https://en.wikipedia.org/wiki/Coscheduling

FrameworkHandle framework.FrameworkHandle
PodLister corelisters.PodLister
// Key is the name of PodGroup.
PodGroupInfos map[string]PodGroupInfo
}

type PodGroupInfo struct {
// InitialTimestamp stores the timestamp of the initialization time of PodGroup.
InitialTimestamp time.Time
UID types.UID
MinAvailable int
Name String
}
```

1. `PodGroupInfo` is initialized the first time the pod belongs to the PodGroup is encountered, and LastFailureTimestamp is updated every time the PodGroup fails to schedule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, what is "fail to schedule"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fails to schedule means that the PodGroup fails in 2 places

  1. Waiting time timeout in WaitingMap
  2. the previous PodGroup is rejected by the second PodGroup in PreFilter

I just want to define a timestamp to compare the timestamp of PodInfo in QueueSort. But I find if I update the LastFailureTimestamp, it maybe lead lead to undefined behavior in the heap. So I change it to InitialTimestamp (InitialTimestamp stores the timestamp of the initialization time of PodGroup.). And compare the InitialAttemptTimestamp of PodInfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, please add these notes to the KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks~~ I update in the KEP @alculquicondor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems out of date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove and LastFailureTimestamp is updated every time the PodGroup fails to schedule..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

2. `InitialTimestamp` stores the timestamp of the initialization time of PodGroup.
3. `UID` is the unique identification value used to distinguish different podgroups.


### Extension points

#### QueueSort
Copy link
Contributor

@everpeace everpeace Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss implementing QueueSort extension points in the Coscheduling plugin. It is because only one QueueSort plugin is allowed in the scheduling framework.

What if some user already has some plugin that implements QueueSort extension point? Or, I think we better to document how we support the case after discussion (How users can compose their plugin and the Coschduling plugin).

Alternative idea: if not implementing QueueSort plugin, we can make timeout configurable and users can set longer timeout values for large pod group. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think composing Less is something we can explore, but it's an overkill for getting a prototype of Coscheduling.

I prefer that we minimize our dependency on timeouts so that we can preserve good throughput.

Copy link
Member

@k82cn k82cn Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should not dependent on timeout which impact the performance of scheduler.

In order to maximize the chance that the pods which belong to the same `PodGroup` to be scheduled consecutively, we need to implement a customized `QueueSort` plugin to sort the Pods properly.

**limitation**: `QueueSort` is the core part of our design and only one `QueueSort` plugin is allowed in the scheduling framework. So our design only supports the case that `QueueSort` extension point isn't implemented in other plugins.

```go
func Less(podA *PodInfo, podB *PodInfo) bool
```

Firstly, we will inherit the default in-tree PrioritySort plugin so as to honor .spec.priority to ensure high-priority Pods are always sorted ahead of low-priority ones.

Secondly, if two Pods hold the same priority, the sorting precedence is described as below:

- If they are both regularPods (without particular PodGroup label), compare their `InitialAttemptTimestamp` field: the Pod with earlier `InitialAttemptTimestamp` is positioned ahead of the other.

- If one is regularPod and the other is pgPod, compare regularPod's `InitialAttemptTimestamp` with pgPod's `InitialTimestamp`: the Pod with earlier timestamp is positioned ahead of the other.

- If they are both pgPods:
- Compare their `InitialAttemptTimestamp`: the Pod with earlier timestamp is positioned ahead of the other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought of starvation?
If two pod groups have the same priority, the first one to enter the queue will always have higher precedence than the other, or even individual pods.

This is different than the original, where the scheduler tries to give the same scheduling opportunities to all pods. But, on the other hand, this might be something that we want for co-scheduling.

Ref kubernetes/kubernetes#83834 (comment)
cc @yqwang-ms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cc.

BTW, seems co-scheduling by itself just needs "Gang Queuing", i.e.:

Pods belonging to the same PodGroup are queued back-to-back

Which is not necessary to be also FIFO. (And FIFO might NOT be something that we want for co-scheduling, it is more like that to implement co-scheduling, to be FIFO is a helper, so we coupled them)

Given current K8S has priority starvation issues mentioned in kubernetes/kubernetes#86373.
E.g. an early large scale pod group, keeps in waiting, will also cause later small pod group to starve, even if the small pod group can fit into free resource, and from user view, they have the same priority, and just only a little different creation time.

So, we would better decouple FIFO with co-scheduling, such as, for sometime, we move the whole early group behind later group, and for other times, we move the whole back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two pod groups have the same priority, the first one to enter the queue will always have higher precedence than the other, or even individual pods.

Technically, due to the limitation of the current global one-queue mechanism, it's possible upon a large pod group as it's pretty challenging to accommodate all precedence into one queue.

But, on the other hand, this might be something that we want for co-scheduling.

Comparing to the way that randomly sorts PodGroup Pods (where wait() is more unpredictable), this approach seems to have more chances to avoid starvation. However, as I mentioned above, technically it still has chances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we would better decouple FIFO with co-scheduling, such as, for sometime, we move the whole early group behind later group, and for other times, we move the whole back.

Could you rephrase this? Did you mean couple or decouple? What would be "other times"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we would better decouple them given current situation.
For example, a large early pod group A, and another small later pod group B.
For queuing order, could we make A has 70% chance ahead of B? (So B still has 30% chance ahead of A)?
So in this way, we at least can mitigate or avoid the starvation for pods of the same priority. Otherwise, same priority pods also suffers from this proposal, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean can we easily queue pods in a group as a gang/unit, instead of leveraging detailed tuned compare rules to make sure they are back-to-back (has side effects to be FIFO)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean can we easily queue pods in a group as a gang/unit, instead of leveraging detailed tuned compare rules to make sure they are back-to-back (has side effects to be FIFO)?

I mentioned it in Alternatives considered. I think this requires major refactoring, which only supports Pods as scheduling unit today. @yqwang-ms

- If their `InitialAttemptTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `InitialAttemptTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back)

**Note1**: There are different `InitialTimestamp` (even if they are the same, the UID will be different). So when the pods enter the queue, the pods that belongs to the same PodGroup will be together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note seems unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#### Pre-Filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of jumping out stating what this phase does, I'd suggest putting a paragraph explaining why we need this and the rough workflow. I suppose you're trying to say:


When a pgGroup Pod is being scheduled for the first time, we have 2 option to deal with it: either start the full scheduling cycle no matter its grouping Pods are present inside schedule queue, or fail quick its scheduling cycle as its grouping Pods number doesn't meet minAvailable. The former case is more efficient, but may cause partial Pods holding system resources until a timeout. The latter case may result in extra scheduling cycles (even if we're going to fail them fast), but will reduce the overall scheduling time for the whole group - as we're waiting them to be all present in the queue first and then start the full scheduling cycle for each).

Here we're adopting the latter approach, blabla...

When a `pgGroup` Pod is being scheduled for the first time, we have 2 option to deal with it: either start the full scheduling cycle no matter its grouping Pods are present inside schedule queue, or fail quick its scheduling cycle as its grouping Pods number doesn't meet `minAvailable`. The former case is more efficient, but may cause partial Pods holding system resources until a timeout. The latter case may result in extra scheduling cycles (even if we're going to fail them fast), but will reduce the overall scheduling time for the whole group - as we're waiting them to be all present in the queue first and then start the full scheduling cycle for each).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pgGroup Pod/pgPod ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Here we're adopting the latter approach, `PreFilter` validates that if the total number of pods belonging to the same `PodGroup` is less than `minAvailable`. If so, the scheduling process will be interrupted directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still removing the PreFilter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the mechanism to reject previous PodGroup, just left that PreFilter` validates that if the total number of pods belonging to the same `PodGroup` is less than `minAvailable`. If so, the scheduling process will be interrupted directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the total number of pods where? i.e. where does this counter get increased?

Copy link
Member Author

@denkensk denkensk Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the total number of pods where? i.e. where does this counter get increased?

I get the total number of Pods belonging to the same PodGroup through PodLister.List(selector) because they all have the same label pod-group.scheduling.sigs.k8s.io/name @alculquicondor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would bypass all the caches, which may led to race conditions. Also it moves pods to the "unschedulable" queue if the number is lower than min. Then there is no way to bring them back unless some condition in the cluster changes that makes the pod "more schedulable". I know there is the flush every one minute, which I want to get rid of. But you probably don't want the plugin to have to wait 1min anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That lister doesn't include unscheduled pods, which these are.

That's true.

That's not the case, because when a pod gets deleted, we bring back pods from the unschedulable queue back to the active queue.

But in both implementations, no (waiting) Pods get deleted - they just get moved from activeQ to unschedulableQ/packoffQ, or the other way around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said "Preempt plugin". I thought you were just referring to preemption. Are you referring to Reserve?
In that case, pods move to the plugin's cache. They are neither in active or unschedulableQ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops... my bad, typo, I meant "Permit" plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. So my comment still applies. We are not moving pods to unschedulable queue when we tell pods to "wait".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. My point was that based on current extension points, this seems to be a common limitation among all coscheduling plugins.


#### Permit
In `Permit` phase, we put the pod that doesn't meet min-available into the WaitingMap and reserve resources until min-available are met or timeout is triggered.
1. Get the number of Running pods that belong to the same PodGroup
2. Get the number of WaitingPods (used to record pods in waiting status) that belong to the same PodGroup
3. If Running + WaitingPods + 1 >= min-available(1 means the pod itself), approve the waiting pods that belong to the same PodGroup. Otherwise, put the pod into WaitingPods and set the timeout (eg: the timeout is dynamic value depends on the size of the `PodGroup`.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/min-available/MinAvailable, the former is easily misunderstood as min minus available

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### UnReserve
After a pod which belongs to a PodGroup times out in the permit phase. UnReserve ```Rejects``` the pods that belong to the same PodGroup to avoid long-term invalid reservation of resources.

## Alternatives considered
1. Using `PodGroup` as a scheduling unit. This requires major refactoring, which only supports Pods as scheduling unit today.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risks and Mitigations are also needed, for example the current scheduler's performance is at 100/s. If PodGroup is enabled, then some measures are needed to ease the pressure on the scheduler.

## Graduation Criteria

## Testing Plan
1. Add detailed unit and integration tests for workloads.
2. Add basic e2e tests, to ensure all components are working together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as someone who mainly contributes to SIG Docs) I like to see at least a sketch of documentation requirements for feature graduation.

Would that be OK?

## Implementation History
## References
- [Coscheduling in Kubernetes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)
- [Schedule a group of pods all at once](https://github.com/kubernetes/kubernetes/issues/16845)
- [kubeflow/tf-operator: Prevent scheduling deadlocks](https://github.com/kubeflow/tf-operator/issues/165)
- [Added PodGroup Phase in Status](https://github.com/kubernetes-sigs/kube-batch/pull/533)