Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kep for coscheduling base on CRD #42

Merged
merged 1 commit into from
Sep 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions kep/42-podgroup-coscheduling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Coscheduling based on PodGroup CRD

## Table of Contents

<!-- toc -->
- [Coscheduling based on PodGroup CRD](#Coscheduling based on PodGroup CRD)
- [Table of Contents](#table-of-contents)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
- [Design Details](#design-details)
- [PodGroup](#podgroup)
- [Controller](#controller)
- [Extension points](#extension-points)
- [QueueSort](#queuesort)
- [PreFilter](#prefilter)
- [Permit](#permit)
- [PostBind](#postbind)
- [Known Limitations](#Known Limitations)
<!-- /toc -->

## Motivation
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods can be scheduled altogether. Under some scenes, it would waste resources since the whole application cannot work with only partial Pods' running, like Spark jobs, TensorFlow jobs, and so on. This proposal is aimed at solving the issue, by introducing a PodGroup CRD to do the heavy lifting on wiring a group of Pods.
## Goals
1. Base on the scheduling framework, implement the co-scheduling feature.
2. Define a CRD name PodGroup to help pod scheduling.

## Non-Goals
Sort the job when we submit jobs to a cluster. Currently, we can only do this base on pods.

## Use Cases
Batch workloads such as Spark jobs, TensorFlow jobs that have to run altogether.

## Design Details

### PodGroup

We define a CRD name PodGroup to help schedule, its definition is as follows:
```go
// PodGroupSpec represents the template of a pod group.
type PodGroupSpec struct {
// MinMember defines the minimal number of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
Huang-Wei marked this conversation as resolved.
Show resolved Hide resolved
MinMember uint32 `json:"minMember"`

// MinResources defines the minimal resource of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinResources *v1.ResourceList `json:"minResources,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the podgroup is made up of different types of pods? Like ps and trainer in tensorflow. How to set the MinResources ?
@cwdsuzhou

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the v1alpha1 version. I could add that if need and extend this CRD later. e.g

type PodGroup struct {
...
Groups [map]PodGroup
...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one PodGroup should describe a particular type of workload. In terms of tensorflow application, a good model is to have PS and Worker referencing their corresponding (Sub-)PodGroup, and then there can be a (Parent-)PodGroup to wire those two (Sub-)PodGroup(s).

(not quite sure it would work, need a PoC on this)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought per #42 (comment) MinResources was going to be removed?


// MaxScheduleTime defines the maximal time of members/tasks to wait before run the pod group;
MaxScheduleTime *metav1.Duration `json:"maxScheduleTime,omitempty"`
}

// PodGroupStatus represents the current state of a pod group.
type PodGroupStatus struct {
// Current phase of PodGroup.
Phase PodGroupPhase `json:"phase"`

// OccupiedBy marks the workload (e.g., deployment, statefulset) UID that occupy the podgroup.
// It is empty if not initialized.
OccupiedBy string `json:"occupiedBy,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, owned, like OwnerReference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more intuitive to use the more common terminology?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical meta.OwnerReference may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" in status can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.

@cwdsuzhou may comment further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical meta.OwnerReference may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" in status can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.

@cwdsuzhou may comment further.

Yep ,this field is used for recording the related object e.g. workload or crd. But not absolutely same as the owner reference.


// The number of actively running pods.
// +optional
Scheduled uint32 `json:"scheduled"`

// The number of actively running pods.
// +optional
Running uint32 `json:"running"`

// The number of pods which reached phase Succeeded.
// +optional
Succeeded uint32 `json:"succeeded"`

// The number of pods which reached phase Failed.
// +optional
Failed uint32 `json:"failed"`

// ScheduleStartTime of the group
ScheduleStartTime metav1.Time `json:"scheduleStartTime"`
}
```

### Controller

We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Once a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to
cluster upgrade.

### Extension points

#### QueueSort

To make sure a group of pods can be scheduled as soon as possible. We implemented this `extension point`. The main progress is as follows:

1. Sort based on Pod Priority
2. Pod does not belong to any group
3. PodGroup creation time
4. Pod creation time

#### PreFilter

This extension pre-filters pods to save scheduling cycles. This is especially helpful when there are not enough resources in a cluster. The overall flow works as below:

1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this "pod belonging to a pod group" relationship expressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pod to PodGroup: pod gets applied with a PodGroup label or annotation
  • PodGroup to pod: either implicitly by dynamic computing, or PodGroup's OccupiedBy - which refers to the the group of pods' controller (i.e., Deployment or StatefulSet).

@cwdsuzhou can confirm.

2. If there are no other pending pod groups - say all other pod groups have already been scheduled, we allow the pod when its resource requirement is satisfied.
3. If there are other pod groups not fully scheduled, we check if the pod group current pod belongs to is the closest to its completion - i.e., we check its completion progress by `len(waiting pods of a pod group) / pod group's minMember`.
- if it is, this pod is allowed.
- if it's not, we check if allowing it would still make the top-progressed pod group (or you want to say all progressed pod grouped?) scheduled:
- if so, this pod is allowed;
- if not, we check if the current pod has higher priority than the top-progressed pod group (I guess for convenience, you would still need the `PriorityClass` defined in PodGroup?). If it does, this pod is allowed and the top-progressed pod group gets rejected.

For any pod that gets rejected, their pod group would be added to a backoff list and get retried until a TTL is met.

#### Permit

1. When the number of waiting pods in a PodGroup is less than `minMember` (defined in the PodGroup), the status `Wait` is returned. They will be added to cache with TLL (equal to MaxScheduleTime).
2. When the number is equal or greater than `minMember`, send a signal to permit the waiting pods.

We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole group would be rejected.

#### PostBind

This extension is primarily used to record the PodGroup Status. When a pod is bound successfully, we would update the status of its affiliated PodGroup.

### Known Limitations

1. We cannot support group preemption very well now. Though we have tried to implement the extensions, it still cannot meet production requirements.
13 changes: 13 additions & 0 deletions kep/42-podgroup-coscheduling/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
title: Coscheduling based on PodGroup CRD
kep-number: 42
authors:
- "@cwdsuzhou"
owning-sig: sig-scheduling
reviewers:
- "@Huang-Wei"
- "@denkensk"
approvers:
- "@Huang-Wei"
creation-date: 2020-08-24
last-updated: 2020-08-24
status: implementable