From d75f27cf5d02a9d7634431f71629137942e7c399 Mon Sep 17 00:00:00 2001 From: wadecai Date: Wed, 2 Sep 2020 10:15:31 +0800 Subject: [PATCH] Change README according to round2 suggestions Fix readme.md accroding to suggestions --- kep/42-podgroup-coscheduling/README.md | 31 ++++++++++++++------------ kep/42-podgroup-coscheduling/kep.yaml | 7 +----- 2 files changed, 18 insertions(+), 20 deletions(-) diff --git a/kep/42-podgroup-coscheduling/README.md b/kep/42-podgroup-coscheduling/README.md index a7519e269..0304f13a8 100644 --- a/kep/42-podgroup-coscheduling/README.md +++ b/kep/42-podgroup-coscheduling/README.md @@ -37,7 +37,7 @@ Batch workloads such as Spark jobs, TensorFlow jobs that have to run altogether. ### PodGroup We define a CRD name PodGroup to help schedule, its definition is as follows: -```build +```go // PodGroupSpec represents the template of a pod group. type PodGroupSpec struct { // MinMember defines the minimal number of members/tasks to run the pod group; @@ -45,6 +45,11 @@ type PodGroupSpec struct { // will not start anyone. MinMember uint32 `json:"minMember"` + // MinResources defines the minimal resource of members/tasks to run the pod group; + // if there's not enough resources to start all tasks, the scheduler + // will not start anyone. + MinResources *v1.ResourceList `json:"minResources,omitempty"` + // MaxScheduleTime defines the maximal time of members/tasks to wait before run the pod group; MaxScheduleTime *metav1.Duration `json:"maxScheduleTime,omitempty"` } @@ -54,8 +59,8 @@ type PodGroupStatus struct { // Current phase of PodGroup. Phase PodGroupPhase `json:"phase"` - // OccupiedBy marks the podgroup occupied by which group. - // Owner reference would be used to filled it, if not initialize, it is empty + // OccupiedBy marks the workload (e.g., deployment, statefulset) UID that occupy the podgroup. + // It is empty if not initialized. OccupiedBy string `json:"occupiedBy,omitempty"` // The number of actively running pods. @@ -81,7 +86,7 @@ type PodGroupStatus struct { ### Controller -We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Onece a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to +We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Once a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to cluster upgrade. ### Extension points @@ -97,28 +102,26 @@ To make sure a group of pods can be scheduled as soon as possible. We implemente #### PreFilter -This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below: +This extension pre-filters pods to save scheduling cycles. This is especially helpful when there are not enough resources in a cluster. The overall flow works as below: -1. Allow the pods that do not belong to any group. -2. If there are no groups scheduling, we check resource, if enough, we allow the pod. +1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps. +2. If there are no other pending pod groups - say all other pod groups have already been scheduled, we allow the pod when its resource requirement is satisfied. 3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it. 4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it. 5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one. -Any pod rejected to run, their group would be added to a denied list with a ttl. +For any pod that gets rejected, their pod group would be added to a backoff list and get retried until a TTL is met. #### Permit -1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime). -2. When number meet that, we would send a signal to permit the pods waiting. +1. When the number of waiting pods in a PodGroup is less than `minMember` (defined in the PodGroup), the status `Wait` is returned. They will be added to cache with TLL (equal to MaxScheduleTime). +2. When the number is equal or greater than `minMember`, send a signal to permit the waiting pods. -We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. +We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole group would be rejected. #### PostBind -This extension is mainly used for helping record the PodGroup Status. When pod binds successfully, we would update the scheduling status of a PodGroup. - -We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. +This extension is primarily used to record the PodGroup Status. When a pod is bound successfully, we would update the status of its affiliated PodGroup. ### Known Limitations diff --git a/kep/42-podgroup-coscheduling/kep.yaml b/kep/42-podgroup-coscheduling/kep.yaml index 799fb6f6d..295495a4d 100644 --- a/kep/42-podgroup-coscheduling/kep.yaml +++ b/kep/42-podgroup-coscheduling/kep.yaml @@ -5,12 +5,7 @@ authors: owning-sig: sig-scheduling reviewers: - "@Huang-Wei" - - "@ahg-g" - - "@alculquicondor" - - "k82cn" - - "@resouer" - - "@hex108" - - "@everpeace" + - "@denkensk" approvers: - "@Huang-Wei" creation-date: 2020-08-24