-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kep for coscheduling base on CRD #42
Conversation
Welcome @cwdsuzhou! |
Hi @cwdsuzhou. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cwdsuzhou for contributing Tencent's coscheduling plugins. I believe this will help grow the
ecosystem of Scheduler Framework a lot.
First comment is on KEP organization, let's make the KEP part a separate PR, and we will discuss the design details there.
kep/podgroup-coscheduling/README.md
Outdated
|
||
1. Allow the pods do not belong to any group. | ||
2. If there are no groups scheduling, we check resource, if enough, we allow the pod. | ||
3. If there are groups running, we check if the current pod belong the max finished group, if it is, we allow it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What‘s the Max finished group
?
@cwdsuzhou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group who has the max progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain max
in the document.
@cwdsuzhou Thanks for your contribution. The definition of PodGroup is very important for Coscheduling. After I review the kep, I find the key points of this kep are similar with the previous one(https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/2-lightweight-coscheduling. ) like queueSort and Permit. Can we consider supporting podgroup based on the original coscheduling implemention Instead of reimplementing a plugin? Looking forward to your reply. |
Yes, that is similar. But we consider more about resources and some add another controller to reconcile the podGroup status. Actually, the coscheduling has finished developing since Nov. 2019. Though the current coscheduling has a similar implementation, it may not cover some scenes or may have bad performance under some scene. We have ran the same integration tests cases. The result is as follows: |
I think it's OK to reconcile the PodGroup status in controller and pre-check or reserve the resource like Volcano. These features are not in conflict with the current co-scheduling. @cwdsuzhou
Can your share the scenes not covered in the current Co-scheduling? |
300 resources in a cluster, a job requires 3 pods, 150 resources requested by per pod. 2 pods would schedule, but actually the job can not run. If another job consists of 3 pods and 100 resources required by per pod comes, this job would not run immediately. Because resources have been occupied by the first job. |
This is a cluster where resources are highly used. The resources occupied by the first job will be released after the timeout. And also we can add some more policy like resource pre-check in Pre-Filter. I don't think there is any problem that can't be solved so that we have to reimplement the coschduling plugins? |
kep/podgroup-coscheduling/README.md
Outdated
// If specified, indicates the PodGroup's priority. "system-node-critical" and | ||
// "system-cluster-critical" are two special keywords which indicate the | ||
// highest priorities with the former being the highest priority. Any other | ||
// name must be defined by creating a PriorityClass object with that name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does PriorityClass mean another CRD or something else? Can you introduce more about this type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name of Priority class, same as the default int K8s! But it is not used not, I am considering if I should remove it
kep/podgroup-coscheduling/README.md
Outdated
// MinResources defines the minimal resource of members/tasks to run the pod group; | ||
// if there's not enough resources to start all tasks, the scheduler | ||
// will not start anyone. | ||
MinResources *v1.ResourceList `json:"minResources,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pod's minimal resource or one PodGroup's minimal resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't quite get the point of MinResources
.
The resource request is just one dimension of a PodGroup's constraints, how can you ensure all Pods of a PodGroup can be scheduled successfully by only looking at its resource requests?
pod
but the comments said it's for all tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems to be not necessary. Actually we did not set it in our current implementation, the value is achieved from Pod. I would remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review.
kep/podgroup-coscheduling/README.md
Outdated
@@ -0,0 +1,140 @@ | |||
# Coscheduling based on PodGroup CRD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this folder to include the issue number, i.e., kep/42-podgroup-coscheduling
.
kep/podgroup-coscheduling/README.md
Outdated
<!-- /toc --> | ||
|
||
## Motivation | ||
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue. | |
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods can be scheduled altogether. Under some scenes, it would waste resources since the whole application cannot work with only partial Pods' running, like Spark jobs, TensorFlow jobs, and so on. This proposal is aimed at solving the issue, by introducing a PodGroup CRD to do the heavy lifting on wiring a group of Pods. |
kep/podgroup-coscheduling/README.md
Outdated
Sort the job when we submit jobs to a cluster. Currently, we can only do this base on pods. | ||
|
||
## Use Cases | ||
Spark jobs, tensorflow jobs and other pods have to run together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark jobs, tensorflow jobs and other pods have to run together. | |
Batch workloads such as Spark jobs, TensorFlow jobs that have to run altogether. |
kep/podgroup-coscheduling/README.md
Outdated
2. Define a CRD name PodGroup to help pod scheduling. | ||
|
||
## Non-Goals | ||
Sort the job when we submit jobs to a cluster. Currently, we can only do this base on pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job here is official k8s Job
? or a general term of batch job? And we didn't mention "sort" in previous paragraphs, so "sort the job" sounds a bit abrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a general term of batch job
kep/podgroup-coscheduling/README.md
Outdated
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue. | ||
|
||
## Goals | ||
1. Base on scheduling framework, implementing the gang scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Base on scheduling framework, implementing the gang scheduling. | |
1. Base on the scheduling framework, implement the co-scheduling feature. |
kep/podgroup-coscheduling/README.md
Outdated
|
||
// OccupiedBy marks the podgroup occupied by which group. | ||
// Owner reference would be used to filled it, it not initialize, it is empty | ||
OccupiedBy string `json:"occupiedBy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to describe a hierarchical relation here? For example, a PodGroup is owned/referred by another PodGroup? If so, you can inject the relation intoObjectMeta.OwnerReferences
, so don't need to define a new API field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not, in our current implement this is just UID of a deployment, statefulSet or Other CR, e.g tf job or mpi job.
kep/podgroup-coscheduling/README.md
Outdated
|
||
### Controller | ||
|
||
We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ... through describing ...
- ... Once a pod in a group failed ... is marked as Failed...
- ... Controller would also help to recover from ...
- interpreted -> interrupted?
kep/podgroup-coscheduling/README.md
Outdated
|
||
#### PreFilter | ||
|
||
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows: | |
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below: |
kep/podgroup-coscheduling/README.md
Outdated
|
||
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows: | ||
|
||
1. Allow the pods do not belong to any group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Allow the pods do not belong to any group. | |
1. Allow the pods that do not belong to any group. |
kep/podgroup-coscheduling/README.md
Outdated
|
||
1. Allow the pods do not belong to any group. | ||
2. If there are no groups scheduling, we check resource, if enough, we allow the pod. | ||
3. If there are groups running, we check if the current pod belong the max finished group, if it is, we allow it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain max
in the document.
Thank for review, I would update this kep according to your kind suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments.
@@ -70,7 +55,7 @@ type PodGroupStatus struct { | |||
Phase PodGroupPhase `json:"phase"` | |||
|
|||
// OccupiedBy marks the podgroup occupied by which group. | |||
// Owner reference would be used to filled it, it not initialize, it is empty | |||
// Owner reference would be used to filled it, if not initialize, it is empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Owner reference would be used to filled it, if not initialize, it is empty | |
// Owner reference would be used to fill it. It's empty if not initialized. |
// MinResources defines the minimal resource of members/tasks to run the pod group; | ||
// if there's not enough resources to start all tasks, the scheduler | ||
// will not start anyone. | ||
MinResources *v1.ResourceList `json:"minResources,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MinResources is needed for the implementation, I think it can be kept. IIRC this serves as a preFilter optimization, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our original version, we fill this field from podSpec. Keeping this is also ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this defined, we would use this pro pre-fileter. If not, we compute the resource requirements from Pod Spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about how to set the MinResource
. It's the total minimal resource of the PodGroup or the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -96,7 +81,7 @@ type PodGroupStatus struct { | |||
|
|||
### Controller | |||
|
|||
We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to | |||
We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Onece a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Onece -> Once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
|
||
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below: | ||
|
||
1. Allow the pods that do not belong to any group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Allow the pods that do not belong to any group. | |
1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps. |
#### Permit | ||
|
||
1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime). | ||
2. When number meet that, we would send a signal to permit the pods waiting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. When number meet that, we would send a signal to permit the pods waiting. | |
2. When the number is equal or greater than `minMember`, send a signal to permit the waiting pods. |
1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime). | ||
2. When number meet that, we would send a signal to permit the pods waiting. | ||
|
||
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: it's now defined in the latest codebase, and will be available in master
branch until #45 gets reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this. Our original version supports from both args and the crd. If the crd did not set the time, the args defines will work.
scene is as bellow:
A job need 10 pods may time out time after 5s, but a job need 10000 pods would like it to be 5min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better not to let the user to set the MaxScheduleTime
. Users will prefer to set the MaxScheduleTime
greater than it needed to reserve the resource. This will lead the resource to be reserved for a long time.
A job need 10 pods may time out time after 5s, but a job need 10000 pods would like it to be 5min.
We talked about it before. We can use a factor to compute the WaitingTime
. The factor can be related to the number of pod. I add Todo in code before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for different cluster, the factor would also be different.
e.g. 100 nodes or 2000nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the size of cluster is also a factor to be considered with the the number of pod. This can be discussed.
And I am still worried about it. @Huang-Wei @cwdsuzhou
I think it is better not to let the user to set the MaxScheduleTime. Users will prefer to set the MaxScheduleTime greater than it needed to reserve the resource. This will lead the resource to be reserved for a long time.
1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime). | ||
2. When number meet that, we would send a signal to permit the pods waiting. | ||
|
||
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. | |
We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole pod group would be rejected. |
|
||
This extension is mainly used for helping record the PodGroup Status. When pod binds successfully, we would update the scheduling status of a PodGroup. | ||
|
||
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected. | |
We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole group would be rejected. |
- "@Huang-Wei" | ||
- "@ahg-g" | ||
- "@alculquicondor" | ||
- "k82cn" | ||
- "@resouer" | ||
- "@hex108" | ||
- "@everpeace" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove names who are not reviewing.
/ok-to-test |
@Huang-Wei PR addressed according to your latest suggestions, PTAL thanks |
/cc @denkensk |
|
||
#### PreFilter | ||
|
||
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
helps ... useful...
are a bit verbose and doesn't add extra info to make the words more understandable. Can you rephrase it?
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below: | ||
|
||
1. Allow the pods that do not belong to any group. | ||
2. If there are no groups scheduling, we check resource, if enough, we allow the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
@@ -96,7 +81,7 @@ type PodGroupStatus struct { | |||
|
|||
### Controller | |||
|
|||
We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to | |||
We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Onece a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it. | ||
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it. | ||
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwdsuzhou does the rewording comment https://github.com/kubernetes-sigs/scheduler-plugins/pull/42/files#r481413128 make sense to you?
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it. | ||
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it. | ||
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @denkensk 's concern, I'd leave it for @cwdsuzhou to comment. Probably it's not a bad idea to also look at the absolute number to reach completion.
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it. | ||
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one. | ||
|
||
Any pod rejected to run, their group would be added to a denied list with a ttl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
|
||
#### Permit | ||
|
||
1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
|
||
#### PostBind | ||
|
||
This extension is mainly used for helping record the PodGroup Status. When pod binds successfully, we would update the scheduling status of a PodGroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments. We're close to getting it merged.
@@ -0,0 +1,13 @@ | |||
title: Coscheduling based on PodGroup CRD | |||
kep-number: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kep-number: 3 | |
kep-number: 42 |
As we're using the issue number as the KEP number.
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it. | ||
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it. | ||
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see if the wordings in https://github.com/kubernetes-sigs/scheduler-plugins/pull/42/files#r481413128 makes more sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I missed this last time.
// MinResources defines the minimal resource of members/tasks to run the pod group; | ||
// if there's not enough resources to start all tasks, the scheduler | ||
// will not start anyone. | ||
MinResources *v1.ResourceList `json:"minResources,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if the podgroup is made up of different types of pods? Like ps and trainer in tensorflow. How to set the MinResources ?
@cwdsuzhou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the v1alpha1 version. I could add that if need and extend this CRD later. e.g
type PodGroup struct {
...
Groups [map]PodGroup
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one PodGroup should describe a particular type of workload. In terms of tensorflow application, a good model is to have PS and Worker referencing their corresponding (Sub-)PodGroup, and then there can be a (Parent-)PodGroup to wire those two (Sub-)PodGroup(s).
(not quite sure it would work, need a PoC on this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought per #42 (comment) MinResources
was going to be removed?
/approve Thanks @cwdsuzhou ! Please squash the commits, then I will /lgtm. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cwdsuzhou, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Done, PTAL, thanks! @Huang-Wei |
/lgtm |
|
||
This extension pre-filters pods to save scheduling cycles. This is especially helpful when there are not enough resources in a cluster. The overall flow works as below: | ||
|
||
1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this "pod belonging to a pod group" relationship expressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- pod to PodGroup: pod gets applied with a PodGroup label or annotation
- PodGroup to pod: either implicitly by dynamic computing, or PodGroup's
OccupiedBy
- which refers to the the group of pods' controller (i.e., Deployment or StatefulSet).
@cwdsuzhou can confirm.
|
||
// OccupiedBy marks the workload (e.g., deployment, statefulset) UID that occupy the podgroup. | ||
// It is empty if not initialized. | ||
OccupiedBy string `json:"occupiedBy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, owned, like OwnerReference
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more intuitive to use the more common terminology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical meta.OwnerReference
may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" in status
can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.
@cwdsuzhou may comment further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical
meta.OwnerReference
may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" instatus
can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.@cwdsuzhou may comment further.
Yep ,this field is used for recording the related object e.g. workload or crd. But not absolutely same as the owner reference.
[release-4.10][manual][PR#373][PR#371]
This PR add a plugin named batch. Our original repo is here.
In Tencent, this plugin has been stably running for more than half a year.
/cc @Huang-Wei