Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kep for coscheduling base on CRD #42

Merged
merged 1 commit into from
Sep 11, 2020

Conversation

cwdsuzhou
Copy link
Member

This PR add a plugin named batch. Our original repo is here.

In Tencent, this plugin has been stably running for more than half a year.

/cc @Huang-Wei

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 24, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @cwdsuzhou!

It looks like this is your first PR to kubernetes-sigs/scheduler-plugins 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/scheduler-plugins has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @cwdsuzhou. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 24, 2020
Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cwdsuzhou for contributing Tencent's coscheduling plugins. I believe this will help grow the
ecosystem of Scheduler Framework a lot.

First comment is on KEP organization, let's make the KEP part a separate PR, and we will discuss the design details there.

kep/batch/README.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 25, 2020
@cwdsuzhou cwdsuzhou changed the title Add batch plugin Add kep for coscheduling base on CRD Aug 25, 2020

1. Allow the pods do not belong to any group.
2. If there are no groups scheduling, we check resource, if enough, we allow the pod.
3. If there are groups running, we check if the current pod belong the max finished group, if it is, we allow it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What‘s the Max finished group
@cwdsuzhou

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group who has the max progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain max in the document.

@denkensk
Copy link
Member

@cwdsuzhou Thanks for your contribution. The definition of PodGroup is very important for Coscheduling.

After I review the kep, I find the key points of this kep are similar with the previous one(https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/2-lightweight-coscheduling. ) like queueSort and Permit.

Can we consider supporting podgroup based on the original coscheduling implemention Instead of reimplementing a plugin? Looking forward to your reply.

@cwdsuzhou
Copy link
Member Author

@cwdsuzhou Thanks for your contribution. The definition of PodGroup is very important for Coscheduling.

After I review the kep, I find the key points of this kep are similar with the previous one(https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/2-lightweight-coscheduling. ) like queueSort and Permit.

Can we consider supporting podgroup based on the original coscheduling implemention Instead of reimplementing a plugin? Looking forward to your reply.

Yes, that is similar. But we consider more about resources and some add another controller to reconcile the podGroup status. Actually, the coscheduling has finished developing since Nov. 2019. Though the current coscheduling has a similar implementation, it may not cover some scenes or may have bad performance under some scene. We have ran the same integration tests cases. The result is as follows:

image

image

@denkensk
Copy link
Member

denkensk commented Aug 25, 2020

we consider more about resources and some add another controller to reconcile the podGroup status.


I think it's OK to reconcile the PodGroup status in controller and pre-check or reserve the resource like Volcano. These features are not in conflict with the current co-scheduling. @cwdsuzhou

it may not cover some scenes or may have bad performance under some scene.


Can your share the scenes not covered in the current Co-scheduling?

@cwdsuzhou
Copy link
Member Author

we consider more about resources and some add another controller to reconcile the podGroup status.

I think it's OK to reconcile the PodGroup status in controller and pre-check or reserve the resource like Volcano. These features are not in conflict with the current co-scheduling. @cwdsuzhou

it may not cover some scenes or may have bad performance under some scene.

Can your share the scenes not covered in the current Co-scheduling?

300 resources in a cluster, a job requires 3 pods, 150 resources requested by per pod. 2 pods would schedule, but actually the job can not run. If another job consists of 3 pods and 100 resources required by per pod comes, this job would not run immediately. Because resources have been occupied by the first job.

@denkensk
Copy link
Member

300 resources in a cluster, a job requires 3 pods, 150 resources requested by per pod. 2 pods would schedule, but actually the job can not run. If another job consists of 3 pods and 100 resources required by per pod comes, this job would not run immediately. Because resources have been occupied by the first job.


This is a cluster where resources are highly used. The resources occupied by the first job will be released after the timeout. And also we can add some more policy like resource pre-check in Pre-Filter.

I don't think there is any problem that can't be solved so that we have to reimplement the coschduling plugins?
@cwdsuzhou

// If specified, indicates the PodGroup's priority. "system-node-critical" and
// "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PriorityClass mean another CRD or something else? Can you introduce more about this type?

Copy link
Member Author

@cwdsuzhou cwdsuzhou Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name of Priority class, same as the default int K8s! But it is not used not, I am considering if I should remove it

// MinResources defines the minimal resource of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinResources *v1.ResourceList `json:"minResources,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pod's minimal resource or one PodGroup's minimal resource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't quite get the point of MinResources.

The resource request is just one dimension of a PodGroup's constraints, how can you ensure all Pods of a PodGroup can be scheduled successfully by only looking at its resource requests?

pod

but the comments said it's for all tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems to be not necessary. Actually we did not set it in our current implementation, the value is achieved from Pod. I would remove it.

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review.

@@ -0,0 +1,140 @@
# Coscheduling based on PodGroup CRD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this folder to include the issue number, i.e., kep/42-podgroup-coscheduling.

<!-- /toc -->

## Motivation
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue.
Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods can be scheduled altogether. Under some scenes, it would waste resources since the whole application cannot work with only partial Pods' running, like Spark jobs, TensorFlow jobs, and so on. This proposal is aimed at solving the issue, by introducing a PodGroup CRD to do the heavy lifting on wiring a group of Pods.

Sort the job when we submit jobs to a cluster. Currently, we can only do this base on pods.

## Use Cases
Spark jobs, tensorflow jobs and other pods have to run together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Spark jobs, tensorflow jobs and other pods have to run together.
Batch workloads such as Spark jobs, TensorFlow jobs that have to run altogether.

2. Define a CRD name PodGroup to help pod scheduling.

## Non-Goals
Sort the job when we submit jobs to a cluster. Currently, we can only do this base on pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job here is official k8s Job? or a general term of batch job? And we didn't mention "sort" in previous paragraphs, so "sort the job" sounds a bit abrupt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a general term of batch job

Currently, through the default scheduler of Kubernetes, we cannot ensure a group of pods scheduled at the same time . Under some scene, it would waste resources since some pods need work together, like spark, tensorflow and so on . So, podgroup-coscheduling is aimed at solving the issue.

## Goals
1. Base on scheduling framework, implementing the gang scheduling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Base on scheduling framework, implementing the gang scheduling.
1. Base on the scheduling framework, implement the co-scheduling feature.


// OccupiedBy marks the podgroup occupied by which group.
// Owner reference would be used to filled it, it not initialize, it is empty
OccupiedBy string `json:"occupiedBy,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to describe a hierarchical relation here? For example, a PodGroup is owned/referred by another PodGroup? If so, you can inject the relation intoObjectMeta.OwnerReferences, so don't need to define a new API field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, in our current implement this is just UID of a deployment, statefulSet or Other CR, e.g tf job or mpi job.


### Controller

We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ... through describing ...
  • ... Once a pod in a group failed ... is marked as Failed...
  • ... Controller would also help to recover from ...
  • interpreted -> interrupted?


#### PreFilter

This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows:
This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below:


This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The main proggress is as follows:

1. Allow the pods do not belong to any group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Allow the pods do not belong to any group.
1. Allow the pods that do not belong to any group.


1. Allow the pods do not belong to any group.
2. If there are no groups scheduling, we check resource, if enough, we allow the pod.
3. If there are groups running, we check if the current pod belong the max finished group, if it is, we allow it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain max in the document.

@cwdsuzhou
Copy link
Member Author

First round of review.

Thank for review, I would update this kep according to your kind suggestions

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments.

kep/42-podgroup-coscheduling/README.md Show resolved Hide resolved
@@ -70,7 +55,7 @@ type PodGroupStatus struct {
Phase PodGroupPhase `json:"phase"`

// OccupiedBy marks the podgroup occupied by which group.
// Owner reference would be used to filled it, it not initialize, it is empty
// Owner reference would be used to filled it, if not initialize, it is empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Owner reference would be used to filled it, if not initialize, it is empty
// Owner reference would be used to fill it. It's empty if not initialized.

// MinResources defines the minimal resource of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinResources *v1.ResourceList `json:"minResources,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MinResources is needed for the implementation, I think it can be kept. IIRC this serves as a preFilter optimization, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our original version, we fill this field from podSpec. Keeping this is also ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this defined, we would use this pro pre-fileter. If not, we compute the resource requirements from Pod Spec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about how to set the MinResource. It's the total minimal resource of the PodGroup or the pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -96,7 +81,7 @@ type PodGroupStatus struct {

### Controller

We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to
We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Onece a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Onece -> Once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.


This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below:

1. Allow the pods that do not belong to any group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Allow the pods that do not belong to any group.
1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps.

#### Permit

1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime).
2. When number meet that, we would send a signal to permit the pods waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. When number meet that, we would send a signal to permit the pods waiting.
2. When the number is equal or greater than `minMember`, send a signal to permit the waiting pods.

1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime).
2. When number meet that, we would send a signal to permit the pods waiting.

We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: it's now defined in the latest codebase, and will be available in master branch until #45 gets reviewed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this. Our original version supports from both args and the crd. If the crd did not set the time, the args defines will work.

scene is as bellow:
A job need 10 pods may time out time after 5s, but a job need 10000 pods would like it to be 5min.

Copy link
Member

@denkensk denkensk Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better not to let the user to set the MaxScheduleTime. Users will prefer to set the MaxScheduleTime greater than it needed to reserve the resource. This will lead the resource to be reserved for a long time.

A job need 10 pods may time out time after 5s, but a job need 10000 pods would like it to be 5min.


We talked about it before. We can use a factor to compute the WaitingTime. The factor can be related to the number of pod. I add Todo in code before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for different cluster, the factor would also be different.
e.g. 100 nodes or 2000nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the size of cluster is also a factor to be considered with the the number of pod. This can be discussed.
And I am still worried about it. @Huang-Wei @cwdsuzhou

I think it is better not to let the user to set the MaxScheduleTime. Users will prefer to set the MaxScheduleTime greater than it needed to reserve the resource. This will lead the resource to be reserved for a long time.

1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime).
2. When number meet that, we would send a signal to permit the pods waiting.

We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected.
We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole pod group would be rejected.


This extension is mainly used for helping record the PodGroup Status. When pod binds successfully, we would update the scheduling status of a PodGroup.

We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can define `MaxScheduleTime` for a PodGroup. If anyone of the pods times out, the whole group would be rejected.
We can define `MaxScheduleTime` for a PodGroup. If any pod times out, the whole group would be rejected.

Comment on lines 7 to 13
- "@Huang-Wei"
- "@ahg-g"
- "@alculquicondor"
- "k82cn"
- "@resouer"
- "@hex108"
- "@everpeace"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove names who are not reviewing.

@Huang-Wei
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 1, 2020
@cwdsuzhou
Copy link
Member Author

@Huang-Wei PR addressed according to your latest suggestions, PTAL thanks

@denkensk
Copy link
Member

denkensk commented Sep 2, 2020

/cc @denkensk

kep/42-podgroup-coscheduling/README.md Outdated Show resolved Hide resolved

#### PreFilter

This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.

helps ... useful... are a bit verbose and doesn't add extra info to make the words more understandable. Can you rephrase it?

This extension helps pre-filter pods. It is useful, especially when there are not enough resources in a cluster. The overall flow works as below:

1. Allow the pods that do not belong to any group.
2. If there are no groups scheduling, we check resource, if enough, we allow the pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.

@@ -96,7 +81,7 @@ type PodGroupStatus struct {

### Controller

We define a controller to reconcile PodGroup status, and we can query the job status through describe the PodGroup. Any pod in a group failed, the Group Status is marked Failed. Controller would also help whem recovering from abnormal cases, e.g. batch scheduling is interpreted due to
We define a controller to reconcile PodGroup status, and we can query the job status through describing the PodGroup. Onece a pod in a group failed, the Group Status is marked Failed. Controller would also help recover from abnormal cases, e.g. batch scheduling is interrupted due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.

Comment on lines 104 to 106
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it.
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it.
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 104 to 106
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it.
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it.
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding @denkensk 's concern, I'd leave it for @cwdsuzhou to comment. Probably it's not a bad idea to also look at the absolute number to reach completion.

4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it.
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one.

Any pod rejected to run, their group would be added to a denied list with a ttl.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.


#### Permit

1. When number of pods cannot meet the `minMember` defines in the PodGroup, `Wait` is returned. They will be added to cache with TLL(equal to MaxScheduleTime).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.


#### PostBind

This extension is mainly used for helping record the PodGroup Status. When pod binds successfully, we would update the scheduling status of a PodGroup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding.

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments. We're close to getting it merged.

@@ -0,0 +1,13 @@
title: Coscheduling based on PodGroup CRD
kep-number: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kep-number: 3
kep-number: 42

As we're using the issue number as the KEP number.

Comment on lines 109 to 111
3. If there are groups running, we check if the current pod belong the group having the max progress(num(Pods)/minMember), if it is, we allow it.
4. Otherwise, we check if the max finished group can still run when allow this pod. If we can, allow it.
5. Otherwise, we check if the pod has higher priority compared with the max finished one. If yes, we reject the pod belongs to the group and allow the current one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see if the wordings in https://github.com/kubernetes-sigs/scheduler-plugins/pull/42/files#r481413128 makes more sense to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I missed this last time.

// MinResources defines the minimal resource of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinResources *v1.ResourceList `json:"minResources,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the podgroup is made up of different types of pods? Like ps and trainer in tensorflow. How to set the MinResources ?
@cwdsuzhou

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the v1alpha1 version. I could add that if need and extend this CRD later. e.g

type PodGroup struct {
...
Groups [map]PodGroup
...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one PodGroup should describe a particular type of workload. In terms of tensorflow application, a good model is to have PS and Worker referencing their corresponding (Sub-)PodGroup, and then there can be a (Parent-)PodGroup to wire those two (Sub-)PodGroup(s).

(not quite sure it would work, need a PoC on this)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought per #42 (comment) MinResources was going to be removed?

@Huang-Wei
Copy link
Contributor

/approve

Thanks @cwdsuzhou ! Please squash the commits, then I will /lgtm.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cwdsuzhou, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2020
@cwdsuzhou
Copy link
Member Author

/approve

Thanks @cwdsuzhou ! Please squash the commits, then I will /lgtm.

Done, PTAL, thanks! @Huang-Wei

@Huang-Wei
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit 6303fc5 into kubernetes-sigs:master Sep 11, 2020

This extension pre-filters pods to save scheduling cycles. This is especially helpful when there are not enough resources in a cluster. The overall flow works as below:

1. If the pod doesn't belong to a pod group, allow it; otherwise, go to the following steps.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this "pod belonging to a pod group" relationship expressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pod to PodGroup: pod gets applied with a PodGroup label or annotation
  • PodGroup to pod: either implicitly by dynamic computing, or PodGroup's OccupiedBy - which refers to the the group of pods' controller (i.e., Deployment or StatefulSet).

@cwdsuzhou can confirm.


// OccupiedBy marks the workload (e.g., deployment, statefulset) UID that occupy the podgroup.
// It is empty if not initialized.
OccupiedBy string `json:"occupiedBy,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, owned, like OwnerReference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more intuitive to use the more common terminology?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical meta.OwnerReference may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" in status can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.

@cwdsuzhou may comment further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical meta.OwnerReference may be used to exercise the concept of hierarchical PodGroups, i.e., a child PodGroup "ownedBy" the other parent PodGroup, so using the term "OccupiedBy" in status can distinguish from that semantics IMO - to represent the semantics of PodGroup <-> Workload mapping.

@cwdsuzhou may comment further.

Yep ,this field is used for recording the related object e.g. workload or crd. But not absolutely same as the owner reference.

swatisehgal pushed a commit to swatisehgal/scheduler-plugins that referenced this pull request Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants