Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pod group instead of PDB for gang scheduling #916

Open
zionwu opened this issue Jan 23, 2019 · 30 comments

Comments

Projects
None yet
9 participants
@zionwu
Copy link
Contributor

commented Jan 23, 2019

The lastest version of kube-batch is using pod group instead of PDB.
With the pod group, user can specify the queue for the job, even the job is in different namespaces. With the PDB, we can't specify the queue, it is defaults to the its namespace.

PDB is still supported in kube-batch for backward compatbility, However it will be removed when v1alpha1 finalized. I think tf-operator should support pod group for more powerful scheduling.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 23, 2019

kubernetes/enhancements#639 was merged, I think it's time to replace PDB with PodGroup. In addition, it will also support phase/condition/status of PodGroup, e.g. unschedulable, which is helpful for lifecycle management of tfjob.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 23, 2019

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 24, 2019

if no objection, I'd like to creat pr for that :)

@zionwu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2019

@k82cn I know kube-batch is using namespace as queue as default, but it also supports creating queue CRD and specify the queue in the pod group. It would be great if I can specify the queue for my TFjob, to do that we can add to field called queue in the TFJobSpec. What do you think?

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 25, 2019

add to field called queue in the TFJobSpec

I'm OK with that; for TFJobSpec, it's better to get @jlewi 's feedback :)

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 25, 2019

xref kubernetes-sigs/kube-batch#465

@adam-marek and I talked about Queue supported in tf-operator; it need to community's feedback on TFJob API changes.

Or maybe we can introduce annotation as an alpha features, @jlewi , WDYT?

@jlewi

This comment has been minimized.

Copy link
Collaborator

commented Jan 28, 2019

@johnugeorge

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

@zionwu Isn't the queue name part of the PodGroupSpec ? https://github.com/kubernetes-sigs/kube-batch/blob/master/pkg/apis/scheduling/v1alpha1/types.go#L116

In such a case, why do we need a separate queue field after we set the PodGroup for the TFJob?

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

Agree with @johnugeorge on the queue name.

I would like to keep k8s details out of the TFJob spec if possible. Ideally, TFJob should contain just TF-specific fields.

@zionwu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

@johnugeorge, @richardsliu , just like PDB, the PodGroup will be created by TF-Operator,. If user can only specify the queue name in PodGroup , the process to specify the queue name is:

  1. User submit a TFJob.
  2. The TF-Operator creates a PodGroup with default queue name.
  3. User has to find out the PodGroup, and modify the queue name.
    However, this may not be working as the job may already be scheduled before step 3.

I think the queue name should be decided at the creation of PodGroup. Since the PodGroup is created by TF-operator, the only way user can specify the queue name is by setting it on TFJob spec, then TF-operator can set the queue name for PodGroup accordingly.

@gaocegege

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

Is pod group a standard resource in K8s? I am not sure what will happen if the user does not register the resource.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 29, 2019

Is pod group a standard resource in K8s?

No for now; CRD should be deployed with kube-batch together. If failed to create PodGroup, just log error message.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 29, 2019

I would like to keep k8s details out of the TFJob spec if possible. Ideally, TFJob should contain just TF-specific fields.

If tf-operator need resource fair-sharing from scheduler, Queue is one of option; and Queue is a common concept in 'batch' system, e.g. Job, Queue.
If we have concern about the API changes, how about annotations as alpha feature?

@johnugeorge

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

@zionwu I am little skeptical about adding non-TF fields in TFJob API. We would then be forced to change the TFJob API when upstream implementation changes.

As @k82cn suggested, how about considering annotations for now? @gaocegege @richardsliu

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

Would it be possible to introduce a field in https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/common/v1beta1/common_types.go? For example something like "SchedulingPolicy". That way all operators can then share the same implementation.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Jan 30, 2019

For example something like "SchedulingPolicy".

+1 ; in the previous version of gang-scheduling design doc, I proposed to introduce SchedulingSpec/PodSchedulingGroup as bridge between operator and scheduler, and provide a default implementation of QueueJob. Currently, PodGroup is playing similar role, it's up to community to wrapper it as new API, and use PodGroup to communicate with scheduler.

@johnugeorge johnugeorge referenced this issue Feb 6, 2019

Closed

TF operator v1beta2 API #935

4 of 4 tasks complete
@johnugeorge

This comment has been minimized.

Copy link
Member

commented Feb 6, 2019

/area 0.5.0

@ChanYiLin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

@k82cn @richardsliu @gaocegege
I remembered that we have discussed wether we should add a common schedulerName in TFJob Spec or not?
In v1alpha2 we decided not to do that and users need to add schedulerName to each replicas(pod) spec.

How about in v1beta2, should we also ask users to add schedulerName to each replicas(pod) spec or also change to use annotation to indicate the scheduler?

@gaocegege

This comment has been minimized.

Copy link
Member

commented Feb 11, 2019

@ChanYiLin Ref #920

We decide to simplify the process.

@johnugeorge

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

@k82cn working on this ?

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Feb 18, 2019

yes, i'm working on this one :)

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

@k82cn Any updates on this?

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented Mar 4, 2019

@k82cn Is this still on track to be finished in 0.5.0? We are trying to reach code complete by 3/15.

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Mar 5, 2019

@richardsliu , sorry for the delay response. Let me try to submit a PR this week.

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2019

@thandayuthapani @k82cn Any more pending work for this item? This was automatically closed by k8s bot.

@richardsliu richardsliu reopened this Mar 19, 2019

@kunmingg kunmingg referenced this issue Mar 19, 2019

Open

Ship v0.5.0 release #2716

18 of 22 tasks complete
@thandayuthapani

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

@richardsliu Yeah, in v1beta2 side, it uses podGroup instead of PDB during gang scheduling.

@johnugeorge

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

@thandayuthapani Are you working on SchedulingPolicy? (From #916 (comment) )

@k82cn

This comment has been minimized.

Copy link
Collaborator

commented Mar 20, 2019

Any more pending work for this item?

@richardsliu , no more items for 0.5 release

Are you working on SchedulingPolicy?

@johnugeorge , we'll do some investigation after 0.5 release.

@richardsliu richardsliu added area/0.6.0 and removed area/0.5.0 labels Mar 27, 2019

@richardsliu richardsliu referenced this issue Mar 27, 2019

Open

TFJob 1.0 #968

1 of 4 tasks complete

@richardsliu richardsliu added this to To Do in TFJob-PyTorch 1.0 Apr 8, 2019

@richardsliu richardsliu moved this from To Do to In Progress in TFJob-PyTorch 1.0 Apr 22, 2019

@jlewi

This comment has been minimized.

Copy link
Collaborator

commented May 13, 2019

@johnugeorge @k82cn @richardsliu Is there more work to be done before we close out this issue? If there's more work can we clarify what work is needed and if anyone is planning on taking it on?

@richardsliu

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

@k82cn Is there any more we want to do for 0.6?

Note that if the scheduling policy changes need to go in the common API repo, then it can wait until after 0.7.

@jlewi jlewi moved this from In Progress to To Do in TFJob-PyTorch 1.0 May 13, 2019

@richardsliu richardsliu removed this from To Do in TFJob-PyTorch 1.0 May 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.