Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add a new field SchedulerName to the controller config file format #398

Closed
wants to merge 1 commit into from

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Feb 22, 2018

This commit adds a new field SchedulerName to the controller config
file format. The purpose of the field is specifying the scheduler name
of the pods created by tf-operator and let the scheduler (which
wouldn't be the default scheduler) handle them. It would be convenient
for letting kube-batchd (a component of kube-arbitrator) handle the
pods.

/cc @ScorpioCPH how do you think about this? This PR would conflict with your ongoing one #344 so I'll wait merging of your PR.


This change is Reviewable

@k8s-ci-robot
Copy link

Hi @mitake. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mitake
Copy link
Contributor Author

mitake commented Feb 22, 2018

Related PR of kube-arbitrator side: kubernetes-retired/kube-batch#170

@coveralls
Copy link

coveralls commented Feb 22, 2018

Pull Request Test Coverage Report for Build 865

  • 2 of 2 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 31.821%

Totals Coverage Status
Change from base Build 861: 0.07%
Covered Lines: 582
Relevant Lines: 1829

💛 - Coveralls

@jlewi
Copy link
Contributor

jlewi commented Feb 22, 2018

What are the pros/cons of making the scheduler a parameter of the operator as opposed to individual jobs?

@mitake
Copy link
Contributor Author

mitake commented Feb 23, 2018

Pros is making definitions of TFJob a little bit shorter: we don't need to write the SchedulerName field in the pod template of the job object. The example (https://github.com/kubeflow/tf-operator/blob/master/examples/tf_job.yaml) has three replica specs so we need to write three SchedulerName fields in the yaml file. It is usual for distributed TF learning tasks.

Also, I found that pods of tensorboard can be handled by kube-batchd if tf-operator does this (this PR isn't including this for now but it is easy to extend). TFJob doesn't have a template of the tensorboard. It would be hard to specify the SchedulerName for the pod.

Current version of this PR has a drawback that the SchedulerName field is overwritten unconditionally even if a user specify it. It can be avoided by a simple fix.

@mitake
Copy link
Contributor Author

mitake commented Feb 23, 2018

Probably extending the TensorBoardSpec (https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L107) would be enough for specifying SchedulerName field of tensorboard pods. But I think basically every pod of tensorflow and tensorboard should share the same SchedulerName. So specifying it as a controller option rather than the parameters of the individual pods would be easier?

This commit adds a new field SchedulerName to the controller config
file format. The purpose of the field is specifying the scheduler name
of the pods created by tf-operator and let the scheduler (which
wouldn't be the default scheduler) handle them. It would be convenient
for letting kube-batchd (a component of kube-arbitrator) handle the
pods.
@mitake
Copy link
Contributor Author

mitake commented Feb 23, 2018

Updated for:

  • not overwriting SchedulerName fields of TFJob's pods if the field is specified explicitly
  • overwriting SchedulerName fields of tensorboard's pod

@mitake
Copy link
Contributor Author

mitake commented Feb 23, 2018

Oops I found this issue now... #347

@jlewi
Copy link
Contributor

jlewi commented Feb 23, 2018

Why would the scheduler be a property of individual replicas as opposed to the job?

If we created a Scheduler field a property of the TFJob, then the TFJob operator could look at that field and figure out how to schedule the pods appropriately.

This would make it really easy for users to start experimenting with different schedulers since they wouldn't need to redeploy TFJob operator and jobs using different schedulers could live side by side.

@mitake
Copy link
Contributor Author

mitake commented Feb 23, 2018

If we created a Scheduler field a property of the TFJob, then the TFJob operator could look at that field and figure out how to schedule the pods appropriately.

This idea is much better than the current version of this PR. I'd like to implement it later. Thanks a lot for the feedback!

@jlewi
Copy link
Contributor

jlewi commented Feb 24, 2018

@mitake Do you still want to submit this PR? I'd prefer to just wait until we have a TFJob field.

@mitake
Copy link
Contributor Author

mitake commented Feb 26, 2018

@jlewi ok, I'll close this and create a new one for the TFJob field update.

@mitake mitake closed this Feb 26, 2018
@mitake mitake deleted the sched-name branch March 8, 2018 02:34
@mitake mitake restored the sched-name branch March 8, 2018 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants