-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add a new field SchedulerName to the controller config file format #398
Conversation
Hi @mitake. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Related PR of kube-arbitrator side: kubernetes-retired/kube-batch#170 |
Pull Request Test Coverage Report for Build 865
💛 - Coveralls |
What are the pros/cons of making the scheduler a parameter of the operator as opposed to individual jobs? |
Pros is making definitions of TFJob a little bit shorter: we don't need to write the SchedulerName field in the pod template of the job object. The example (https://github.com/kubeflow/tf-operator/blob/master/examples/tf_job.yaml) has three replica specs so we need to write three SchedulerName fields in the yaml file. It is usual for distributed TF learning tasks. Also, I found that pods of tensorboard can be handled by kube-batchd if tf-operator does this (this PR isn't including this for now but it is easy to extend). TFJob doesn't have a template of the tensorboard. It would be hard to specify the SchedulerName for the pod. Current version of this PR has a drawback that the SchedulerName field is overwritten unconditionally even if a user specify it. It can be avoided by a simple fix. |
Probably extending the TensorBoardSpec (https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L107) would be enough for specifying SchedulerName field of tensorboard pods. But I think basically every pod of tensorflow and tensorboard should share the same SchedulerName. So specifying it as a controller option rather than the parameters of the individual pods would be easier? |
This commit adds a new field SchedulerName to the controller config file format. The purpose of the field is specifying the scheduler name of the pods created by tf-operator and let the scheduler (which wouldn't be the default scheduler) handle them. It would be convenient for letting kube-batchd (a component of kube-arbitrator) handle the pods.
Updated for:
|
Oops I found this issue now... #347 |
Why would the scheduler be a property of individual replicas as opposed to the job? If we created a Scheduler field a property of the TFJob, then the TFJob operator could look at that field and figure out how to schedule the pods appropriately. This would make it really easy for users to start experimenting with different schedulers since they wouldn't need to redeploy TFJob operator and jobs using different schedulers could live side by side. |
This idea is much better than the current version of this PR. I'd like to implement it later. Thanks a lot for the feedback! |
@mitake Do you still want to submit this PR? I'd prefer to just wait until we have a TFJob field. |
@jlewi ok, I'll close this and create a new one for the TFJob field update. |
This commit adds a new field SchedulerName to the controller config
file format. The purpose of the field is specifying the scheduler name
of the pods created by tf-operator and let the scheduler (which
wouldn't be the default scheduler) handle them. It would be convenient
for letting kube-batchd (a component of kube-arbitrator) handle the
pods.
/cc @ScorpioCPH how do you think about this? This PR would conflict with your ongoing one #344 so I'll wait merging of your PR.
This change is