Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a field SchedulerName to TFJob for specifying a scheduler #408

Merged
merged 1 commit into from Mar 7, 2018

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Feb 26, 2018

This commit adds a new field SchedulerName to the definition of TFJob.
The purpose of the field is specifying the scheduler name of the pods
created by tf-operator and let the scheduler (which wouldn't be the
default scheduler) handle them. It would be convenient for letting
kube-batchd (a component of kube-arbitrator) handle the pods.

/cc @jlewi this is a newer version of #398


This change is Reviewable

@coveralls
Copy link

coveralls commented Feb 26, 2018

Coverage Status

Coverage increased (+0.2%) to 28.606% when pulling 445f01b on mitake:sched-name-tfjob into 997c583 on kubeflow:master.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define a default scheduler for TFJob?

@mitake
Copy link
Contributor Author

mitake commented Feb 26, 2018

@gaocegege probably we don't need to define. If the field is empty, apiserver will supply the name of the default scheduler (default-scheduler) to a pod.

@gaocegege
Copy link
Member

/ok-to-test

@jlewi
Copy link
Contributor

jlewi commented Feb 26, 2018

Why is scheduler name a property of the replica vs. a property of the job?

I was expecting this PR to enable testing of the use of kube-arbitrartor. I was expecting that to be a job level setting as opposed to a replica level setting.

@gaocegege
Copy link
Member

I think all the TFJob's replicas will be scheduled by one scheduler, which may be kube-arbitrator. Then Job level SGTM.

@mitake
Copy link
Contributor Author

mitake commented Feb 27, 2018

I was expecting this PR to enable testing of the use of kube-arbitrartor. I was expecting that to be a job level setting as opposed to a replica level setting.

This PR provides a job level setting. The job level setting will be propagated to the replica level setting during the creation process. If I'm misunderstanding something, sorry for that and please point out.

@jlewi
Copy link
Contributor

jlewi commented Feb 27, 2018

Thanks. Looks good; must have misread it before.

Should we add a unittest to verify pod scheduler is set properly?

@mitake
Copy link
Contributor Author

mitake commented Feb 27, 2018

Yes, the unittest would be valuable. I'll add it tomorrow.

@@ -178,6 +178,8 @@ func (s *TBReplicaSet) getDeploymentSpecTemplate(image string) v1.PodTemplateSpe

ps.Volumes = append(ps.Volumes, s.Spec.Volumes...)

ps.SchedulerName = s.Job.SchedulerName()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no TensorBoard now so we could remove the changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll remove it in the next update. Thanks.

@mitake
Copy link
Contributor Author

mitake commented Feb 28, 2018

@jlewi @gaocegege updated for

  • unit test
  • removing the change related to tensorboard

Could you take a look?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@gaocegege
Copy link
Member

/lgtm

@wackxu
Copy link
Contributor

wackxu commented Mar 1, 2018

You should run hack/update-codegen.sh to generate code and we should add hack/verify-codegen.sh into our test @gaocegege @jlewi

@gaocegege
Copy link
Member

@wackxu

Thanks and I agree that. I will file an issue for it.

@mitake
Copy link
Contributor Author

mitake commented Mar 1, 2018

Probably the CI is failing because of invalid format of python files. I created another PR for the issue: #429

@jlewi
Copy link
Contributor

jlewi commented Mar 1, 2018

@ScorpioCPH @mitake This change likely conflicts with #344. Which PR should be submitted first

/hold

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitake Thanks for your PR, LGTM for adding this field.
But i think we don't need a special Getter for SchedulerName.

@@ -405,3 +405,7 @@ func (j *TrainingJob) name() string {
func (j *TrainingJob) fullname() string {
return j.job.ObjectMeta.GetNamespace() + ":" + j.job.ObjectMeta.GetName()
}

func (j *TrainingJob) SchedulerName() string {
return j.job.Spec.SchedulerName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a special Getter for SchedulerName?
We can just get it directly form j.job.Spec.SchedulerName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because TrainingJob.job is a private member so TrainingJob should provide the accessor method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current structure is a little complex:

  • We have TFJob for API object Spec/Status
  • Then create TrainingJob and TFReplicaSet which keep many cache of TFJob

And PodSpec already have a field SchedulerName, can we reuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodSpec.SchedulerName isn't suitable for the purpose. See the discussion start with this message #398 (comment)

@gaocegege
Copy link
Member

@mitake Thanks for your contribution! Could you rebase the master and I think we could merge it ASAP.

This commit adds a new field SchedulerName to the definition of TFJob.
The purpose of the field is specifying the scheduler name of the pods
created by tf-operator and let the scheduler (which wouldn't be the
default scheduler) handle them. It would be convenient for letting
kube-batchd (a component of kube-arbitrator) handle the pods.
@mitake
Copy link
Contributor Author

mitake commented Mar 7, 2018

@gaocegege rebased on the latest master, PTAL

@gaocegege
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

@ScorpioCPH Do you want to have another review?

@gaocegege
Copy link
Member

I will merge it soon since there is no more review.

@gaocegege gaocegege merged commit a4b8031 into kubeflow:master Mar 7, 2018
@mitake mitake deleted the sched-name-tfjob branch March 7, 2018 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants