Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Pod instead of Job #344

Merged
merged 1 commit into from Mar 5, 2018
Merged

Conversation

ScorpioCPH
Copy link
Member

@ScorpioCPH ScorpioCPH commented Jan 24, 2018

This PR is a part of #325:

  • rename jobName() to genName()
  • create Pod instead of Job

TODOs (in another PR):

@jlewi @gaocegege


This change is Reviewable

@gaocegege
Copy link
Member

gaocegege commented Jan 24, 2018

Do we keep the trainer and just use pod instead of job in this PR?

@jlewi
Copy link
Contributor

jlewi commented Jan 29, 2018

Can you explain how you are going to deal with pod terminations? For example suppose, a pod gets terminated because the machine its running on becomes unavailable. How will this situation be handled?

@gaocegege
Copy link
Member

@jlewi

The pod event callback handler should be notified by the apiserver and we could handle the failure of pod, I think.

@gaocegege
Copy link
Member

Hi, @ScorpioCPH

Any progress here?

@ScorpioCPH
Copy link
Member Author

This is dependent on this discussion #333.

@jlewi
Copy link
Contributor

jlewi commented Feb 10, 2018

Why is this blocked on #333 ?

The discussion in #333 is to tackle TFJobStatus as part of the next version of the TFJob API. That will probably take some time.

Can we do the migration to direct management of pods using the current API?

@ScorpioCPH
Copy link
Member Author

If we just want to replace Job with Pod, I think this PR is enough now.

Maybe we can add event driven logic in another PR.

@jlewi
Copy link
Contributor

jlewi commented Feb 10, 2018

When the PR is ready please remove WIP from the title.

@coveralls
Copy link

coveralls commented Feb 10, 2018

Coverage Status

Coverage decreased (-0.4%) to 31.333% when pulling be20406 on ScorpioCPH:replace-job-with-pod into bde716e on tensorflow:master.

@ScorpioCPH ScorpioCPH changed the title [WIP] Create Pod instead of Job Create Pod instead of Job Feb 10, 2018
Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, and one question:

Do we need to update https://github.com/tensorflow/k8s/blob/master/examples/gke/TF%20on%20GKE.ipynb?

/cc @jlewi

}
} else {
s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created job: %v", createdJob.Name)
s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created Pod: %v", createdPod.Name)
}
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tensorflow/k8s/pull/344/files#diff-d2bc8c1807fa25d2b911d2d781f48a07R236

err = s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).DeleteCollection(&meta_v1.DeleteOptions{}, options)

I think we should update here, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, done.

@@ -359,7 +341,7 @@ func replicaStatusFromPodList(l v1.PodList, name tfv1alpha1.ContainerName) tfv1a
}

func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState {
j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.jobName(index), meta_v1.GetOptions{})
j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.genName(index), meta_v1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should replace BatchV1().Jobs() to CoreV1().Pods()

@jlewi
Copy link
Contributor

jlewi commented Feb 10, 2018

We can probably just delete
https://github.com/tensorflow/k8s/blob/master/examples/gke/TF%20on%20GKE.ipynb?

If its broken.

@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2018

What are the failure/retry semantics?

Before we relied on the Job controller to create new pods as necessary. What happens now?

It looks like genNames always creates a POD with the same name for replica index i. But what happens if a pod is terminated (e.g. a VM restarts)? We would either need to recreate the pod with a new name or delete the current pod before recreating it.

@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2018

/test all

1 similar comment
@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2018

/test all

@gaocegege
Copy link
Member

@jlewi We are celebrating Chinese New Year those days, and apologize that maybe we can not reply promptly.

What are the failure/retry semantics?

I think it should be handled by the pod restartPolicy.

@jlewi
Copy link
Contributor

jlewi commented Feb 13, 2018

Restart policy only applies to the containers within the pod docs.

So I don't think it helps in the case where the pod itself is terminated; e.g. because the VM becomes unhealthy.

@gaocegege
Copy link
Member

@jlewi

Sorry for the misunderstanding. I think we have to handle it in the operator. We need to register the callback for pod and if pod status is changed we should do some work similar to job controller in kubernetes maybe.

@ScorpioCPH
Copy link
Member Author

ScorpioCPH commented Feb 14, 2018

But what happens if a pod is terminated (e.g. a VM restarts)?
failure/retry semantics

I think this is another topic event-driven, we should register some EventHandlerFuncs to watch the Pods' status changed to do some action.

@jlewi
Copy link
Contributor

jlewi commented Feb 14, 2018

Right now because we create Job controllers, we are resilient to unwanted pod terminations. If a pod is terminated (e.g. because a VM goes away) then the job controller will create a new pod to replace it.

So don't we need to provide the same level of reliability? Otherwise it would be a regression?

Why can't we handle this in the reconcile function?
https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/training.go#L343

Reconcile should be called periodically; so in the reconcile function why can't we check that all the requisite pods are running and if not create new ones?

This seems like a good idea even if we have event handlers. If we rely on event handlers and an event gets dropped the job would be stuck. But as long as we have logic in Renoncile to get a job back to a healthy state then we can always recover.

@jlewi
Copy link
Contributor

jlewi commented Feb 14, 2018

See #309 for more info about how the informer periodically sends Update events that we can use to trigger reconcile.

@ScorpioCPH
Copy link
Member Author

@jlewi In fact, we call Renoncile() functions in the event handler currently, code is here.

I can't see any different between Reconcile() and EventHandler.

@jlewi
Copy link
Contributor

jlewi commented Feb 22, 2018

I can't see any different between Reconcile() and EventHandler.

I'm not following.

The lines of code you [linked to(]https://github.com/kubeflow/tf-operator/blob/master/pkg/controller/controller.go#L123-L125) show that the informer will generate an Update event periodically which will cause Reconcile to be called periodically.

So if the Reconcile creates any missing pods then the code should be resilient to failures.

But your Reconcile function always uses the same name for every pod it creates; i.e. the function is TFReplicaSet.genName is a deterministic function of replica type, job type, index, runtime id.

So master-0 would always correspond to some pod name "master-abcd-0".

So suppose we create pod "master-abcd-0" and then this pod is terminated because the VM restarts.
At this point the Reconcile function needs to create a new pod otherwise the job will never make progress. But the Reconcile function right now just calls create using the same name "master-abcd-0" which will return an already exists error.

This worked before when we were creating Job Controllers because the job controller would automatically create a new pod if a pod terminated and wasn't successful.

Now that we are creating the pods ourselves we need to do that.

@ScorpioCPH
Copy link
Member Author

@jlewi Addressed your comments (creating pods with a random UUID & reuse SyncPods/SyncServices), PTAL :)

@jlewi
Copy link
Contributor

jlewi commented Mar 4, 2018

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/trainer/replicas.go, line 420 at r4 (raw file):

		// Label to get all pods of this TFReplicaType + index
		labels := s.Labels()

Why not just call GetSingleReplicaStatus and then decide based on the result whether or not to create a new pod?


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 4, 2018

Review status: 0 of 3 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/trainer/replicas.go, line 510 at r4 (raw file):

func (s *TFReplicaSet) genPodName(index int32) string {
	// Generate a new pod name with random string
	return s.genName(index) + util.RandString(8)

Nit please add a "-" between the name and the random string. Also 5 digits of randomness is what K8s does and that's probably sufficient.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 4, 2018

A couple minor comments. Biggest issue is that travis is still failing. It would be nice to reuse GetSingleReplicaStatus if it makes sense but I'm fine with the current code if you think its better.

So fixing travis is the only blocking change at this point.

@ScorpioCPH ScorpioCPH force-pushed the replace-job-with-pod branch 2 times, most recently from 61309e8 to 3e60600 Compare March 5, 2018 02:35
@k8s-ci-robot
Copy link

k8s-ci-robot commented Mar 5, 2018

@ScorpioCPH: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
tf-k8s-presubmit 6e1c2b6 link /test tf-k8s-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ScorpioCPH
Copy link
Member Author

/retest

@ScorpioCPH
Copy link
Member Author

@jlewi Thanks, i will rethink GetSingleReplicaStatus in v1alpha2 :)

And i think CI is ok now.

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.


pkg/trainer/replicas.go, line 420 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why not just call GetSingleReplicaStatus and then decide based on the result whether or not to create a new pod?

Per discussion elsehwere will reconsider this as part of v1alpha2.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

Reviewed 1 of 2 files at r5.
Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

@ScorpioCPH Thank you.

@gaocegege
Copy link
Member

Reviewed 1 of 2 files at r3, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@gaocegege
Copy link
Member

Could we merge it now? @jlewi

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

@gaocegege Yeah. I was hoping auto merge would work but I think it's being blocked by the reviewable failed status check. I'll merge it manually.

@jlewi jlewi merged commit 6706903 into kubeflow:master Mar 5, 2018
jimexist pushed a commit to jimexist/tf-operator that referenced this pull request Mar 7, 2018
This PR is a part of kubeflow#325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in kubeflow#314
jlewi added a commit to jlewi/k8s that referenced this pull request Mar 23, 2018
* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Fix kubeflow#500
jlewi added a commit to jlewi/k8s that referenced this pull request Mar 25, 2018
* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Don't create any resources if the DeletionTimestamp is set.
  Creating resources at this point would end blocking deletion of the object
  because the controller would create resources while we are trying to delete
  them.

* Use logrus in controller.go, trainer.go, and replicas.go to log
  with fields providing information about the job and repliac.
  This makes it easy to filter logs for a particular job.

* Use logrus to log the name of the job in a field.
k8s-ci-robot pushed a commit that referenced this pull request Mar 26, 2018
* Fix bug with jobs not being marked as completed.

* A bug was introduced with getting the replica status in #344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Don't create any resources if the DeletionTimestamp is set.
  Creating resources at this point would end blocking deletion of the object
  because the controller would create resources while we are trying to delete
  them.

* Use logrus in controller.go, trainer.go, and replicas.go to log
  with fields providing information about the job and repliac.
  This makes it easy to filter logs for a particular job.

* Use logrus to log the name of the job in a field.

* Checking the deletiontime stamp doesn't appear to be sufficient.

Use the Phase to determine whether we should create resources.

* Run gofmt.

* * Reset the rate limiter after every successful sync.
* Otherwise the ratelimiter will end up delaying processing subsequent
  events which isn't what we want.

* Run goimports to fix lint issues.

* * Reconcile needs to update the TFJob stored in TrainingJob. This ensures
  TrainingJob has an up to date representation of the job.

* Otherwise changes made to the spec won't be available to TrainingJob. For
  example, if the job is deleted by the user, the deletion timestamp will
  be set. But if we don't update the TFJob stored in TrainingJob this
  change won't be propogated.

* * TrainingJob.update should log the value of the job not the pointer.

* Add more comments to the code.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* Fix bug with jobs not being marked as completed.

* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Don't create any resources if the DeletionTimestamp is set.
  Creating resources at this point would end blocking deletion of the object
  because the controller would create resources while we are trying to delete
  them.

* Use logrus in controller.go, trainer.go, and replicas.go to log
  with fields providing information about the job and repliac.
  This makes it easy to filter logs for a particular job.

* Use logrus to log the name of the job in a field.

* Checking the deletiontime stamp doesn't appear to be sufficient.

Use the Phase to determine whether we should create resources.

* Run gofmt.

* * Reset the rate limiter after every successful sync.
* Otherwise the ratelimiter will end up delaying processing subsequent
  events which isn't what we want.

* Run goimports to fix lint issues.

* * Reconcile needs to update the TFJob stored in TrainingJob. This ensures
  TrainingJob has an up to date representation of the job.

* Otherwise changes made to the spec won't be available to TrainingJob. For
  example, if the job is deleted by the user, the deletion timestamp will
  be set. But if we don't update the TFJob stored in TrainingJob this
  change won't be propogated.

* * TrainingJob.update should log the value of the job not the pointer.

* Add more comments to the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants