Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the TfJob controller more event driven #314

Closed
jlewi opened this issue Jan 15, 2018 · 3 comments
Closed

Make the TfJob controller more event driven #314

jlewi opened this issue Jan 15, 2018 · 3 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 15, 2018

Right now the controller relies on the TrainingJob.reconcile being called frequently to check the state of the job and take any needed action.

In #308 it was suggested we adopt a more event driven design.

Here's the comment from @ScorpioCPH

  • Listen on Pods which created by TFJob controller.
    • This needs a little change to use Pod instead of Job.
    • And set OwnerReferences of each Pods to TFJob controller.
  • On Pod created/updated/deleted, get TFJob by parse OwnerReferences, set the TFJob.Status.TFClusterStatus map as mentioned here.
  • Listen on TFJobs changed (we update the status in previous).
  • Set the whole status of this TFJob due to the TFClusterStatus map (like what we do in Reconcile).
    • And update the TFJob.Status.Condition.
  • Terminated/Delete the TFJob is every Pod is completed.

I think its more complicated than that since we create other resources (e.g. services, config maps, etc...).

Its also not clear to me why the queue would get filled up since the number of items in the queue would be the same as number of jobs in the cluster.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 25, 2018

Should we try to get this done by Kubecon?

@gaocegege
Copy link
Member

I think so, and we should refactor the test in trainer package to controller level. But I am not sure if we could finish it before kubecon

jlewi pushed a commit that referenced this issue Mar 5, 2018
This PR is a part of #325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in #314
jimexist pushed a commit to jimexist/tf-operator that referenced this issue Mar 7, 2018
This PR is a part of kubeflow#325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in kubeflow#314
@gaocegege
Copy link
Member

Closed by #492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants