New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the TfJob controller more event driven #314

Closed
jlewi opened this Issue Jan 15, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@jlewi
Copy link
Collaborator

jlewi commented Jan 15, 2018

Right now the controller relies on the TrainingJob.reconcile being called frequently to check the state of the job and take any needed action.

In #308 it was suggested we adopt a more event driven design.

Here's the comment from @ScorpioCPH

  • Listen on Pods which created by TFJob controller.
    • This needs a little change to use Pod instead of Job.
    • And set OwnerReferences of each Pods to TFJob controller.
  • On Pod created/updated/deleted, get TFJob by parse OwnerReferences, set the TFJob.Status.TFClusterStatus map as mentioned here.
  • Listen on TFJobs changed (we update the status in previous).
  • Set the whole status of this TFJob due to the TFClusterStatus map (like what we do in Reconcile).
    • And update the TFJob.Status.Condition.
  • Terminated/Delete the TFJob is every Pod is completed.

I think its more complicated than that since we create other resources (e.g. services, config maps, etc...).

Its also not clear to me why the queue would get filled up since the number of items in the queue would be the same as number of jobs in the cluster.

@jlewi

This comment has been minimized.

Copy link
Collaborator

jlewi commented Jan 25, 2018

Should we try to get this done by Kubecon?

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented Feb 11, 2018

I think so, and we should refactor the test in trainer package to controller level. But I am not sure if we could finish it before kubecon

jlewi added a commit that referenced this issue Mar 5, 2018

Create pod instead of job (#344)
This PR is a part of #325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in #314

Jimexist pushed a commit to Jimexist/tf-operator that referenced this issue Mar 7, 2018

Create pod instead of job (kubeflow#344)
This PR is a part of kubeflow#325:

rename jobName() to genName()
create Pod instead of Job

TODOs (in another PR):

use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod
Listen Pod CRUD and update TFJob status which descried in kubeflow#314

@jlewi jlewi added the priority/p2 label Mar 7, 2018

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented Apr 22, 2018

Closed by #492

@gaocegege gaocegege closed this Apr 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment