Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainning-operator may need to monitor PodGroup #1574

Closed
qiankunli opened this issue Apr 16, 2022 · 5 comments
Closed

trainning-operator may need to monitor PodGroup #1574

qiankunli opened this issue Apr 16, 2022 · 5 comments

Comments

@qiankunli
Copy link
Contributor

I create a pytorchjob, then pytorch-controller create a PodGroup. The PodGroup was pending state at the beginning, so no pod was created for the pytorchjob. After the PodGroup became the inqueue state, It seems that the controller is not listening to the change of Podgroup, so that the reconcile logic could not be triggered, and the job had no pod.

@cheimu
Copy link
Member

cheimu commented Apr 16, 2022

Hi @qiankunli . I don't know if we are on the same page, but I think the base job controller has already tried to reconcile podGroup https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256

@qiankunli
Copy link
Contributor Author

qiankunli commented Apr 17, 2022

@cheimu I mean, if the state of podgroup changed from pending to inqueue, the reconcile should be triggered. when podgroup is pending, https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256 break the reconcile. when podgroup is inqueue, the reconcile can not be triggered, so the ReconcilePods and ReconcileServices can not be reached.

@zw0610
Copy link
Member

zw0610 commented Apr 17, 2022

It is true that PodGroup is not watched to trigger reconciliation but the controller does resync after certain period. You may change to a short time for the resynchronization.

@gaocegege
Copy link
Member

I do not think the resync works here since podgroup is per job and should be watched in the controller.

@johnugeorge
Copy link
Member

Closing this issue as feature is implemented as part of #1724 #1666
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants