New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492

Closed

jazka opened this issue Nov 29, 2021 · 16 comments

Labels

kind/question lifecycle/stale

jazka commented Nov 29, 2021

After creating Pytorch Job, the stauts of the job pods will always be pending, and the training-operator controller throws error as below:

Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org "xxx-pytorchjob": the object has been modifyed; please apply your changes to the latest version and try again

The gang schedule is enabled with volcano.

The text was updated successfully, but these errors were encountered:

Member

gaocegege commented Nov 29, 2021

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

gaocegege added the kind/question label

Author

jazka commented Nov 29, 2021

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

Yes, it tries to "Reconciling", but the job pods hang on "pending" status, and there is no any logs about this job on volcano-scheduer pod.

Member

gaocegege commented Nov 29, 2021

May I ask which version you are using?

Author

jazka commented Nov 29, 2021

The container image of training operator is "public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5". I am not sure if it is okay to check the version, or any other way to get the version?

Member

gaocegege commented Nov 29, 2021

/cc @Jeffwan @PatrickXYS Do you know about it?

Are you using it in AWS?

Author

jazka commented Nov 29, 2021

If it is not the official version, I can change it with official version and try it again.

Author

jazka commented Nov 29, 2021

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

Attach the error logs:

Member

gaocegege commented Nov 29, 2021

Is there any pod created?

Author

jazka commented Nov 29, 2021

Yes, the pods are created but hang on pending status.

Member

gaocegege commented Nov 29, 2021 •

edited

Can you please show kubectl describe pods? I need to know why it hangs.

Author

jazka commented Nov 29, 2021 •

edited

I could find any clue from pod describe. It is different to attatch large image directly for me, please open it with link url.

master-0 pod:
https://l4x826wg3c.feishu.cn/file/boxcnvhEHBXXtnjE1s4qcMWSOme
https://l4x826wg3c.feishu.cn/file/boxcnMa1sYsODn7Weu3QRrT1Bqd

worker-0 pod:
https://l4x826wg3c.feishu.cn/file/boxcnDBuVOyg7YPUTA2ERxxXHNe
https://l4x826wg3c.feishu.cn/file/boxcn5pLXEE9kbuywU1P7yReRt4

Member

gaocegege commented Nov 29, 2021

I did not understand why there is no event of the pods. It's weird.

But I do not think it is related to the operator. Seems that the pods are already created.

Author

jazka commented Nov 29, 2021

Yes, it is weird. Some things I noticed may be helpful：

The pending pytroch job will fail after long time(last one job, it is abount 60min). One info log is "Ignoring inactive pod xxxx-pytorchjob-worker-0 in state Failed, deleteion time "
It is still pending if restarting training-operator pod
It will begin to running if restart volcano-scheduler pod. But there is no any log abount the pod scheduling error of this pytorch job

Member

gaocegege commented Nov 29, 2021

Personally, I think it is may be related to volcano. Since it can work after restarting the volcano.

jazka mentioned this issue

Pytorchjob pods hang on pending status with volcano scheduler volcano-sh/volcano#1863

Closed

Member

Jeffwan commented Nov 29, 2021

The container image of training operator is "public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5". I am not sure if it is okay to check the version, or any other way to get the version?

760ac1171dd30039a7363ffa03c77454bd714da5 is the commit id, you can search it in git logs.
We probably can change to version tag later for easy debugging.

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the lifecycle/stale label

stale bot closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment