Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492

Closed
jazka opened this issue Nov 29, 2021 · 16 comments

Comments

@jazka
Copy link

jazka commented Nov 29, 2021

After creating Pytorch Job, the stauts of the job pods will always be pending, and the training-operator controller throws error as below:

Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org "xxx-pytorchjob": the object has been modifyed; please apply your changes to the latest version and try again

The gang schedule is enabled with volcano.

@gaocegege
Copy link
Member

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

@jazka
Copy link
Author

jazka commented Nov 29, 2021

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

Yes, it tries to "Reconciling", but the job pods hang on "pending" status, and there is no any logs about this job on volcano-scheduer pod.

@gaocegege
Copy link
Member

May I ask which version you are using?

@jazka
Copy link
Author

jazka commented Nov 29, 2021

The container image of training operator is "public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5". I am not sure if it is okay to check the version, or any other way to get the version?

@gaocegege
Copy link
Member

/cc @Jeffwan @PatrickXYS Do you know about it?

Are you using it in AWS?

@jazka
Copy link
Author

jazka commented Nov 29, 2021

If it is not the official version, I can change it with official version and try it again.

@jazka
Copy link
Author

jazka commented Nov 29, 2021

It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using?

Attach the error logs:

1638175051459.png

@gaocegege
Copy link
Member

Is there any pod created?

@jazka
Copy link
Author

jazka commented Nov 29, 2021

Yes, the pods are created but hang on pending status.

1638177266263.png

1638177263745.png

@gaocegege
Copy link
Member

gaocegege commented Nov 29, 2021

Can you please show kubectl describe pods? I need to know why it hangs.

@jazka
Copy link
Author

jazka commented Nov 29, 2021

I could find any clue from pod describe. It is different to attatch large image directly for me, please open it with link url.

master-0 pod:
https://l4x826wg3c.feishu.cn/file/boxcnvhEHBXXtnjE1s4qcMWSOme
https://l4x826wg3c.feishu.cn/file/boxcnMa1sYsODn7Weu3QRrT1Bqd

worker-0 pod:
https://l4x826wg3c.feishu.cn/file/boxcnDBuVOyg7YPUTA2ERxxXHNe
https://l4x826wg3c.feishu.cn/file/boxcn5pLXEE9kbuywU1P7yReRt4

@gaocegege
Copy link
Member

I did not understand why there is no event of the pods. It's weird.

But I do not think it is related to the operator. Seems that the pods are already created.

@jazka
Copy link
Author

jazka commented Nov 29, 2021

Yes, it is weird. Some things I noticed may be helpful:

  • The pending pytroch job will fail after long time(last one job, it is abount 60min). One info log is "Ignoring inactive pod xxxx-pytorchjob-worker-0 in state Failed, deleteion time "
  • It is still pending if restarting training-operator pod
  • It will begin to running if restart volcano-scheduler pod. But there is no any log abount the pod scheduling error of this pytorch job

@gaocegege
Copy link
Member

Personally, I think it is may be related to volcano. Since it can work after restarting the volcano.

@Jeffwan
Copy link
Member

Jeffwan commented Nov 29, 2021

The container image of training operator is "public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5". I am not sure if it is okay to check the version, or any other way to get the version?

760ac1171dd30039a7363ffa03c77454bd714da5 is the commit id, you can search it in git logs.
We probably can change to version tag later for easy debugging.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants