Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to fetch TFJob when I use client.go run tfjob #1612

Closed
goodpp opened this issue Jun 13, 2022 · 1 comment
Closed

unable to fetch TFJob when I use client.go run tfjob #1612

goodpp opened this issue Jun 13, 2022 · 1 comment

Comments

@goodpp
Copy link

goodpp commented Jun 13, 2022

为了表述准确,用中文来描述这个问题。
当我使用client.go 运行tfjob,tfjob运行成功后,会自动把相关tfjob,pod 数据都清理了。此时在training-operator的日志中看到提示 TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found,请问下怎么能查到什么原因导致出现unable to fetch TFJob
xxx?

PS:使用kubectl运行同样的tfjob的时候是能正常结束的,使用Client.go运行tfjob时才会出现该问题。

k8s version: 1.20
client.go version: 1.21

相关日志logs:
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="Ignoring inactive pod aios/tjob-tf1-ps-demo-10-1-0-0-worker-0 in state Succeeded, deletion time "
time="2022-06-13T06:51:01Z" level=info msg="Pod: aios.tjob-tf1-ps-demo-10-1-0-0-worker-0 exited with code 0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="TFJob=aios/tjob-tf1-ps-demo-10-1-0-0, ReplicaType=PS expected=1, running=1, failed=0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="TFJob=aios/tjob-tf1-ps-demo-10-1-0-0, ReplicaType=Worker expected=1, running=1, failed=0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.777Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"79919"}, "reason": "ExitedWithCode", "message": "Pod: aios.tjob-tf1-ps-demo-10-1-0-0-worker-0 exited with code 0"}
2022-06-13T06:51:01.778Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"79919"}, "reason": "TFJobSucceeded", "message": "TFJob aios/tjob-tf1-ps-demo-10-1-0-0 successfully completed."}
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (10.619824ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting pod aios/tjob-tf1-ps-demo-10-1-0-0-worker-1" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.800Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: tjob-tf1-ps-demo-10-1-0-0-worker-1"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting service aios/tjob-tf1-ps-demo-10-1-0-0-worker-1"
2022-06-13T06:51:01.809Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: tjob-tf1-ps-demo-10-1-0-0-worker-1"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.817Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: tjob-tf1-ps-demo-10-1-0-0-ps-0"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting service aios/tjob-tf1-ps-demo-10-1-0-0-ps-0"
2022-06-13T06:51:01.825Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: tjob-tf1-ps-demo-10-1-0-0-ps-0"}
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (7.262213ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (3.70475ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=warning msg="Reconcile Tensorflow Job error Operation cannot be fulfilled on tfjobs.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0": the object has been modified; please apply your changes to the latest version and try again"
2022-06-13T06:51:01.839Z ERROR controller-runtime.manager.controller.tfjob-controller Reconciler error {"name": "tjob-tf1-ps-demo-10-1-0-0", "namespace": "aios", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:10.096Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:10.102Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:10.105Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:40.214Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}

@goodpp
Copy link
Author

goodpp commented Jun 13, 2022

I‘m sorry,请忽略或关闭这个问题!是我们应用程序自己的问题,

@goodpp goodpp closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant