Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693

Closed
NettrixTobin opened this issue Nov 22, 2022 · 8 comments
Closed

Comments

@NettrixTobin
Copy link

NettrixTobin commented Nov 22, 2022

`root@master:~# kubectl logs -f training-operator-5cc8cdfdd6-xz5qq -n kubeflow
I1122 01:52:15.291326       1 request.go:601] Waited for 1.011932954s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/serving.kserve.io/v1alpha1?timeout=32s
1.6690819362796633e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.6690819362820742e+09  INFO    setup   starting manager
1.6690819363789077e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6690819363790262e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6690819363792255e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6690819363792121e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.669081936379304e+09   INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.669081936379264e+09   INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6690819363793213e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.669081936379366e+09   INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6690819363793771e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6690819363793895e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6690819363794024e+09  INFO    Starting Controller     {"controller": "pytorchjob-controller"}
1.6690819363793309e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.669081936379433e+09   INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6690819363794458e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.6690819363794565e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6690819363793914e+09  INFO    Starting Controller     {"controller": "mxjob-controller"}
1.6690819363794715e+09  INFO    Starting Controller     {"controller": "mpijob-controller"}
1.6690819363794422e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.6690819363794968e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.669081936379517e+09   INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.6690819363795297e+09  INFO    Starting Controller     {"controller": "tfjob-controller"}
1.66908193637956e+09    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6690819363796897e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6690819363797452e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6690819363797586e+09  INFO    Starting Controller     {"controller": "xgboostjob-controller"}
I1122 01:52:17.578664       1 trace.go:205] Trace[1095108423]: "DeltaFIFO Pop Process" ID:gpu-operator-resources/default,Depth:59,Reason:slow event handlers blocking the queue (22-Nov-2022 01:52:17.182) (total time: 295ms):
Trace[1095108423]: [295.716397ms] [295.716397ms] END
I1122 01:52:17.679340       1 trace.go:205] Trace[935737529]: "DeltaFIFO Pop Process" ID:kube-system/token-cleaner,Depth:58,Reason:slow event handlers blocking the queue (22-Nov-2022 01:52:17.578) (total time: 100ms):
`
@kuizhiqing
Copy link
Member

Is any crash log, maybe try with kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow.

@NettrixTobin
Copy link
Author

NettrixTobin commented Nov 22, 2022

@kuizhiqing, I gave it a try and the output is as follows


root@master:~# kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow
I1122 03:15:14.029800       1 request.go:601] Waited for 1.049849815s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/messaging.knative.dev/v1?timeout=32s
1.6690869151348069e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.6690869152808545e+09  INFO    setup   starting manager
1.6690869152815266e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6690869152815475e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.669086915281704e+09   INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6690869152818277e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6690869152818406e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6690869152818534e+09  INFO    Starting Controller     {"controller": "pytorchjob-controller"}
1.6690869152818773e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6690869152819426e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.6690869152819917e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6690869152820137e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6690869152820244e+09  INFO    Starting Controller     {"controller": "xgboostjob-controller"}
1.6690869152819967e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.669086915282056e+09   INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.6690869152820652e+09  INFO    Starting Controller     {"controller": "tfjob-controller"}
1.6690869152821472e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6690869152822428e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.6690869152822747e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.6690869152822936e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6690869152823093e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.66908691528229e+09    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.6690869152823222e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6690869152823365e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.669086915282339e+09   INFO    Starting Controller     {"controller": "mpijob-controller"}
1.6690869152823498e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6690869152823596e+09  INFO    Starting Controller     {"controller": "mxjob-controller"}
I1122 03:15:16.378733       1 trace.go:205] Trace[535296393]: "DeltaFIFO Pop Process" ID:kubeflow/argo-role,Depth:15,Reason:slow event handlers blocking the queue (22-Nov-2022 03:15:16.081) (total time: 297ms):
Trace[535296393]: [297.332355ms] [297.332355ms] END

Other Pods work well

root@master:~# kubectl get po -A |grep kubeflow
kubeflow-user-example-com   ml-pipeline-ui-artifact-76474bc75f-w9qcx                          2/2     Running             4 (12h ago)       17h
kubeflow-user-example-com   ml-pipeline-visualizationserver-85f989dbfc-sbmq6                  2/2     Running             4 (12h ago)       17h
kubeflow                    admission-webhook-deployment-bb7c6b4d6-hkrj7                      1/1     Running             1 (12h ago)       19h
kubeflow                    cache-server-59bf8ff85d-wphp6                                     2/2     Running             7 (12h ago)       19h
kubeflow                    centraldashboard-8dc67db66-c79wv                                  2/2     Running             8 (12h ago)       19h
kubeflow                    jupyter-web-app-deployment-59c6bc85cc-nwk9r                       1/1     Running             2 (12h ago)       19h
kubeflow                    katib-controller-6478fbd64c-hjqhh                                 1/1     Running             3 (12h ago)       19h
kubeflow                    katib-db-manager-78fc8b7895-hhpdf                                 1/1     Running             22 (12h ago)      19h
kubeflow                    katib-mysql-6975d6c6c4-m5rdq                                      1/1     Running             2 (12h ago)       19h
kubeflow                    katib-ui-5cb6cc4d97-82tvk                                         1/1     Running             5 (12h ago)       19h
kubeflow                    kserve-controller-manager-0                                       2/2     Running             7 (12h ago)       19h
kubeflow                    kserve-models-web-app-5454bfdb86-h92kp                            2/2     Running             7 (12h ago)       19h
kubeflow                    kubeflow-pipelines-profile-controller-5b8474b7bc-msfl7            1/1     Running             2 (12h ago)       19h
kubeflow                    metacontroller-0                                                  1/1     Running             1 (12h ago)       19h
kubeflow                    metadata-envoy-deployment-6c6f8c6c59-r7sz5                        1/1     Running             4 (12h ago)       19h
kubeflow                    metadata-grpc-deployment-679b49cc95-hhcjg                         2/2     Running             22 (12h ago)      19h
kubeflow                    metadata-writer-d6567ddf6-8zkq4                                   2/2     Running             15 (12h ago)      19h
kubeflow                    minio-7955cfc9fc-v2vn4                                            2/2     Running             2 (12h ago)       19h
kubeflow                    ml-pipeline-5d6f7c985c-pczs7                                      2/2     Running             22 (12h ago)      19h
kubeflow                    ml-pipeline-persistenceagent-5544dd8bf4-8x5tx                     2/2     Running             8 (12h ago)       19h
kubeflow                    ml-pipeline-scheduledworkflow-7d464d85bf-4cn9q                    2/2     Running             8 (12h ago)       19h
kubeflow                    ml-pipeline-ui-6576d6ddcb-xvng5                                   2/2     Running             8 (12h ago)       19h
kubeflow                    ml-pipeline-viewer-crd-59b9f99f9b-25f9c                           2/2     Running             9 (12h ago)       19h
kubeflow                    ml-pipeline-visualizationserver-7c7464896f-hn7zn                  2/2     Running             8 (12h ago)       19h
kubeflow                    mysql-75f4964b48-x557v                                            2/2     Running             2 (12h ago)       19h
kubeflow                    notebook-controller-deployment-68f88d5479-xhmd6                   2/2     Running             4 (12h ago)       19h
kubeflow                    profiles-deployment-6d754c7bc7-fcbjq                              3/3     Running             5 (12h ago)       19h
kubeflow                    tensorboard-controller-deployment-6d67f8bfff-xhn5g                3/3     Running             7 (12h ago)       19h
kubeflow                    tensorboards-web-app-deployment-8446c8f5b5-4zc4h                  1/1     Running             1 (12h ago)       19h
kubeflow                    training-operator-5cc8cdfdd6-xz5qq                                0/1     CrashLoopBackOff    177 (4m28s ago)   14h
kubeflow                    volumes-web-app-deployment-b579747b4-8mqv2                        1/1     Running             1 (12h ago)       19h
kubeflow                    workflow-controller-555f64865-66tsm                               2/2     Running             14 (12h ago)      19h

@johnugeorge
Copy link
Member

@NettrixTobin Any other interesting logs? Is it any resource limits issue?

@NettrixTobin
Copy link
Author

@NettrixTobin Any other interesting logs? Is it any resource limits issue?

@johnugeorge Could you please provide me with some ideas or instructions? I did not find any other pod error

@johnugeorge
Copy link
Member

Is it any Out of Memory issue? I am not seeing any issues in logs.

@kuizhiqing
Copy link
Member

@NettrixTobin maybe you can run the controller locally, which means compile and run cmd/training-operator.v1/main.go with your local kubeconfig or just run with make run. This can handle RBAC or config version mismatch things.

@caffeinism
Copy link

I increased the resource and it really worked.

@johnugeorge
Copy link
Member

Closing it as

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants