Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml-pipeline-persistenceagent restarts forever #741

Closed
nareshganesan opened this issue Jan 27, 2019 · 8 comments
Closed

ml-pipeline-persistenceagent restarts forever #741

nareshganesan opened this issue Jan 27, 2019 · 8 comments
Assignees

Comments

@nareshganesan
Copy link

Kubeflow v0.4.1
On Prem

ksonnet version: 0.13.1
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4

All other components work well, but the ml-pipeline-persistenceagent keeps restarting forever.
Steps:

mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
export KUBEFLOW_TAG=v0.4.1

curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none
cd ${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

logs from ml-pipeline-persistenceagent pod

W0127 08:20:37.093010       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-01-27T08:23:01Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Its restarted couple of times on a fresh cluster already.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-h62lq   0/1       CrashLoopBackOff   6          26m

I manually hit the health check url from one of my pod, and it works!

$ kubectl -n kubeflow exec -it mysql-xyxsdf /bin/bash
$ curl http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz
{"commit_sha":"d9a1313b88d9a0db52792016f8faab91f9cb4bae"}

Please let me know.

@neuromage
Copy link
Contributor

Looks like ml pipeline API server is failing. Maybe something to do with the fact that it's on-prem? Can you please report back what the API server logs say? Something like:

$ kubectl -n kubeflow logs $(kubectl -n kubeflow get pods -l app=ml-pipeline -o jsonpath='{.items[0].metadata.name}')

@nareshganesan
Copy link
Author

@neuromage ,

Sorry for the late respone. I see empty response when I tail 'ml-pipeline api server' logs.

$ kubectl -n kubeflow logs $(kubectl -n kubeflow get pods -l app=ml-pipeline -o jsonpath='{.items[0].metadata.name}')

Thanks for helping out.

@paveldournov
Copy link
Contributor

@nareshganesan is the issue still happening? Can you see the error logs of the persistence agent?

/assign @nareshganesan

@nareshganesan
Copy link
Author

@paveldournov

Yeah the issue is still happening.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-2kp9x   0/1       CrashLoopBackOff   334        1d

persistenceagent logs

$ kubectl -n kubeflow logs -f pod/ml-pipeline-persistenceagent-5669f69cdd-2kp9x
W0204 05:17:12.550767       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-02-04T05:19:18Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Please let me know.

Thanks for helping out!

@neuromage
Copy link
Contributor

@nareshganesan
That error indicates that the persistence agent is unable to connect to the ML Pipeline API server. But it appears that the api server is up and running right? Can you run the following and confirm that it's up and running?

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline

@IronPan
Copy link
Member

IronPan commented Feb 7, 2019

@nareshganesan If you verified the API server actually starts up and running, it might be caused by DNS resolution failure.
Could you verify K8s DNS service is running in your cluster?

Some links might be helpful for debugging
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#check-for-errors-in-the-dns-pod

@nareshganesan
Copy link
Author

@IronPan

Thanks for your inputs.

The DNS service was the issue, it was not able to find my ml-persistentagent pod. Our current cluster, was spun up using kubeadm (kubernetes v1.11.6), without using coredns feature gate flag. To validate,
I spun up another cluster with coredns feature gate enabled (though it is supposed to use coredns for kubernetes version > 1.11) ,this solved the issue. I do not see any restarts for ml-pipeline-persistenceagent.

# solution
kubeadm init --feature-gates CoreDNS=true

Thanks @neuromage @paveldournov @IronPan 👍 I'll close the issue.

@neuromage
Copy link
Contributor

Thanks for the update @nareshganesan! That'll be useful for us when debugging issues like this in the future as well.

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
* updated troubleshooting links in Installation readme

* updated install links

* Update guides/kfp_tekton_install.md

Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com>

Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants