ml-pipeline-persistenceagent restarts forever #741

nareshganesan · 2019-01-27T08:45:02Z

Kubeflow v0.4.1
On Prem

ksonnet version: 0.13.1
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4

All other components work well, but the ml-pipeline-persistenceagent keeps restarting forever.
Steps:

mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
export KUBEFLOW_TAG=v0.4.1

curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none
cd ${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

logs from ml-pipeline-persistenceagent pod

W0127 08:20:37.093010       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-01-27T08:23:01Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Its restarted couple of times on a fresh cluster already.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-h62lq   0/1       CrashLoopBackOff   6          26m

I manually hit the health check url from one of my pod, and it works!

$ kubectl -n kubeflow exec -it mysql-xyxsdf /bin/bash
$ curl http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz
{"commit_sha":"d9a1313b88d9a0db52792016f8faab91f9cb4bae"}

Please let me know.

The text was updated successfully, but these errors were encountered:

neuromage · 2019-01-28T19:48:54Z

Looks like ml pipeline API server is failing. Maybe something to do with the fact that it's on-prem? Can you please report back what the API server logs say? Something like:

$ kubectl -n kubeflow logs $(kubectl -n kubeflow get pods -l app=ml-pipeline -o jsonpath='{.items[0].metadata.name}')

nareshganesan · 2019-01-30T10:09:10Z

@neuromage ,

Sorry for the late respone. I see empty response when I tail 'ml-pipeline api server' logs.

$ kubectl -n kubeflow logs $(kubectl -n kubeflow get pods -l app=ml-pipeline -o jsonpath='{.items[0].metadata.name}')

Thanks for helping out.

paveldournov · 2019-02-04T03:04:02Z

@nareshganesan is the issue still happening? Can you see the error logs of the persistence agent?

/assign @nareshganesan

nareshganesan · 2019-02-04T05:24:42Z

@paveldournov

Yeah the issue is still happening.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-2kp9x   0/1       CrashLoopBackOff   334        1d

persistenceagent logs

$ kubectl -n kubeflow logs -f pod/ml-pipeline-persistenceagent-5669f69cdd-2kp9x
W0204 05:17:12.550767       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-02-04T05:19:18Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Please let me know.

Thanks for helping out!

neuromage · 2019-02-06T17:12:10Z

@nareshganesan
That error indicates that the persistence agent is unable to connect to the ML Pipeline API server. But it appears that the api server is up and running right? Can you run the following and confirm that it's up and running?

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline

IronPan · 2019-02-07T17:58:28Z

@nareshganesan If you verified the API server actually starts up and running, it might be caused by DNS resolution failure.
Could you verify K8s DNS service is running in your cluster?

Some links might be helpful for debugging
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#check-for-errors-in-the-dns-pod

nareshganesan · 2019-02-08T09:01:52Z

@IronPan

Thanks for your inputs.

The DNS service was the issue, it was not able to find my ml-persistentagent pod. Our current cluster, was spun up using kubeadm (kubernetes v1.11.6), without using coredns feature gate flag. To validate,
I spun up another cluster with coredns feature gate enabled (though it is supposed to use coredns for kubernetes version > 1.11) ,this solved the issue. I do not see any restarts for ml-pipeline-persistenceagent.

# solution
kubeadm init --feature-gates CoreDNS=true

Thanks @neuromage @paveldournov @IronPan 👍 I'll close the issue.

neuromage · 2019-02-08T19:47:43Z

Thanks for the update @nareshganesan! That'll be useful for us when debugging issues like this in the future as well.

* updated troubleshooting links in Installation readme * updated install links * Update guides/kfp_tekton_install.md Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com> Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com>

nareshganesan mentioned this issue Jan 27, 2019

ml-pipeline-persistenceagent fails a few times. #624

Closed

k8s-ci-robot assigned nareshganesan Feb 4, 2019

paveldournov added the problems/bug label Feb 4, 2019

nareshganesan closed this as completed Feb 8, 2019

ashahba mentioned this issue Dec 5, 2019

"ml-pipeline-persistenceagent" and "ml-pipeline" pods consistently reboot after v0.7.0 Kubeflow install. #2699

Closed

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023

chore: add cla yes and no labels for googlebot (kubeflow#741)

66ff622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml-pipeline-persistenceagent restarts forever #741

ml-pipeline-persistenceagent restarts forever #741

nareshganesan commented Jan 27, 2019

neuromage commented Jan 28, 2019

nareshganesan commented Jan 30, 2019

paveldournov commented Feb 4, 2019

nareshganesan commented Feb 4, 2019

neuromage commented Feb 6, 2019

IronPan commented Feb 7, 2019

nareshganesan commented Feb 8, 2019

neuromage commented Feb 8, 2019

ml-pipeline-persistenceagent restarts forever #741

ml-pipeline-persistenceagent restarts forever #741

Comments

nareshganesan commented Jan 27, 2019

neuromage commented Jan 28, 2019

nareshganesan commented Jan 30, 2019

paveldournov commented Feb 4, 2019

nareshganesan commented Feb 4, 2019

neuromage commented Feb 6, 2019

IronPan commented Feb 7, 2019

nareshganesan commented Feb 8, 2019

neuromage commented Feb 8, 2019