Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presubmit unhealthy? Networking issues #869

Closed
lluunn opened this issue May 24, 2018 · 4 comments
Closed

Presubmit unhealthy? Networking issues #869

lluunn opened this issue May 24, 2018 · 4 comments

Comments

@lluunn
Copy link
Contributor

lluunn commented May 24, 2018

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-presubmit

screen shot 2018-05-24 at 14 19 53

@jlewi jlewi changed the title Presubmit unhealthy? Presubmit unhealthy? Networking issues May 25, 2018
@jlewi
Copy link
Contributor

jlewi commented May 25, 2018

#866 is one PR where I observed a number of test flakes.

I'm observing a variety of failure modes

Failure #1

TFJob client appears to hang trying to contact the K8s API server to get job status (kubeflow/training-operator#606).

Failure #2

TFServing fails

INFO|2018-05-25T04:54:27|/mnt/test-data-volume/kubeflow-presubmit-tf-serving-image-866-b0e8431-1664-9878/src/kubeflow/kubeflow/testing/test_tf_serving.py|
99| prediction failed: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed"). Retrying...

Both of these are suggestive of some form of networking issue.

@jlewi jlewi mentioned this issue May 25, 2018
4 tasks
@jlewi
Copy link
Contributor

jlewi commented May 25, 2018

It looks like the kube-dns pods might be having some issues

LAST SEEN   FIRST SEEN   COUNT     NAME                                                KIND      SUBOBJECT                   TYPE      REASON      SOURCE                                                     MESSAGE
1s          23m          2         heapster-v1.5.0-5554f4f6fc-9lf9l.1531ca827f9b15d2   Pod       spec.containers{heapster}   Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-bj6f   Liveness probe failed: Get http://10.36.4.93:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m          21h          16        kube-dns-7fcdddb4c5-j9r5r.153183e33edf80ee          Pod       spec.containers{kubedns}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-x2pr   Liveness probe failed: HTTP probe failed with statuscode: 503
5m          3h           5         kube-dns-7fcdddb4c5-rp22p.1531bede709fd467          Pod       spec.containers{sidecar}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-dzvl   Liveness probe failed: Get http://10.36.6.109:10054/metrics: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
15m         3h           7         kube-dns-7fcdddb4c5-rp22p.1531bef68c257721          Pod       spec.containers{dnsmasq}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-dzvl   Liveness probe failed: HTTP probe failed with statuscode: 503

Although I guess its possible that is yet another problem caused by networking issues.

@jlewi
Copy link
Contributor

jlewi commented May 25, 2018

I'm going to try deleting all the VMs in the cluster. They should get recreated and hopefully when they get rescheduled any transient issues will be addressed.

@lluunn
Copy link
Contributor Author

lluunn commented May 25, 2018

Thanks Jeremy, I think it's good now.

@lluunn lluunn closed this as completed May 25, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants