Presubmit unhealthy? Networking issues #869

lluunn · 2018-05-24T21:22:08Z

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-presubmit

jlewi · 2018-05-25T05:17:34Z

#866 is one PR where I observed a number of test flakes.

I'm observing a variety of failure modes

Failure #1

TFJob client appears to hang trying to contact the K8s API server to get job status (kubeflow/training-operator#606).

Failure #2

TFServing fails

INFO|2018-05-25T04:54:27|/mnt/test-data-volume/kubeflow-presubmit-tf-serving-image-866-b0e8431-1664-9878/src/kubeflow/kubeflow/testing/test_tf_serving.py|
99| prediction failed: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed"). Retrying...

Both of these are suggestive of some form of networking issue.

jlewi · 2018-05-25T05:25:51Z

It looks like the kube-dns pods might be having some issues

LAST SEEN   FIRST SEEN   COUNT     NAME                                                KIND      SUBOBJECT                   TYPE      REASON      SOURCE                                                     MESSAGE
1s          23m          2         heapster-v1.5.0-5554f4f6fc-9lf9l.1531ca827f9b15d2   Pod       spec.containers{heapster}   Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-bj6f   Liveness probe failed: Get http://10.36.4.93:8082/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m          21h          16        kube-dns-7fcdddb4c5-j9r5r.153183e33edf80ee          Pod       spec.containers{kubedns}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-x2pr   Liveness probe failed: HTTP probe failed with statuscode: 503
5m          3h           5         kube-dns-7fcdddb4c5-rp22p.1531bede709fd467          Pod       spec.containers{sidecar}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-dzvl   Liveness probe failed: Get http://10.36.6.109:10054/metrics: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
15m         3h           7         kube-dns-7fcdddb4c5-rp22p.1531bef68c257721          Pod       spec.containers{dnsmasq}    Warning   Unhealthy   kubelet, gke-kubeflow-testing-default-pool-90dc0402-dzvl   Liveness probe failed: HTTP probe failed with statuscode: 503

Although I guess its possible that is yet another problem caused by networking issues.

jlewi · 2018-05-25T05:40:40Z

I'm going to try deleting all the VMs in the cluster. They should get recreated and hopefully when they get rescheduled any transient issues will be addressed.

lluunn · 2018-05-25T17:22:48Z

Thanks Jeremy, I think it's good now.

lluunn mentioned this issue May 24, 2018

WIP dummy -- test presubmit #870

Closed

jlewi added area/build-release priority/p1 labels May 24, 2018

jlewi changed the title ~~Presubmit unhealthy?~~ Presubmit unhealthy? Networking issues May 25, 2018

jlewi mentioned this issue May 25, 2018

[WIP] Ambassador proxy #817

Closed

4 tasks

jlewi mentioned this issue May 25, 2018

crd: Add validation using OpenAPI 3.0 kubeflow/training-operator#605

Merged

lluunn closed this as completed May 25, 2018

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021

Remove unused files (kubeflow#869)

4014465

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presubmit unhealthy? Networking issues #869

Presubmit unhealthy? Networking issues #869

lluunn commented May 24, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

lluunn commented May 25, 2018

Presubmit unhealthy? Networking issues #869

Presubmit unhealthy? Networking issues #869

Comments

lluunn commented May 24, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

lluunn commented May 25, 2018