Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

ghost · 2015-04-28T19:24:04Z

This test fails intermittently on our CI system, and it's fairly easy to reproduce the error in a local test cluster, which I've done.

The error message output by the e2e test is consistently:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:165
Expected
    <int>: 3
to equal
    <int>: 0

Digging a bit deeper, it looks like our kube2sky bridge is failing to GET the pods for which name resolution is being requested:

Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro.default failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro.default.kubernetes.local failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
lookups using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 failed for: [kubernetes-ro kubernetes-ro.default kubernetes-ro.default.kubernetes.local]

While kube2sky was busy returning these errors, I confirmed that I was able to GET the above PODs using kubectl, so the API is doing the right thing. So this looks like a bug in kube2sky.

I'll keep digging. This is seemingly breaking quite a few other e2e tests which rely on cluster DNS (understandably). For example, when this test fails, the following other tests also seem to fail in concert:

kubectl guestbook should create and stop a working application
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/kubectl.go:113
Frontend service did not start serving content in 600 seconds.

Events should be sent by kubelets and the scheduler about pods scheduling and running
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/events.go:124
kubelet events from running pod
Expected
    <int>: 0
not to be zero-valued

Cluster level logging using Elasticsearch should check that logs from pods on all nodes are ingested into Elasticsearch
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
Failed to find all 200 log lines

Aside: #4852 (health checks for DNS) would probably also help to prevent this failure mode.

The text was updated successfully, but these errors were encountered:

ghost · 2015-04-28T19:42:50Z

Looking through the history of this test, the following commit seems to correlate with the beginning of the increased flakyness:

Commit e982ac5 by cjcullen
Change kube2sky to use token-system-dns secret, point at https endpoint (instead of kubernetes-ro service).

I'm going to try to roll that back in a local test cluster, and if that fixes the flakyness, roll it back in upstream/master.

ghost · 2015-04-28T19:43:51Z

It's perhaps worth noting that when a cluster comes up in this "DNS broken" state, it never seems to recover. If I rerun the DNS service e2e tests against the same cluster, over and over again, they fail every time. However, looking at our CI runs (for which a new cluster is build every time), clusters come up in this state approximately 20% of the time.

ghost · 2015-04-28T19:44:16Z

The PR for the above commit is #7154.

lavalamp · 2015-04-28T20:00:38Z

I'm not sure that rolling back that PR will make any difference; the error message refers to the -ro service, but if that PR were actually having an effect, it would be talking about the -rw service.

cjcullen · 2015-04-28T20:08:11Z

The test checks if a pod can hit the kubernetes-ro service. DNS is now pointing at the RW service, but for reasons TBD, that is flaky, and it is not making any of the other services available to pods.

cjcullen · 2015-04-28T21:08:40Z

I finally got a local e2e to fail. Looking at the kube2sky container's logs, in the failure case I see:

2015/04/28 20:16:25 Etcd server found: http://127.0.0.1:4001
2015/04/28 20:16:25 Using https://10.0.0.2:443 for kubernetes master
2015/04/28 20:16:25 Using kubernetes API 
2015/04/28 20:16:26 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.146.9:9200
2015/04/28 20:21:26 watchLoop channel closed

A successful DNS startup looks like:

2015/04/28 20:56:18 Etcd server found: http://127.0.0.1:4001
2015/04/28 20:56:18 Using https://10.0.0.2:443 for kubernetes master
2015/04/28 20:56:18 Using kubernetes API 
2015/04/28 20:56:18 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.14.38:9200
2015/04/28 20:56:18 Setting DNS record: kibana-logging.default.kubernetes.local. -> 10.0.2.52:5601
2015/04/28 20:56:18 Setting DNS record: kube-dns.default.kubernetes.local. -> 10.0.0.10:53
2015/04/28 20:56:18 Setting DNS record: kubernetes.default.kubernetes.local. -> 10.0.0.2:443
2015/04/28 20:56:18 Setting DNS record: kubernetes-ro.default.kubernetes.local. -> 10.0.0.1:80
2015/04/28 20:56:18 Setting DNS record: monitoring-grafana.default.kubernetes.local. -> 10.0.208.33:80
2015/04/28 20:56:18 Setting DNS record: monitoring-heapster.default.kubernetes.local. -> 10.0.31.9:80
2015/04/28 20:56:18 Setting DNS record: monitoring-influxdb.default.kubernetes.local. -> 10.0.34.200:80
2015/04/28 20:56:19 Setting DNS record: monitoring-influxdb-ui.default.kubernetes.local. -> 10.0.67.146:80
2015/04/28 21:01:18 watchLoop channel closed
2015/04/28 21:01:18 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.14.38:9200
2015/04/28 21:01:18 Setting DNS record: kibana-logging.default.kubernetes.local. -> 10.0.2.52:5601
2015/04/28 21:01:18 Setting DNS record: kube-dns.default.kubernetes.local. -> 10.0.0.10:53
2015/04/28 21:01:18 Setting DNS record: kubernetes.default.kubernetes.local. -> 10.0.0.2:443
2015/04/28 21:01:18 Setting DNS record: kubernetes-ro.default.kubernetes.local. -> 10.0.0.1:80
2015/04/28 21:01:18 Setting DNS record: monitoring-grafana.default.kubernetes.local. -> 10.0.208.33:80
2015/04/28 21:01:19 Setting DNS record: monitoring-heapster.default.kubernetes.local. -> 10.0.31.9:80
2015/04/28 21:01:19 Setting DNS record: monitoring-influxdb.default.kubernetes.local. -> 10.0.34.200:80
2015/04/28 21:01:19 Setting DNS record: monitoring-influxdb-ui.default.kubernetes.local. -> 10.0.67.146:80
...

In both cases, I can kubectl get services, curl "http:/10.0.0.1/api/v1beta3/services", and curl "https://10.0.0.2/api/v1beta3/services" and get the full list of services, so I think this points to the problem being somewhere in the depths of the watch API or in kube2sky's use of the watch API?

cjcullen · 2015-04-28T21:20:33Z

But my vote would be to revert this before we cut a 0.16.0.

cjcullen · 2015-04-28T22:16:50Z

#7461 rolls back to a more stable kube2sky until this can be debugged further.

ghost · 2015-04-29T01:16:40Z

Still somewhat in the realm of speculation, but #7353 (Increase maxIdleConnection limit when creating etcd client in apiserver) appears to have fixed this problem.

Since it was merged today, we have seen zero failures of the DNS e2e test.

It makes quite a bit of sense that it would, given the explanation in #7160 and above.

ghost · 2015-04-29T23:27:29Z

Closed by #7353

ghost added area/test area/test-infra labels Apr 28, 2015

ghost added this to the v1.0 milestone Apr 28, 2015

ghost self-assigned this Apr 28, 2015

ghost added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 28, 2015

ghost mentioned this issue Apr 28, 2015

Change kube2sky to use token-system-dns secret, point at https endpoint ... #7154

Merged

ghost mentioned this issue Apr 29, 2015

Revert kube2sky from 1.2 back to 1.1 until we figure out why it's flaky. #7461

Merged

nikhiljindal mentioned this issue Apr 29, 2015

"Cluster level logging using Elasticsearch" breaks when master stops exporting pre-v1beta3 APIs #7420

Closed

cjcullen mentioned this issue Apr 29, 2015

Bump kube2sky to 1.2. Point it at https endpoint (3rd try). #7527

Merged

ghost closed this as completed Apr 29, 2015

ghost removed their assignment Aug 12, 2015

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

lavalamp commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

ghost commented Apr 29, 2015

ghost commented Apr 29, 2015

Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

Comments

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

ghost commented Apr 28, 2015

lavalamp commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

cjcullen commented Apr 28, 2015

ghost commented Apr 29, 2015

ghost commented Apr 29, 2015