Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453

Closed
ghost opened this issue Apr 28, 2015 · 10 comments
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ghost
Copy link

ghost commented Apr 28, 2015

This test fails intermittently on our CI system, and it's fairly easy to reproduce the error in a local test cluster, which I've done.

The error message output by the e2e test is consistently:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:165
Expected
    <int>: 3
to equal
    <int>: 0

Digging a bit deeper, it looks like our kube2sky bridge is failing to GET the pods for which name resolution is being requested:

Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro.default failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
Lookup using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 for kubernetes-ro.default.kubernetes.local failed: the server could not find the requested resource (get pods dns-test-3c04e84a-eda2-11e4-ab38-42010af01555)
lookups using dns-test-3c04e84a-eda2-11e4-ab38-42010af01555 failed for: [kubernetes-ro kubernetes-ro.default kubernetes-ro.default.kubernetes.local]

While kube2sky was busy returning these errors, I confirmed that I was able to GET the above PODs using kubectl, so the API is doing the right thing. So this looks like a bug in kube2sky.

I'll keep digging. This is seemingly breaking quite a few other e2e tests which rely on cluster DNS (understandably). For example, when this test fails, the following other tests also seem to fail in concert:

kubectl guestbook should create and stop a working application
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/kubectl.go:113
Frontend service did not start serving content in 600 seconds.

Events should be sent by kubelets and the scheduler about pods scheduling and running
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/events.go:124
kubelet events from running pod
Expected
    <int>: 0
not to be zero-valued

Cluster level logging using Elasticsearch should check that logs from pods on all nodes are ingested into Elasticsearch
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
Failed to find all 200 log lines

Aside: #4852 (health checks for DNS) would probably also help to prevent this failure mode.

@ghost ghost added this to the v1.0 milestone Apr 28, 2015
@ghost ghost self-assigned this Apr 28, 2015
@ghost ghost added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 28, 2015
@ghost
Copy link
Author

ghost commented Apr 28, 2015

Looking through the history of this test, the following commit seems to correlate with the beginning of the increased flakyness:

Commit e982ac5 by cjcullen
Change kube2sky to use token-system-dns secret, point at https endpoint (instead of kubernetes-ro service).

I'm going to try to roll that back in a local test cluster, and if that fixes the flakyness, roll it back in upstream/master.

@ghost
Copy link
Author

ghost commented Apr 28, 2015

It's perhaps worth noting that when a cluster comes up in this "DNS broken" state, it never seems to recover. If I rerun the DNS service e2e tests against the same cluster, over and over again, they fail every time. However, looking at our CI runs (for which a new cluster is build every time), clusters come up in this state approximately 20% of the time.

@ghost
Copy link
Author

ghost commented Apr 28, 2015

The PR for the above commit is #7154.

@lavalamp
Copy link
Member

I'm not sure that rolling back that PR will make any difference; the error message refers to the -ro service, but if that PR were actually having an effect, it would be talking about the -rw service.

@cjcullen
Copy link
Member

The test checks if a pod can hit the kubernetes-ro service. DNS is now pointing at the RW service, but for reasons TBD, that is flaky, and it is not making any of the other services available to pods.

@cjcullen
Copy link
Member

I finally got a local e2e to fail. Looking at the kube2sky container's logs, in the failure case I see:

2015/04/28 20:16:25 Etcd server found: http://127.0.0.1:4001
2015/04/28 20:16:25 Using https://10.0.0.2:443 for kubernetes master
2015/04/28 20:16:25 Using kubernetes API 
2015/04/28 20:16:26 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.146.9:9200
2015/04/28 20:21:26 watchLoop channel closed

A successful DNS startup looks like:

2015/04/28 20:56:18 Etcd server found: http://127.0.0.1:4001
2015/04/28 20:56:18 Using https://10.0.0.2:443 for kubernetes master
2015/04/28 20:56:18 Using kubernetes API 
2015/04/28 20:56:18 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.14.38:9200
2015/04/28 20:56:18 Setting DNS record: kibana-logging.default.kubernetes.local. -> 10.0.2.52:5601
2015/04/28 20:56:18 Setting DNS record: kube-dns.default.kubernetes.local. -> 10.0.0.10:53
2015/04/28 20:56:18 Setting DNS record: kubernetes.default.kubernetes.local. -> 10.0.0.2:443
2015/04/28 20:56:18 Setting DNS record: kubernetes-ro.default.kubernetes.local. -> 10.0.0.1:80
2015/04/28 20:56:18 Setting DNS record: monitoring-grafana.default.kubernetes.local. -> 10.0.208.33:80
2015/04/28 20:56:18 Setting DNS record: monitoring-heapster.default.kubernetes.local. -> 10.0.31.9:80
2015/04/28 20:56:18 Setting DNS record: monitoring-influxdb.default.kubernetes.local. -> 10.0.34.200:80
2015/04/28 20:56:19 Setting DNS record: monitoring-influxdb-ui.default.kubernetes.local. -> 10.0.67.146:80
2015/04/28 21:01:18 watchLoop channel closed
2015/04/28 21:01:18 Setting DNS record: elasticsearch-logging.default.kubernetes.local. -> 10.0.14.38:9200
2015/04/28 21:01:18 Setting DNS record: kibana-logging.default.kubernetes.local. -> 10.0.2.52:5601
2015/04/28 21:01:18 Setting DNS record: kube-dns.default.kubernetes.local. -> 10.0.0.10:53
2015/04/28 21:01:18 Setting DNS record: kubernetes.default.kubernetes.local. -> 10.0.0.2:443
2015/04/28 21:01:18 Setting DNS record: kubernetes-ro.default.kubernetes.local. -> 10.0.0.1:80
2015/04/28 21:01:18 Setting DNS record: monitoring-grafana.default.kubernetes.local. -> 10.0.208.33:80
2015/04/28 21:01:19 Setting DNS record: monitoring-heapster.default.kubernetes.local. -> 10.0.31.9:80
2015/04/28 21:01:19 Setting DNS record: monitoring-influxdb.default.kubernetes.local. -> 10.0.34.200:80
2015/04/28 21:01:19 Setting DNS record: monitoring-influxdb-ui.default.kubernetes.local. -> 10.0.67.146:80
...

In both cases, I can kubectl get services, curl "http:/10.0.0.1/api/v1beta3/services", and curl "https://10.0.0.2/api/v1beta3/services" and get the full list of services, so I think this points to the problem being somewhere in the depths of the watch API or in kube2sky's use of the watch API?

@cjcullen
Copy link
Member

But my vote would be to revert this before we cut a 0.16.0.

@cjcullen
Copy link
Member

#7461 rolls back to a more stable kube2sky until this can be debugged further.

@ghost
Copy link
Author

ghost commented Apr 29, 2015

Still somewhat in the realm of speculation, but #7353 (Increase maxIdleConnection limit when creating etcd client in apiserver) appears to have fixed this problem.

Since it was merged today, we have seen zero failures of the DNS e2e test.

It makes quite a bit of sense that it would, given the explanation in #7160 and above.

@ghost
Copy link
Author

ghost commented Apr 29, 2015

Closed by #7353

@ghost ghost closed this as completed Apr 29, 2015
@ghost ghost removed their assignment Aug 12, 2015
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

2 participants