New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky e2e test: Services should provide DNS for the cluster - Expected <int>: 3 to equal <int>: 0 #7453
Comments
Looking through the history of this test, the following commit seems to correlate with the beginning of the increased flakyness: Commit e982ac5 by cjcullen I'm going to try to roll that back in a local test cluster, and if that fixes the flakyness, roll it back in upstream/master. |
It's perhaps worth noting that when a cluster comes up in this "DNS broken" state, it never seems to recover. If I rerun the DNS service e2e tests against the same cluster, over and over again, they fail every time. However, looking at our CI runs (for which a new cluster is build every time), clusters come up in this state approximately 20% of the time. |
The PR for the above commit is #7154. |
I'm not sure that rolling back that PR will make any difference; the error message refers to the -ro service, but if that PR were actually having an effect, it would be talking about the -rw service. |
The test checks if a pod can hit the kubernetes-ro service. DNS is now pointing at the RW service, but for reasons TBD, that is flaky, and it is not making any of the other services available to pods. |
I finally got a local e2e to fail. Looking at the kube2sky container's logs, in the failure case I see:
A successful DNS startup looks like:
In both cases, I can kubectl get services, curl "http:/10.0.0.1/api/v1beta3/services", and curl "https://10.0.0.2/api/v1beta3/services" and get the full list of services, so I think this points to the problem being somewhere in the depths of the watch API or in kube2sky's use of the watch API? |
But my vote would be to revert this before we cut a 0.16.0. |
#7461 rolls back to a more stable kube2sky until this can be debugged further. |
Still somewhat in the realm of speculation, but #7353 (Increase maxIdleConnection limit when creating etcd client in apiserver) appears to have fixed this problem. Since it was merged today, we have seen zero failures of the DNS e2e test. It makes quite a bit of sense that it would, given the explanation in #7160 and above. |
Closed by #7353 |
This test fails intermittently on our CI system, and it's fairly easy to reproduce the error in a local test cluster, which I've done.
The error message output by the e2e test is consistently:
Digging a bit deeper, it looks like our kube2sky bridge is failing to GET the pods for which name resolution is being requested:
While kube2sky was busy returning these errors, I confirmed that I was able to GET the above PODs using kubectl, so the API is doing the right thing. So this looks like a bug in kube2sky.
I'll keep digging. This is seemingly breaking quite a few other e2e tests which rely on cluster DNS (understandably). For example, when this test fails, the following other tests also seem to fail in concert:
Aside: #4852 (health checks for DNS) would probably also help to prevent this failure mode.
The text was updated successfully, but these errors were encountered: