-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky e2e: Proxy version v1 should proxy logs on node (Failed 4 times in the last 30 runs. Stability: 86 % #10792
Comments
It's totally nuts that we have to change every test to check for nodes being ready. :/ |
@lavalamp You're right (as usual). Better suggestions welcome. A few I can think of off the top of my head:
Others? It's not totally clear which of these is the best approach yet. Let me give it some more thought. |
@lavalamp: That's why I was proposing a general fixture that could check for this type of thing on the way out, then we could catch bad actors. |
Yeah, it should be pretty trivial to check this when a framework is shut On Mon, Jul 6, 2015 at 5:26 PM, Zach Loafman notifications@github.com
|
…te-proxy-e2e Demote e2e test as per #10792
Regarding the original test - it seems that the problem was different in the last failure. Basically, the first 3 failures were failing with error: The last failure (once the previous one was fixed) failed with: |
Also - I'm not able to reproduce this failure. |
I was able to reproduce the connection refused failures pretty consistently (~75%) prior to @wojtek-t's 2 fixes by running:
I've now run it 10 times without seeing the "connection refused" error, but I did see the "connection reset" failure once on the "proxy logs on node" test. |
It still seems to be failing occasionally in Jenkins, even after #10820 was merged. e.g. job/kubernetes-e2e-gce/7461/ It's quite possible that the failures are for other reasons - I've not looked into it deeply. |
I'm pretty sure the reason is different here. I will try to look into it deeper today. |
I took a bit deeper look and it seems that after adding #10820 all of the failures of Proxy tests are caused by some NOT ready node at the end of the test. Also other failures are caused by it. @dchen1107: FYI
My current hypothesis is that a lot of different flakes that we are observing (e.g. Proxy flakes #10792 or EmptyDir flakes #10657) might in fact be caused by not-ready nodes at some random points in time. @dchen1107 @lavalamp is there any way to get an information why Kubelet was restarted? Can Kubelet be restarted by Monit more frequently than once per 5 minutes? |
Kubelet restart because of /healthz failure? |
cc/ @saad-ali Saad, could you please take a look at this one to figure out why kubelet restart so frequently. |
monit checks every two minutes.
monit checks the existence of the pid file and /healthz |
Looks like the first monit |
Looking at the thread numbers, it looks like Kubelet was restarted more than twice:
where 2928 was the shortest lived thread:
Nothing indicates why it was killed. |
There exists a race condition between kubelet start up and monit. If monit comes up before salt starts kubelet, it will notice that kubelet is not running and will start it. And with #8931 if salt then starts kubelet, the second start up call will kill the previous instance of kubelet. But that should result in only 1 "restart". We see 4 starts (i.e. 3 restarts). It's possible that the healthz check happens during the restarts, triggering the same cycle. Regardless this seems to be very common. Checking random otherwise successful GCE E2E runs, I see at least 2 restarts:
|
@saad-ali - I think that the situation from my test is a bit more dangerous because it took more than 3 minutes (although I agree it seems to be a problem in general). --Update-- |
I agree it might be risky to change the monit version now. But the problem @saad-ali and I described is not the problem of bootstrap. Although there are some restarts at the beginning (within 30 seconds after restart) I don't think this is serious. However - we ARE observing restarts also in the middle of running tests, e.g. 10 minutes after creating cluster. I don't think we can call that moments "boot sequences" |
Closing in favor of #10899 to track remaining issue. |
@quinton-hoole I think you closed a wrong one here. I am reopen it, re-close it if you disagreed with me. :-) |
@dchen1107 as I understand it, we still need to track down the reason for the seemingly unnecessary kubelet restarts. Is #10899 not the canonical tracking issue for that? |
I looked into those failures and both are exactly the same. For me this seems like some problems with network. Basically in the apiserver there is a log it seems that it's correctly sending a request to the kubelet:
However - there arecompletely no logs around that time in Kubelet:
So it seems like the http request was lost somewhere in the network. It doesn't seem to be a problem with Kubernetes - what we can do is just to retry the request in the test in case of that the errors. What do you think? |
Yeah, retrying seems like a good idea; it can't hurt and maybe it will fix the problem as you say. |
@davidopp - ok - I can prepare a PR for it. |
Having a retry in the test seems fine, but it would be good to get to the bottom of why the node network is getting borked. Weve seen that problem elsewhere also. |
@quinton-hoole - I don't think it's Kubernetes problems. Basically, this particular request is not going through kube-proxy or anything like that - we just send an http request directly to Kubelet and it seems Kubelet doesn't event receive it. I'm not sure how/if we can debug it... |
I'm concerned about our iptables reconfiguration on the nodes, e2e tests that down the node network interface, and the kubelet process that is being restarted. |
I don't know about retrying, it already tries N times and expects them all to succeed. |
This test is failing repeatedly on kubernetes-e2e-gce-parallel Did we merge something today that might have broken this? |
Did we merge something between 1:30 and 2:00 today that may have broken this? |
False alarm, these failures are all performance expectations. We should investigate why stuff got slower but it's not a correctness bug. |
Actually this is frequently failing with a more worrisome error, e.g.
I'm reluctant to increase the timeout to fix the timeout failures ("took xxx > yyy") until we understand what is going on with this other failure. |
The "an error on the server has prevented the request from succeeding" failures seem to mostly be happening in the PR builder Jenkins (http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/) whereas all the ones in the kubernetes-e2e-gce-parallel (http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3572/) seem to be of the timeout flavor. |
These failures appear to have been due to another issue that occurred at about the same time. Per-PR, regular, and parallel Jenkins runs appear to be back to normal. I'll check again later this evening to make sure. Thanks @lavalamp and @dchen1107 for noticing the connection between this and the other issue. |
I've confirmed that this test is again 100% stable. Closing. |
#9312 and #10739 provide further details. #10739, the intended fix, does not seem to have done the job.
The text was updated successfully, but these errors were encountered: