New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing performance tests on Jenkins #7561
Comments
Feel free to move it outside v1.0 if you can explain it - if not, we should definitely try to understand it. |
Also note that this is a clear regression - I didn't observe any such failure before. That's why I'm marking it (at least for now) as v1.0 issue. |
Might end up being a duplicate of #7548, but let's investigate both in parallel until we confirm. |
I think those are unrelated. Basically, the GCE tests are green now, whereas performance tests didn't pass since then even once. My feeling is that it might be related to: #6866 @lavalamp @bprashanth - any thoughts about it? |
Tl;dr Logs at V(4) would help, that can certainly happen if the watch takes O(minutes) to deliver notifications (which I've seen happen when we, say, delete 1000 pods at once). If that's the problem this essentially boils down to: watch is slow and things timeout because they don't know if a packet was lost or still in flight; to which there are 2 solutions: make watch faster, increase the timeout. The good news is that the rc will balance itself out over time as the appropriate notifications get there. So if you really want to track the exact number of replicas at scale minute by minute, its going to be hard. Previously we didn't have this problem because the rcs would poll every 5s, which has various other implications. Currently that 5s poll is replaced by 4 things:
So it feels like we're hitting the 4th case (I can't be sure without logs), and that if we didn't care about the replica count being exact minute to minute, we would be ok after a bit. Fyi I'm also adding some clarity to the density tests because currently they're close to impossible to debug |
There are at least 2 issues that can lead to a wrong number of replicas at this stage of the test:
I'm am still digging into 2, I've seen it happen at least 2/5 runs. In both cases the system will correct itself given time. |
Thanks for investigating it @bprashanth Since the flakes are currently very rarely I'm decreasing the priority of this bug. |
There's a third case which is currently the most frequent failure mode: It also occasionally crops up in the normal e2es (#7548 (comment)). Investigating. |
This should fix the unexpected terminations, or reveal another race condition that needs handling: #7749 |
@bprashanth thanks for the heads-up. |
The mentioned pr has merged, and now I'm seeing high latency requests of 2 types:
To be clear I'm talking about the failure message:
which inevitably end up being:
|
Correct - this is something we're currently looking at. |
The basic event listing e2e test is intermittently failing in the same way as bprashanth@ mentions above, so it's failure is currently somewhat independent of scale. I'd be very happy to temporarily disable that part of this scalability test until the flakyness of the basic event listing test has been deflaked. To be clear, I'm talking about this flaky basic event listing e2e test: ==== snip snip from out continuous integration tests ============= Identified problems Events should be sent by kubelets and the scheduler about pods scheduling and running /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/events.go:124 |
@bprashanth I'll send a PR later today to ignore events related metrics. I was also investigating slow listing of pods. Main reason currently is GOMAXPROCS set for etcd to 1. Increasing it to NumCores() improves a lot, but metrics are still above 1.0 targets. I believe that now the bottleneck is somewhere in apiserver itself. |
Thanks for the heads up, will keep you posted |
In all the cases I've seen O(10) - those nodes weren't even registered. |
Currently, the limit is 5 minutes, but I'm increasing it to 10 minutes now. |
I was able to catch the above situation on Jenkins and did some more debugging on it. What happened:
|
I guess the good news is that this shouldn't happen on release tars.... This sounds like it can sometimes take a little while for GCS ACLs to propagate (after this call is made). I wonder if there is a way to tell GCS that anything you put into bucket X in the future should be automatically marked as world readable. |
@wojtek-t is on vacation for the next couple of weeks, so we should either reassign or close if we think #8998 is sufficient. @roberthbailey what do you think? |
When the gce issues are fixed feel free to re-assign to me, I can track the progress of the test on jenkins and follow up on this bug. |
As of the weekend this wasn't fixed - we had a lot of red runs during the weekend. It seems to be in a better shape now. However, there seem to be a huge drop in the performance metrics few hours ago - can someone please take a look at it and rollback the PR that caused the regression? |
Majority of e2e scalability jenkins runs are green now. In the last 14 runs we had 12 greens and 2 red ones (failed firewall creating and one due to node being not ready). Let's close this issue and open separate issues if we decide to fix remaining problems. |
Thanks @fgrzadkowski - SGTM |
I'm going to reopen this since there are apparently still problems with e2e scalability test (load test had to be disabled today, #9201). Probably I should open a separate bug but this is close enough. |
@fgrzadkowski any update this, or you are still looking into it? |
@davidopp Yes, I'm looking into this. I need to refactor load test to make it more robust. |
This is related to #4521. |
The first failure happened at ~9:20 PST 29/04/2015. The failure looks like that:
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:158
Expected error:
<*errors.errorString | 0xc20b84daa0>: {
s: "Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000",
}
Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000
not to have occurred
It seems that because of some reason we are starting much more pods within a replication controller than expected. Another related failure:
Expected error:
<*errors.errorString | 0xc2083b3050>: {
s: "Number of reported pods changed: 3222 vs 3000",
}
Number of reported pods changed: 3222 vs 3000
not to have occurred
cc @quinton-hoole @lavalamp @bprashanth @fgrzadkowski
The text was updated successfully, but these errors were encountered: