Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing performance tests on Jenkins #7561

Closed
wojtek-t opened this issue Apr 30, 2015 · 62 comments
Closed

Failing performance tests on Jenkins #7561

wojtek-t opened this issue Apr 30, 2015 · 62 comments
Assignees
Labels
area/reliability priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

The first failure happened at ~9:20 PST 29/04/2015. The failure looks like that:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:158
Expected error:
<*errors.errorString | 0xc20b84daa0>: {
s: "Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000",
}
Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000
not to have occurred

It seems that because of some reason we are starting much more pods within a replication controller than expected. Another related failure:

Expected error:
<*errors.errorString | 0xc2083b3050>: {
s: "Number of reported pods changed: 3222 vs 3000",
}
Number of reported pods changed: 3222 vs 3000
not to have occurred

cc @quinton-hoole @lavalamp @bprashanth @fgrzadkowski

@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/reliability team/master area/test-infra labels Apr 30, 2015
@wojtek-t wojtek-t added this to the v1.0 milestone Apr 30, 2015
@wojtek-t
Copy link
Member Author

Feel free to move it outside v1.0 if you can explain it - if not, we should definitely try to understand it.

@wojtek-t
Copy link
Member Author

Also note that this is a clear regression - I didn't observe any such failure before. That's why I'm marking it (at least for now) as v1.0 issue.

@ghost
Copy link

ghost commented Apr 30, 2015

Might end up being a duplicate of #7548, but let's investigate both in parallel until we confirm.

@wojtek-t
Copy link
Member Author

Might end up being a duplicate of #7548, but let's investigate both in parallel until we confirm.

I think those are unrelated. Basically, the GCE tests are green now, whereas performance tests didn't pass since then even once.

My feeling is that it might be related to: #6866 @lavalamp @bprashanth - any thoughts about it?

@bprashanth
Copy link
Contributor

Tl;dr Logs at V(4) would help, that can certainly happen if the watch takes O(minutes) to deliver notifications (which I've seen happen when we, say, delete 1000 pods at once).

If that's the problem this essentially boils down to: watch is slow and things timeout because they don't know if a packet was lost or still in flight; to which there are 2 solutions: make watch faster, increase the timeout.

The good news is that the rc will balance itself out over time as the appropriate notifications get there. So if you really want to track the exact number of replicas at scale minute by minute, its going to be hard. Previously we didn't have this problem because the rcs would poll every 5s, which has various other implications. Currently that 5s poll is replaced by 4 things:

  1. watchers on pods and controllers
  2. a periodic relist of the rcs every 30s
  3. a periodic relist of pods every 5m
  4. a ttl on what the rc expects to see via watch (i.e if it created 3 pods it expects to see 3 watch notifications) every 2m

So it feels like we're hitting the 4th case (I can't be sure without logs), and that if we didn't care about the replica count being exact minute to minute, we would be ok after a bit. Fyi I'm also adding some clarity to the density tests because currently they're close to impossible to debug

@davidopp davidopp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 30, 2015
@bprashanth
Copy link
Contributor

There are at least 2 issues that can lead to a wrong number of replicas at this stage of the test:

  1. Slow create->watch cycle (by slow i mean slower than the 2 node-50 pods cluster case). See comments in Update expiration timeout based on observed latencies #7628 for more details.
  2. Scheduling pods on dead nodes (Pods being scheduled on shut down machines #7222) or nodes dying after pods have been scheduled on them. This will cause the nodecontroller to delete pods after the rc has observed them to be running, and the test will panic because of the mismatch.

I'm am still digging into 2, I've seen it happen at least 2/5 runs. In both cases the system will correct itself given time.

@wojtek-t
Copy link
Member Author

wojtek-t commented May 4, 2015

Thanks for investigating it @bprashanth
It seems that #7628 fixed huge majority of flakes. For the the fix!
And it seems that the remaining flakes are exactly due to #7222 that you described.

Since the flakes are currently very rarely I'm decreasing the priority of this bug.
Also, I think that we can close this PR as a duplicate of #7222 (feel free to do it).

@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 4, 2015
@bprashanth
Copy link
Contributor

There's a third case which is currently the most frequent failure mode:
<*errors.errorString | 0xc21c4c0580>: {
s: "Error: Pod my-hostname-density3000-8da6844d-f23a-11e4-bd75-42010af0159ffb5: Container my-hostname-density3000-8da6844d-f23a-11e4-bd75-42010af01555 was found to have terminated 1 times",
}

It also occasionally crops up in the normal e2es (#7548 (comment)). Investigating.

@bprashanth
Copy link
Contributor

This should fix the unexpected terminations, or reveal another race condition that needs handling: #7749

@wojtek-t
Copy link
Member Author

wojtek-t commented May 5, 2015

@bprashanth thanks for the heads-up.

@bprashanth
Copy link
Contributor

The mentioned pr has merged, and now I'm seeing high latency requests of 2 types:

  1. Listing events: this doesn't sound like reason to fail the test, can we ignore them? @wojtek-t @rrati
  2. Lising pods: happens less often, investigating.

To be clear I'm talking about the failure message:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:164
Expected
    <int>: 2
not to be >
    <int>: 0

which inevitably end up being:

INFO: WARNING - requests with too high latency: {verb:LIST resource:events quantile:0.9 latency:11844619000}
INFO: WARNING - requests with too high latency: {verb:LIST resource:events quantile:0.99 latency:14242416000}

@wojtek-t
Copy link
Member Author

wojtek-t commented May 6, 2015

Correct - this is something we're currently looking at.
cc @fgrzadkowski

@ghost
Copy link

ghost commented May 6, 2015

The basic event listing e2e test is intermittently failing in the same way as bprashanth@ mentions above, so it's failure is currently somewhat independent of scale. I'd be very happy to temporarily disable that part of this scalability test until the flakyness of the basic event listing test has been deflaked.

To be clear, I'm talking about this flaky basic event listing e2e test:

==== snip snip from out continuous integration tests =============

Identified problems

Events should be sent by kubelets and the scheduler about pods scheduling and running

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/events.go:124
kubelet events from running pod
Expected
: 0
not to be zero-valued

@fgrzadkowski
Copy link
Contributor

@bprashanth I'll send a PR later today to ignore events related metrics.

I was also investigating slow listing of pods. Main reason currently is GOMAXPROCS set for etcd to 1. Increasing it to NumCores() improves a lot, but metrics are still above 1.0 targets. I believe that now the bottleneck is somewhere in apiserver itself.
Please let know if you find something, so that we don't investigate the same thing :)

@bprashanth
Copy link
Contributor

Thanks for the heads up, will keep you posted

@wojtek-t
Copy link
Member Author

In all the cases I've seen O(10) - those nodes weren't even registered.

@wojtek-t
Copy link
Member Author

Currently, the limit is 5 minutes, but I'm increasing it to 10 minutes now.

@wojtek-t
Copy link
Member Author

I was able to catch the above situation on Jenkins and did some more debugging on it. What happened:

  • Detected 99 ready nodes, found 99 nodes out of expected 100. Your cluster may not be working.

  • I found the node that wasn't registered and after logging to the corresponding VM there is:

    Broken (or in progress) GCE Kubernetes node setup! Suggested first step:
       tail /var/log/startupscript.log
    
  • Looking at the logs from startup script, it basically wasn't able to unpack the salt tree:

    May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: Unpacking Salt tree
    May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: 
    May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: gzip: stdin: not in gzip format
    May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: tar: Child returned status 1
    May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: tar: Error is not recoverable: exiting now
    
  • Once looking at the exact file it was trying to unpack we have:

    wojtekt@e2e-scalability-minion-i5x0:/var/cache/kubernetes-install$ cat kubernetes-salt.tar.gz 
    <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>wojtekt@e2e-scalability-minion-i5x0:/var/cache/kubernetes-install$ 
    
  • So this seems like some kind of race - I looked in few other machines and they were starting later.

    I'm going to add a mechanism to retry downloading the files if we're not able to unpack them.

@roberthbailey
Copy link
Contributor

I guess the good news is that this shouldn't happen on release tars....

This sounds like it can sometimes take a little while for GCS ACLs to propagate (after this call is made). I wonder if there is a way to tell GCS that anything you put into bucket X in the future should be automatically marked as world readable.

@davidopp
Copy link
Member

davidopp commented Jun 1, 2015

@wojtek-t is on vacation for the next couple of weeks, so we should either reassign or close if we think #8998 is sufficient. @roberthbailey what do you think?

@bprashanth
Copy link
Contributor

When the gce issues are fixed feel free to re-assign to me, I can track the progress of the test on jenkins and follow up on this bug.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2015

As of the weekend this wasn't fixed - we had a lot of red runs during the weekend. It seems to be in a better shape now.

However, there seem to be a huge drop in the performance metrics few hours ago - can someone please take a look at it and rollback the PR that caused the regression?

cc @fgrzadkowski

@fgrzadkowski
Copy link
Contributor

@wojtek-t I'm debugging performance regression. I also submitted today #9115 that should fix remaining red runs.

@fgrzadkowski
Copy link
Contributor

Majority of e2e scalability jenkins runs are green now. In the last 14 runs we had 12 greens and 2 red ones (failed firewall creating and one due to node being not ready). Let's close this issue and open separate issues if we decide to fix remaining problems.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 3, 2015

Thanks @fgrzadkowski - SGTM

@davidopp
Copy link
Member

davidopp commented Jun 4, 2015

I'm going to reopen this since there are apparently still problems with e2e scalability test (load test had to be disabled today, #9201). Probably I should open a separate bug but this is close enough.

@davidopp
Copy link
Member

davidopp commented Jun 8, 2015

@fgrzadkowski any update this, or you are still looking into it?

@davidopp davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015
@fgrzadkowski
Copy link
Contributor

@davidopp Yes, I'm looking into this. I need to refactor load test to make it more robust.

@davidopp
Copy link
Member

davidopp commented Jun 9, 2015

This is related to #4521.

@fgrzadkowski
Copy link
Contributor

In the last runs many of them didn't even start due to failed firewall creation. #9658 should fix this. In the remaining ones most problems were caused by too short timeout for RC creation. Hopefully @9571 should fix it.

I'll watch jenkins closely overnight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants