Failing performance tests on Jenkins #7561

wojtek-t · 2015-04-30T13:01:08Z

The first failure happened at ~9:20 PST 29/04/2015. The failure looks like that:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:158
Expected error:
<*errors.errorString | 0xc20b84daa0>: {
s: "Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000",
}
Controller my-hostname-density3000-75cb549b-ee8e-11e4-ae78-42010af01555: Only found 3052 replicas out of 3000
not to have occurred

It seems that because of some reason we are starting much more pods within a replication controller than expected. Another related failure:

Expected error:
<*errors.errorString | 0xc2083b3050>: {
s: "Number of reported pods changed: 3222 vs 3000",
}
Number of reported pods changed: 3222 vs 3000
not to have occurred

cc @quinton-hoole @lavalamp @bprashanth @fgrzadkowski

wojtek-t · 2015-04-30T13:02:09Z

Feel free to move it outside v1.0 if you can explain it - if not, we should definitely try to understand it.

wojtek-t · 2015-04-30T13:03:13Z

Also note that this is a clear regression - I didn't observe any such failure before. That's why I'm marking it (at least for now) as v1.0 issue.

ghost · 2015-04-30T13:21:27Z

Might end up being a duplicate of #7548, but let's investigate both in parallel until we confirm.

wojtek-t · 2015-04-30T13:35:08Z

Might end up being a duplicate of #7548, but let's investigate both in parallel until we confirm.

I think those are unrelated. Basically, the GCE tests are green now, whereas performance tests didn't pass since then even once.

My feeling is that it might be related to: #6866 @lavalamp @bprashanth - any thoughts about it?

bprashanth · 2015-04-30T15:33:15Z

Tl;dr Logs at V(4) would help, that can certainly happen if the watch takes O(minutes) to deliver notifications (which I've seen happen when we, say, delete 1000 pods at once).

If that's the problem this essentially boils down to: watch is slow and things timeout because they don't know if a packet was lost or still in flight; to which there are 2 solutions: make watch faster, increase the timeout.

The good news is that the rc will balance itself out over time as the appropriate notifications get there. So if you really want to track the exact number of replicas at scale minute by minute, its going to be hard. Previously we didn't have this problem because the rcs would poll every 5s, which has various other implications. Currently that 5s poll is replaced by 4 things:

watchers on pods and controllers
a periodic relist of the rcs every 30s
a periodic relist of pods every 5m
a ttl on what the rc expects to see via watch (i.e if it created 3 pods it expects to see 3 watch notifications) every 2m

So it feels like we're hitting the 4th case (I can't be sure without logs), and that if we didn't care about the replica count being exact minute to minute, we would be ok after a bit. Fyi I'm also adding some clarity to the density tests because currently they're close to impossible to debug

bprashanth · 2015-05-01T16:56:13Z

There are at least 2 issues that can lead to a wrong number of replicas at this stage of the test:

Slow create->watch cycle (by slow i mean slower than the 2 node-50 pods cluster case). See comments in Update expiration timeout based on observed latencies #7628 for more details.
Scheduling pods on dead nodes (Pods being scheduled on shut down machines #7222) or nodes dying after pods have been scheduled on them. This will cause the nodecontroller to delete pods after the rc has observed them to be running, and the test will panic because of the mismatch.

I'm am still digging into 2, I've seen it happen at least 2/5 runs. In both cases the system will correct itself given time.

wojtek-t · 2015-05-04T07:15:31Z

Thanks for investigating it @bprashanth
It seems that #7628 fixed huge majority of flakes. For the the fix!
And it seems that the remaining flakes are exactly due to #7222 that you described.

Since the flakes are currently very rarely I'm decreasing the priority of this bug.
Also, I think that we can close this PR as a duplicate of #7222 (feel free to do it).

bprashanth · 2015-05-04T14:34:48Z

There's a third case which is currently the most frequent failure mode:
<*errors.errorString | 0xc21c4c0580>: {
s: "Error: Pod my-hostname-density3000-8da6844d-f23a-11e4-bd75-42010af0159ffb5: Container my-hostname-density3000-8da6844d-f23a-11e4-bd75-42010af01555 was found to have terminated 1 times",
}

It also occasionally crops up in the normal e2es (#7548 (comment)). Investigating.

bprashanth · 2015-05-05T03:02:24Z

This should fix the unexpected terminations, or reveal another race condition that needs handling: #7749

wojtek-t · 2015-05-05T06:04:04Z

@bprashanth thanks for the heads-up.

bprashanth · 2015-05-06T16:34:20Z

The mentioned pr has merged, and now I'm seeing high latency requests of 2 types:

Listing events: this doesn't sound like reason to fail the test, can we ignore them? @wojtek-t @rrati
Lising pods: happens less often, investigating.

To be clear I'm talking about the failure message:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:164
Expected
    <int>: 2
not to be >
    <int>: 0

which inevitably end up being:

INFO: WARNING - requests with too high latency: {verb:LIST resource:events quantile:0.9 latency:11844619000}
INFO: WARNING - requests with too high latency: {verb:LIST resource:events quantile:0.99 latency:14242416000}

wojtek-t · 2015-05-06T16:48:03Z

Correct - this is something we're currently looking at.
cc @fgrzadkowski

ghost · 2015-05-06T16:52:47Z

The basic event listing e2e test is intermittently failing in the same way as bprashanth@ mentions above, so it's failure is currently somewhat independent of scale. I'd be very happy to temporarily disable that part of this scalability test until the flakyness of the basic event listing test has been deflaked.

To be clear, I'm talking about this flaky basic event listing e2e test:

==== snip snip from out continuous integration tests =============

Identified problems

Events should be sent by kubelets and the scheduler about pods scheduling and running

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/events.go:124
kubelet events from running pod
Expected
: 0
not to be zero-valued

fgrzadkowski · 2015-05-06T16:55:32Z

@bprashanth I'll send a PR later today to ignore events related metrics.

I was also investigating slow listing of pods. Main reason currently is GOMAXPROCS set for etcd to 1. Increasing it to NumCores() improves a lot, but metrics are still above 1.0 targets. I believe that now the bottleneck is somewhere in apiserver itself.
Please let know if you find something, so that we don't investigate the same thing :)

bprashanth · 2015-05-06T16:58:02Z

Thanks for the heads up, will keep you posted

wojtek-t · 2015-05-29T08:37:00Z

In all the cases I've seen O(10) - those nodes weren't even registered.

wojtek-t · 2015-05-29T08:37:30Z

Currently, the limit is 5 minutes, but I'm increasing it to 10 minutes now.

wojtek-t · 2015-05-29T10:25:58Z

I was able to catch the above situation on Jenkins and did some more debugging on it. What happened:

Detected 99 ready nodes, found 99 nodes out of expected 100. Your cluster may not be working.

I found the node that wasn't registered and after logging to the corresponding VM there is:

Broken (or in progress) GCE Kubernetes node setup! Suggested first step:
   tail /var/log/startupscript.log

Looking at the logs from startup script, it basically wasn't able to unpack the salt tree:

May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: Unpacking Salt tree
May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: 
May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: gzip: stdin: not in gzip format
May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: tar: Child returned status 1
May 29 08:38:01 e2e-scalability-minion-i5x0 startupscript: tar: Error is not recoverable: exiting now

Once looking at the exact file it was trying to unpack we have:

wojtekt@e2e-scalability-minion-i5x0:/var/cache/kubernetes-install$ cat kubernetes-salt.tar.gz 
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>wojtekt@e2e-scalability-minion-i5x0:/var/cache/kubernetes-install$

So this seems like some kind of race - I looked in few other machines and they were starting later.

I'm going to add a mechanism to retry downloading the files if we're not able to unpack them.

roberthbailey · 2015-05-29T16:02:10Z

I guess the good news is that this shouldn't happen on release tars....

This sounds like it can sometimes take a little while for GCS ACLs to propagate (after this call is made). I wonder if there is a way to tell GCS that anything you put into bucket X in the future should be automatically marked as world readable.

davidopp · 2015-06-01T17:28:24Z

@wojtek-t is on vacation for the next couple of weeks, so we should either reassign or close if we think #8998 is sufficient. @roberthbailey what do you think?

bprashanth · 2015-06-01T17:31:12Z

When the gce issues are fixed feel free to re-assign to me, I can track the progress of the test on jenkins and follow up on this bug.

wojtek-t · 2015-06-02T20:46:33Z

As of the weekend this wasn't fixed - we had a lot of red runs during the weekend. It seems to be in a better shape now.

However, there seem to be a huge drop in the performance metrics few hours ago - can someone please take a look at it and rollback the PR that caused the regression?

cc @fgrzadkowski

fgrzadkowski · 2015-06-02T20:56:33Z

@wojtek-t I'm debugging performance regression. I also submitted today #9115 that should fix remaining red runs.

fgrzadkowski · 2015-06-03T08:24:08Z

Majority of e2e scalability jenkins runs are green now. In the last 14 runs we had 12 greens and 2 red ones (failed firewall creating and one due to node being not ready). Let's close this issue and open separate issues if we decide to fix remaining problems.

wojtek-t · 2015-06-03T18:11:43Z

Thanks @fgrzadkowski - SGTM

davidopp · 2015-06-04T07:20:51Z

I'm going to reopen this since there are apparently still problems with e2e scalability test (load test had to be disabled today, #9201). Probably I should open a separate bug but this is close enough.

davidopp · 2015-06-08T17:06:25Z

@fgrzadkowski any update this, or you are still looking into it?

fgrzadkowski · 2015-06-09T06:38:04Z

@davidopp Yes, I'm looking into this. I need to refactor load test to make it more robust.

davidopp · 2015-06-09T21:42:16Z

This is related to #4521.

fgrzadkowski · 2015-06-11T14:02:15Z

In the last runs many of them didn't even start due to failed firewall creation. #9658 should fix this. In the remaining ones most problems were caused by too short timeout for RC creation. Hopefully @9571 should fix it.

I'll watch jenkins closely overnight.

wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/reliability team/master area/test-infra labels Apr 30, 2015

wojtek-t added this to the v1.0 milestone Apr 30, 2015

davidopp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 30, 2015

davidopp assigned bprashanth May 1, 2015

bprashanth mentioned this issue May 1, 2015

Update expiration timeout based on observed latencies #7628

Merged

bprashanth mentioned this issue May 2, 2015

Scheduler ignores nodes that are in a bad state #7668

Merged

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 4, 2015

wojtek-t mentioned this issue May 4, 2015

Log host for failed pod in Density test #7700

Merged

bprashanth mentioned this issue May 5, 2015

ControllerManager is consuming a lot of CPU on empty cluster #7697

Closed

wojtek-t mentioned this issue May 5, 2015

Kubelet: record the timestamp correctly in the runtime cache #7749

Merged

fgrzadkowski mentioned this issue May 6, 2015

Ignore latency metrics for events #7857

Merged

wojtek-t mentioned this issue May 29, 2015

Retry downloading tars in case of unpack failure #8998

Merged

fgrzadkowski closed this as completed Jun 3, 2015

davidopp reopened this Jun 4, 2015

davidopp assigned fgrzadkowski and unassigned wojtek-t Jun 4, 2015

davidopp mentioned this issue Jun 4, 2015

Design a mechanism to generate a synthetic workload and play it #3139

Closed

ghost removed the area/test-infra label Jun 5, 2015

fgrzadkowski mentioned this issue Jun 8, 2015

Do not overwrite bash variables in jenkins for scalability tests #9391

Merged

davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015

fgrzadkowski mentioned this issue Jun 9, 2015

Refactor load test and reduce the load during the test. #9478

Merged

This was referenced Jun 10, 2015

Simplify e2e.RunRC method and wait up to 10 minutes for pods to start #9571

Merged

Wait longer for firewall creation during e2e cluster setup. #9658

Merged

fgrzadkowski mentioned this issue Jun 12, 2015

Increase threshold for density test to accomodate slower 4 core master #9709

Merged

jszczepkowski closed this as completed in #9709 Jun 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing performance tests on Jenkins #7561

Failing performance tests on Jenkins #7561

wojtek-t commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

ghost commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

bprashanth commented Apr 30, 2015

bprashanth commented May 1, 2015

wojtek-t commented May 4, 2015

bprashanth commented May 4, 2015

bprashanth commented May 5, 2015

wojtek-t commented May 5, 2015

bprashanth commented May 6, 2015

wojtek-t commented May 6, 2015

ghost commented May 6, 2015

fgrzadkowski commented May 6, 2015

bprashanth commented May 6, 2015

wojtek-t commented May 29, 2015

wojtek-t commented May 29, 2015

wojtek-t commented May 29, 2015

roberthbailey commented May 29, 2015

davidopp commented Jun 1, 2015

bprashanth commented Jun 1, 2015

wojtek-t commented Jun 2, 2015

fgrzadkowski commented Jun 2, 2015

fgrzadkowski commented Jun 3, 2015

wojtek-t commented Jun 3, 2015

davidopp commented Jun 4, 2015

davidopp commented Jun 8, 2015

fgrzadkowski commented Jun 9, 2015

davidopp commented Jun 9, 2015

fgrzadkowski commented Jun 11, 2015

Failing performance tests on Jenkins #7561

Failing performance tests on Jenkins #7561

Comments

wojtek-t commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

ghost commented Apr 30, 2015

wojtek-t commented Apr 30, 2015

bprashanth commented Apr 30, 2015

bprashanth commented May 1, 2015

wojtek-t commented May 4, 2015

bprashanth commented May 4, 2015

bprashanth commented May 5, 2015

wojtek-t commented May 5, 2015

bprashanth commented May 6, 2015

wojtek-t commented May 6, 2015

ghost commented May 6, 2015

fgrzadkowski commented May 6, 2015

bprashanth commented May 6, 2015

wojtek-t commented May 29, 2015

wojtek-t commented May 29, 2015

wojtek-t commented May 29, 2015

roberthbailey commented May 29, 2015

davidopp commented Jun 1, 2015

bprashanth commented Jun 1, 2015

wojtek-t commented Jun 2, 2015

fgrzadkowski commented Jun 2, 2015

fgrzadkowski commented Jun 3, 2015

wojtek-t commented Jun 3, 2015

davidopp commented Jun 4, 2015

davidopp commented Jun 8, 2015

fgrzadkowski commented Jun 9, 2015

davidopp commented Jun 9, 2015

fgrzadkowski commented Jun 11, 2015