e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

mikedanese · 2015-12-22T23:37:50Z

Density [Skipped] [Performance] should allow starting 30 pods per node

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:149 There should be no high-latency requests Expected : 1 not to be > : 0

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3584/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3581/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3580/

@wojtek-t

wojtek-t · 2015-12-23T06:29:33Z

@gmarek - can you please increase limit for kibana (one of those failures above)?

gmarek · 2015-12-23T07:42:34Z

I'm not sure if that's a correct course of action - currently the limit is .05 CPU, and it passes nearly all the time. This failure is for nearly .5 CPU so it's a 10x difference. Increasing it that much will make the test pretty much useless and we'll miss real regression.

Sadly, I think we need to learn how to live with sporadic failures of this kind - we depend on 3rd party code, which may have bugs and e.g. enter some infinite loop or sth, which will cause massive CPU usage spike.

If you still think increasing the limit to .5 is a way to go I'll do it.

wojtek-t · 2015-12-23T09:07:00Z

ohh - I didn't look at the exact value. I agree that incrasing it to 0.5 doesn't make much sense. I will think if we can do something else to prevent such future failures...

gmarek · 2015-12-23T09:19:20Z

Changed the name to match most other flake related issues.

gmarek · 2015-12-23T09:27:19Z

To sum it up:

first failure is 'real flake' in a meaning, that really something bad happened with kibana
second failure is a long e2e startup latency. I believe it's the same thing as (Figure out the root cause for 1000 node cluster density flakes #19048)
third failure is long pod POST call, which we were observing for quite some time, but thought it's a temporary issue with infrastructure (Flaky e2e: [Performance] should allow starting 30 pods per node #18608)

davidopp · 2016-01-10T07:04:39Z

@gmarek You've done a huge amount of work on the flaky tests already but would you mind looking into these a bit more since @wojtek-t is out?

gmarek · 2016-01-11T12:19:08Z

Yes - we were looking into it and added some more logging. I'm waiting for next failure to happen.

gmarek · 2016-01-22T12:49:31Z

We now have a lot of data, and a tool to analyze it is under review. It's gradually appearing in https://github.com/kubernetes/contrib/tree/master/compare.

gmarek · 2016-01-22T13:10:59Z

I'm not certain it's worth keeping it tagged as flaky - those tests are quite sensitive to help us catch even small regressions. The bad side of it is that we're seeing false positives here as well. Because this suite take quite some long time (still - it's shortest from blockers, except parallel suite and unit tests). The performance suite currently consists of two tests: density (which is really quick) and load (which by design run for quite some time).

I think that we should remove load test from blockers, but keep density and accept the rare false positives. The suite running density test only will take probably around 15 minutes to reload - maybe we can live with that?

We certainly want to run load test somewhere as well, but maybe we don't need to block merges because of it. @wojtek-t @davidopp @ihmccreery @alex-mohr

wojtek-t · 2016-01-22T13:17:00Z

I'm not sure that's true - actually starting & stopping 100-node cluster takes quite a long time, so even running 5-minutes-long density test would take quite some time, so a single run will last ~30 minutes (with building, starting & stopping cluster).

I think that the question is how frequently it is now failing - since I'm on paternity leave can you write how frequently it is failing now?

gmarek · 2016-01-22T13:42:43Z

We had a couple of regressions recently, so it's hard to tell...

gmarek · 2016-01-22T13:43:08Z

I mean - I can give you some numbers, but they will be high and meaningless.

wojtek-t · 2016-01-22T14:00:23Z

Can we wait for some better numbers and have an exact number how many more PRs we will be able to merge with that (assuming that all other suites are green)?

davidopp · 2016-01-22T18:54:53Z

I'm OK with leaving this open for now, but we're trying extremely hard to get to number of flaky tests down to zero, so it would be better if someone could look at this. (That said, I realize that everyone who is the best position to look at this is OOO at least all of next week.)

@gmarek When will better numbers be available?

davidopp · 2016-01-25T18:37:50Z

Since @gmarek is out, reassinging to @fgrzadkowski who I believe is back from leave today. We're trying to close all the flake issues so we need someone to work on this ASAP.

fgrzadkowski · 2016-01-26T09:12:37Z

I checked last 7 failures that spans ~4 days:

5 failed due to long pod POST
1 failed due to long pod POST and long nodes LIST
1 failed due to long nodes LIST

I'll focus on understanding long pod POSTs.

Ref kubernetes#19036

fgrzadkowski · 2016-01-26T16:28:13Z

I carefully checked logs of 3 last failures. It seems that all long POSTs occurred in exactly the same time. Looking at the trace information it seems that writing to the database is long:

I0124 21:49:07.287768       6 trace.go:57] Trace "Create /api/v1/namespaces/e2e-tests-density-zedo8/pods" (started 2016-01-24 21:49:06.728761009 +0000 UTC):
[33.882µs] [33.882µs] About to convert to expected version
[219.655µs] [185.773µs] Conversion done
[293.985µs] [74.33µs] About to store object in database
[558.851537ms] [558.557552ms] Object stored in database
[558.864064ms] [12.527µs] Self-link added
[558.96554ms] [101.476µs] END

However metrics for etcd requests latency are normal:

13:58:34       {
13:58:34         "metric": {
13:58:34           "__name__": "etcd_request_latencies_summary",
13:58:34           "operation": "create",
13:58:34           "quantile": "0.99",
13:58:34           "type": "*api.Pod"
13:58:34         },
13:58:34         "value": [
13:58:34           0,
13:58:34           "39583"
13:58:34         ]
13:58:34       },

I'm adding more tracing information that might shed more light.

fgrzadkowski · 2016-01-27T11:06:19Z

Adding more logs didn't help. It seems that it's not related to etcd itself. It seems something slows us down on the generic registry level (registry/generic/etcd). I'm trying to trace this locally now. Will update later today.

fgrzadkowski · 2016-01-27T13:55:34Z

Ok. After more debugging I believe I understand what's the problem. I narrowed down this to genering new UIDs here. There are two steps which might be slow under heavy load:

starting go routine may be blocked if there are no free threads
coming back from sleep might be delayed

Each of those can add even 100ms under heavy load. From my observations the second one is more problematic. As a result I observed creations delayed by hundreds of milliseconds.

According to the comment in the code the library we use cannot be called more often than every 100ns. I think that in such case we should just actively loop if we see that the uid is the same as the previous one. I will be much faster than what we have now.

I'll send a PR.

Fixes kubernetes#19036 Fixes kubernetes#18608

davidopp · 2016-01-27T19:32:08Z

Nice debugging! Thanks!

saad-ali · 2016-05-13T07:26:19Z

Reactivating issue. This test ([k8s.io] Density [Feature:Performance] should allow starting 30 pods per node) looks like it is still flaking with the same error: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:150 There should be no high-latency requests Expected : 1 not to be > : 0 and blocking submit queue.

See:

j3ffml · 2016-05-13T17:59:36Z

Ping. This is grinding submit queue to a halt.

j3ffml · 2016-05-13T18:57:55Z

Two most recent occurrences on kubernetes-e2e-gce-scalability:

kubernetes-e2e-gce-scalability/7404 (internal jenkins)
kubernetes-e2e-gce-scalability/7402 (internal jenkins)

from build-log:

10:39:54 • Failure in Spec Teardown (AfterEach) [431.493 seconds]
10:39:54 [k8s.io] Density
10:39:54 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:508
10:39:54   [Feature:Performance] should allow starting 30 pods per node [AfterEach]
10:39:54   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:563
10:39:54 
10:39:54   There should be no high-latency requests
10:39:54   Expected
10:39:54       <int>: 1
10:39:54   not to be >
10:39:54       <int>: 0
10:39:54 
10:39:54   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:144

davidopp · 2016-05-13T19:01:54Z

Can someone try to binary search which PR caused this to start flaking?

wojtek-t · 2016-05-16T08:53:25Z

I took a quick look into those failures and all failures so far where on "Listing nodes".
Looking into traces, all long calls have traces like this one:

I0516 01:13:55.053026       5 trace.go:61] Trace "List /api/v1/nodes" (started 2016-05-16 01:13:54.4026014 +0000 UTC):
[6.736µs] [6.736µs] About to List from storage
[515.902567ms] [515.895831ms] Listing from storage done
[517.720369ms] [1.817802ms] Self-linking done
[650.384693ms] [132.664324ms] Writing http response done (101 items)
[650.385785ms] [1.092µs] END

The corresponding log from handlers is this one:

I0516 01:13:55.053113       5 handlers.go:161] GET /api/v1/nodes: (650.928582ms) 200 [[pod_nanny/v0.0.0 (linux/amd64) kubernetes/$Format] 10.180.57.3:58384]

So basically, this is LIST request without ResourceVersion option specified. Thus, this request is handled directly by etcd and not in cacher.

There isn't any complicated logic in apiserver, so this is most probably slow etcd (because of some reason). But it will slightly tune traces in the lower layers of apiserver to confirm if it's really etcd.

wojtek-t · 2016-05-16T13:34:54Z

OK - so after adding some traces, we can confirm that there are two issues here.

etcd is slow in some requests (to confirm a trace):

I0516 12:59:10.087352       6 trace.go:61] Trace "List *api.NodeList" (started 2016-05-16 12:59:09.582932886 +0000 UTC):
[6.408µs] [6.408µs] About to list etcd node
[487.402243ms] [487.395835ms] Etcd node listed
[504.359283ms] [16.95704ms] Node list decoded
[504.3678ms] [8.517µs] END

There is problem when pod_nanny is listing all nodes and it take 0.5s+ to actually send the result over wire:

I0516 13:03:32.866816       6 trace.go:61] Trace "List /api/v1/nodes" (started 2016-05-16 13:03:32.303054205 +0000 UTC):
[7.76µs] [7.76µs] About to List from storage
[10.633748ms] [10.625988ms] Listing from storage done
[60.482619ms] [49.848871ms] Self-linking done
[563.714287ms] [503.231668ms] Writing http response done (101 items)
[563.716009ms] [1.722µs] END

Neither of those is a real system issue.
So I'm afraid the only reasonable fix to it, is to increase timeout for listing nodes.
I'm going to send out a PR for it.

gmarek changed the title ~~Density [Skipped] [Performance] should allow starting 30 pods per node failed 3 times today~~ e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node Dec 23, 2015

gmarek added the team/control-plane label Dec 23, 2015

davidopp assigned gmarek Jan 10, 2016

davidopp assigned fgrzadkowski and unassigned gmarek Jan 25, 2016

fgrzadkowski mentioned this issue Jan 26, 2016

Add tracing information in etcd helper #20159

Merged

fgrzadkowski added a commit to fgrzadkowski/kubernetes that referenced this issue Jan 26, 2016

Add tracing information in etcd helper.

2200aac

Ref kubernetes#19036

fgrzadkowski mentioned this issue Jan 27, 2016

Speed up UUID generation. #20209

Merged

fgrzadkowski added a commit to fgrzadkowski/kubernetes that referenced this issue Jan 27, 2016

Speed up UUID generation.

a5a2394

Fixes kubernetes#19036 Fixes kubernetes#18608

fgrzadkowski closed this as completed in #20209 Jan 27, 2016

saad-ali reopened this May 13, 2016

wojtek-t assigned wojtek-t and unassigned fgrzadkowski May 16, 2016

wojtek-t mentioned this issue May 16, 2016

Tune traces thresholds in etcdHelper #25653

Merged

wojtek-t mentioned this issue May 16, 2016

Bump threshold for listing nodes #25667

Merged

wojtek-t closed this as completed in #25667 May 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

mikedanese commented Dec 22, 2015

wojtek-t commented Dec 23, 2015

gmarek commented Dec 23, 2015

wojtek-t commented Dec 23, 2015

gmarek commented Dec 23, 2015

gmarek commented Dec 23, 2015

davidopp commented Jan 10, 2016

gmarek commented Jan 11, 2016

gmarek commented Jan 22, 2016

gmarek commented Jan 22, 2016

wojtek-t commented Jan 22, 2016

gmarek commented Jan 22, 2016

gmarek commented Jan 22, 2016

wojtek-t commented Jan 22, 2016

davidopp commented Jan 22, 2016

davidopp commented Jan 25, 2016

fgrzadkowski commented Jan 26, 2016

fgrzadkowski commented Jan 26, 2016

fgrzadkowski commented Jan 27, 2016

fgrzadkowski commented Jan 27, 2016

davidopp commented Jan 27, 2016

saad-ali commented May 13, 2016 •

edited

j3ffml commented May 13, 2016

j3ffml commented May 13, 2016

davidopp commented May 13, 2016

wojtek-t commented May 16, 2016

wojtek-t commented May 16, 2016

e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

Comments

mikedanese commented Dec 22, 2015

wojtek-t commented Dec 23, 2015

gmarek commented Dec 23, 2015

wojtek-t commented Dec 23, 2015

gmarek commented Dec 23, 2015

gmarek commented Dec 23, 2015

davidopp commented Jan 10, 2016

gmarek commented Jan 11, 2016

gmarek commented Jan 22, 2016

gmarek commented Jan 22, 2016

wojtek-t commented Jan 22, 2016

gmarek commented Jan 22, 2016

gmarek commented Jan 22, 2016

wojtek-t commented Jan 22, 2016

davidopp commented Jan 22, 2016

davidopp commented Jan 25, 2016

fgrzadkowski commented Jan 26, 2016

fgrzadkowski commented Jan 26, 2016

fgrzadkowski commented Jan 27, 2016

fgrzadkowski commented Jan 27, 2016

davidopp commented Jan 27, 2016

saad-ali commented May 13, 2016 • edited

j3ffml commented May 13, 2016

j3ffml commented May 13, 2016

davidopp commented May 13, 2016

wojtek-t commented May 16, 2016

wojtek-t commented May 16, 2016

saad-ali commented May 13, 2016 •

edited