Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036

Closed
mikedanese opened this issue Dec 22, 2015 · 26 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@mikedanese
Copy link
Member

Density [Skipped] [Performance] should allow starting 30 pods per node

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:149 There should be no high-latency requests Expected : 1 not to be > : 0

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3584/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3581/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3580/

@wojtek-t

@mikedanese mikedanese added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 22, 2015
@wojtek-t
Copy link
Member

@gmarek - can you please increase limit for kibana (one of those failures above)?

@gmarek
Copy link
Contributor

gmarek commented Dec 23, 2015

I'm not sure if that's a correct course of action - currently the limit is .05 CPU, and it passes nearly all the time. This failure is for nearly .5 CPU so it's a 10x difference. Increasing it that much will make the test pretty much useless and we'll miss real regression.

Sadly, I think we need to learn how to live with sporadic failures of this kind - we depend on 3rd party code, which may have bugs and e.g. enter some infinite loop or sth, which will cause massive CPU usage spike.

If you still think increasing the limit to .5 is a way to go I'll do it.

@wojtek-t
Copy link
Member

ohh - I didn't look at the exact value. I agree that incrasing it to 0.5 doesn't make much sense. I will think if we can do something else to prevent such future failures...

@gmarek gmarek changed the title Density [Skipped] [Performance] should allow starting 30 pods per node failed 3 times today e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node Dec 23, 2015
@gmarek
Copy link
Contributor

gmarek commented Dec 23, 2015

Changed the name to match most other flake related issues.

@gmarek
Copy link
Contributor

gmarek commented Dec 23, 2015

To sum it up:

@davidopp
Copy link
Member

@gmarek You've done a huge amount of work on the flaky tests already but would you mind looking into these a bit more since @wojtek-t is out?

@gmarek
Copy link
Contributor

gmarek commented Jan 11, 2016

Yes - we were looking into it and added some more logging. I'm waiting for next failure to happen.

@gmarek
Copy link
Contributor

gmarek commented Jan 22, 2016

We now have a lot of data, and a tool to analyze it is under review. It's gradually appearing in https://github.com/kubernetes/contrib/tree/master/compare.

@gmarek
Copy link
Contributor

gmarek commented Jan 22, 2016

I'm not certain it's worth keeping it tagged as flaky - those tests are quite sensitive to help us catch even small regressions. The bad side of it is that we're seeing false positives here as well. Because this suite take quite some long time (still - it's shortest from blockers, except parallel suite and unit tests). The performance suite currently consists of two tests: density (which is really quick) and load (which by design run for quite some time).

I think that we should remove load test from blockers, but keep density and accept the rare false positives. The suite running density test only will take probably around 15 minutes to reload - maybe we can live with that?

We certainly want to run load test somewhere as well, but maybe we don't need to block merges because of it. @wojtek-t @davidopp @ihmccreery @alex-mohr

@wojtek-t
Copy link
Member

I'm not sure that's true - actually starting & stopping 100-node cluster takes quite a long time, so even running 5-minutes-long density test would take quite some time, so a single run will last ~30 minutes (with building, starting & stopping cluster).

I think that the question is how frequently it is now failing - since I'm on paternity leave can you write how frequently it is failing now?

@gmarek
Copy link
Contributor

gmarek commented Jan 22, 2016

We had a couple of regressions recently, so it's hard to tell...

@gmarek
Copy link
Contributor

gmarek commented Jan 22, 2016

I mean - I can give you some numbers, but they will be high and meaningless.

@wojtek-t
Copy link
Member

Can we wait for some better numbers and have an exact number how many more PRs we will be able to merge with that (assuming that all other suites are green)?

@davidopp
Copy link
Member

I'm OK with leaving this open for now, but we're trying extremely hard to get to number of flaky tests down to zero, so it would be better if someone could look at this. (That said, I realize that everyone who is the best position to look at this is OOO at least all of next week.)

@gmarek When will better numbers be available?

@davidopp
Copy link
Member

Since @gmarek is out, reassinging to @fgrzadkowski who I believe is back from leave today. We're trying to close all the flake issues so we need someone to work on this ASAP.

@davidopp davidopp assigned fgrzadkowski and unassigned gmarek Jan 25, 2016
@fgrzadkowski
Copy link
Contributor

I checked last 7 failures that spans ~4 days:

  • 5 failed due to long pod POST
  • 1 failed due to long pod POST and long nodes LIST
  • 1 failed due to long nodes LIST

I'll focus on understanding long pod POSTs.

@fgrzadkowski
Copy link
Contributor

I carefully checked logs of 3 last failures. It seems that all long POSTs occurred in exactly the same time. Looking at the trace information it seems that writing to the database is long:

I0124 21:49:07.287768       6 trace.go:57] Trace "Create /api/v1/namespaces/e2e-tests-density-zedo8/pods" (started 2016-01-24 21:49:06.728761009 +0000 UTC):
[33.882µs] [33.882µs] About to convert to expected version
[219.655µs] [185.773µs] Conversion done
[293.985µs] [74.33µs] About to store object in database
[558.851537ms] [558.557552ms] Object stored in database
[558.864064ms] [12.527µs] Self-link added
[558.96554ms] [101.476µs] END

However metrics for etcd requests latency are normal:

13:58:34       {
13:58:34         "metric": {
13:58:34           "__name__": "etcd_request_latencies_summary",
13:58:34           "operation": "create",
13:58:34           "quantile": "0.99",
13:58:34           "type": "*api.Pod"
13:58:34         },
13:58:34         "value": [
13:58:34           0,
13:58:34           "39583"
13:58:34         ]
13:58:34       },

I'm adding more tracing information that might shed more light.

@fgrzadkowski
Copy link
Contributor

Adding more logs didn't help. It seems that it's not related to etcd itself. It seems something slows us down on the generic registry level (registry/generic/etcd). I'm trying to trace this locally now. Will update later today.

@fgrzadkowski
Copy link
Contributor

Ok. After more debugging I believe I understand what's the problem. I narrowed down this to genering new UIDs here. There are two steps which might be slow under heavy load:

  • starting go routine may be blocked if there are no free threads
  • coming back from sleep might be delayed

Each of those can add even 100ms under heavy load. From my observations the second one is more problematic. As a result I observed creations delayed by hundreds of milliseconds.

According to the comment in the code the library we use cannot be called more often than every 100ns. I think that in such case we should just actively loop if we see that the uid is the same as the previous one. I will be much faster than what we have now.

I'll send a PR.

fgrzadkowski added a commit to fgrzadkowski/kubernetes that referenced this issue Jan 27, 2016
@davidopp
Copy link
Member

Nice debugging! Thanks!

@saad-ali
Copy link
Member

saad-ali commented May 13, 2016

Reactivating issue. This test ([k8s.io] Density [Feature:Performance] should allow starting 30 pods per node) looks like it is still flaking with the same error: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:150 There should be no high-latency requests Expected : 1 not to be > : 0 and blocking submit queue.

See:

@saad-ali saad-ali reopened this May 13, 2016
@j3ffml
Copy link
Contributor

j3ffml commented May 13, 2016

Ping. This is grinding submit queue to a halt.

@j3ffml
Copy link
Contributor

j3ffml commented May 13, 2016

Two most recent occurrences on kubernetes-e2e-gce-scalability:

kubernetes-e2e-gce-scalability/7404 (internal jenkins)
kubernetes-e2e-gce-scalability/7402 (internal jenkins)

from build-log:

10:39:54 • Failure in Spec Teardown (AfterEach) [431.493 seconds]
10:39:54 [k8s.io] Density
10:39:54 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:508
10:39:54   [Feature:Performance] should allow starting 30 pods per node [AfterEach]
10:39:54   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:563
10:39:54 
10:39:54   There should be no high-latency requests
10:39:54   Expected
10:39:54       <int>: 1
10:39:54   not to be >
10:39:54       <int>: 0
10:39:54 
10:39:54   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:144

@davidopp
Copy link
Member

Can someone try to binary search which PR caused this to start flaking?

@wojtek-t wojtek-t assigned wojtek-t and unassigned fgrzadkowski May 16, 2016
@wojtek-t
Copy link
Member

I took a quick look into those failures and all failures so far where on "Listing nodes".
Looking into traces, all long calls have traces like this one:

I0516 01:13:55.053026       5 trace.go:61] Trace "List /api/v1/nodes" (started 2016-05-16 01:13:54.4026014 +0000 UTC):
[6.736µs] [6.736µs] About to List from storage
[515.902567ms] [515.895831ms] Listing from storage done
[517.720369ms] [1.817802ms] Self-linking done
[650.384693ms] [132.664324ms] Writing http response done (101 items)
[650.385785ms] [1.092µs] END

The corresponding log from handlers is this one:

I0516 01:13:55.053113       5 handlers.go:161] GET /api/v1/nodes: (650.928582ms) 200 [[pod_nanny/v0.0.0 (linux/amd64) kubernetes/$Format] 10.180.57.3:58384]

So basically, this is LIST request without ResourceVersion option specified. Thus, this request is handled directly by etcd and not in cacher.

There isn't any complicated logic in apiserver, so this is most probably slow etcd (because of some reason). But it will slightly tune traces in the lower layers of apiserver to confirm if it's really etcd.

@wojtek-t
Copy link
Member

OK - so after adding some traces, we can confirm that there are two issues here.

  1. etcd is slow in some requests (to confirm a trace):

    I0516 12:59:10.087352       6 trace.go:61] Trace "List *api.NodeList" (started 2016-05-16 12:59:09.582932886 +0000 UTC):
    [6.408µs] [6.408µs] About to list etcd node
    [487.402243ms] [487.395835ms] Etcd node listed
    [504.359283ms] [16.95704ms] Node list decoded
    [504.3678ms] [8.517µs] END
    
  2. There is problem when pod_nanny is listing all nodes and it take 0.5s+ to actually send the result over wire:

    I0516 13:03:32.866816       6 trace.go:61] Trace "List /api/v1/nodes" (started 2016-05-16 13:03:32.303054205 +0000 UTC):
    [7.76µs] [7.76µs] About to List from storage
    [10.633748ms] [10.625988ms] Listing from storage done
    [60.482619ms] [49.848871ms] Self-linking done
    [563.714287ms] [503.231668ms] Writing http response done (101 items)
    [563.716009ms] [1.722µs] END
    

Neither of those is a real system issue.
So I'm afraid the only reasonable fix to it, is to increase timeout for listing nodes.
I'm going to send out a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants