New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: Density [Skipped] [Performance] should allow starting 30 pods per node #19036
Comments
@gmarek - can you please increase limit for kibana (one of those failures above)? |
I'm not sure if that's a correct course of action - currently the limit is .05 CPU, and it passes nearly all the time. This failure is for nearly .5 CPU so it's a 10x difference. Increasing it that much will make the test pretty much useless and we'll miss real regression. Sadly, I think we need to learn how to live with sporadic failures of this kind - we depend on 3rd party code, which may have bugs and e.g. enter some infinite loop or sth, which will cause massive CPU usage spike. If you still think increasing the limit to .5 is a way to go I'll do it. |
ohh - I didn't look at the exact value. I agree that incrasing it to 0.5 doesn't make much sense. I will think if we can do something else to prevent such future failures... |
Changed the name to match most other flake related issues. |
To sum it up:
|
Yes - we were looking into it and added some more logging. I'm waiting for next failure to happen. |
We now have a lot of data, and a tool to analyze it is under review. It's gradually appearing in https://github.com/kubernetes/contrib/tree/master/compare. |
I'm not certain it's worth keeping it tagged as flaky - those tests are quite sensitive to help us catch even small regressions. The bad side of it is that we're seeing false positives here as well. Because this suite take quite some long time (still - it's shortest from blockers, except parallel suite and unit tests). The performance suite currently consists of two tests: density (which is really quick) and load (which by design run for quite some time). I think that we should remove load test from blockers, but keep density and accept the rare false positives. The suite running density test only will take probably around 15 minutes to reload - maybe we can live with that? We certainly want to run load test somewhere as well, but maybe we don't need to block merges because of it. @wojtek-t @davidopp @ihmccreery @alex-mohr |
I'm not sure that's true - actually starting & stopping 100-node cluster takes quite a long time, so even running 5-minutes-long density test would take quite some time, so a single run will last ~30 minutes (with building, starting & stopping cluster). I think that the question is how frequently it is now failing - since I'm on paternity leave can you write how frequently it is failing now? |
We had a couple of regressions recently, so it's hard to tell... |
I mean - I can give you some numbers, but they will be high and meaningless. |
Can we wait for some better numbers and have an exact number how many more PRs we will be able to merge with that (assuming that all other suites are green)? |
I'm OK with leaving this open for now, but we're trying extremely hard to get to number of flaky tests down to zero, so it would be better if someone could look at this. (That said, I realize that everyone who is the best position to look at this is OOO at least all of next week.) @gmarek When will better numbers be available? |
Since @gmarek is out, reassinging to @fgrzadkowski who I believe is back from leave today. We're trying to close all the flake issues so we need someone to work on this ASAP. |
I checked last 7 failures that spans ~4 days:
I'll focus on understanding long pod POSTs. |
I carefully checked logs of 3 last failures. It seems that all long POSTs occurred in exactly the same time. Looking at the trace information it seems that writing to the database is long:
However metrics for etcd requests latency are normal:
I'm adding more tracing information that might shed more light. |
Adding more logs didn't help. It seems that it's not related to etcd itself. It seems something slows us down on the generic registry level (registry/generic/etcd). I'm trying to trace this locally now. Will update later today. |
Ok. After more debugging I believe I understand what's the problem. I narrowed down this to genering new UIDs here. There are two steps which might be slow under heavy load:
Each of those can add even 100ms under heavy load. From my observations the second one is more problematic. As a result I observed creations delayed by hundreds of milliseconds. According to the comment in the code the library we use cannot be called more often than every 100ns. I think that in such case we should just actively loop if we see that the uid is the same as the previous one. I will be much faster than what we have now. I'll send a PR. |
Nice debugging! Thanks! |
Reactivating issue. This test ( See: |
Ping. This is grinding submit queue to a halt. |
Two most recent occurrences on kubernetes-e2e-gce-scalability: kubernetes-e2e-gce-scalability/7404 (internal jenkins) from build-log:
|
Can someone try to binary search which PR caused this to start flaking? |
I took a quick look into those failures and all failures so far where on "Listing nodes".
The corresponding log from handlers is this one:
So basically, this is LIST request without ResourceVersion option specified. Thus, this request is handled directly by etcd and not in cacher. There isn't any complicated logic in apiserver, so this is most probably slow etcd (because of some reason). But it will slightly tune traces in the lower layers of apiserver to confirm if it's really etcd. |
OK - so after adding some traces, we can confirm that there are two issues here.
Neither of those is a real system issue. |
Density [Skipped] [Performance] should allow starting 30 pods per node
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:149 There should be no high-latency requests Expected : 1 not to be > : 0
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3584/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3581/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-scalability/3580/
@wojtek-t
The text was updated successfully, but these errors were encountered: