Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call latency regression #76579

Closed
wojtek-t opened this issue Apr 15, 2019 · 51 comments
Closed

API call latency regression #76579

wojtek-t opened this issue Apr 15, 2019 · 51 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@wojtek-t
Copy link
Member

There is yet another regression for API-call latencies for 5k-node test:
https://testgrid.k8s.io/sig-scalability-gce#gce-scale-performance

The last 3 runs of load test failed there.

I didn't yet have time to look into it deeper, but (especially the last run) seems suspicious, due to this one:

Resource:nodes Subresource: Verb:POST Scope:cluster Latency:{Perc50:4.749ms Perc90:1.807403s Perc99:1.807403s} Count:16}

It seems some nodes were added during the test, which isn't expected.

@kubernetes/sig-scalability-bugs

@wojtek-t wojtek-t added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2019
@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Apr 15, 2019
@wojtek-t wojtek-t added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 15, 2019
@wojtek-t
Copy link
Member Author

/assign @oxddr

@k8s-ci-robot
Copy link
Contributor

@wojtek-t: GitHub didn't allow me to assign the following users: oxddr.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@oxddr
Copy link
Contributor

oxddr commented Apr 15, 2019

I'll have a look.

@wojtek-t
Copy link
Member Author

I took a quick look into this one and looked into PRs merged between runs when it started failing:
f873d2a...aa74064
But I didn't find anything really suspicious.

So I looked a a bit into graphs and my current hypothesis is that the regression actually happened before. This is based on looking into some graphs in perf-dash:

Screenshot from 2019-04-16 14-26-57
Screenshot from 2019-04-16 14-26-15

Both tend to suggest that the regression happened one run before (and we were just lucky in the first attempt after regression).
Unfortunately, there was a whole in runs at this point, so there are 93 PR merged in that period:
51db0fe...f873d2a
and I didn't yet have time to look into those...

@wojtek-t
Copy link
Member Author

I took a look into those 93 PR merged in the period mentioned above. And there are exactly 6 that which impact I couldn't exclude immediately (though in most cases I highly doubt they may cause it):
[I highly doubt] #74877
[in theory probable] #75389
[I doubt, but...] #76065
[I doubt] #75967
[I doubt] #76211
[only if there would be many panics] #75853

@oxddr
Copy link
Contributor

oxddr commented Apr 18, 2019

Last two runs have also failed density tests.

Interestingly 5000 kubemark tests are passing just fine. I'll have a look at PR listed above and try to narrow down them down.

@oxddr
Copy link
Contributor

oxddr commented Apr 18, 2019

At this point, we are still not sure, what is the last good commit. Assuming that the regression has happened before f873d2a and we may get lucky so that load test doesn't fail, the last good run may go back up to f6c51d6.

Next steps:

  1. Re-run load test to check whether reverting commits from API call latency regression #76579 (comment) helps
  2. Check the f6c51d6...51db0fe range for other obvious culprits

@oxddr
Copy link
Contributor

oxddr commented Apr 19, 2019

The first re-run wasn't successful - master become unreachable, causing test to crash. I am going to re-run it again.

@oxddr
Copy link
Contributor

oxddr commented Apr 20, 2019

Reverting commits from #76579 (comment) didn't help and load test has failed. I'll have a look at earliest commits then.

@oxddr
Copy link
Contributor

oxddr commented Apr 23, 2019

In the meantime I am re-running tests at 51db0fe (another green run) to verify what was the last good run. I suspect the regression has happened earlier on and this was just a lucky run.

@oxddr
Copy link
Contributor

oxddr commented Apr 24, 2019

Load tests had failed at 51db0fe. My current hypothesis is that the last green run was at f6c51d6. I am going to verify that now.

@oxddr
Copy link
Contributor

oxddr commented Apr 25, 2019

I had problems with my machine and results from yesterdays run are lost. I am rerunning tests now.

I am also having a look whether there are any traces of increased latency in smaller tests (ideally for 100 nodes cluster). Being able to run against smaller cluster will reduce feedback loop.

@wojtek-t
Copy link
Member Author

I'm not convinced this is reflected in 100-node tests.

That said, I tried to looked around if I can find something else around that time that regressed.
And here is the graph for:

  • LIST pods (scope=cluster) for kubemark-5000 from exactly that time:

Screenshot from 2019-04-26 10-31-00

The increase is really visible there.
The good news from that is:

  • it's from density test, which is much faster
  • it's from kubemark, which requires less resources to run
  • this is from Apr 9th timerange, which is exactly when load test started failing

@wojtek-t
Copy link
Member Author

BTW, assuming I'm correct above, there are only 15 commits in that range:
2a05e1b...99a5235

@wojtek-t
Copy link
Member Author

Obviously, assuming that it's not test-infra change or sth like that...

@wojtek-t
Copy link
Member Author

I checked test-infra and there weren't any changes for our jobs in this period (we enabled prometheus stack in the previous run, but the last good one already had that).

@oxddr
Copy link
Contributor

oxddr commented Apr 26, 2019

I've checked your suggestion from the comment above and there is indeed correlation between increased latency in density tests on kubemark and failure on GCE 5000.

I had a look at the commit range. There were two minor changes to scheduler (db1ab23, a93f803) and update of dependencies (1cca3f9).

@dims
Copy link
Member

dims commented Apr 26, 2019

@liggitt 1cca3f9 was one of yours ...

@oxddr
Copy link
Contributor

oxddr commented Apr 26, 2019

FYI: I am going to run kubemark tests now with 1cca3f9 revert to check whether it's culprit.

@liggitt
Copy link
Member

liggitt commented Apr 26, 2019

1cca3f9 is the merge commit of a PR updating several dependencies (each in their own commit), and won't revert cleanly on master. If I had to pick a likely candidate, I would try reverting the golang.org/x/sys dependency to its previous version

@oxddr
Copy link
Contributor

oxddr commented Apr 26, 2019

@liggitt I wasn't precise with my previous comment: I am running version Kuberntes as of 99a5235 with 1cca3f9 reverted. It reverted cleanly for me.

@liggitt
Copy link
Member

liggitt commented Apr 26, 2019

I see. Looking through the list of updated dependencies in that PR, only the golang.org/x/sys and golang.org/x/text seem likely possibilities for having any performance implications.

If you wanted to try on master with the previous levels of those dependencies, you could do this:

hack/pin-dependency.sh golang.org/x/sys v0.0.0-20171031081856-95c657629925
hack/pin-dependency.sh golang.org/x/text v0.0.0-20170810154203-b19bf474d317
hack/update-vendor.sh

Note that the level of those dependencies we were previously on was ~18 months old and not in sync with go 1.12, so I'd be concerned about leaving those at those levels indefinitely.

@wojtek-t
Copy link
Member Author

Thanks Jordan - although other PRs does seem to be suspicious, we should first try to prove that it's really this PR that caused regression. Once we know that for sure, we should find the exact commit and then think what do do with it.
I looked into description of golang.org/x/text, and it doesn't seem to be something that should visibly affect the tests.
What I'm wondering though, is that why it can't be golang.org/x/crypto - why did you exclude this one?

@liggitt
Copy link
Member

liggitt commented Apr 26, 2019

What I'm wondering though, is that why it can't be golang.org/x/crypto - why did you exclude this one?

because that bump only changed a single line:

 func parseOpenSSHPrivateKey(key []byte) (crypto.PrivateKey, error) {
-       magic := append([]byte("openssh-key-v1"), 0)
-       if !bytes.Equal(magic, key[0:len(magic)]) {
+       const magic = "openssh-key-v1\x00"
+       if len(key) < len(magic) || string(key[:len(magic)]) != magic {
                return nil, errors.New("ssh: invalid openssh private key format")
        }

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 26, 2019

got it - thanks makes a lot of sense now

Unfortunately, I just realized, that the results when test run locally (those for kubemark-5000) will not really be comparable, because:

  • there are only 3 LIST all pods request in this test (it's interesting how stable that is though...)
  • those are done from test, so will be affected by network latency

@oxddr So the fact that those will be higher doesn't mean anything - you need to compare the results relatively to other when running your test locally...

@wojtek-t
Copy link
Member Author

I did a number of experiments over the weekend.
I was experimenting with "kubemark-5000", looking into "LIST pods (scope-cluster)" APi call latencies based on #76579 (comment)

As I also wrote above, I'm not entirely sure if it's a good thing to look, but it was easy to leave tests to run. Here are the results (below 50th, 90th and 99th percentile in microsecond):

  • from commit 1cca3f9 (PR with Jordan`s update dependency):
297409, 542543, 542543   [a bit of outlier]
165483, 298405, 298405
172608, 297054, 297054
176362, 299967, 299967
  • from commit b8b7ab3 (the one before Jordan`s PR):
3691183, 4021243, 5058662  [this is strange outlier]
263, 224956, 224956
20565, 96741, 96741
139, 21546, 21546

The above seem to suggest that this is really the offending PR.

So I did a bit more:

  • I reverted 4 commits: golang.org/x/sys, golang.org/x/crypto, golang.org/x/tools, golang.org/x/exp:
181832, 295141, 295141
179151, 293593, 293593
166846, 292647, 292647

[The results look really similar to the bad ones]

  • On top of that, I also reverted: github.com/stretchr/objx, gonum.org/v1/gonum, golang.org/x/text
24333, 219550, 219550
199553, 854049, 854049
179570, 293148, 293148
25676, 181217, 181217
20889, 1850705, 1850705
177521, 293834, 293834

And i have troubles with interpreting these results - lower 50th percentile in 3/6 attempt may suggest it's here, but it still looks visibly worse than b8b7ab3

But I will risk, and in the background will try to run tests with all golang.org deps reverted leving all others not-reverted (if the change is really in that second batch of reverted commits).

@liggitt
Copy link
Member

liggitt commented May 10, 2019

Kubemark 5000 looks better once Revert "github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973" got in.

It looks better, but that is likely due to the revert restoring the bug that artificially reports lower latencies at small sample sizes. I expect #77735 (which fixes that reporting bug) will likely shift those reported latencies higher again. Note that it is not regressing actual latency, just making the reported latency more accurate at small sample sizes.

@wojtek-t
Copy link
Member Author

It looks better, but that is likely due to the revert restoring the bug that artificially reports lower latencies at small sample sizes. I expect #77735 (which fixes that reporting bug) will likely shift those reported latencies higher again. Note that it is not regressing actual latency, just making the reported latency more accurate at small sample sizes.

@liggitt - as you expected, the revert increased it again:
http://perf-dash.k8s.io/#/?jobname=kubemark-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

But that proves that this is not a cause of regression, but the real bug.
So we should probably get rid of the pin to that old version completely.

[Sorry for delay, I was OOO since Wednesday]

Why is increased kube-proxy logging causing an OOM in the first place? And why is it continuously failing to restore? If it's -4/lock contention, then there are already many fixes in-place (some kernel-side) to handle that. I'm very curious.

@dcbw - it's probably because we are not setting limits for kube-proxy in any setup, thus it may eat the whole memory on the machine.
And it's NOT "continuously" failing to restore - just some attempts to restore are failing (though I don't have concrete numbers about the percentage).

@liggitt
Copy link
Member

liggitt commented May 13, 2019

So we should probably get rid of the pin to that old version completely.

agreed. opened #77809 to restore the version prometheus expects to use

mm4tt added a commit to mm4tt/test-infra that referenced this issue May 13, 2019
Ref. kubernetes/kubernetes#76579

As the regression is cleary visible in the density tests, we'll disable the long
taking (~10h) load tests and run the density tests more frequently until we make them pass.
mm4tt added a commit to mm4tt/test-infra that referenced this issue May 13, 2019
Ref. kubernetes/kubernetes#76579

As the regression is cleary visible in the density tests, we'll disable the long
taking (~10h) load tests and run the density tests more frequently until we make them pass.
mm4tt added a commit to mm4tt/test-infra that referenced this issue May 13, 2019
Ref. kubernetes/kubernetes#76579

As the regression is cleary visible in the density tests, we'll disable the long
taking (~10h) load tests and run the density tests more frequently until we make them pass.
@wojtek-t
Copy link
Member Author

This is what we should focus on going forward probably:
http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

The latency of "LIST pods (scope=cluster)" latency was visibly increased. This may be related to the fact the the number of those calls were also increased. Understanding where are those coming from may help us understand the problem. @mborsz @mm4tt
#77808 (comment) seems related

@mm4tt
Copy link
Contributor

mm4tt commented May 16, 2019

Reg. kubernetes/test-infra#12686 and why we believe it may help with this regression (one of these regressions).

We've noticed an increase in memory used by coredns between runs 328 and 329
0uQsPcCopoP

This memory increases causes coredns to OOM (due to 170MB memory limit) which in turn has a negative impact on api latency (coredns gets restarted then lists all pods/endponts/etc.)
We've noticed that between runs 328 and 330 the cos version has changed on nodes (nodes only, because we overwrite it on master) in #75149

We run two tests. One on the exactly the same configuration as run 330 and another with reverted #75149.
Below are the graphs which shows that cos version upgrade is very likely related to the coredns memory increase

330
ziNndcbdj7U

330 with reverted #75149
wzEfFdaqanY

mm4tt added a commit to mm4tt/test-infra that referenced this issue May 20, 2019
We've ruled out that the prometheus was a cause of the regressions a long time ago. Actually it helps a lot when debugging.

The positive side effect of this change will be a mitigation of the faster MIG startup problem, in the end it will add an O(a few minutes) delay before running the tests.

Ref. kubernetes/kubernetes#76579
mm4tt added a commit to mm4tt/test-infra that referenced this issue May 22, 2019
@wojtek-t
Copy link
Member Author

I forked the CoreDNS problem to a separate issue: #78562

With that, this one can be closed.

chases2 pushed a commit to chases2/test-infra that referenced this issue Jun 4, 2019
Ref. kubernetes/kubernetes#76579

As the regression is cleary visible in the density tests, we'll disable the long
taking (~10h) load tests and run the density tests more frequently until we make them pass.
chases2 pushed a commit to chases2/test-infra that referenced this issue Jun 4, 2019
This is essentially reverting kubernetes/kubernetes#77224 in our gce 5000 performance tests.

We believe it may be related to kubernetes/kubernetes#76579
chases2 pushed a commit to chases2/test-infra that referenced this issue Jun 4, 2019
We've ruled out that the prometheus was a cause of the regressions a long time ago. Actually it helps a lot when debugging.

The positive side effect of this change will be a mitigation of the faster MIG startup problem, in the end it will add an O(a few minutes) delay before running the tests.

Ref. kubernetes/kubernetes#76579
chases2 pushed a commit to chases2/test-infra that referenced this issue Jun 4, 2019
@answer1991
Copy link
Contributor

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

9 participants