API call latency regression #76579

wojtek-t · 2019-04-15T06:28:40Z

There is yet another regression for API-call latencies for 5k-node test:
https://testgrid.k8s.io/sig-scalability-gce#gce-scale-performance

The last 3 runs of load test failed there.

I didn't yet have time to look into it deeper, but (especially the last run) seems suspicious, due to this one:

Resource:nodes Subresource: Verb:POST Scope:cluster Latency:{Perc50:4.749ms Perc90:1.807403s Perc99:1.807403s} Count:16}

It seems some nodes were added during the test, which isn't expected.

@kubernetes/sig-scalability-bugs

The text was updated successfully, but these errors were encountered:

wojtek-t · 2019-04-15T09:13:55Z

/assign @oxddr

k8s-ci-robot · 2019-04-15T09:13:56Z

@wojtek-t: GitHub didn't allow me to assign the following users: oxddr.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

oxddr · 2019-04-15T09:53:45Z

I'll have a look.

wojtek-t · 2019-04-16T12:29:26Z

I took a quick look into this one and looked into PRs merged between runs when it started failing:
f873d2a...aa74064
But I didn't find anything really suspicious.

So I looked a a bit into graphs and my current hypothesis is that the regression actually happened before. This is based on looking into some graphs in perf-dash:

Both tend to suggest that the regression happened one run before (and we were just lucky in the first attempt after regression).
Unfortunately, there was a whole in runs at this point, so there are 93 PR merged in that period:
51db0fe...f873d2a
and I didn't yet have time to look into those...

wojtek-t · 2019-04-17T08:07:30Z

I took a look into those 93 PR merged in the period mentioned above. And there are exactly 6 that which impact I couldn't exclude immediately (though in most cases I highly doubt they may cause it):
[I highly doubt] #74877
[in theory probable] #75389
[I doubt, but...] #76065
[I doubt] #75967
[I doubt] #76211
[only if there would be many panics] #75853

oxddr · 2019-04-18T09:43:31Z

Last two runs have also failed density tests.

Interestingly 5000 kubemark tests are passing just fine. I'll have a look at PR listed above and try to narrow down them down.

oxddr · 2019-04-18T11:25:09Z

At this point, we are still not sure, what is the last good commit. Assuming that the regression has happened before f873d2a and we may get lucky so that load test doesn't fail, the last good run may go back up to f6c51d6.

Next steps:

Re-run load test to check whether reverting commits from API call latency regression #76579 (comment) helps
Check the f6c51d6...51db0fe range for other obvious culprits

oxddr · 2019-04-19T07:21:24Z

The first re-run wasn't successful - master become unreachable, causing test to crash. I am going to re-run it again.

oxddr · 2019-04-20T16:31:54Z

Reverting commits from #76579 (comment) didn't help and load test has failed. I'll have a look at earliest commits then.

oxddr · 2019-04-23T11:48:04Z

In the meantime I am re-running tests at 51db0fe (another green run) to verify what was the last good run. I suspect the regression has happened earlier on and this was just a lucky run.

oxddr · 2019-04-24T09:07:04Z

Load tests had failed at 51db0fe. My current hypothesis is that the last green run was at f6c51d6. I am going to verify that now.

oxddr · 2019-04-25T14:50:51Z

I had problems with my machine and results from yesterdays run are lost. I am rerunning tests now.

I am also having a look whether there are any traces of increased latency in smaller tests (ideally for 100 nodes cluster). Being able to run against smaller cluster will reduce feedback loop.

wojtek-t · 2019-04-26T08:35:28Z

I'm not convinced this is reflected in 100-node tests.

That said, I tried to looked around if I can find something else around that time that regressed.
And here is the graph for:

LIST pods (scope=cluster) for kubemark-5000 from exactly that time:

The increase is really visible there.
The good news from that is:

it's from density test, which is much faster
it's from kubemark, which requires less resources to run
this is from Apr 9th timerange, which is exactly when load test started failing

wojtek-t · 2019-04-26T08:54:22Z

BTW, assuming I'm correct above, there are only 15 commits in that range:
2a05e1b...99a5235

wojtek-t · 2019-04-26T08:56:45Z

Obviously, assuming that it's not test-infra change or sth like that...

wojtek-t · 2019-04-26T09:31:43Z

I checked test-infra and there weren't any changes for our jobs in this period (we enabled prometheus stack in the previous run, but the last good one already had that).

oxddr · 2019-04-26T09:43:52Z

I've checked your suggestion from the comment above and there is indeed correlation between increased latency in density tests on kubemark and failure on GCE 5000.

I had a look at the commit range. There were two minor changes to scheduler (db1ab23, a93f803) and update of dependencies (1cca3f9).

dims · 2019-04-26T10:51:07Z

@liggitt 1cca3f9 was one of yours ...

oxddr · 2019-04-26T11:07:09Z

FYI: I am going to run kubemark tests now with 1cca3f9 revert to check whether it's culprit.

liggitt · 2019-04-26T12:07:18Z

1cca3f9 is the merge commit of a PR updating several dependencies (each in their own commit), and won't revert cleanly on master. If I had to pick a likely candidate, I would try reverting the golang.org/x/sys dependency to its previous version

oxddr · 2019-04-26T12:15:54Z

@liggitt I wasn't precise with my previous comment: I am running version Kuberntes as of 99a5235 with 1cca3f9 reverted. It reverted cleanly for me.

liggitt · 2019-04-26T12:24:37Z

I see. Looking through the list of updated dependencies in that PR, only the golang.org/x/sys and golang.org/x/text seem likely possibilities for having any performance implications.

If you wanted to try on master with the previous levels of those dependencies, you could do this:

hack/pin-dependency.sh golang.org/x/sys v0.0.0-20171031081856-95c657629925
hack/pin-dependency.sh golang.org/x/text v0.0.0-20170810154203-b19bf474d317
hack/update-vendor.sh

Note that the level of those dependencies we were previously on was ~18 months old and not in sync with go 1.12, so I'd be concerned about leaving those at those levels indefinitely.

wojtek-t · 2019-04-26T12:33:53Z

Thanks Jordan - although other PRs does seem to be suspicious, we should first try to prove that it's really this PR that caused regression. Once we know that for sure, we should find the exact commit and then think what do do with it.
I looked into description of golang.org/x/text, and it doesn't seem to be something that should visibly affect the tests.
What I'm wondering though, is that why it can't be golang.org/x/crypto - why did you exclude this one?

liggitt · 2019-04-26T12:46:46Z

What I'm wondering though, is that why it can't be golang.org/x/crypto - why did you exclude this one?

because that bump only changed a single line:

 func parseOpenSSHPrivateKey(key []byte) (crypto.PrivateKey, error) {
-       magic := append([]byte("openssh-key-v1"), 0)
-       if !bytes.Equal(magic, key[0:len(magic)]) {
+       const magic = "openssh-key-v1\x00"
+       if len(key) < len(magic) || string(key[:len(magic)]) != magic {
                return nil, errors.New("ssh: invalid openssh private key format")
        }

wojtek-t · 2019-04-26T13:35:58Z

got it - thanks makes a lot of sense now

Unfortunately, I just realized, that the results when test run locally (those for kubemark-5000) will not really be comparable, because:

there are only 3 LIST all pods request in this test (it's interesting how stable that is though...)
those are done from test, so will be affected by network latency

@oxddr So the fact that those will be higher doesn't mean anything - you need to compare the results relatively to other when running your test locally...

wojtek-t · 2019-04-29T06:19:14Z

I did a number of experiments over the weekend.
I was experimenting with "kubemark-5000", looking into "LIST pods (scope-cluster)" APi call latencies based on #76579 (comment)

As I also wrote above, I'm not entirely sure if it's a good thing to look, but it was easy to leave tests to run. Here are the results (below 50th, 90th and 99th percentile in microsecond):

from commit 1cca3f9 (PR with Jordan`s update dependency):

297409, 542543, 542543   [a bit of outlier]
165483, 298405, 298405
172608, 297054, 297054
176362, 299967, 299967

from commit b8b7ab3 (the one before Jordan`s PR):

3691183, 4021243, 5058662  [this is strange outlier]
263, 224956, 224956
20565, 96741, 96741
139, 21546, 21546

The above seem to suggest that this is really the offending PR.

So I did a bit more:

I reverted 4 commits: golang.org/x/sys, golang.org/x/crypto, golang.org/x/tools, golang.org/x/exp:

181832, 295141, 295141
179151, 293593, 293593
166846, 292647, 292647

[The results look really similar to the bad ones]

On top of that, I also reverted: github.com/stretchr/objx, gonum.org/v1/gonum, golang.org/x/text

24333, 219550, 219550
199553, 854049, 854049
179570, 293148, 293148
25676, 181217, 181217
20889, 1850705, 1850705
177521, 293834, 293834

And i have troubles with interpreting these results - lower 50th percentile in 3/6 attempt may suggest it's here, but it still looks visibly worse than b8b7ab3

But I will risk, and in the background will try to run tests with all golang.org deps reverted leving all others not-reverted (if the change is really in that second batch of reverted commits).

liggitt · 2019-05-10T15:08:21Z

Kubemark 5000 looks better once Revert "github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973" got in.

It looks better, but that is likely due to the revert restoring the bug that artificially reports lower latencies at small sample sizes. I expect #77735 (which fixes that reporting bug) will likely shift those reported latencies higher again. Note that it is not regressing actual latency, just making the reported latency more accurate at small sample sizes.

wojtek-t · 2019-05-13T06:17:13Z

It looks better, but that is likely due to the revert restoring the bug that artificially reports lower latencies at small sample sizes. I expect #77735 (which fixes that reporting bug) will likely shift those reported latencies higher again. Note that it is not regressing actual latency, just making the reported latency more accurate at small sample sizes.

@liggitt - as you expected, the revert increased it again:
http://perf-dash.k8s.io/#/?jobname=kubemark-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

But that proves that this is not a cause of regression, but the real bug.
So we should probably get rid of the pin to that old version completely.

[Sorry for delay, I was OOO since Wednesday]

Why is increased kube-proxy logging causing an OOM in the first place? And why is it continuously failing to restore? If it's -4/lock contention, then there are already many fixes in-place (some kernel-side) to handle that. I'm very curious.

@dcbw - it's probably because we are not setting limits for kube-proxy in any setup, thus it may eat the whole memory on the machine.
And it's NOT "continuously" failing to restore - just some attempts to restore are failing (though I don't have concrete numbers about the percentage).

liggitt · 2019-05-13T12:44:19Z

So we should probably get rid of the pin to that old version completely.

agreed. opened #77809 to restore the version prometheus expects to use

Ref. kubernetes/kubernetes#76579 As the regression is cleary visible in the density tests, we'll disable the long taking (~10h) load tests and run the density tests more frequently until we make them pass.

wojtek-t · 2019-05-13T19:03:04Z

This is what we should focus on going forward probably:
http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

The latency of "LIST pods (scope=cluster)" latency was visibly increased. This may be related to the fact the the number of those calls were also increased. Understanding where are those coming from may help us understand the problem. @mborsz @mm4tt
#77808 (comment) seems related

This is essentially reverting kubernetes/kubernetes#77224 in our gce 5000 performance tests. We believe it may be related to kubernetes/kubernetes#76579

mm4tt · 2019-05-16T09:00:54Z

Reg. kubernetes/test-infra#12686 and why we believe it may help with this regression (one of these regressions).

We've noticed an increase in memory used by coredns between runs 328 and 329

This memory increases causes coredns to OOM (due to 170MB memory limit) which in turn has a negative impact on api latency (coredns gets restarted then lists all pods/endponts/etc.)
We've noticed that between runs 328 and 330 the cos version has changed on nodes (nodes only, because we overwrite it on master) in #75149

We run two tests. One on the exactly the same configuration as run 330 and another with reverted #75149.
Below are the graphs which shows that cos version upgrade is very likely related to the coredns memory increase

330

330 with reverted #75149

We've ruled out that the prometheus was a cause of the regressions a long time ago. Actually it helps a lot when debugging. The positive side effect of this change will be a mitigation of the faster MIG startup problem, in the end it will add an O(a few minutes) delay before running the tests. Ref. kubernetes/kubernetes#76579

Ref. kubernetes/kubernetes#76579

wojtek-t · 2019-05-31T06:16:25Z

I forked the CoreDNS problem to a separate issue: #78562

With that, this one can be closed.

Ref. kubernetes/kubernetes#76579 As the regression is cleary visible in the density tests, we'll disable the long taking (~10h) load tests and run the density tests more frequently until we make them pass.

This is essentially reverting kubernetes/kubernetes#77224 in our gce 5000 performance tests. We believe it may be related to kubernetes/kubernetes#76579

We've ruled out that the prometheus was a cause of the regressions a long time ago. Actually it helps a lot when debugging. The positive side effect of this change will be a mitigation of the faster MIG startup problem, in the end it will add an O(a few minutes) delay before running the tests. Ref. kubernetes/kubernetes#76579

Ref. kubernetes/kubernetes#76579

answer1991 · 2019-07-01T15:27:16Z

/cc

wojtek-t added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2019

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Apr 15, 2019

wojtek-t added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 15, 2019

oxddr mentioned this issue Apr 25, 2019

[Failing test] GCE-5000: density tests are failing #77080

Closed

mm4tt mentioned this issue May 13, 2019

gce-5000-performance - disable load & run density more often kubernetes/test-infra#12599

Merged

This was referenced May 14, 2019

Revert fluentd-gcp version to 0.6-1.6.0-1 in gce-5000. kubernetes/test-infra#12622

Merged

Revert #12622 and #12599 kubernetes/test-infra#12654

Merged

Override cos version used by nodes in scalability tests. kubernetes/test-infra#12686

Merged

mm4tt mentioned this issue May 16, 2019

K8s Prow is limited to 1k concurrent pods, or podgc will fight with sinker kubernetes/test-infra#11594

Closed

mm4tt mentioned this issue May 20, 2019

Re-enable prometheus in GCE 5K Performance tests. kubernetes/test-infra#12714

Merged

wojtek-t mentioned this issue May 20, 2019

[Failing test] GCE-5000: increased latency of replicationcontrollers list #78111

Closed

mm4tt mentioned this issue May 22, 2019

Revert experimental etcd image. kubernetes/test-infra#12733

Merged

mm4tt added a commit to mm4tt/test-infra that referenced this issue May 22, 2019

Enable prometheus snapshot for gce-5K-performance

18e6c09

Ref. kubernetes/kubernetes#76579

mm4tt mentioned this issue May 22, 2019

Enable prometheus snapshot for gce-5K-performance kubernetes/test-infra#12734

Merged

wojtek-t mentioned this issue May 31, 2019

Bump of cos version results in increased usage of CoreDNS #78562

Closed

wojtek-t closed this as completed May 31, 2019

mborsz mentioned this issue Jun 4, 2019

Revert CoreDNS to 1.3.1 in kube-up #78691

Merged

chases2 pushed a commit to chases2/test-infra that referenced this issue Jun 4, 2019

Enable prometheus snapshot for gce-5K-performance

1eeb59b

Ref. kubernetes/kubernetes#76579

oxddr mentioned this issue Jun 6, 2019

REQUEST: New membership for oxddr kubernetes/org#891

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API call latency regression #76579

API call latency regression #76579

wojtek-t commented Apr 15, 2019

wojtek-t commented Apr 15, 2019

k8s-ci-robot commented Apr 15, 2019

oxddr commented Apr 15, 2019

wojtek-t commented Apr 16, 2019

wojtek-t commented Apr 17, 2019

oxddr commented Apr 18, 2019

oxddr commented Apr 18, 2019

oxddr commented Apr 19, 2019

oxddr commented Apr 20, 2019

oxddr commented Apr 23, 2019 •

edited

Loading

oxddr commented Apr 24, 2019

oxddr commented Apr 25, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

oxddr commented Apr 26, 2019

dims commented Apr 26, 2019

oxddr commented Apr 26, 2019

liggitt commented Apr 26, 2019

oxddr commented Apr 26, 2019

liggitt commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

liggitt commented Apr 26, 2019

wojtek-t commented Apr 26, 2019 •

edited

Loading

wojtek-t commented Apr 29, 2019

liggitt commented May 10, 2019

wojtek-t commented May 13, 2019

liggitt commented May 13, 2019

wojtek-t commented May 13, 2019

mm4tt commented May 16, 2019

wojtek-t commented May 31, 2019

answer1991 commented Jul 1, 2019

API call latency regression #76579

API call latency regression #76579

Comments

wojtek-t commented Apr 15, 2019

wojtek-t commented Apr 15, 2019

k8s-ci-robot commented Apr 15, 2019

oxddr commented Apr 15, 2019

wojtek-t commented Apr 16, 2019

wojtek-t commented Apr 17, 2019

oxddr commented Apr 18, 2019

oxddr commented Apr 18, 2019

oxddr commented Apr 19, 2019

oxddr commented Apr 20, 2019

oxddr commented Apr 23, 2019 • edited Loading

oxddr commented Apr 24, 2019

oxddr commented Apr 25, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

oxddr commented Apr 26, 2019

dims commented Apr 26, 2019

oxddr commented Apr 26, 2019

liggitt commented Apr 26, 2019

oxddr commented Apr 26, 2019

liggitt commented Apr 26, 2019

wojtek-t commented Apr 26, 2019

liggitt commented Apr 26, 2019

wojtek-t commented Apr 26, 2019 • edited Loading

wojtek-t commented Apr 29, 2019

liggitt commented May 10, 2019

wojtek-t commented May 13, 2019

liggitt commented May 13, 2019

wojtek-t commented May 13, 2019

mm4tt commented May 16, 2019

wojtek-t commented May 31, 2019

answer1991 commented Jul 1, 2019

oxddr commented Apr 23, 2019 •

edited

Loading

wojtek-t commented Apr 26, 2019 •

edited

Loading