Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call latency regression #76579

Open
wojtek-t opened this Issue Apr 15, 2019 · 9 comments

Comments

Projects
None yet
3 participants
@wojtek-t
Copy link
Member

wojtek-t commented Apr 15, 2019

There is yet another regression for API-call latencies for 5k-node test:
https://testgrid.k8s.io/sig-scalability-gce#gce-scale-performance

The last 3 runs of load test failed there.

I didn't yet have time to look into it deeper, but (especially the last run) seems suspicious, due to this one:

Resource:nodes Subresource: Verb:POST Scope:cluster Latency:{Perc50:4.749ms Perc90:1.807403s Perc99:1.807403s} Count:16}

It seems some nodes were added during the test, which isn't expected.

@kubernetes/sig-scalability-bugs

@wojtek-t

This comment has been minimized.

Copy link
Member Author

wojtek-t commented Apr 15, 2019

/assign @oxddr

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Apr 15, 2019

@wojtek-t: GitHub didn't allow me to assign the following users: oxddr.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@oxddr

This comment has been minimized.

Copy link
Contributor

oxddr commented Apr 15, 2019

I'll have a look.

@wojtek-t

This comment has been minimized.

Copy link
Member Author

wojtek-t commented Apr 16, 2019

I took a quick look into this one and looked into PRs merged between runs when it started failing:
f873d2a...aa74064
But I didn't find anything really suspicious.

So I looked a a bit into graphs and my current hypothesis is that the regression actually happened before. This is based on looking into some graphs in perf-dash:

Screenshot from 2019-04-16 14-26-57
Screenshot from 2019-04-16 14-26-15

Both tend to suggest that the regression happened one run before (and we were just lucky in the first attempt after regression).
Unfortunately, there was a whole in runs at this point, so there are 93 PR merged in that period:
51db0fe...f873d2a
and I didn't yet have time to look into those...

@wojtek-t

This comment has been minimized.

Copy link
Member Author

wojtek-t commented Apr 17, 2019

I took a look into those 93 PR merged in the period mentioned above. And there are exactly 6 that which impact I couldn't exclude immediately (though in most cases I highly doubt they may cause it):
[I highly doubt] #74877
[in theory probable] #75389
[I doubt, but...] #76065
[I doubt] #75967
[I doubt] #76211
[only if there would be many panics] #75853

@oxddr

This comment has been minimized.

Copy link
Contributor

oxddr commented Apr 18, 2019

Last two runs have also failed density tests.

Interestingly 5000 kubemark tests are passing just fine. I'll have a look at PR listed above and try to narrow down them down.

@oxddr

This comment has been minimized.

Copy link
Contributor

oxddr commented Apr 18, 2019

At this point, we are still not sure, what is the last good commit. Assuming that the regression has happened before f873d2a and we may get lucky so that load test doesn't fail, the last good run may go back up to f6c51d6.

Next steps:

  1. Re-run load test to check whether reverting commits from #76579 (comment) helps
  2. Check the f6c51d6...51db0fe range for other obvious culprits
@oxddr

This comment has been minimized.

Copy link
Contributor

oxddr commented Apr 19, 2019

The first re-run wasn't successful - master become unreachable, causing test to crash. I am going to re-run it again.

@oxddr

This comment has been minimized.

Copy link
Contributor

oxddr commented Apr 20, 2019

Reverting commits from #76579 (comment) didn't help and load test has failed. I'll have a look at earliest commits then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.