Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce-scale-performance has failed #74419

Closed
mborsz opened this Issue Feb 22, 2019 · 12 comments

Comments

@mborsz
Copy link
Member

mborsz commented Feb 22, 2019

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/314 failed with

[measurement call APIResponsiveness - APIResponsiveness error: top latency metric: there should be no high-latency requests, but: [got: {Resource:secrets Subresource: Verb:GET Scope:namespace Latency:{Perc50:1.927ms Perc90:74.256ms Perc99:1.038866s} Count:116}; expected perc99 <= 1s]]
@mborsz

This comment has been minimized.

Copy link
Member Author

mborsz commented Feb 22, 2019

/assign
/sig scalability

@mborsz

This comment has been minimized.

Copy link
Member Author

mborsz commented Feb 22, 2019

/cc @wojtek-t @mm4tt

I see e.g.

I0221 19:25:56.173035       1 trace.go:81] Trace[984401692]: "Get /api/v1/namespaces/kube-system/secrets/kubernetes-dashboard-key-holder" (started: 2019-02-21 19:25:55.16744962 +0000 UTC m=+40274.455376109) (total time: 1.005565832s):
Trace[984401692]: [1.005225643s] [1.005214761s] About to write a response

I see several "About to write a response" traces around that time: https://gist.github.com/mborsz/6810900c941ea85a0e51f0cd169088a2

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Feb 22, 2019

About to write a response is misleading - we should change that step name - it usually is time to get it from storage. Are those requests served from etcd from from watchcache?

@mborsz

This comment has been minimized.

Copy link
Member Author

mborsz commented Feb 22, 2019

Sorry, I haven't updated this issue. I already found that in etcd.log there is a plenty of entries like:

2019-02-21 19:25:56.131868 W | etcdserver: read-only range request "key:\"/registry/pods/test-h3o2gk-37/small-rc-4-55wmf\" " with result "range_response_count:1 size:1511" took too long (231.933091ms) to execute
2019-02-21 19:25:56.131896 W | etcdserver: read-only range request "key:\"/registry/leases/kube-node-lease/gce-scale-cluster-minion-group-4-mjrv\" " with result "range_response_count:1 size:410" took too long (926.689615ms) to execute
2019-02-21 19:25:56.131909 W | etcdserver: read-only range request "key:\"/registry/leases/kube-node-lease/gce-scale-cluster-minion-group-1-jw0q\" " with result "range_response_count:1 size:410" took too long (232.211928ms) to execute
2019-02-21 19:25:56.131961 W | etcdserver: read-only range request "key:\"/registry/pods/test-h3o2gk-30/small-rc-83-d6w7k\" " with result "range_response_count:1 size:1522" took too long (232.053797ms) to execute

around that time.

On Monday I will continue debugging.

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/314/artifacts/gce-scale-cluster-master/etcd.log

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Feb 24, 2019

/kind failing-test
/priority important-soon
/milestone v1.14
We're entering burndown and this looks relevant to the 1.14 release. Please /milestone clear if I am incorrect

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Feb 25, 2019

It seems these requests were from etcd, not from cache.
Also, were there any long list requests or sth like that happening around that time?

@mborsz

This comment has been minimized.

Copy link
Member Author

mborsz commented Feb 25, 2019

I see that this period of slowness starts 19:25:27 and ends 19:26:11.
This correlates quite well with kube-apiserver-audit.log-20190221-1550777117.gz log rotation (1550777117 is 19:25:17 and first entry in kube-apiserver-audit.log is at 19:26:11)

This graph shows average request latency (skipping watch and cache-based ones) over time (timestamps in CEST tz):
8cc0rrkmdzc

We can clearly see some ~hourly pattern -- I bet this was there for a long time, but now we were lucky enough to ran into this with some requests that have low enough quantity that makes perc99 goes high).

@spiffxp spiffxp added this to New in 1.14 CI Signal Feb 28, 2019

@mborsz

This comment has been minimized.

Copy link
Member Author

mborsz commented Mar 1, 2019

I managed to verify that all spikes in above graph are caused by IO load generated by logrotate.
Theoretically etcd and logrotate are using separate PDs so they should isolate well, in practice we know that it happens that they interfere with each other on OS-level.
I also confirmed that 315 scalability run has similar latency pattern.

I experimentally changed master's cos version to cos-69-10895-138-0 (see kubernetes/test-infra/pull/11540), which has some IO isolation fixes (mostly setting kernel's CONFIG_BLK_WBT_MQ=y).
Scalability 319 was running with that version and the latency graph is here:
muwmfycxoks

For me it looks better, at least there is no hourly pattern caused by logrotate. There is still one fairly big spike, but maybe it's not related to IO or something.

I'm planning to wait for 1-2 more runs with that version to confirm that this is deterministic improvement and then to propose using the new version of cos globally.

@alejandrox1

This comment has been minimized.

Copy link

alejandrox1 commented Mar 6, 2019

@alejandrox1

This comment has been minimized.

Copy link

alejandrox1 commented Mar 6, 2019

/remove-kind failing-test

@alejandrox1 alejandrox1 moved this from New to Flakes in 1.14 CI Signal Mar 6, 2019

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 18, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 18, 2019

@spiffxp: Closing this issue.

In response to this:

/close
closing as resolved
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance
Screen Shot 2019-03-18 at 2 42 42 PM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.14 CI Signal automation moved this from Flakes to Open PR-wait for >5 successes before "Resolved" Mar 18, 2019

@spiffxp spiffxp moved this from Open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 11) in 1.14 CI Signal Mar 18, 2019

@spiffxp spiffxp moved this from Resolved (week Mar 11) to Resolved (week Mar 18) in 1.14 CI Signal Mar 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.