Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare 10-second timeouts from ResourceQuota admission controller #63608

Closed
MikeSpreitzer opened this issue May 9, 2018 · 25 comments · Fixed by #64598
Closed

Rare 10-second timeouts from ResourceQuota admission controller #63608

MikeSpreitzer opened this issue May 9, 2018 · 25 comments · Fixed by #64598
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@MikeSpreitzer
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
I wrote a simple driver program (https://github.com/MikeSpreitzer/k8api-scaletest/blob/2f5a8414b5da4059bb2a2b33fcdb88f40ca2e07b/cmdriverclosed/main.go) that creates ConfigMap objects as fast as it can, and ran it like this: cmdriverclosed --kubeconfig $myconfig -n 30103 -threads 10 -conns 10. In the CSV file produced, two of the creates recorded an error that simply says "timeout":

...
2018-05-09 15:22:14.289,2018-05-09 15:22:14.299,create,"0509.1022.0797-30103",""
2018-05-09 15:22:11.278,2018-05-09 15:22:21.291,create,"0509.1022.0797-13420","timeout"
2018-05-09 15:22:12.419,2018-05-09 15:22:22.426,create,"0509.1022.0797-19770","timeout"
2018-05-09 15:22:22.426,2018-05-09 15:22:22.431,update,"0509.1022.0797-00012",""
...

Here is an excerpt from the api-server's log:

...
W0508 03:28:10.651586       1 watcher.go:342] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic
W0508 03:28:10.651615       1 watcher.go:342] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic
W0508 03:28:10.651982       1 watcher.go:235] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow dispatching events to watchers
E0509 15:22:21.291010       1 status.go:64] apiserver received an error that is not an metav1.Status: timeout
I0509 15:22:21.291138       1 trace.go:76] Trace[1365056476]: "Create /api/v1/namespaces/scaletest/configmaps" (started: 2018-05-09 15:22:11.29074348 +0000 UTC m=+137535.870501480) (total time: 10.00038031s):
Trace[1365056476]: [10.00038031s] [10.000241017s] END
E0509 15:22:22.426011       1 status.go:64] apiserver received an error that is not an metav1.Status: timeout
I0509 15:22:22.426144       1 trace.go:76] Trace[1711753668]: "Create /api/v1/namespaces/scaletest/configmaps" (started: 2018-05-09 15:22:12.425778861 +0000 UTC m=+137537.005536868) (total time: 10.000346361s):
Trace[1711753668]: [10.000346361s] [10.000263503s] END
W0509 15:22:29.433234       1 watcher.go:342] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic
W0509 15:22:29.434199       1 watcher.go:342] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic
...

A few minutes later I checked that my etcd cluster is healthy; it was:

root@mjs-etcd-dal10-d:/etc/kubernetes/pki/etcd# etcdctl --endpoints=https://10.171.126.244:2379,https://10.171.126.251:2379,https://10.171.126.221:2379 --cacert ca.pem --cert client.pem --key client-key.pem endpoint health
https://10.171.126.221:2379 is healthy: successfully committed proposal: took = 975.185µs
https://10.171.126.251:2379 is healthy: successfully committed proposal: took = 595.302µs
https://10.171.126.244:2379 is healthy: successfully committed proposal: took = 842.223µs

I also looked at the systemctl status etcd on each host for etcd, and each reported that etcd had been running for over a day.

What you expected to happen:
I expected no such problem, or at least a more informative error message.

How to reproduce it (as minimally and precisely as possible):
That test driver is pretty minimal.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
root@mjs-api-dal10-b:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: SoftLayer bare metal machines, SSD storage on the etcd servers
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.4 LTS
  • Kernel (e.g. uname -a): Linux mjs-api-dal10-b 4.4.0-122-generic DNS #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm driven by an Ansible playbook I wrote, based on https://kubernetes.io/docs/setup/independent/high-availability/
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels May 9, 2018
@MikeSpreitzer
Copy link
Member Author

@kubernetes/sig-api-machinery-bugs

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 9, 2018
@lavalamp
Copy link
Member

lavalamp commented May 9, 2018

That's weird, I'm not aware of a 10s timeout anywhere (there is a 60s one). @jpbetz does the etcd client have such a timeout?

@MikeSpreitzer
Copy link
Member Author

FYI:

root@mjs-etcd-dal10-d:/etc/kubernetes/pki/etcd# etcd --version
etcd Version: 3.2.19
Git SHA: 8a9b3d538
Go Version: go1.8.7
Go OS/Arch: linux/amd64

@dims
Copy link
Member

dims commented May 10, 2018

@MikeSpreitzer - can you try etcd without SSL/TLS? it could be TLSHandshakeTimeout ( https://k8s-code.appspot.com/?q=TLS.*10%20%5C*%20time.Second&i=nope&files=vendor%2F.*&repos=kubernetes )

@MikeSpreitzer
Copy link
Member Author

Yes, I will try that, the next time it is convenient to tear everything down and start over.

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 23, 2018

OK, I tried again with etcd NOT using TLS (neither client/server nor server/server). I still got a "timeout" error. The driver at commit f3a4391de25867b2a3e6de184fe7577f600d00c0.

root@mjstest1:~# etcdctl --endpoints http://10.171.126.244:2379 member list
13d980b5a8bfef3d, started, m10_171_126_244, http://10.171.126.244:2380, http://10.171.126.244:2379
1fc5efcc30712f82, started, m10_171_126_251, http://10.171.126.251:2380, http://10.171.126.251:2379
efd0242e221f2882, started, m10_171_126_221, http://10.171.126.221:2380, http://10.171.126.221:2379
root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 5000 --maxpop 108000 --n 108000 --threads 10 --conns 1
RunID is 0523.0849.7752
Wrote parameter file "0523.0849.7752-driver.parms"
DEBUG: Creating 108000 objects
DEBUG: maxpop = 108000
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 10
DEBUG: RATE = 5000/sec
DEBUG: op period = 0.0002 sec
I0523 08:51:34.581682   22461 main.go:174] 108000 object lifecycles in 105.379905731 seconds = 3074.5899586118885 writes/sec, with 1 errors on create, 1 on update, and 1 on delete
root@mjstest1:~# egrep '[^"]"$' 0523.0849.7752-driver.csv 
2018-05-23 13:49:50.232,2018-05-23 13:50:00.234,create,"0523.0849.7752-000053","timeout"
2018-05-23 13:50:13.226,2018-05-23 13:50:13.228,update,"0523.0849.7752-000053","configmaps \"0523.0849.7752-000053\" not found"
2018-05-23 13:50:39.574,2018-05-23 13:50:39.576,delete,"0523.0849.7752-000053","configmaps \"0523.0849.7752-000053\" not found"
root@mjstest1:~# 

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 23, 2018

I am running kube release 1.10.3 now. I turned v to 5 in the api-server driven by the client, and ran another trial. Here are the errors logged by the driver (with timestamps in UTC):

root@mjstest1:~# egrep '[^"]"$' 0523.1019.0374-driver.csv 
2018-05-23 15:19:09.085,2018-05-23 15:19:19.087,create,"0523.1019.0374-005642","timeout"
2018-05-23 15:19:20.655,2018-05-23 15:19:30.657,create,"0523.1019.0374-055890","timeout"
2018-05-23 15:19:33.863,2018-05-23 15:19:33.865,update,"0523.1019.0374-005642","configmaps \"0523.1019.0374-005642\" not found"
2018-05-23 15:19:58.729,2018-05-23 15:19:58.731,update,"0523.1019.0374-055890","configmaps \"0523.1019.0374-055890\" not found"
2018-05-23 15:20:02.549,2018-05-23 15:20:02.551,delete,"0523.1019.0374-005642","configmaps \"0523.1019.0374-005642\" not found"
2018-05-23 15:20:34.757,2018-05-23 15:20:34.759,delete,"0523.1019.0374-055890","configmaps \"0523.1019.0374-055890\" not found"

I found this in the stderr of the api-server (the timestamp seems to be in UTC):

I0523 15:19:30.656749       1 handler.go:149] kube-apiserver: POST "/api/v1/namespaces/scaletest/configmaps" satisfied by gorestful with webservice /api/v1
E0523 15:19:30.656771       1 status.go:64] apiserver received an error that is not an metav1.Status: timeout
I0523 15:19:30.656967       1 wrap.go:42] PATCH /api/v1/namespaces/scaletest/configmaps/0523.1019.0374-021726: (1.623733ms) 200 [[scaletest driver] 10.171.126.224:48078]
I0523 15:19:30.657117       1 trace.go:76] Trace[1407519722]: "Create /api/v1/namespaces/scaletest/configmaps" (started: 2018-05-23 15:19:20.656435305 +0000 UTC m=+57.791736592) (total\
 time: 10.000652697s):
Trace[1407519722]: [31.809µs] [31.809µs] About to convert to expected version
Trace[1407519722]: [67.361µs] [35.552µs] Conversion done
Trace[1407519722]: [92.563µs] [25.202µs] About to store object in database
Trace[1407519722]: [10.000652697s] [10.000560134s] END
I0523 15:19:30.657141       1 handler.go:159] kube-aggregator: POST "/api/v1/namespaces/scaletest/configmaps" satisfied by nonGoRestful
I0523 15:19:30.657156       1 pathrecorder.go:247] kube-aggregator: "/api/v1/namespaces/scaletest/configmaps" satisfied by prefix /api/
I0523 15:19:30.657164       1 handler.go:149] kube-apiserver: POST "/api/v1/namespaces/scaletest/configmaps" satisfied by gorestful with webservice /api/v1
I0523 15:19:30.657208       1 wrap.go:42] POST /api/v1/namespaces/scaletest/configmaps: (10.001350708s) 500
goroutine 369958 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc426b37b90, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc426b37b90, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc4361b3280, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:188 +0xac
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics.(*ResponseWriterDelegator).WriteHeader(0xc43563ea50, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:281 +0x45
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.SerializeObject(0x3d39e37, 0x10, 0x7ff0fdc89490, 0xc42aa77600, 0x988ae80, 0xc423660858, 0xc42eb79800, 0x1f4, 0x9871c40, 0xc425bbf950)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:95 +0x8d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.WriteObjectNegotiated(0x7ff0fdae7080, 0xc4382d4cf0, 0x988d940, 0xc420da8840, 0x0, 0x0, 0x3d209f1, 0x2, 0x988ae80, 0xc423660858, ...)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:123 +0x32b
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.ErrorNegotiated(0x7ff0fdae7080, 0xc4382d4cf0, 0x9854b40, 0xc426b2a2b0, 0x988d940, 0xc420da8840, 0x0, 0x0, 0x3d209f1, 0x2, ...)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:142 +0x163
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*RequestScope).err(0xc427af5680, 0x9854b40, 0xc426b2a2b0, 0x988ae80, 0xc423660858, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:67 +0x10c
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.createHandler.func1(0x988ae80, 0xc423660858, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go:120 +0x1409
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints.restfulCreateResource.func1(0xc43563e930, 0xc431fef4a0)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/installer.go:1027 +0xd5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics.InstrumentRouteFunc.func1(0xc43563e930, 0xc431fef4a0)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:199 +0x208
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc420b341b0, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:277 +0xb18
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).Dispatch(0xc420b341b0, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:199 +0x57
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x3d34530, 0xe, 0xc420b341b0, 0xc4203cc230, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:152 +0x4e0
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*director).ServeHTTP(0xc420b22840, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        <autogenerated>:1 +0x75
k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver.(*proxyHandler).ServeHTTP(0xc42f825c00, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go:93 +0x18a
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*pathHandler).ServeHTTP(0xc42c2bae00, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:248 +0x26d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).ServeHTTP(0xc423722c40, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:234 +0xa1
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x3d372f1, 0xf, 0xc4230555f0, 0xc423722c40, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:160 +0x6ad
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*director).ServeHTTP(0xc4269cd480, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        <autogenerated>:1 +0x75
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:52 +0x37d
net/http.HandlerFunc.ServeHTTP(0xc421d917c0, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:165 +0x42a
net/http.HandlerFunc.ServeHTTP(0xc426d36940, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:49 +0x203a
net/http.HandlerFunc.ServeHTTP(0xc421d91810, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:78 +0x2b1
net/http.HandlerFunc.ServeHTTP(0xc421d91860, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/request.WithRequestContext.func1(0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/request/requestcontext.go:110 +0xcb
net/http.HandlerFunc.ServeHTTP(0xc4269cd4a0, 0x7ff0fd465050, 0xc423660848, 0xc42eb79800)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc4269cd520, 0x988f800, 0xc423660848, 0xc42eb79800, 0xc42366d9e0)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:93 +0x8d
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:92 +0x1ab

logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"timeout\",\"code\":500}\n"
 [[scaletest driver] 10.171.126.224:48078]

@MikeSpreitzer
Copy link
Member Author

The etcd cluster should have no trouble with that write rate.

root@mjstest1:~# ./benchmark --endpoints=http://10.171.126.244:2379,http://10.171.126.251:2379,http://10.171.126.221:2379 --target-leader --conns=100 --clients=1000 put --key-size=8 --sequential-keys --total=10000 --val-size=256
 10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:	0.2218 secs.
  Slowest:	0.0456 secs.
  Fastest:	0.0010 secs.
  Average:	0.0207 secs.
  Stddev:	0.0068 secs.
  Requests/sec:	45086.4350

Response time histogram:
  0.0010 [1]	|
  0.0055 [38]	|
  0.0099 [127]	|∎
  0.0144 [767]	|∎∎∎∎∎∎∎∎
  0.0189 [3754]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0233 [3360]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0278 [694]	|∎∎∎∎∎∎∎
  0.0322 [177]	|∎
  0.0367 [609]	|∎∎∎∎∎∎
  0.0411 [371]	|∎∎∎
  0.0456 [102]	|∎
...

@MikeSpreitzer
Copy link
Member Author

Around that time, the etcd server that was loaded by the kube api-server logged the following (in US CDT):

May 23 10:18:13 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:22 mjs-etcd-dal10-d etcd[24164]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
May 23 10:18:23 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:18:33 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:18:43 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:18:53 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:03 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:13 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:23 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:32 mjs-etcd-dal10-d etcd[24164]: start to snapshot (applied: 1800021, lastsnap: 1700020)
May 23 10:19:32 mjs-etcd-dal10-d etcd[24164]: saved snapshot at index 1800021
May 23 10:19:32 mjs-etcd-dal10-d etcd[24164]: compacted raft log at 1795021
May 23 10:19:33 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:37 mjs-etcd-dal10-d etcd[24164]: segmented wal file /var/lib/etcd/member/wal/000000000000000e-00000000001bc388.wal is created
May 23 10:19:43 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:19:47 mjs-etcd-dal10-d etcd[24164]: purged file /var/lib/etcd/member/snap/0000000000000002-000000000013d62e.snap successfully
May 23 10:19:47 mjs-etcd-dal10-d etcd[24164]: purged file /var/lib/etcd/member/wal/0000000000000009-00000000000f852d.wal successfully
May 23 10:19:53 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:00 mjs-etcd-dal10-d etcd[24164]: start to snapshot (applied: 1900022, lastsnap: 1800021)
May 23 10:20:00 mjs-etcd-dal10-d etcd[24164]: saved snapshot at index 1900022
May 23 10:20:00 mjs-etcd-dal10-d etcd[24164]: compacted raft log at 1895022
May 23 10:20:03 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:13 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:17 mjs-etcd-dal10-d etcd[24164]: purged file /var/lib/etcd/member/snap/0000000000000002-0000000000155cd0.snap successfully
May 23 10:20:23 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:33 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:36 mjs-etcd-dal10-d etcd[24164]: start to snapshot (applied: 2000023, lastsnap: 1900022)
May 23 10:20:36 mjs-etcd-dal10-d etcd[24164]: saved snapshot at index 2000023
May 23 10:20:36 mjs-etcd-dal10-d etcd[24164]: compacted raft log at 1995023
May 23 10:20:36 mjs-etcd-dal10-d etcd[24164]: segmented wal file /var/lib/etcd/member/wal/000000000000000f-00000000001e8951.wal is created
May 23 10:20:43 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628
May 23 10:20:47 mjs-etcd-dal10-d etcd[24164]: purged file /var/lib/etcd/member/snap/0000000000000002-000000000016e371.snap successfully
May 23 10:20:47 mjs-etcd-dal10-d etcd[24164]: purged file /var/lib/etcd/member/wal/000000000000000a-000000000011b6e4.wal successfully
May 23 10:20:53 mjs-etcd-dal10-d etcd[24164]: [GET] /metrics remote:10.171.126.216:54628

@dims
Copy link
Member

dims commented May 23, 2018

@MikeSpreitzer
Copy link
Member Author

root@mjs-api-dal10s-a:~# kubectl exec -it kube-apiserver-mjs-api-dal10s-a -n kube-system /bin/sh
/ # ulimit -n
1048576

@MikeSpreitzer
Copy link
Member Author

I added --request-timeout=47s to the loaded api-server's command line, and still got 10s timeouts.

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 5000 --maxpop 108000 --n 108000 --threads 10 --conns 1
RunID is 0523.1224.6596
Wrote parameter file "0523.1224.6596-driver.parms"
DEBUG: Creating 108000 objects
DEBUG: maxpop = 108000
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 10
DEBUG: RATE = 5000/sec
DEBUG: op period = 0.0002 sec
I0523 12:26:03.766916   23527 main.go:174] 108000 object lifecycles in 95.492694472 seconds = 3392.929708303519 writes/sec, with 2 errors on create, 2 on update, and 2 on delete
root@mjstest1:~# 
root@mjstest1:~# 
root@mjstest1:~# egrep '[^"]"$' 0523.1224.6596-driver.csv  
2018-05-23 17:24:30.051,2018-05-23 17:24:40.053,create,"0523.1224.6596-007672","timeout"
2018-05-23 17:24:38.451,2018-05-23 17:24:48.452,create,"0523.1224.6596-045428","timeout"
...

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 23, 2018

The doc string for min-request-timeout seems to say it is irrelevant for create operations.

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 23, 2018

However, making the client specify a timeout of 17 sec (see https://github.com/MikeSpreitzer/k8api-scaletest/tree/45958532ca0b661a567ed763362078d3722837f8/cmdriverclosed) warded off the problem --- according to my first and only trial so far.

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 5000 --maxpop 108000 --n 108000 --threads 10 --conns 1
RunID is 0523.1358.6776
Wrote parameter file "0523.1358.6776-driver.parms"
DEBUG: Creating 108000 objects
DEBUG: maxpop = 108000
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 10
DEBUG: RATE = 5000/sec
DEBUG: op period = 0.0002 sec
I0523 13:59:29.695988   23929 main.go:176] 108000 object lifecycles in 88.015452218 seconds = 3681.171792397369 writes/sec, with 0 errors on create, 0 on update, and 0 on delete

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 23, 2018

However, my second trial was much more interesting. The api-server logged an error on create --- but the client did not! And the subsequent update and delete operations on the object in question got 404s.

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 5000 --maxpop 108000 --n 216000 --threads 10 --conns 1
RunID is 0523.1359.5442
Wrote parameter file "0523.1359.5442-driver.parms"
DEBUG: Creating 216000 objects
DEBUG: maxpop = 108000
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 10
DEBUG: RATE = 5000/sec
DEBUG: op period = 0.0002 sec
I0523 14:02:45.827714   23994 main.go:176] 216000 object lifecycles in 185.222353026 seconds = 3498.497829303783 writes/sec, with 0 errors on create, 1 on update, and 1 on delete

Here is what the driver logged about that object:

root@mjstest1:~# grep 010561 0523.1359.5442-driver.csv
2018-05-23 18:59:43.082,2018-05-23 18:59:53.084,create,"0523.1359.5442-010561",""
2018-05-23 19:00:08.392,2018-05-23 19:00:08.394,update,"0523.1359.5442-010561","configmaps \"0523.1359.5442-010561\" not found"
2018-05-23 19:00:37.992,2018-05-23 19:00:37.993,delete,"0523.1359.5442-010561","configmaps \"0523.1359.5442-010561\" not found"

Here is what the api-server logged around that time:

I0523 18:59:53.086820       1 wrap.go:42] POST /api/v1/namespaces/scaletest/configmaps?timeoutSeconds=17: (1.615446ms) 201 [[scaletest driver] 10.171.126.224:49180]
E0523 18:59:53.086835       1 status.go:64] apiserver received an error that is not an metav1.Status: timeout
I0523 18:59:53.086842       1 wrap.go:42] POST /api/v1/namespaces/scaletest/configmaps?timeoutSeconds=17: (1.478159ms) 201 [[scaletest driver] 10.171.126.224:49180]
I0523 18:59:53.086822       1 handler.go:149] kube-apiserver: PATCH "/api/v1/namespaces/scaletest/configmaps/0523.1359.5442-000319" satisfied by gorestful with webservice /api/v1
I0523 18:59:53.087067       1 trace.go:76] Trace[2039336565]: "Create /api/v1/namespaces/scaletest/configmaps" (started: 2018-05-23 18:59:43.086505855 +0000 UTC m=+5750.297168971) (total time: 10.000526428s):
Trace[2039336565]: [75.228µs] [75.228µs] About to convert to expected version
Trace[2039336565]: [108.199µs] [32.971µs] Conversion done
Trace[2039336565]: [136.663µs] [28.464µs] About to store object in database
Trace[2039336565]: [10.000526428s] [10.000389765s] END
I0523 18:59:53.087077       1 wrap.go:42] POST /api/v1/namespaces/scaletest/configmaps?timeoutSeconds=17: (1.490848ms) 201 [[scaletest driver] 10.171.126.224:49180]
I0523 18:59:53.087160       1 wrap.go:42] POST /api/v1/namespaces/scaletest/configmaps?timeoutSeconds=17: (10.001134203s) 500
goroutine 5124152 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc4486cf420, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc4486cf420, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc424b08e60, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:188 +0xac
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics.(*ResponseWriterDelegator).WriteHeader(0xc4415e40f0, 0x1f4)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:281 +0x45
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.SerializeObject(0x3d39e37, 0x10, 0x7f435d6d16e0, 0xc43f771f80, 0x988ae80, 0xc4373660d8, 0xc42e1d6400, 0x1f4, 0x9871c40, 0xc4353ea360)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:95 +0x8d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.WriteObjectNegotiated(0x7f435d6d40e0, 0xc424d1e120, 0x988d940, 0xc42082bb60, 0x0, 0x0, 0x3d209f1, 0x2, 0x988ae80, 0xc4373660d8, ...)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:123 +0x32b
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.ErrorNegotiated(0x7f435d6d40e0, 0xc424d1e120, 0x9854b40, 0xc42720b0f0, 0x988d940, 0xc42082bb60, 0x0, 0x0, 0x3d209f1, 0x2, ...)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/writers.go:142 +0x163
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*RequestScope).err(0xc43fa12b40, 0x9854b40, 0xc42720b0f0, 0x988ae80, 0xc4373660d8, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:67 +0x10c
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.createHandler.func1(0x988ae80, 0xc4373660d8, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go:120 +0x1409
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints.restfulCreateResource.func1(0xc43c7b7fb0, 0xc42dad28a0)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/installer.go:1027 +0xd5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics.InstrumentRouteFunc.func1(0xc43c7b7fb0, 0xc42dad28a0)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:199 +0x208
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc420bc6240, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:277 +0xb18
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).Dispatch(0xc420bc6240, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:199 +0x57
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x3d34530, 0xe, 0xc420bc6240, 0xc4203d8770, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:152 +0x4e0
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*director).ServeHTTP(0xc4203f0960, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        <autogenerated>:1 +0x75
k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver.(*proxyHandler).ServeHTTP(0xc427025c80, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go:93 +0x18a
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*pathHandler).ServeHTTP(0xc427281b00, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:248 +0x26d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).ServeHTTP(0xc423692000, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:234 +0xa1
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x3d372f1, 0xf, 0xc428f60900, 0xc423692000, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:160 +0x6ad
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*director).ServeHTTP(0xc42809c760, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        <autogenerated>:1 +0x75
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:52 +0x37d
net/http.HandlerFunc.ServeHTTP(0xc4222fe3c0, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:165 +0x42a
net/http.HandlerFunc.ServeHTTP(0xc425dcd400, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:49 +0x203a
net/http.HandlerFunc.ServeHTTP(0xc4222fe410, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:78 +0x2b1
net/http.HandlerFunc.ServeHTTP(0xc4222fe460, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/request.WithRequestContext.func1(0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/request/requestcontext.go:110 +0xcb
net/http.HandlerFunc.ServeHTTP(0xc42809c780, 0x7f435cf64470, 0xc42cdb19c0, 0xc42e1d6400)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc42809c800, 0x988f800, 0xc42cdb19c0, 0xc42e1d6400, 0xc42347f860)
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:93 +0x8d
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
        /workspace/anago-v1.10.3-beta.0.74+2bba0127d85d5a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:92 +0x1ab

logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"timeout\",\"code\":500}\n"
 [[scaletest driver] 10.171.126.224:49180]

The etcd leader logged nothing unusual around that time.

@MikeSpreitzer
Copy link
Member Author

@hzxuzhonghu has suggested the mysterious timeout might be due to https://github.com/kubernetes/kubernetes/blob/v1.10.3/plugin/pkg/admission/resourcequota/controller.go#L551 . I do indeed have the ResourceQuota admission controller enabled. However, I have no ResourceQuota objects in any namespace.

@MikeSpreitzer
Copy link
Member Author

I tried removing ResourceQuota from the admission controller list, and that made the mysterious 10 second timeouts go away.

Why is the ResourceQuota admission controller causing these 10 second timeouts? Remember, they are rare: I can do hundreds of thousands of create operations in rapid fire, and get a single digit number of timeouts, while the other create operations complete in well under 10 seconds. See https://docs.google.com/document/d/1o1ygFQ2n7uQIMNRT5xmKPDV5gUbWP5pmtZTQA6V3u9I for an example.

@MikeSpreitzer MikeSpreitzer changed the title Mysterious timeout from apiserver Mysterious rate 10-second timeouts from ResourceQuota admission controller May 25, 2018
@MikeSpreitzer MikeSpreitzer changed the title Mysterious rate 10-second timeouts from ResourceQuota admission controller Mysterious rare 10-second timeouts from ResourceQuota admission controller May 25, 2018
@MikeSpreitzer
Copy link
Member Author

Here is another example (now with https://github.com/MikeSpreitzer/k8api-scaletest/tree/e3ca6afab243c90ce4e1fee7a09ee9cb6e163205/cmdriverclosed and the etcd cluster using TLS):

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 3000 --maxpop 90800 --n 421000 --threads 30 --conns 1
RunID is 0525.1215.6647
Wrote parameter file "0525.1215.6647-driver.parms"
DEBUG: Creating 421000 objects
DEBUG: maxpop = 90800
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 30
DEBUG: RATE = 3000/sec
DEBUG: op period = 0.0003333333333333333 sec
I0525 12:22:21.322319     882 main.go:176] 421000 object lifecycles in 421.036121005 seconds = 2999.7426277471363 writes/sec, with 2 errors on create, 2 on update, and 2 on delete
root@mjstest1:~# egrep '[^"]"$' 0525.1215.6647-driver.csv 
2018-05-25 17:15:21.436,2018-05-25 17:15:31.438,create,"0525.1215.6647-001440","timeout"
2018-05-25 17:15:36.366,2018-05-25 17:15:36.368,update,"0525.1215.6647-001440","configmaps \"0525.1215.6647-001440\" not found"
2018-05-25 17:16:07.086,2018-05-25 17:16:07.088,delete,"0525.1215.6647-001440","configmaps \"0525.1215.6647-001440\" not found"
2018-05-25 17:20:42.532,2018-05-25 17:20:52.534,create,"0525.1215.6647-367616","timeout"
2018-05-25 17:21:27.894,2018-05-25 17:21:27.896,update,"0525.1215.6647-367616","configmaps \"0525.1215.6647-367616\" not found"
2018-05-25 17:22:00.824,2018-05-25 17:22:00.826,delete,"0525.1215.6647-367616","configmaps \"0525.1215.6647-367616\" not found"

screen shot 2018-05-25 at 1 29 51 pm

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 25, 2018

So I slowed down from 3000 to 2000 kube writes per second, but I still got a timeout.

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 2000 --maxpop 90800 --n 1100100 --threads 30 --conns 1
RunID is 0525.1234.5116
Wrote parameter file "0525.1234.5116-driver.parms"
DEBUG: Creating 1100100 objects
DEBUG: maxpop = 90800
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 30
DEBUG: RATE = 2000/sec
DEBUG: op period = 0.0005 sec
I0525 13:01:39.824634    1749 main.go:176] 1100100 object lifecycles in 1650.154192699 seconds = 1999.9949184154807 writes/sec, with 1 errors on create, 1 on update, and 1 on delete
root@mjstest1:~# egrep '[^"]"$' 0525.1234.5116-driver.csv 
2018-05-25 17:35:15.834,2018-05-25 17:35:25.870,create,"0525.1234.5116-0088858","timeout"
2018-05-25 17:36:22.944,2018-05-25 17:36:22.946,update,"0525.1234.5116-0088858","configmaps \"0525.1234.5116-0088858\" not found"
2018-05-25 17:37:30.999,2018-05-25 17:37:31.005,delete,"0525.1234.5116-0088858","configmaps \"0525.1234.5116-0088858\" not found"

screen shot 2018-05-25 at 2 13 59 pm

And no, that is not a really tight cluster of data points at (17:35:25.870, 10.036) --- it really is just a single data point. No other operation took anywhere near that long.

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 25, 2018

Interestingly, there is a roughly 50 ms bubble of latency around the time of the start of the operation that timed out. As you can see from the graph, there are other bubbles of latency. All much smaller than 10 sec.

Here are 36 consecutive lines of driver data from around the timeout, sorted by preop time.

...
2018-05-25 17:35:15.826,2018-05-25 17:35:15.828,create,"0525.1234.5116-0088841",""
2018-05-25 17:35:15.826,2018-05-25 17:35:15.829,create,"0525.1234.5116-0088842",""
2018-05-25 17:35:15.827,2018-05-25 17:35:15.835,create,"0525.1234.5116-0088844",""
2018-05-25 17:35:15.827,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088843",""
2018-05-25 17:35:15.828,2018-05-25 17:35:15.834,create,"0525.1234.5116-0088845",""
2018-05-25 17:35:15.828,2018-05-25 17:35:15.834,create,"0525.1234.5116-0088846",""
2018-05-25 17:35:15.829,2018-05-25 17:35:15.881,create,"0525.1234.5116-0088847",""
2018-05-25 17:35:15.829,2018-05-25 17:35:15.881,create,"0525.1234.5116-0088848",""
2018-05-25 17:35:15.830,2018-05-25 17:35:15.880,create,"0525.1234.5116-0088850",""
2018-05-25 17:35:15.830,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088849",""
2018-05-25 17:35:15.831,2018-05-25 17:35:15.880,create,"0525.1234.5116-0088851",""
2018-05-25 17:35:15.831,2018-05-25 17:35:15.880,create,"0525.1234.5116-0088852",""
2018-05-25 17:35:15.832,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088853",""
2018-05-25 17:35:15.832,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088854",""
2018-05-25 17:35:15.833,2018-05-25 17:35:15.880,create,"0525.1234.5116-0088855",""
2018-05-25 17:35:15.833,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088856",""
2018-05-25 17:35:15.834,2018-05-25 17:35:15.880,create,"0525.1234.5116-0088857",""
2018-05-25 17:35:15.834,2018-05-25 17:35:25.870,create,"0525.1234.5116-0088858","timeout"
2018-05-25 17:35:15.835,2018-05-25 17:35:15.881,create,"0525.1234.5116-0088859",""
2018-05-25 17:35:15.835,2018-05-25 17:35:15.882,create,"0525.1234.5116-0088860",""
2018-05-25 17:35:15.836,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043471",""
2018-05-25 17:35:15.836,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043472",""
2018-05-25 17:35:15.837,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043473",""
2018-05-25 17:35:15.837,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043474",""
2018-05-25 17:35:15.838,2018-05-25 17:35:15.881,update,"0525.1234.5116-0043475",""
2018-05-25 17:35:15.838,2018-05-25 17:35:15.881,update,"0525.1234.5116-0043476",""
2018-05-25 17:35:15.839,2018-05-25 17:35:15.881,update,"0525.1234.5116-0043477",""
2018-05-25 17:35:15.839,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043478",""
2018-05-25 17:35:15.840,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043479",""
2018-05-25 17:35:15.840,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043480",""
2018-05-25 17:35:15.841,2018-05-25 17:35:15.881,update,"0525.1234.5116-0043482",""
2018-05-25 17:35:15.841,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043481",""
2018-05-25 17:35:15.842,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043484",""
2018-05-25 17:35:15.843,2018-05-25 17:35:15.882,update,"0525.1234.5116-0043485",""
2018-05-25 17:35:15.843,2018-05-25 17:35:15.884,update,"0525.1234.5116-0043486",""
2018-05-25 17:35:15.880,2018-05-25 17:35:15.884,update,"0525.1234.5116-0043490",""
...

@MikeSpreitzer
Copy link
Member Author

Looking in https://github.com/kubernetes/kubernetes/blob/v1.10.3/plugin/pkg/admission/resourcequota/controller.go , I see something odd. In two places --- getWork and in doWork.workFunc --- there is a special case for an empty list of Attributes objects. Now, it is not easy to imagine how the list could ever be empty when the namespace is in the queue --- but if this ever comes to pass then something bad happens: e.queue.Done(ns) is called twice (first in getWork and second in completeWork).

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented May 26, 2018

So I made https://github.com/MikeSpreitzer/kubernetes/tree/quota-eval-fix to test the hypothesis that the double-call of e.queue.Done(ns) is a critical part of the pathology. I ran a few trials, and got no timeouts. Here is the longest:

root@mjstest1:~# ./cmdriverclosed --kubeconfig admin-d10bm1 --rate 3000 --maxpop 190800 --n 1100100 --threads 30 --conns 1
RunID is 0526.1752.4083
Wrote parameter file "0526.1752.4083-driver.parms"
DEBUG: Creating 1100100 objects
DEBUG: maxpop = 190800
DEBUG: endpoints are [https://10.171.126.235:6443]
DEBUG: CONNS = 1
DEBUG: THREADS = 30
DEBUG: RATE = 3000/sec
DEBUG: op period = 0.0003333333333333333 sec
I0526 18:11:00.419747    7039 main.go:176] 1100100 object lifecycles in 1100.135926647 seconds = 2999.902030341534 writes/sec, with 0 errors on create, 0 on update, and 0 on delete

@dims
Copy link
Member

dims commented May 26, 2018

@MikeSpreitzer wanna file a WIP PR so we can see the exact changes?

@MikeSpreitzer
Copy link
Member Author

yeah, when I get a chance. Weekend now.

MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue May 27, 2018
In plugin/pkg/admission/resourcequota/controller.go, getWork has an
optimization that obviates --- and actually is inconsistent with --- a
subsequent call on completeWork in the case where the returned list is
empty.

Fixes kubernetes#63608
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jun 1, 2018
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

Fixes kubernetes#63608
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jun 1, 2018
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

Fixes kubernetes#63608
@MikeSpreitzer MikeSpreitzer changed the title Mysterious rare 10-second timeouts from ResourceQuota admission controller Rare 10-second timeouts from ResourceQuota admission controller Jun 5, 2018
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jun 12, 2018
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

Fixes kubernetes#63608
k8s-github-robot pushed a commit that referenced this issue Jun 22, 2018
Automatic merge from submit-queue (batch tested with PRs 65152, 65199, 65179, 64598, 65216). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove optimization from getWork in resourcequota/controller.go

**What this PR does / why we need it**:
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #63608 

**Special notes for your reviewer**:
This is a simpler alternative to #64377 

**Release note**:

```release-note
NONE
```
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jun 23, 2018
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

Fixes kubernetes#63608
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jun 29, 2018
This change simplifies the code in
plugin/pkg/admission/resourcequota/controller.go by removing the
optimization in getWork that required the caller to NOT call
completeWork if getWork returns the empty list of work.  BTW, the
caller was not obeying that requirement; now the caller's behavior
(which is unchanged) is right.

Fixes kubernetes#63608
@redbaron
Copy link
Contributor

redbaron commented Jul 6, 2018

@MikeSpreitzer very nice investigation you've done here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
5 participants