New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServerTimeout when there are a huge amount of requests on a specific resource #45811

Closed
zhouhaibing089 opened this Issue May 15, 2017 · 23 comments

Comments

Projects
None yet
@zhouhaibing089
Copy link
Contributor

zhouhaibing089 commented May 15, 2017

We are running on version 1.5.2, when I search similar issue through the list, I found this is somewhat relevant.

What I did

We designed a load test on apiserver, which can be simply described as below:

  1. we create ten configmaps in a given namespace.
  2. we spawn 500 pods which run curl https://..../api/v1/namespaces/<ns>/configmaps in a high frequency, and the desired frequency is 10 requests per second.

What I saw

$ kubectl get configmaps
Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get configmaps)
$ kubectl get svc
NAME            CLUSTER-IP        EXTERNAL-IP     PORT(S)        AGE
kubernetes      192.168.0.1       <none>          443/TCP        362d
nginx           192.168.50.222    10.67.204.61    80:30343/TCP   94d
nginx-2         192.168.235.111   10.67.204.11    80:30076/TCP   87d
nginx-tessnet   192.168.71.26     10.67.204.189   80:32360/TCP   293d

The server only times out for configmaps request, while other resources are still in ok status.

Related Information

flags for kube-apiserver:

$ ps aux | grep kube-apiserver
root     31660  223  0.6 805188 748744 ?       Sl   06:54 141:07 /usr/local/bin/kube-apiserver 
  --etcd_servers=https://<etcd server>:4001 
  --cloud_provider=c3 
  --cloud_config=/etc/sysconfig/openstack.rc 
  --runtime_config=<runtime_configs> 
  --admission_control=<admission controls> 
  --service-cluster-ip-range=192.168.0.0/16 
  --insecure-port=8080 
  --client_ca_file=/etc/ssl/kubernetes/ca.crt 
  --max_requests_inflight=4000 
  --tls_cert_file=/etc/ssl/kubernetes/server.crt 
  --tls_private_key_file=/etc/ssl/kubernetes/server.key 
  --service_account_key_file=/etc/ssl/kubernetes/service-account.key 
  --secure-port=6443 
  --token_auth_file=/srv/kubernetes/known_tokens.csv 
  --v=7 --keystone_url=<keystone_url>  
  --authorization-mode=RBAC 
  --etcd-cafile=/etc/ssl/kubernetes/ca.crt 
  --apiserver-count=1 
  --authorization-rbac-super-user=<super user> 
  --keystone_auth_config_file=/etc/sysconfig/keystoneauthorization.json 
  --enable-garbage-collector=True
  --audit-log-maxage=7 
  --audit-log-maxbackup=3 
  --audit-log-maxsize=500 
  --audit-log-path=/var/log/kube-audit/kube-apiserver-audit.log 
  --storage-backend=etcd3 
  --cors-allowed-origins=.* 
  --anonymous-auth=True 
  --allow_privileged=True

stream terminated by RST_STREAM with error code: 1(error code 1 means PROTOCOL_ERROR per here) when create configmaps:

$ kubectl create configmap test -n <ns> --from-file=<file> --v=10
...
I0515 08:52:29.157950    6133 round_trippers.go:299] curl -k -v -XPOST  -H "Content-Type: application/json" -H "User-Agent: kubectl/v1.5.2 (linux/amd64) kubernetes/df31c96" -H "Accept: application/json, */*" http://localhost:8080/api/v1/namespaces/lnp/configmaps
I0515 08:52:29.385764    6133 round_trippers.go:318] POST http://localhost:8080/api/v1/namespaces/lnp/configmaps 500 Internal Server Error in 227 milliseconds
I0515 08:52:29.385811    6133 round_trippers.go:324] Response Headers:
I0515 08:52:29.385820    6133 round_trippers.go:327]     Content-Length: 171
I0515 08:52:29.385832    6133 round_trippers.go:327]     Content-Type: application/json
I0515 08:52:29.385842    6133 round_trippers.go:327]     Date: Mon, 15 May 2017 08:52:29 GMT
I0515 08:52:29.385942    6133 request.go:904] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 1","code":500}
I0515 08:52:29.385971    6133 request.go:995] Response Body: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 1\",\"code\":500}\n"
I0515 08:52:29.386427    6133 helpers.go:203] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 1",
  "code": 500
}]
F0515 08:52:29.386462    6133 helpers.go:116] Error from server: rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 1
@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented May 15, 2017

cc @gmarek @wojtek-t I think you guys might have some ideas on the performance part, could you please help out?

@gmarek

This comment has been minimized.

Copy link
Member

gmarek commented May 15, 2017

@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented May 15, 2017

It seems in such a circumstance, the web service(for configmap) hangs or crashes??

@armstrongli

This comment has been minimized.

Copy link

armstrongli commented May 15, 2017

@zhouhaibing089 does this PR give any help #7559 ?

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented May 15, 2017

@timothysc timothysc added the kind/bug label May 15, 2017

@timothysc timothysc added this to the v1.7 milestone May 15, 2017

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented May 15, 2017

I don't think this is necessary a bug - this may be working as intended.

If you have 500 pods, each of the generating 10qps of "get configmap), this gives us 5000qps. Depending on the size of the master machine/number of apiservers and/or etcd instances, this just may be too many requests to handle.

@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented May 15, 2017

@wojtek-t Not actually the case, I think I need to emphasis it a little bit:

  1. only one resource affected.
  2. the timeout did not stop after the test finished(until we restart the apiserver).
  3. the 5000qps is the one that we want to achieve, might not be the actual load.
@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented May 26, 2017

please look at and/or paste here the apiserver logs.

@lavalamp lavalamp removed this from the v1.7 milestone May 26, 2017

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jun 27, 2017

We updated the grpc package whose version is same as what kubernetes 1.7 is used, the hung has gone. The newer version grpc has huge performance improvement especially there are many concurrent requests on a specific etcd connection (as each connection is created for each resource in apiserver)

@mml

This comment has been minimized.

Copy link
Contributor

mml commented Aug 21, 2017

/assign

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 3, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented Feb 5, 2018

/remove-lifecycle stale

@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented Feb 5, 2018

Below is the goroutine profile based on k8s 1.6.3.

goroutine profile: total 49531
20276 @ 0x42ce8a 0x43c5d5 0x43b36c 0x294b926 0x293a799 0x25b3f05 0x25b4a12 0x25cd6e6 0x25b42fd 0x255c692 0x2233292 0x222c55f 0x222c3b7 0x2224388 0x221d67e 0x221cf5d 0x221ca62 0x220bf92 0xf7d136 0xf8ac74 0x225a357 0x2259f70 0x223abbc 0xfc4bf0 0xfd3048 0x603bc4 0x6050bf 0x102a205 0x603bc4 0x102c10d 0x603bc4 0x1029792
#	0x294b925	k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.wait+0x445										/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/transport/transport.go:577
#	0x293a798	k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.(*http2Client).NewStream+0x658							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/transport/http2_client.go:319
#	0x25b3f04	k8s.io/kubernetes/vendor/google.golang.org/grpc.sendRequest+0x94										/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/call.go:80
#	0x25b4a11	k8s.io/kubernetes/vendor/google.golang.org/grpc.invoke+0x621											/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/call.go:191
#	0x25cd6e5	k8s.io/kubernetes/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1+0x125			/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/client_metrics.go:84
#	0x25b42fc	k8s.io/kubernetes/vendor/google.golang.org/grpc.Invoke+0xdc											/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/call.go:116
#	0x255c691	k8s.io/kubernetes/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*kVClient).Range+0xd1							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2203
#	0x2233291	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*retryWriteKVClient).Range+0x91							<autogenerated>:174
#	0x222c55e	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*retryKVClient).Range.func1+0x8e							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/retry.go:92
#	0x222c3b6	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*Client).newAuthRetryWrapper.func1+0x46						/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/retry.go:61
#	0x2224387	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*retryKVClient).Range+0x157								/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/retry.go:94
#	0x221d67d	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*kv).do+0x4ed										/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/kv.go:145
#	0x221cf5c	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*kv).Do+0x7c										/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/kv.go:124
#	0x221ca61	k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3.(*kv).Get+0xe1										/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/coreos/etcd/clientv3/kv.go:98
#	0x220bf91	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3.(*store).Get+0x131									/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3/store.go:128
#	0xf7d135	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage.(*Cacher).Get+0xc5									/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:360
#	0xf8ac73	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Get+0x183							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go:517
#	0x225a356	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.GetResource.func1+0x1f6							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:162
#	0x2259f6f	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.getResourceHandler.func1+0x19f							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:123
#	0x223abbb	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics.InstrumentRouteFunc.func1+0x1eb							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:104
#	0xfc4bef	k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch+0xb9f								/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:272
#	0xfd3047	k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm+0x47	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:120
#	0x603bc3	net/http.HandlerFunc.ServeHTTP+0x43														/usr/local/go/src/net/http/server.go:1726
#	0x6050be	net/http.(*ServeMux).ServeHTTP+0x7e														/usr/local/go/src/net/http/server.go:2022
#	0x102a204	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1+0x364							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:50
#	0x603bc3	net/http.HandlerFunc.ServeHTTP+0x43														/usr/local/go/src/net/http/server.go:1726
#	0x102c10c	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1+0x1e4c							/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:47
#	0x603bc3	net/http.HandlerFunc.ServeHTTP+0x43														/usr/local/go/src/net/http/server.go:1726
#	0x1029791	k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1+0x911								/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:137
@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented Feb 5, 2018

Looks like exactly the same issue as #57061

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented May 6, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@zhouhaibing089

This comment has been minimized.

Copy link
Contributor

zhouhaibing089 commented May 6, 2018

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Aug 4, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@nikhita

This comment has been minimized.

Copy link
Member

nikhita commented Aug 10, 2018

/remove-lifecycle stale

@iahmad-khan

This comment has been minimized.

Copy link

iahmad-khan commented Aug 14, 2018

Same issue

kubectl create -f kubia-liveness.yml 
Error from server (ServerTimeout): error when creating "kubia-liveness.yml": the server cannot complete the requested operation at this time, try again later (post pods)

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Nov 12, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 12, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 11, 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 11, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment