e2e flake: kubemark-gce-scale #28537

wojtek-t · 2016-07-06T15:59:16Z

2000-node kubemark failed with:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:187 scaling rc load-small-rc-1692 for the first time Expected error: <*errors.errorString | 0xc827b9bd60>: { s: "error while scaling RC load-small-rc-1692 to 5 replicas: Get https://104.198.196.81/api/v1/namespaces/e2e-tests-load-30-nodepods-5-oexiu/replicationcontrollers/load-small-rc-1692: http2: no cached connection was available", }

http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-gce-scale/1263/

Seems to be related to http2 - any ideas?

@timothysc @krousey

The text was updated successfully, but these errors were encountered:

krousey · 2016-07-06T16:45:17Z

Seems that way. We create a non-dialing client pool when we configure the transport:

kubernetes/vendor/golang.org/x/net/http2/configure_transport.go

Line 18 in 786b798

ConnPool: noDialClientConnPool{connPool},

And that seems to cause the error to be returned here:

kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go

Line 53 in 786b798

return nil, ErrNoCachedConn

But only because all the current connections were unavailable:

kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go

Line 46 in 786b798

if cc.CanTakeNewRequest() {

A connection is unavailable if:

kubernetes/vendor/golang.org/x/net/http2/transport.go

Line 473 in 786b798

return cc.goAway == nil && !cc.closed &&

The server told it to GoAway
The connection is closed
The next stream ID is less than 2^31 - 1
The number of active streams does not exceed a configured max.

Given that this is a scale test, I'm willing to bet it's the number of active streams.

There's a default value for the max.

1000 on the client side:

kubernetes/vendor/golang.org/x/net/http2/transport.go

Line 393 in 786b798

maxConcurrentStreams: 1000, // "infinite", per spec. 1000 seems good enough.
250 on the server side:

kubernetes/vendor/golang.org/x/net/http2/server.go

Line 67 in 786b798

defaultMaxStreams = 250 // TODO: make this 100 as the GFE seems to?

This value is also negotiated through a SettingsFrame

The client doesn't seem to have any way to change this, but the server does. We should be changing this on the server side anyway.

https://godoc.org/golang.org/x/net/http2#Server
https://godoc.org/golang.org/x/net/http2#ConfigureServer

@timothysc I'm on call this week and out for a bit after that. Would you mind putting together a fix? If you get it done this week, I can review.

krousey · 2016-07-06T16:47:17Z

Also, that's like... at least 249 TLS handshakes we didn't have to do. So yay?

krousey · 2016-07-06T16:51:10Z

@wojtek-t It might be worth considering the validity of these scale test results as they may no longer be stressing actual concurrent connections anymore.

wojtek-t · 2016-07-06T18:06:36Z

@krousey - thanks for debugging;

However, I think I'm missing something. This was a flake - it's not failing constantly. And this is kubemark-2000, which mean we have 2000 fake nodes in kubemark and each of them have at least one open connection to apiserver. So apiserver has at least 2000 open connections all the time. And you are saying, the limit is 250?

regarding our scale test - we are not going to stree the connections; we are focusing on apiserver/controllers from the number of requests point of view; we've never get to the point of thinking about connections deeper.

krousey · 2016-07-06T18:10:54Z

This limit is HTTP2 streams within a single connection. My guess is the e2e suite just happened to have 251 concurrent requests over a single transport this one time. Probably not related to the scale.

wojtek-t · 2016-07-07T07:12:42Z

I see - yeah, that's probably possible.

timothysc · 2016-07-14T19:18:56Z

My guess is the e2e suite just happened to have 251 concurrent requests over a single transport this one time.

251 concurrent requests from the e2e client seems... odd.

wojtek-t · 2016-07-15T06:21:17Z

Well - whatever it is, we need to fix it. I've seen at least 5 (I think it was closer to 10) failures because of that in the last few days.

timothysc · 2016-07-15T14:46:36Z

Is it only this test that is causing this issue? b/c the clients can be overridden via environment variable.

wojtek-t · 2016-07-15T17:16:03Z

I've seen it both in real-cluster-related tests and in kubemarks. But so far I've seen it only in large ones so it seems to be scale-related.

timothysc · 2016-07-18T13:48:42Z

Curious if #29001 is related here. /cc @gmarek

wojtek-t · 2016-07-18T13:56:13Z

@timothysc The panic is not related to this one (I mean I checked for panics in those runs and there weren't any). However, the godep update that you did may potentially help (it was just merged, so we don't have data for it - it was flaking ~1 per day or two).

wojtek-t · 2016-07-19T12:59:53Z

This has just happened in 100-node cluster.

wojtek-t · 2016-07-19T14:25:22Z

@kubernetes/sig-api-machinery

timothysc · 2016-07-19T16:06:13Z

k, I'll look into the test iteself soon.

Automatic merge from submit-queue Revert "Follow on for 1.4 to default HTTP2 on by default" This reverts commit efe2555 in order to address: #29001 #28537

timothysc · 2016-07-20T15:32:06Z

closed via #29283

wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/cluster kind/flake Categorizes issue or PR as related to a flaky test. labels Jul 6, 2016

krousey added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 6, 2016

wojtek-t assigned krousey and timothysc Jul 15, 2016

wojtek-t added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 15, 2016

wojtek-t added this to the v1.4 milestone Jul 19, 2016

This was referenced Jul 19, 2016

Disable http2 for the scale tests #29232

Closed

Revert "Follow on for 1.4 to default HTTP2 on by default" #29283

Merged

timothysc closed this as completed Jul 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: kubemark-gce-scale #28537

e2e flake: kubemark-gce-scale #28537

wojtek-t commented Jul 6, 2016

krousey commented Jul 6, 2016 •

edited

Loading

krousey commented Jul 6, 2016

krousey commented Jul 6, 2016

wojtek-t commented Jul 6, 2016

krousey commented Jul 6, 2016

wojtek-t commented Jul 7, 2016

timothysc commented Jul 14, 2016

wojtek-t commented Jul 15, 2016

timothysc commented Jul 15, 2016

wojtek-t commented Jul 15, 2016

timothysc commented Jul 18, 2016

wojtek-t commented Jul 18, 2016

wojtek-t commented Jul 19, 2016

wojtek-t commented Jul 19, 2016

timothysc commented Jul 19, 2016

timothysc commented Jul 20, 2016

e2e flake: kubemark-gce-scale #28537

e2e flake: kubemark-gce-scale #28537

Comments

wojtek-t commented Jul 6, 2016

krousey commented Jul 6, 2016 • edited Loading

krousey commented Jul 6, 2016

krousey commented Jul 6, 2016

wojtek-t commented Jul 6, 2016

krousey commented Jul 6, 2016

wojtek-t commented Jul 7, 2016

timothysc commented Jul 14, 2016

wojtek-t commented Jul 15, 2016

timothysc commented Jul 15, 2016

wojtek-t commented Jul 15, 2016

timothysc commented Jul 18, 2016

wojtek-t commented Jul 18, 2016

wojtek-t commented Jul 19, 2016

wojtek-t commented Jul 19, 2016

timothysc commented Jul 19, 2016

timothysc commented Jul 20, 2016

krousey commented Jul 6, 2016 •

edited

Loading