Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: kubemark-gce-scale #28537

Closed
wojtek-t opened this issue Jul 6, 2016 · 16 comments
Closed

e2e flake: kubemark-gce-scale #28537

wojtek-t opened this issue Jul 6, 2016 · 16 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

wojtek-t commented Jul 6, 2016

2000-node kubemark failed with:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:187 scaling rc load-small-rc-1692 for the first time Expected error: <*errors.errorString | 0xc827b9bd60>: { s: "error while scaling RC load-small-rc-1692 to 5 replicas: Get https://104.198.196.81/api/v1/namespaces/e2e-tests-load-30-nodepods-5-oexiu/replicationcontrollers/load-small-rc-1692: http2: no cached connection was available", } 

http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-gce-scale/1263/

Seems to be related to http2 - any ideas?

@timothysc @krousey

@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/cluster kind/flake Categorizes issue or PR as related to a flaky test. labels Jul 6, 2016
@krousey krousey added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 6, 2016
@krousey
Copy link
Contributor

krousey commented Jul 6, 2016

Seems that way. We create a non-dialing client pool when we configure the transport:

ConnPool: noDialClientConnPool{connPool},

And that seems to cause the error to be returned here:

But only because all the current connections were unavailable:

A connection is unavailable if:

return cc.goAway == nil && !cc.closed &&

  • The server told it to GoAway
  • The connection is closed
  • The next stream ID is less than 2^31 - 1
  • The number of active streams does not exceed a configured max.

Given that this is a scale test, I'm willing to bet it's the number of active streams.

There's a default value for the max.

This value is also negotiated through a SettingsFrame

The client doesn't seem to have any way to change this, but the server does. We should be changing this on the server side anyway.

https://godoc.org/golang.org/x/net/http2#Server
https://godoc.org/golang.org/x/net/http2#ConfigureServer

@timothysc I'm on call this week and out for a bit after that. Would you mind putting together a fix? If you get it done this week, I can review.

@krousey
Copy link
Contributor

krousey commented Jul 6, 2016

Also, that's like... at least 249 TLS handshakes we didn't have to do. So yay?

@krousey
Copy link
Contributor

krousey commented Jul 6, 2016

@wojtek-t It might be worth considering the validity of these scale test results as they may no longer be stressing actual concurrent connections anymore.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 6, 2016

@krousey - thanks for debugging;

However, I think I'm missing something. This was a flake - it's not failing constantly. And this is kubemark-2000, which mean we have 2000 fake nodes in kubemark and each of them have at least one open connection to apiserver. So apiserver has at least 2000 open connections all the time. And you are saying, the limit is 250?

regarding our scale test - we are not going to stree the connections; we are focusing on apiserver/controllers from the number of requests point of view; we've never get to the point of thinking about connections deeper.

@krousey
Copy link
Contributor

krousey commented Jul 6, 2016

This limit is HTTP2 streams within a single connection. My guess is the e2e suite just happened to have 251 concurrent requests over a single transport this one time. Probably not related to the scale.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 7, 2016

I see - yeah, that's probably possible.

@timothysc
Copy link
Member

My guess is the e2e suite just happened to have 251 concurrent requests over a single transport this one time.

251 concurrent requests from the e2e client seems... odd.

@wojtek-t
Copy link
Member Author

Well - whatever it is, we need to fix it. I've seen at least 5 (I think it was closer to 10) failures because of that in the last few days.

@wojtek-t wojtek-t added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 15, 2016
@timothysc
Copy link
Member

Is it only this test that is causing this issue? b/c the clients can be overridden via environment variable.

@wojtek-t
Copy link
Member Author

I've seen it both in real-cluster-related tests and in kubemarks. But so far I've seen it only in large ones so it seems to be scale-related.

@timothysc
Copy link
Member

Curious if #29001 is related here. /cc @gmarek

@wojtek-t
Copy link
Member Author

@timothysc The panic is not related to this one (I mean I checked for panics in those runs and there weren't any). However, the godep update that you did may potentially help (it was just merged, so we don't have data for it - it was flaking ~1 per day or two).

@wojtek-t
Copy link
Member Author

This has just happened in 100-node cluster.

@wojtek-t
Copy link
Member Author

@kubernetes/sig-api-machinery

@wojtek-t wojtek-t added this to the v1.4 milestone Jul 19, 2016
@timothysc
Copy link
Member

k, I'll look into the test iteself soon.

k8s-github-robot pushed a commit that referenced this issue Jul 20, 2016
Automatic merge from submit-queue

Revert "Follow on for 1.4 to default HTTP2 on by default"

This reverts commit efe2555  
in order to address: #29001 #28537
@timothysc
Copy link
Member

closed via #29283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

3 participants