Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #31589

Closed
k8s-github-robot opened this issue Aug 28, 2016 · 10 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@k8s-github-robot
Copy link

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5336/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:197
scaling rc load-medium-rc-7 for the first time
Expected error:
    <*errors.errorString | 0xc837b9b590>: {
        s: "error while scaling RC load-medium-rc-7 to 38 replicas: timed out waiting for \"load-medium-rc-7\" to be synced",
    }
    error while scaling RC load-medium-rc-7 to 38 replicas: timed out waiting for "load-medium-rc-7" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:318

Previous issues for this test: #26544 #26938 #27595 #30146 #30469 #31374 #31427 #31433

@k8s-github-robot k8s-github-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 28, 2016
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5357/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:197
scaling rc load-small-rc-1490 for the first time
Expected error:
    <*errors.errorString | 0xc83499f5f0>: {
        s: "error while scaling RC load-small-rc-1490 to 2 replicas: timed out waiting for the condition",
    }
    error while scaling RC load-small-rc-1490 to 2 replicas: timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:318

@wojtek-t
Copy link
Member

I think there are two different issues here - I thikn that I already debugged the second case.
Basically, looking into logs, it seems that there are pretty big pauses between calls to apiserver from the test (my very strong hypothesis is that it is because of overloaded Jenkins machine where the test is running).
These are the logs from the offending RC update:

I0828 13:16:58.801043    3174 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-3-i64c1/replicationcontrollers/load-medium-rc-7: (1.621528ms) 200 [[e2e.test/v1.4.0 (linux    /amd64) kubernetes/72fbb51] 104.154.21.165:43674]
I0828 13:17:02.461486    3174 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-3-i64c1/replicationcontrollers/load-medium-rc-7: (16.951018ms) 200 [[e2e.test/v1.4.0 (linu    x/amd64) kubernetes/72fbb51] 104.154.21.165:43713]
I0828 13:17:06.900362    3174 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-3-i64c1/replicationcontrollers/load-medium-rc-7: (1.07016ms) 200 [[e2e.test/v1.4.0 (linux/    amd64) kubernetes/72fbb51] 104.154.21.165:43749]
I0828 13:17:06.902427    3174 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-3-i64c1/replicationcontrollers?fieldSelector=metadata.name%3Dload-medium-rc-7&resour    ceVersion=258064: (677.848µs) 200 [[e2e.test/v1.4.0 (linux/amd64) kubernetes/72fbb51] 104.154.21.165:43749]

As a result, we will end up with "too old resource version" error coming from watch (since it was called ~5 seconds later then the get for it).

However, this test is pretty big, so the solution for this problem is to actually make the size of the sliding window larger (we can afford it in large clusters). Will send a PR for it.

@wojtek-t
Copy link
Member

@gmarek ^^

k8s-github-robot pushed a commit that referenced this issue Aug 29, 2016
Automatic merge from submit-queue

Increase cache size for RCs

Ref #31589

[This should also help with failures of kubemark-scale.]
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5384/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:197
scaling rc load-big-rc-10 for the first time
Expected error:
    <*errors.errorString | 0xc83655c6c0>: {
        s: "error while scaling RC load-big-rc-10 to 128 replicas: timed out waiting for the condition",
    }
    error while scaling RC load-big-rc-10 to 128 replicas: timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:318

@k8s-github-robot k8s-github-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Aug 30, 2016
@wojtek-t
Copy link
Member

We should double check, but my hypothesis is that it may (similarly to our failures of kubemark-scale) be a consequence of overloaded Jenkins machine.

@wojtek-t
Copy link
Member

wojtek-t commented Aug 30, 2016

I checked the last failure and it's pretty obvious that the machine where the test is running is overloaded.
We are calling this code:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/scale.go#L164

Get() and Update() are the method visible in apiserver logs. And here are the logs (corresponding get and put operations for one of RCs):

I0830 02:40:15.818176    3176 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (806.784µs) 200 [[e2e.test/v1.4.0 (linux/a     md64) kubernetes/956501b] 104.154.21.165:51865]
I0830 02:40:29.498850    3176 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (1.109854ms) 409 [[e2e.test/v1.4.0 (linux/     amd64) kubernetes/956501b] 104.154.21.165:51981]
...
I0830 02:40:47.138619    3176 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (1.013773ms) 200 [[e2e.test/v1.4.0 (linux/     amd64) kubernetes/956501b] 104.154.21.165:52145]
I0830 02:41:08.338827    3176 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (1.155109ms) 409 [[e2e.test/v1.4.0 (linux/     amd64) kubernetes/956501b] 104.154.21.165:52388]
...
I0830 02:41:30.058385    3176 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (750.595µs) 200 [[e2e.test/v1.4.0 (linux/a     md64) kubernetes/956501b] 104.154.21.165:52605]
I0830 02:41:49.539039    3176 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-1-w11cf/replicationcontrollers/load-big-rc-10: (1.312908ms) 409 [[e2e.test/v1.4.0 (linux/     amd64) kubernetes/956501b] 104.154.21.165:52865]

As you can see, there are even ~20s breaks between those two consecutive calls.
@fejta @ixdy - FYI

@gmarek
Copy link
Contributor

gmarek commented Aug 30, 2016

FYI @fejta @ixdy

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5412/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:197
scaling rc load-big-rc-7 for the first time
Expected error:
    <*errors.errorString | 0xc8213e1cd0>: {
        s: "error while scaling RC load-big-rc-7 to 294 replicas: timed out waiting for the condition",
    }
    error while scaling RC load-big-rc-7 to 294 replicas: timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:318

@k8s-github-robot k8s-github-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 31, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Sep 1, 2016

Hmm - we are now running tests on exclusive machines, and the symptoms are pretty much the same.

The new hypothesis is that maybe client (from test) is being throttled?

@wojtek-t
Copy link
Member

wojtek-t commented Sep 1, 2016

Yeah - I confirmed running large kubemark on my own cluster, that the problem is actually throttling in the e2e test client.
I will send out a PR that is fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

3 participants