[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

k8s-github-robot · 2016-09-30T15:49:13Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1451/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:211
scaling rc load-small-rc-5214 for the first time
Expected error:
    <*errors.errorString | 0xc85f25fe80>: {
        s: "error while scaling RC load-small-rc-5214 to 2 replicas: Scaling the resource failed with: the server has received too many requests and has asked us to try again later (get replicationcontrollers load-small-rc-5214); Current resource version Unknown",
    }
    error while scaling RC load-small-rc-5214 to 2 replicas: Scaling the resource failed with: the server has received too many requests and has asked us to try again later (get replicationcontrollers load-small-rc-5214); Current resource version Unknown
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:389

Previous issues for this test: #26544 #26938 #27595 #30146 #30469 #31374 #31427 #31433 #31589 #31981 #32257 #33711

The text was updated successfully, but these errors were encountered:

wojtek-t · 2016-10-10T11:38:31Z

We should ignore failures for enormous-cluster, those were experiments.

As for kubemark-scale, those issues were etcd-related. We are already working on migrating to etcd3.
So I'm decreasing the priority of it for now.

k8s-github-robot · 2016-10-20T11:06:34Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11517/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:211
scaling rc load-medium-rc-11 for the first time
Expected error:
    <*errors.errorString | 0xc424fe3aa0>: {
        s: "Error while waiting for replication controller load-medium-rc-11 pods to be running: Timeout while waiting for pods with labels \"name=load-medium-rc-11\" to be running",
    }
    Error while waiting for replication controller load-medium-rc-11 pods to be running: Timeout while waiting for pods with labels "name=load-medium-rc-11" to be running
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:389

k8s-github-robot · 2016-11-02T19:29:25Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:210
deleting rc load-small-rc-121
Expected error:
    <*errors.errorString | 0xc42034ecf0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:402

wojtek-t · 2016-11-08T09:44:56Z

All failures from kubemark-scale and gce-enormous cluster are expected. So I'm removing them all from here.

wojtek-t · 2016-11-08T09:46:25Z

So there are 2 failures from gce-scalability that we should take a look.

wojtek-t · 2016-11-09T14:16:39Z

Regarding the second failure:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/
it's super strange.

I took a look into it and here are some findings:

cat build.log | grep "Deleting RC load-" | wc -l
328
cat build.log | grep "Terminating RC load-" | wc -l
327

This shows that the error happened:

kubernetes/test/e2e/framework/util.go

Line 2793 in a2bf827

    
           return fmt.Errorf("error while waiting for pods to become inactive %s: %v", name, err)

because there is no other return between the first log:

kubernetes/test/e2e/framework/util.go

Line 2774 in a2bf827

Logf("Deleting RC %s took: %v", name, deleteRCTime)

and the second one:

kubernetes/test/e2e/framework/util.go

Line 2796 in a2bf827

Logf("Terminating RC %s pods took: %v", name, terminatePodTime)

That said, I have no idea why the log is:

Nov  2 11:54:14.817: INFO: Unexpected error occurred: timed out waiting for the condition

it should be something like:

error while waiting for pods to become inactive

Any thought on it? @gmarek ??

The second part is also super suspicious, because the WaitPollImmediate can return timed out waiting for the condition only when it actually times out (or is manually stopped, which doesn't happen). And the timeout is 10m, which is way bigger than the time from when the first deletion started in the test:

Nov  2 11:52:39.333: INFO: Starting to delete ReplicationControllers...
..
Nov  2 11:54:14.817: INFO: Unexpected error occurred: timed out waiting for the condition

This failure is extremely strange.

wojtek-t · 2016-11-09T14:28:57Z

Actually - please ignore my message above. I misread the logs.
Basically the problem is with: load-small-rc-121
The error in fact is:

Expected error:
      <*errors.errorString | 0xc42034ecf0>: {
          s: "timed out waiting for the condition",
      }
      timed out waiting for the condition
  not to have occurred�[0m

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:402

And there is no this log for 121 RC:

Deleting RC %s took

That said what happened here is that we hit this error:
https://github.com/foxish/kubernetes/blob/71941016c1ba13cb1d54e08c4c955562966acd6f/test/e2e/framework/util.go#L2740
which now makes perfect sense.

wojtek-t · 2016-11-09T14:46:42Z

I looked into apiserver logs, and it confirms that theory.
So from the test logs:

Nov  2 11:53:14.170: INFO: Terminating RC load-small-rc-239 pods took: 12.100267602s
�[1mSTEP�[0m: deleting replication controller load-small-rc-170 in namespace e2e-tests-load-30-nodepods-1-h872l, will wait for the garbage collector to delete the pods
�[1mSTEP�[0m: deleting replication controller load-small-rc-121 in namespace e2e-tests-load-30-nodepods-1-h872l, will wait for the garbage collector to delete the pods
Nov  2 11:53:14.878: INFO: Terminating RC load-small-rc-81 pods took: 12.700235381s

it shows that the podStoreForRC should be created around 11:53:14.

However from apiserver logs (11;53:14 is 18:53:14) so:

$ cat kube-apiserver.log | grep "small-rc-121"
...
I1102 18:52:11.840755       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (6.273824ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:53:14.908384       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/replicationcontrollers/load-small-rc-121: (20.11952ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:53:18.567953       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (4.991468ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:58:55.010617       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods/load-small-rc-121-mdttj: (1.239178ms) 200 [[kube-controller-manager/v1.5.0 (li    nux/amd64) kubernetes/df8db65/generic-garbage-collector] 127.0.0.1:51517]
...

What is more:

cat kube-apiserver.log | grep "small-rc-121" | grep watch
...
I1102 18:52:11.834393       7 panics.go:76] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=34594&timeoutSeco    nds=544: (4.977505757s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 19:04:24.752938       7 panics.go:76] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=38512&timeoutSeco    nds=556: (9m16.001141546s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
...

Looking deeper into logs, the watch that finished at 19:04:24 (thus it started at 18:55:08) was the one that is corresponding to our podStoreRC.

So the problem is that podStoreRC wasn't initialized correctly, because WATCH request was sent 2 minutes after it was supposed to be send.
So the problem is on the test side, but the question is why this happened.

wojtek-t · 2016-11-09T15:11:07Z

For the first test:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11517/
the issue was that one pod didn't started correctly. From kubelet logs:

I1020 09:45:05.450111    3643 kubelet.go:1592] SyncLoop (ADD, "api"): "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)"
I1020 09:45:05.751174    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:47:05.837740    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
I1020 09:47:17.088461    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:49:17.165065    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
E1020 09:49:17.171547    3643 generic.go:238] PLEG: Ignoring events for pod load-medium-rc-11-4qcqh/e2e-tests-load-30-nodepods-1-t608t: operation timeout: context deadline exceeded
I1020 09:49:30.088745    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:51:30.160984    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
I1020 09:51:45.088798    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:53:45.173738    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
...

I filed: #36502 for this problem.

wojtek-t · 2016-11-09T15:12:12Z

So the only remaining question in this issue is to answer, why "WATCH" request from the reflector wasn't issued in this run:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/

[as described in details in https://github.com//issues/33839#issuecomment-259431136 ]

wojtek-t · 2016-11-09T15:31:55Z

Actually, it's not about issuing "WATCH" itself. We now that the LIST request from reflector finished successfuly:

I1102 18:53:18.567953       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (4.991468ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]

at 18:53:18.

From the reflector code:

This List finished successfully.
https://github.com/caesarxuchao/kubernetes/blob/204492703457eb571b83dfbaa3008c473a6f979d/pkg/client/cache/reflector.go#L249
When we get here:
https://github.com/caesarxuchao/kubernetes/blob/204492703457eb571b83dfbaa3008c473a6f979d/pkg/client/cache/reflector.go#L265
The reflector is initialize, so podStore would also be initialized.

That means that we didn't get to the second place for the next ~1 minute (which is a timeout for initializing podStore).
Since there aren't any other LIST requests, and there is finally a WATCH request, that means that there weren't any errors in the meantime.
Also, there are only some simple in-memory operations in between those two places.

So, the only reason for that could have been overloaded machine where the test was running.
And indeed looking into other RCs it seems that there were also some missing gaps, e.g. for rc-162, it should have been started at 18:53:16, but LIST was issued 16 seconds later.

I1102 18:53:32.155699       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-162&resourceVersion=0: (9.771823ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46903]

I also looked whether this could have been because of qps-limits, and it wasn't because there weren't many requests from e2e test.

So as the only possible AI, I will just increase timeout for waiting for podStore to be initialized.

Automatic merge from submit-queue Increase initialization timeout for podStore Fix #33839 - see #33839 (comment) for more details.

k8s-github-robot assigned wojtek-t Sep 30, 2016

wojtek-t added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Oct 10, 2016

wojtek-t mentioned this issue Nov 9, 2016

Increase initialization timeout for podStore #36504

Merged

k8s-github-robot closed this as completed in #36504 Nov 9, 2016

k8s-github-robot pushed a commit that referenced this issue Nov 9, 2016

Merge pull request #36504 from wojtek-t/increase_initialization-timeout

b6461c9

Automatic merge from submit-queue Increase initialization timeout for podStore Fix #33839 - see #33839 (comment) for more details.

k8s-github-robot mentioned this issue Nov 10, 2016

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #36547

Closed

This was referenced Nov 18, 2016

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #37111

Closed

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #37470

Closed

k8s-github-robot mentioned this issue Apr 24, 2017

ci-kubernetes-e2e-gci-gke-staging: broken test run #43037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

k8s-github-robot commented Sep 30, 2016

wojtek-t commented Oct 10, 2016

k8s-github-robot commented Oct 20, 2016

k8s-github-robot commented Nov 2, 2016

wojtek-t commented Nov 8, 2016

wojtek-t commented Nov 8, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

Comments

k8s-github-robot commented Sep 30, 2016

wojtek-t commented Oct 10, 2016

k8s-github-robot commented Oct 20, 2016

k8s-github-robot commented Nov 2, 2016

wojtek-t commented Nov 8, 2016

wojtek-t commented Nov 8, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016

wojtek-t commented Nov 9, 2016