Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #33839

Closed
k8s-github-robot opened this issue Sep 30, 2016 · 11 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@k8s-github-robot
Copy link

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1451/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:211
scaling rc load-small-rc-5214 for the first time
Expected error:
    <*errors.errorString | 0xc85f25fe80>: {
        s: "error while scaling RC load-small-rc-5214 to 2 replicas: Scaling the resource failed with: the server has received too many requests and has asked us to try again later (get replicationcontrollers load-small-rc-5214); Current resource version Unknown",
    }
    error while scaling RC load-small-rc-5214 to 2 replicas: Scaling the resource failed with: the server has received too many requests and has asked us to try again later (get replicationcontrollers load-small-rc-5214); Current resource version Unknown
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:389

Previous issues for this test: #26544 #26938 #27595 #30146 #30469 #31374 #31427 #31433 #31589 #31981 #32257 #33711

@k8s-github-robot k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 30, 2016
@wojtek-t
Copy link
Member

We should ignore failures for enormous-cluster, those were experiments.

As for kubemark-scale, those issues were etcd-related. We are already working on migrating to etcd3.
So I'm decreasing the priority of it for now.

@wojtek-t wojtek-t added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Oct 10, 2016
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11517/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:211
scaling rc load-medium-rc-11 for the first time
Expected error:
    <*errors.errorString | 0xc424fe3aa0>: {
        s: "Error while waiting for replication controller load-medium-rc-11 pods to be running: Timeout while waiting for pods with labels \"name=load-medium-rc-11\" to be running",
    }
    Error while waiting for replication controller load-medium-rc-11 pods to be running: Timeout while waiting for pods with labels "name=load-medium-rc-11" to be running
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:389

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:210
deleting rc load-small-rc-121
Expected error:
    <*errors.errorString | 0xc42034ecf0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:402

@wojtek-t
Copy link
Member

wojtek-t commented Nov 8, 2016

All failures from kubemark-scale and gce-enormous cluster are expected. So I'm removing them all from here.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 8, 2016

So there are 2 failures from gce-scalability that we should take a look.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

Regarding the second failure:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/
it's super strange.

I took a look into it and here are some findings:

cat build.log | grep "Deleting RC load-" | wc -l
328
cat build.log | grep "Terminating RC load-" | wc -l
327

This shows that the error happened:

return fmt.Errorf("error while waiting for pods to become inactive %s: %v", name, err)

because there is no other return between the first log:
Logf("Deleting RC %s took: %v", name, deleteRCTime)

and the second one:
Logf("Terminating RC %s pods took: %v", name, terminatePodTime)

That said, I have no idea why the log is:

Nov  2 11:54:14.817: INFO: Unexpected error occurred: timed out waiting for the condition

it should be something like:

error while waiting for pods to become inactive

Any thought on it? @gmarek ??

The second part is also super suspicious, because the WaitPollImmediate can return timed out waiting for the condition only when it actually times out (or is manually stopped, which doesn't happen). And the timeout is 10m, which is way bigger than the time from when the first deletion started in the test:

Nov  2 11:52:39.333: INFO: Starting to delete ReplicationControllers...
..
Nov  2 11:54:14.817: INFO: Unexpected error occurred: timed out waiting for the condition

This failure is extremely strange.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

Actually - please ignore my message above. I misread the logs.
Basically the problem is with: load-small-rc-121
The error in fact is:

Expected error:
      <*errors.errorString | 0xc42034ecf0>: {
          s: "timed out waiting for the condition",
      }
      timed out waiting for the condition
  not to have occurred�[0m

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:402

And there is no this log for 121 RC:

Deleting RC %s took

That said what happened here is that we hit this error:
https://github.com/foxish/kubernetes/blob/71941016c1ba13cb1d54e08c4c955562966acd6f/test/e2e/framework/util.go#L2740
which now makes perfect sense.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

I looked into apiserver logs, and it confirms that theory.
So from the test logs:

Nov  2 11:53:14.170: INFO: Terminating RC load-small-rc-239 pods took: 12.100267602s
�[1mSTEP�[0m: deleting replication controller load-small-rc-170 in namespace e2e-tests-load-30-nodepods-1-h872l, will wait for the garbage collector to delete the pods
�[1mSTEP�[0m: deleting replication controller load-small-rc-121 in namespace e2e-tests-load-30-nodepods-1-h872l, will wait for the garbage collector to delete the pods
Nov  2 11:53:14.878: INFO: Terminating RC load-small-rc-81 pods took: 12.700235381s

it shows that the podStoreForRC should be created around 11:53:14.

However from apiserver logs (11;53:14 is 18:53:14) so:

$ cat kube-apiserver.log | grep "small-rc-121"
...
I1102 18:52:11.840755       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (6.273824ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:53:14.908384       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/replicationcontrollers/load-small-rc-121: (20.11952ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:53:18.567953       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (4.991468ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 18:58:55.010617       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods/load-small-rc-121-mdttj: (1.239178ms) 200 [[kube-controller-manager/v1.5.0 (li    nux/amd64) kubernetes/df8db65/generic-garbage-collector] 127.0.0.1:51517]
...

What is more:

cat kube-apiserver.log | grep "small-rc-121" | grep watch
...
I1102 18:52:11.834393       7 panics.go:76] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=34594&timeoutSeco    nds=544: (4.977505757s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
I1102 19:04:24.752938       7 panics.go:76] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=38512&timeoutSeco    nds=556: (9m16.001141546s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]
...

Looking deeper into logs, the watch that finished at 19:04:24 (thus it started at 18:55:08) was the one that is corresponding to our podStoreRC.

So the problem is that podStoreRC wasn't initialized correctly, because WATCH request was sent 2 minutes after it was supposed to be send.
So the problem is on the test side, but the question is why this happened.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

For the first test:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11517/
the issue was that one pod didn't started correctly. From kubelet logs:

I1020 09:45:05.450111    3643 kubelet.go:1592] SyncLoop (ADD, "api"): "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)"
I1020 09:45:05.751174    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:47:05.837740    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
I1020 09:47:17.088461    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:49:17.165065    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
E1020 09:49:17.171547    3643 generic.go:238] PLEG: Ignoring events for pod load-medium-rc-11-4qcqh/e2e-tests-load-30-nodepods-1-t608t: operation timeout: context deadline exceeded
I1020 09:49:30.088745    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:51:30.160984    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
I1020 09:51:45.088798    3643 docker_manager.go:1941] Need to restart pod infra container for "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)" because it is not found
E1020 09:53:45.173738    3643 docker_manager.go:2154] Failed to create pod infra container: RunContainerError; Skipping pod "load-medium-rc-11-4qcqh_e2e-tests-load-30-nodepods-1-t608t(e14812a5-96a9-11e6-b89e-42010af00002)": runContainer: operation timeout: context deadline exceeded
...

I filed: #36502 for this problem.

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

So the only remaining question in this issue is to answer, why "WATCH" request from the reflector wasn't issued in this run:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/11771/

[as described in details in https://github.com//issues/33839#issuecomment-259431136 ]

@wojtek-t
Copy link
Member

wojtek-t commented Nov 9, 2016

Actually, it's not about issuing "WATCH" itself. We now that the LIST request from reflector finished successfuly:

I1102 18:53:18.567953       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-121&resourceVersion=0: (4.991468ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46894]

at 18:53:18.

From the reflector code:

  1. This List finished successfully.
    https://github.com/caesarxuchao/kubernetes/blob/204492703457eb571b83dfbaa3008c473a6f979d/pkg/client/cache/reflector.go#L249
  2. When we get here:
    https://github.com/caesarxuchao/kubernetes/blob/204492703457eb571b83dfbaa3008c473a6f979d/pkg/client/cache/reflector.go#L265
    The reflector is initialize, so podStore would also be initialized.

That means that we didn't get to the second place for the next ~1 minute (which is a timeout for initializing podStore).
Since there aren't any other LIST requests, and there is finally a WATCH request, that means that there weren't any errors in the meantime.
Also, there are only some simple in-memory operations in between those two places.

So, the only reason for that could have been overloaded machine where the test was running.
And indeed looking into other RCs it seems that there were also some missing gaps, e.g. for rc-162, it should have been started at 18:53:16, but LIST was issued 16 seconds later.

I1102 18:53:32.155699       7 panics.go:76] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-1-h872l/pods?labelSelector=name%3Dload-small-rc-162&resourceVersion=0: (9.771823ms) 200 [[e    2e.test/v1.5.0 (linux/amd64) kubernetes/df8db65] 130.211.158.233:46903]

I also looked whether this could have been because of qps-limits, and it wasn't because there weren't many requests from e2e test.

So as the only possible AI, I will just increase timeout for waiting for podStore to be initialized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

2 participants