Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite} #31981

Closed
k8s-github-robot opened this issue Sep 2, 2016 · 21 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Milestone

Comments

@k8s-github-robot
Copy link

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5489/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:198
scaling rc load-big-rc-2 for the first time
Expected error:
    <*errors.errorString | 0xc839362ba0>: {
        s: "error while scaling RC load-big-rc-2 to 219 replicas: timed out waiting for \"load-big-rc-2\" to be synced",
    }
    error while scaling RC load-big-rc-2 to 219 replicas: timed out waiting for "load-big-rc-2" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:319

Previous issues for this test: #26544 #26938 #27595 #30146 #30469 #31374 #31427 #31433 #31589

@k8s-github-robot k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 2, 2016
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1385/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:72
Expected
    <int>: 1
not to be >
    <int>: 0
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:71

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5522/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:198
scaling rc load-small-rc-79 for the first time
Expected error:
    <*errors.errorString | 0xc836752760>: {
        s: "error while scaling RC load-small-rc-79 to 4 replicas: timed out waiting for \"load-small-rc-79\" to be synced",
    }
    error while scaling RC load-small-rc-79 to 4 replicas: timed out waiting for "load-small-rc-79" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:319

@k8s-github-robot k8s-github-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 3, 2016
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5529/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:198
scaling rc load-small-rc-476 for the first time
Expected error:
    <*errors.errorString | 0xc8214c0110>: {
        s: "error while scaling RC load-small-rc-476 to 4 replicas: timed out waiting for \"load-small-rc-476\" to be synced",
    }
    error while scaling RC load-small-rc-476 to 4 replicas: timed out waiting for "load-small-rc-476" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:319

@k8s-github-robot k8s-github-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 3, 2016
@k8s-github-robot
Copy link
Author

[FLAKE-PING] @wojtek-t

This flaky-test issue would love to have more attention.

@wojtek-t
Copy link
Member

wojtek-t commented Sep 5, 2016

Hmm - something strange is happening in those kubemark-500 failure.

The first failure - it seems that controller-manager wasn't even aware of the decrease in number of replicas:

STEP: creating replication controller load-big-rc-2 in namespace e2e-tests-load-30-nodepods-3-1z5ut
Sep  2 08:19:15.065: INFO: Created replication controller with name: load-big-rc-2, namespace: e2e-tests-load-30-nodepods-3-1z5ut, replica count: 250
Sep  2 08:19:25.065: INFO: load-big-rc-2 Pods: 250 out of 250 created, 250 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
STEP: Scaling replication controller load-big-rc-2 in namespace e2e-tests-load-30-nodepods-3-1z5ut to 219
E0902 08:27:04.006899   11623 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
Sep  2 08:27:04.007: INFO: Unexpected error occurred: error while scaling RC load-big-rc-2 to 219 replicas: timed out waiting for "load-big-rc-2" to be synced
...
STEP: Scaling replication controller load-big-rc-2 in namespace e2e-tests-load-30-nodepods-3-1z5ut to 146

apiserver:

...
I0902 15:19:21.618507    3165 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2/status: (9.616598ms) 200 [[kube-controller-m     anager/v1.5.0 (linux/amd64) kubernetes/dfccabe/replication-controller] 127.0.0.1:32980]
I0902 15:19:25.067683    3165 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/pods?labelSelector=name%3Dload-big-rc-2&resourceVersion=224278&timeoutSecon     ds=322: (9.971036438s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/dfccabe] 146.148.96.166:42286]

15:22:03.988813    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2: (1.114716ms) 200 [[e2e.test/v1.5.0 (linux/a     md64) kubernetes/dfccabe] 146.148.96.166:44582]
I0902 15:22:04.003190    3165 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2: (12.858173ms) 200 [[e2e.test/v1.5.0 (linux/     amd64) kubernetes/dfccabe] 146.148.96.166:44581]
I0902 15:22:04.005725    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2: (952.893µs) 200 [[e2e.test/v1.5.0 (linux/am     d64) kubernetes/dfccabe] 146.148.96.166:44581]

15:27:04.007372    3165 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers?fieldSelector=metadata.name%3Dload-big-rc-2&resource     Version=266155: (5m0.000319298s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/dfccabe] 146.148.96.166:44574]
I0902 15:27:24.780552    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2: (902.312µs) 200 [[e2e.test/v1.5.0 (linux/am     d64) kubernetes/dfccabe] 146.148.96.166:46504]
I0902 15:27:24.797328    3165 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-3-1z5ut/replicationcontrollers/load-big-rc-2: (15.195073ms) 200 [[e2e.test/v1.5.0 (linux/     amd64) kubernetes/dfccabe] 146.148.96.166:46504]

controller-manager:

...
I0902 15:19:17.690762    3208 event.go:217] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"e2e-tests-load-30-nodepods-3-1z5ut", Name:"load-big-rc-2", UID:"9bf29183-712    0-11e6-89ef-42010af00009", APIVersion:"v1", ResourceVersion:"224279", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: load-big-rc-2-wue4v
I0902 15:19:17.757762    3208 trace.go:61] Trace "syncReplicationController: e2e-tests-load-30-nodepods-3-1z5ut/load-big-rc-2" (started 2016-09-02 15:19:15.105936366 +0000 UTC):

I0902 15:27:26.215219    3208 replication_controller.go:555] Too many "e2e-tests-load-30-nodepods-3-1z5ut"/"load-big-rc-2" replicas, need 146, deleting 104
I0902 15:27:26.273704    3208 event.go:217] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"e2e-tests-load-30-nodepods-3-1z5ut", Name:"load-big-rc-2", UID:"9bf29183-712    0-11e6-89ef-42010af00009", APIVersion:"v1", ResourceVersion:"297411", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: load-big-rc-2-26txc
...

So it seems that controller-amanger didn't even receive the update from 250 to 211 and it only received update to 146 replicas.

On the other hand, in this failure:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5522/
apiserver, didn't even receive the first update to 4 replicas:

test:

...
Sep  3 09:32:14.584: INFO: load-small-rc-79 Pods: 5 out of 5 created, 5 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
STEP: Scaling replication controller load-small-rc-79 in namespace e2e-tests-load-30-nodepods-5-0573d to 4
E0903 09:41:41.953515   10476 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
Sep  3 09:41:41.953: INFO: Unexpected error occurred: error while scaling RC load-small-rc-79 to 4 replicas: timed out waiting for "load-small-rc-79" to be synced

STEP: Scaling replication controller load-small-rc-1078 in namespace e2e-tests-load-30-nodepods-4-wkl7b to 4
Sep  3 09:42:38.162: INFO: Unexpected error occurred: error while scaling RC load-small-rc-79 to 4 replicas: timed out waiting for "load-small-rc-79" to be synced

apiserver:

I0903 16:32:14.585900    3165 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-5-0573d/pods?labelSelector=name%3Dload-small-rc-79&resourceVersion=201637&timeoutSec    onds=582: (9.986391511s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/af9afbf] 130.211.160.145:53672]

I0903 16:36:41.635664    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79: (1.205592ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/af9afbf] 130.211.160.145:57583]
I0903 16:36:41.951021    3165 trace.go:61] Trace "Get /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79" (started 2016-09-03 16:36:41.682127757     +0000 UTC):
I0903 16:36:41.951341    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79: (269.617256ms) 200 [[e2e.test/v1.5.0 (lin    ux/amd64) kubernetes/af9afbf] 130.211.160.145:57591]

I0903 16:41:41.953958    3165 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers?fieldSelector=metadata.name%3Dload-small-rc-79&resour    ceVersion=277716: (4m59.944559632s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/af9afbf] 130.211.160.145:57590]

I0903 16:42:38.146781    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79: (1.041313ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/af9afbf] 130.211.160.145:59367]
I0903 16:42:38.149825    3165 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79: (1.303532ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/af9afbf] 130.211.160.145:59367]
I0903 16:42:38.160244    3165 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers/load-small-rc-79: (1.005628ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/af9afbf] 130.211.160.145:59370]
I0903 16:42:38.161874    3165 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-5-0573d/replicationcontrollers?fieldSelector=metadata.name%3Dload-small-rc-79&resour    ceVersion=277716: (557.594µs) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/af9afbf] 130.211.160.145:59370]

@wojtek-t wojtek-t added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Sep 5, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Sep 5, 2016

I've never seen anything like that in the past, so it seems like recent regression. But I don't see anything suspicious being merge in the last few days.

@kubernetes/sig-api-machinery @lavalamp

@wojtek-t wojtek-t added the kind/bug Categorizes issue or PR as related to a bug. label Sep 5, 2016
gmarek added a commit to gmarek/kubernetes that referenced this issue Sep 5, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Sep 5, 2016

#31981 (comment) is addressed by #32081

The other 3 failures are still under investigation.

gmarek added a commit that referenced this issue Sep 5, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Sep 5, 2016

Sorry - I missed something in the logs for the second run.

So actually all three runs (except from the one that is being fixed by #32081) are instance of exactly the same problem:

  • we are scaling RC (by either adding or removing replicas)
  • no actions happen for it
  • there are no corresponding logs in replication controller (which means that either RC ignores this update, or this update is not even delivered to RC).

What is suspicious to me:

  • it only happens on updates (all creates and deletes work correctly)
  • I didn't see anything like that before (it started happening during last 3 days)
  • unless I'm not missing anything, we don't observe anything like that for other resources than RC (I think that we should see it e.g. for pods if it would happening, because there are way more pod updates that are important for the system (e.g. scheduling, updating status to running, etc.)

So my hypothesis is that it's related to RC, but this is currently just a hypothesis.

k8s-github-robot pushed a commit that referenced this issue Sep 5, 2016
…ntroller

Automatic merge from submit-queue

NodeController listing nodes from apiserver cache

Ref #31981

This is addressing this particular failure:
#31981 (comment)
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5600/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:204
scaling rc load-small-rc-1338 for the first time
Expected error:
    <*errors.errorString | 0xc82267aba0>: {
        s: "error while scaling RC load-small-rc-1338 to 3 replicas: timed out waiting for \"load-small-rc-1338\" to be synced",
    }
    error while scaling RC load-small-rc-1338 to 3 replicas: timed out waiting for "load-small-rc-1338" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:325

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

The last failure is exactly the same as the other three, so we continue debugging the same issue.
I looked into logs and here are the interesting ones:

  • the second scale of "load-small-rc-1338" failed
STEP: Scaling replication controller load-small-rc-1338 in namespace e2e-tests-load-30-nodepods-4-aopgp to 3
E0905 19:46:07.879759   10997 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
Sep  5 19:46:07.880: INFO: Unexpected error occurred: error while scaling RC load-small-rc-1338 to 3 replicas: timed out waiting for "load-small-rc-1338" to be synced

from apiserver:

I0906 02:41:07.848684    3170 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-4-aopgp/replicationcontrollers/load-small-rc-1338: (842.549µs) 200 [[e2e.test/v1.5.0 (linu    x/amd64) kubernetes/2977d03] 104.155.162.195:36962]
I0906 02:41:07.873576    3170 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-4-aopgp/replicationcontrollers/load-small-rc-1338: (16.311062ms) 200 [[e2e.test/v1.5.0 (li    nux/amd64) kubernetes/2977d03] 104.155.162.195:36962]
I0906 02:41:07.878515    3170 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-4-aopgp/replicationcontrollers/load-small-rc-1338: (923.081µs) 200 [[e2e.test/v1.5.0 (linu    x/amd64) kubernetes/2977d03] 104.155.162.195:36962]
I0906 02:46:07.880683    3170 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-4-aopgp/replicationcontrollers?fieldSelector=metadata.name%3Dload-small-rc-1338&reso    urceVersion=120794: (4m59.991065451s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/2977d03] 104.155.162.195:36967]

from controller:

wojtekt@wojtekt-work:~/Downloads$ cat controller.txt | grep "load-small-rc-1338"
...
I0906 02:36:34.355485    3209 replication_controller.go:323] Observed updated replication controller load-small-rc-1338. Pod count change: 5->5
I0906 02:41:07.875292    3209 replication_controller.go:323] Observed updated replication controller load-small-rc-1338. Pod count change: 5->5
I0906 02:41:07.875340    3209 replication_controller.go:323] Observed updated replication controller load-small-rc-1338. Pod count change: 5->5
...

So at the time when the scale from 5 to 3 happened, ReplicationController received some update. However, the log we have isn't exactly what we need so there are two possibilities:

  • the update that we received is not correct (i.e. it's not 5->3 scale)
    or
  • replication controller is incorrectly processing the update

#32100 is supposed to fix the log so that we can debug it further.

k8s-github-robot pushed a commit that referenced this issue Sep 6, 2016
Automatic merge from submit-queue

Fix debugging log in RC

Ref #31981
@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

For now, we are waiting for the next failure (currently with correct logs), to narrow down the reason of the failure.

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/5610/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:204
scaling rc load-small-rc-616 for the first time
Expected error:
    <*errors.errorString | 0xc820aec820>: {
        s: "error while scaling RC load-small-rc-616 to 4 replicas: timed out waiting for \"load-small-rc-616\" to be synced",
    }
    error while scaling RC load-small-rc-616 to 4 replicas: timed out waiting for "load-small-rc-616" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:325

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

OK - so we seem to have enough logs to start real debugging now.
So from the last failure:

  • we were scaling from 5 to 4
  • apiserver logs:
I0906 11:24:03.475652    3170 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-2-avmfa/replicationcontrollers/load-small-rc-616: (798.599µs) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/c79a07f] 104.197.28.108:52638]
I0906 11:24:03.504478    3170 handlers.go:162] PUT /api/v1/namespaces/e2e-tests-load-30-nodepods-2-avmfa/replicationcontrollers/load-small-rc-616: (21.9822ms) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/c79a07f] 104.197.28.108:52638]
I0906 11:24:03.506345    3170 handlers.go:162] GET /api/v1/namespaces/e2e-tests-load-30-nodepods-2-avmfa/replicationcontrollers/load-small-rc-616: (827.917µs) 200 [[e2e.test/v1.5.0 (linux    /amd64) kubernetes/c79a07f] 104.197.28.108:52641]
I0906 11:29:03.515409    3170 handlers.go:162] GET /api/v1/watch/namespaces/e2e-tests-load-30-nodepods-2-avmfa/replicationcontrollers?fieldSelector=metadata.name%3Dload-small-rc-616&resou    rceVersion=120229: (5m0.007658819s) 200 [[e2e.test/v1.5.0 (linux/amd64) kubernetes/c79a07f] 104.197.28.108:52638]
  • controller logs:
I0906 11:24:03.523224    3215 replication_controller.go:323] Observed updated replication controller load-small-rc-616. Desired pod count change: 5->4
I0906 11:24:03.523239    3215 replication_controller.go:323] Observed updated replication controller load-small-rc-616. Desired pod count change: 4->5

There weren't any other PUT events on our replication controller around that time.

From the above logs, it's pretty clear, that we correctly deliver watch event that is changing the size of RC from 5 to 4, but then some new event is coming that scales RC back from 4 to 5.

So in my opionion, the problem is either in RC itself, or somewhere in the controller framework.

@lavalamp @bprashanth FYI ^^

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

Note that difference between processing those two events is 0.15ms, so they are basically processed on by the other. My guess is that there are two events on the underlying DeltaFIFO and they are just processed on by the other. The other event might potentially be resync, but I still don't see the possibility for it to be "incorrect" sync event.

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

Looking into logs from RC from around that time, there is huge amount of logs like this:

I0906 11:24:03.523163    3215 replication_controller.go:323] Observed updated replication controller load-small-rc-14. Desired pod count change: 4->4

[where number of replicas didn't change in spec]
There are ~1600-1700 such lines (which is ~the number of RCs in the system)

Thus my very strong hypothesis is that there was a Resync at that time.

Which probably means, that we have some race between Resync and Update...

@lavalamp ^^

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

OK - so I think I understand what is happening here.
Basically I think that what happens is that:

  • an Update event comes from watch
  • it is queued in DeltaFIFO (however underlying store still has the old version of object)
  • Resync happens
  • the resync event is queued in DeltaFIFO (but the value of object is the values from underlying store which is the old version)
  • then those two events are delivered to RC one by one
    This is causing a problem.

I think the solution is to not append "Sync" event, if there is already one queued in DeltaFIFO for that object.
I will send a PR for it soon (hopefully today).

@lavalamp - FYI

@ncdc
Copy link
Member

ncdc commented Sep 6, 2016

cc @deads2k @pmorie

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

OK - I have reproduced it unit test, so I'm pretty sure this is this problem

@kubernetes/sig-api-machinery - FYI

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

PR fixing it is coming soon

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2016

The fix is out for review.

@wojtek-t wojtek-t added this to the v1.4 milestone Sep 6, 2016
@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1393/

Failed: [k8s.io] Load capacity [Feature:Performance] should be able to handle 30 pods per node {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:204
scaling rc load-small-rc-2406 for the first time
Expected error:
    <*errors.errorString | 0xc878220b90>: {
        s: "error while scaling RC load-small-rc-2406 to 3 replicas: timed out waiting for \"load-small-rc-2406\" to be synced",
    }
    error while scaling RC load-small-rc-2406 to 3 replicas: timed out waiting for "load-small-rc-2406" to be synced
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/load.go:325

k8s-github-robot pushed a commit that referenced this issue Sep 6, 2016
Automatic merge from submit-queue

Fix race in DeltaFIFO

Fix #31981

@ncdc @deads2k @pmorie @gmarek
k8s-github-robot pushed a commit that referenced this issue Apr 13, 2017
Automatic merge from submit-queue

Reduce replication_controller log spam

Decrease verbosity and reword 'Observed updated replication controller
...' now that the issue it was added for has been fixed.

This was originally added to debug #31981, and it was fixed back in September 2016.

cc @gmarek @wojtek-t @Kargakis @eparis @smarterclayton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

4 participants