flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

goltermann · 2015-09-29T00:01:30Z

DaemonRestart Controller Manager should not create/delete replicas across restart (Failed 7 times in the last 30 runs. Stability: 76 %)

brendandburns · 2015-09-29T03:05:57Z

@davidopp for triage

fgrzadkowski · 2015-09-29T13:01:02Z

I'll look into this.

fgrzadkowski · 2015-09-29T14:53:18Z

To be honest I can't reproduce this problem. I've run this ~20 times locally and all passed. I checked on jenkins and it's passing since 23rd September (it's been failing ~50% before). I'll run this 100 times during the night. If they pass I'll move it to kubernetes-e2e-gce.

davidopp · 2015-09-30T07:44:51Z

This did fail again here
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel-flaky/6808/

bprashanth · 2015-10-01T06:23:05Z

Daemon restart cannot run in parallel (noting that your url has gce-parallel-flaky in it), the controllers will fight each other. But assuming that isn't always the case, I would expect this to happen if the first list after restart somehow returned an inconsistent resource version.

Not a great theory, but otherwise the rc manager should restart and wait on a non-empty RV.
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replication/replication_controller.go#L418
The assumption is if the RV is non-empty, the store has latest contents. If that isn't the case the RC can overshoot.

fgrzadkowski · 2015-10-01T06:48:00Z

@bprashanth I don't understand what do you mean by controllers fighting each other. Isn't it that gce-parallel runs tests in parallel, but there is only one controller manager? The only risk I see is that there will be multiple tests which try to restart controller manager and as a result it will take longer for it to start working.

Can you also explain why it might return inconsistent resource version?

davidopp · 2015-10-01T07:04:14Z

Sorry, I accidentally updated #14695.

I talked to @bprashanth about this and he had a pretty good guess in about two seconds about what the problem is.

ReplicationController sync method does this:

    if !rm.podStoreSynced() {
        // Sleep so we give the pod reflector goroutine a chance to run.
        time.Sleep(PodStoreSyncedPollPeriod)
        glog.Infof("Waiting for pods controller to sync, requeuing rc %v", rc.Name)
        rm.enqueueController(&rc)
        return nil
    }

but DaemonSet controller doesn't. (I guess when DaemonSet was copy-pasted from ReplicationController, this was accidentally removed.)

Pod reflector and syncer are running in separate goroutines, and if pod reflector doesn't run then syncer will think there are no pods on the node.

bprashanth · 2015-10-01T17:08:15Z

Oh, hmm, does Daemon restart actually test the daemon controller? I though it didn't.

@fgrzadkowski

@bprashanth I don't understand what do you mean by controllers fighting each other. Isn't it that gce-parallel runs tests in parallel, but there is only one controller manager? The only risk I see is that there will be multiple tests which try to restart controller manager and as a result it will take longer for it to start working.

Multiple RCs I mean, controlling the same pods. So if you look in the log David posted you'll see multiple daemon-restart controllers active simultaneously:

I0930 00:10:49.668342       6 controller_utils.go:137] Controller e2e-tests-daemonrestart-5tstk/daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555 either never recorded expectations, or the ttl expired.
I0930 00:10:49.668459       6 controller_utils.go:147] Setting expectations &{add:10 del:0 key:e2e-tests-daemonrestart-5tstk/daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555}
I0930 00:10:49.668533       6 replication_controller.go:354] Too few "e2e-tests-daemonrestart-5tstk"/"daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555" replicas, need 10, creating 10
I0930 00:10:49.672833       6 controller_utils.go:137] Controller e2e-tests-daemonrestart-oc9tp/daemonrestart10-ab27a4a6-6707-11e5-8bd8-42010af01555 either never recorded expectations, or the ttl expired.

This is probably one from the kubelet half of the test, another from the scheduler etc all selecting across the same pod labels and stopping/starting at the same time (I'm guessing).

Can you also explain why it might return inconsistent resource version?

If that happened this is what I'd expect. Don't have evidence that it is.

Fix race condition in DaemonSet controller. Fixes #14693.

goltermann added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 29, 2015

brendandburns added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/controller-manager priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 29, 2015

brendandburns assigned davidopp Sep 29, 2015

fgrzadkowski assigned fgrzadkowski and unassigned davidopp Sep 29, 2015

davidopp mentioned this issue Sep 30, 2015

Re-activate DaemonRestart tests, they've been clean in gce-flaky for > 10 runs #14777

Merged

fgrzadkowski mentioned this issue Oct 1, 2015

flaky e2e: Network when a minion node becomes unreachable [replication controller] recreates pods scheduled on the unreachable minion node AND allows scheduling of pods on a minion after it rejoins the cluster #14695

Closed

davidopp assigned davidopp and unassigned fgrzadkowski Oct 1, 2015

davidopp mentioned this issue Oct 2, 2015

DaemonSet e2e test fails in BeforeEach #14909

Closed

davidopp closed this as completed in 29dd7e3 Oct 5, 2015

a-robinson added a commit that referenced this issue Oct 5, 2015

Merge pull request #14896 from davidopp/master

53067d0

Fix race condition in DaemonSet controller. Fixes #14693.

davidopp pushed a commit to davidopp/kubernetes that referenced this issue Oct 5, 2015

Fix race condition in DaemonSet controller. Fixes kubernetes#14693.

a82aea4

mkulke pushed a commit to mkulke/kubernetes that referenced this issue Oct 7, 2015

Fix race condition in DaemonSet controller. Fixes kubernetes#14693.

5e9b374

RichieEscarez pushed a commit to RichieEscarez/kubernetes that referenced this issue Dec 4, 2015

Fix race condition in DaemonSet controller. Fixes kubernetes#14693.

ffe51c9

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016

Fix race condition in DaemonSet controller. Fixes kubernetes#14693.

8823663

shouhong pushed a commit to shouhong/kubernetes that referenced this issue Feb 14, 2017

Fix race condition in DaemonSet controller. Fixes kubernetes#14693.

3117522

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

goltermann commented Sep 29, 2015

brendandburns commented Sep 29, 2015

fgrzadkowski commented Sep 29, 2015

fgrzadkowski commented Sep 29, 2015

davidopp commented Sep 30, 2015

bprashanth commented Oct 1, 2015

fgrzadkowski commented Oct 1, 2015

davidopp commented Oct 1, 2015

bprashanth commented Oct 1, 2015

flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

Comments

goltermann commented Sep 29, 2015

brendandburns commented Sep 29, 2015

fgrzadkowski commented Sep 29, 2015

fgrzadkowski commented Sep 29, 2015

davidopp commented Sep 30, 2015

bprashanth commented Oct 1, 2015

fgrzadkowski commented Oct 1, 2015

davidopp commented Oct 1, 2015

bprashanth commented Oct 1, 2015