Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693

Closed
goltermann opened this issue Sep 29, 2015 · 8 comments
Assignees
Labels
area/controller-manager kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@goltermann
Copy link
Contributor

DaemonRestart Controller Manager should not create/delete replicas across restart (Failed 7 times in the last 30 runs. Stability: 76 %)

@goltermann goltermann added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 29, 2015
@brendandburns brendandburns added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/controller-manager priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 29, 2015
@brendandburns
Copy link
Contributor

@davidopp for triage

@fgrzadkowski
Copy link
Contributor

I'll look into this.

@fgrzadkowski
Copy link
Contributor

To be honest I can't reproduce this problem. I've run this ~20 times locally and all passed. I checked on jenkins and it's passing since 23rd September (it's been failing ~50% before). I'll run this 100 times during the night. If they pass I'll move it to kubernetes-e2e-gce.

@davidopp
Copy link
Member

@bprashanth
Copy link
Contributor

Daemon restart cannot run in parallel (noting that your url has gce-parallel-flaky in it), the controllers will fight each other. But assuming that isn't always the case, I would expect this to happen if the first list after restart somehow returned an inconsistent resource version.

Not a great theory, but otherwise the rc manager should restart and wait on a non-empty RV.
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replication/replication_controller.go#L418
The assumption is if the RV is non-empty, the store has latest contents. If that isn't the case the RC can overshoot.

@fgrzadkowski
Copy link
Contributor

@bprashanth I don't understand what do you mean by controllers fighting each other. Isn't it that gce-parallel runs tests in parallel, but there is only one controller manager? The only risk I see is that there will be multiple tests which try to restart controller manager and as a result it will take longer for it to start working.

Can you also explain why it might return inconsistent resource version?

@davidopp
Copy link
Member

davidopp commented Oct 1, 2015

Sorry, I accidentally updated #14695.

I talked to @bprashanth about this and he had a pretty good guess in about two seconds about what the problem is.

ReplicationController sync method does this:

    if !rm.podStoreSynced() {
        // Sleep so we give the pod reflector goroutine a chance to run.
        time.Sleep(PodStoreSyncedPollPeriod)
        glog.Infof("Waiting for pods controller to sync, requeuing rc %v", rc.Name)
        rm.enqueueController(&rc)
        return nil
    }

but DaemonSet controller doesn't. (I guess when DaemonSet was copy-pasted from ReplicationController, this was accidentally removed.)

Pod reflector and syncer are running in separate goroutines, and if pod reflector doesn't run then syncer will think there are no pods on the node.

@davidopp davidopp assigned davidopp and unassigned fgrzadkowski Oct 1, 2015
@bprashanth
Copy link
Contributor

Oh, hmm, does Daemon restart actually test the daemon controller? I though it didn't.

@fgrzadkowski

@bprashanth I don't understand what do you mean by controllers fighting each other. Isn't it that gce-parallel runs tests in parallel, but there is only one controller manager? The only risk I see is that there will be multiple tests which try to restart controller manager and as a result it will take longer for it to start working.

Multiple RCs I mean, controlling the same pods. So if you look in the log David posted you'll see multiple daemon-restart controllers active simultaneously:

I0930 00:10:49.668342       6 controller_utils.go:137] Controller e2e-tests-daemonrestart-5tstk/daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555 either never recorded expectations, or the ttl expired.
I0930 00:10:49.668459       6 controller_utils.go:147] Setting expectations &{add:10 del:0 key:e2e-tests-daemonrestart-5tstk/daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555}
I0930 00:10:49.668533       6 replication_controller.go:354] Too few "e2e-tests-daemonrestart-5tstk"/"daemonrestart10-ab27fd87-6707-11e5-add1-42010af01555" replicas, need 10, creating 10
I0930 00:10:49.672833       6 controller_utils.go:137] Controller e2e-tests-daemonrestart-oc9tp/daemonrestart10-ab27a4a6-6707-11e5-8bd8-42010af01555 either never recorded expectations, or the ttl expired.

This is probably one from the kubelet half of the test, another from the scheduler etc all selecting across the same pod labels and stopping/starting at the same time (I'm guessing).

Can you also explain why it might return inconsistent resource version?

If that happened this is what I'd expect. Don't have evidence that it is.

a-robinson added a commit that referenced this issue Oct 5, 2015
Fix race condition in DaemonSet controller. Fixes #14693.
davidopp pushed a commit to davidopp/kubernetes that referenced this issue Oct 5, 2015
mkulke pushed a commit to mkulke/kubernetes that referenced this issue Oct 7, 2015
RichieEscarez pushed a commit to RichieEscarez/kubernetes that referenced this issue Dec 4, 2015
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016
shouhong pushed a commit to shouhong/kubernetes that referenced this issue Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

5 participants