New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky e2e: DaemonRestart Controller Manager should not create/delete replicas across restart #14693
Comments
@davidopp for triage |
I'll look into this. |
To be honest I can't reproduce this problem. I've run this ~20 times locally and all passed. I checked on jenkins and it's passing since 23rd September (it's been failing ~50% before). I'll run this 100 times during the night. If they pass I'll move it to |
This did fail again here |
Daemon restart cannot run in parallel (noting that your url has gce-parallel-flaky in it), the controllers will fight each other. But assuming that isn't always the case, I would expect this to happen if the first list after restart somehow returned an inconsistent resource version. Not a great theory, but otherwise the rc manager should restart and wait on a non-empty RV. |
@bprashanth I don't understand what do you mean by controllers fighting each other. Isn't it that gce-parallel runs tests in parallel, but there is only one controller manager? The only risk I see is that there will be multiple tests which try to restart controller manager and as a result it will take longer for it to start working. Can you also explain why it might return inconsistent resource version? |
Sorry, I accidentally updated #14695. I talked to @bprashanth about this and he had a pretty good guess in about two seconds about what the problem is. ReplicationController sync method does this:
but DaemonSet controller doesn't. (I guess when DaemonSet was copy-pasted from ReplicationController, this was accidentally removed.) Pod reflector and syncer are running in separate goroutines, and if pod reflector doesn't run then syncer will think there are no pods on the node. |
Oh, hmm, does Daemon restart actually test the daemon controller? I though it didn't.
Multiple RCs I mean, controlling the same pods. So if you look in the log David posted you'll see multiple daemon-restart controllers active simultaneously:
This is probably one from the kubelet half of the test, another from the scheduler etc all selecting across the same pod labels and stopping/starting at the same time (I'm guessing).
If that happened this is what I'd expect. Don't have evidence that it is. |
Fix race condition in DaemonSet controller. Fixes #14693.
DaemonRestart Controller Manager should not create/delete replicas across restart (Failed 7 times in the last 30 runs. Stability: 76 %)
The text was updated successfully, but these errors were encountered: