-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to revert deployment to previous version #43948
Comments
Is this from local testing? Can you post the full log from the controller manager? Can you also post the output of: kubectl get deploy,rs,pods -l kube-dns_label_here -n kube-system -o yaml |
It seems that the controller manager was fast enough to process the kube-dns deployment in less than 0.2 seconds before the cache got updated with the replica set. Can happen when your queues are empty. We need to verify that rate-limiting actually works and maybe revisit its settings. Also we probably want to bump the retries to something higher, the deployment controller is the only one out of the workload controllers that drops objects out of its queue after some retry count. |
@kubernetes/sig-apps-bugs @deads2k |
Rate-limiting works fine, we need to add more retries. |
Thanks for looking into it - I was able to reproduce. My suspicion is that it is not a race, but rather because of newly introduced fields across versions.
(I inserted some sections for sanity)
|
The error was
I am 95% sure (sadly kubectl edit doesn't appear in scrollback) that the problem is that the I then re-added My hypothesis (though I don't understand the code) is that the hash (1321724180) is based on the configuration as applied, which might not accurately reflect the values actually set either on a cluster downgrade & upgrade, or when using an older version of kubectl/k8s originally (I am unsure exactly). So then I set the manifest to readd the missing field, and collide with the existing hashed value. Is that plausible? |
The hash is based on the Deployment PodTemplateSpec. If the Deployment controller can't find a ReplicaSet that has a semantically deep equal PodTemplateSpec, then it will create a new ReplicaSet by hashing the Deployment template (for the new RS name). There is a slight chance you will hit a hash collision with the current algo but this seems to be a problem with 200s of old ReplicaSets: #29735
You re-added Can you try to patch the Deployment controller to use more retries (15 is a good number) and retry the upgrade-downgrade? |
Ok, this seems like a collision: https://www.diffchecker.com/E4CxPOdr |
I want a concrete timeline here:
|
@kubernetes/sig-apps-bugs |
Automatic merge from submit-queue [1.5] Update deployment retries to a saner count Safe-guard for failures like #43948
Automatic merge from submit-queue Switch Deployments to new hashing algo w/ collision avoidance mechanism Implements kubernetes/community#477 @kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews Fixes #29735 Fixes #43948 ```release-note Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore. ```
Automatic merge from submit-queue Switch Deployments to new hashing algo w/ collision avoidance mechanism Implements kubernetes/community#477 @kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews Fixes kubernetes/kubernetes#29735 Fixes kubernetes/kubernetes#43948 ```release-note Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore. ```
K8s 1.6.0: When I try setting a deployment back to a version that already existed (i.e. A -> B -> A), I get the following error logged in k-c-m:
(In my case, I was changing the kube-dns config map from optional: true -> optional: false -> optional: true, around a 1.5 -> 1.6 -> 1.5 upgrade / downgrade)
This is particularly problematic because the new deployment is then not configured - the pod retains its old configuration.
I was able to reproduce this repeatedly, but then I deleted the replicaset and the system was then able to recover and I could no longer reproduce it.
The text was updated successfully, but these errors were encountered: