-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249
Comments
/cc @johngmyers If I understand things correctly, |
its impossible currently to update old clusters, all masters are failing |
I recreated the cluster but now it somehow works?
there is no .pem file that it tried to search |
I cannot reproduce this anymore, lets reopen if needed |
this is valid problem, does not happen always I think I found the problem:
So the daemonset says that we should use 1.22.0 beta 1 image.
As we can see the pod that is crashlooping is located in node
so its using OLD image?! I executed delete pod and now it used the newer image. So it might be that addons are updated after first master is updated (and old version of pod is already started before addons are updated in new master)? |
We have now updated quite many clusters, this problem is not present in OpenStack. But it do exists in all AWS clusters, always the first rolling update will fail to this problem. |
I am a bit in "don't understand how this can happen, but it happens" mode. Seems related to #12299 (comment) |
So after quite a lot of digging, this happens because we removed version from addons. 1.21 channels will check if a version is set, if it is not, it will skip the update! That means that a A couple of ways to go about this: Add the Version field to addons again, but set it to a fixed, high version (e.g 99.99.99) to all addons. This is most likely the simplest fix. Bundle channels into the kops binary and let /kind office-hours |
@olemarkus Can we hardcode the version string in the manifest(s) only? |
Yeah. That was my idea. Set the version unconditionally before we encode the yaml. |
Meant in the |
What we need to ensure is that the following condition does not happen: https://github.com/kubernetes/kops/blob/release-1.21/channels/pkg/channels/channel_version.go#L110-L113 As far as I know, the manifests themselves are not related to this. |
This also breaks dns-controller (a Deployment with 1 replica) which breaks our upgrade tests:
We have the fix in kops 1.23 but because of this bug, the manifest wont be applied by the old control plane instance, only the new instance after the pod that can't be garbage collected which breaks the rolling update. KCM Logs showing PodGC issues. These lines are repeated every 40 seconds:
This causes the replicaset operations by the deployment controller to fail, preventing the deployment's rolling update to progress which prevents the new dns-controller pod from being scheduled on the new control plane node:
|
I'm not sure this is the best approach but my solution was:
|
Hi we ran into this upgrading from 1.21.10 to 1.22.13 on AWS with kops version 1.22.6 on AWS
Overall this is a very jarring experience and I wish this was really better handled from kops. We are brave enough only to use the corresponding kops version for a given upgrade because we got bitten by the |
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.22 beta 1
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.1.22.1
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
I am updating cluster from 1.21.x k8s (it was installed by using kops 1.21) to 1.22.1 (using kops 1.22 beta 1)
5. What happened after the commands executed?
I see kops-controller in new 1.22.1 master crashlooping
6. What did you expect to happen?
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
The text was updated successfully, but these errors were encountered: