Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

Closed
zetaab opened this issue Sep 2, 2021 · 15 comments · Fixed by #12416
Closed

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

zetaab opened this issue Sep 2, 2021 · 15 comments · Fixed by #12416
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/office-hours priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@zetaab
Copy link
Member

zetaab commented Sep 2, 2021

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.22 beta 1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.22.1

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

I am updating cluster from 1.21.x k8s (it was installed by using kops 1.21) to 1.22.1 (using kops 1.22 beta 1)

5. What happened after the commands executed?

I see kops-controller in new 1.22.1 master crashlooping

kops-controller-fzdhz                                                    0/1     CrashLoopBackOff   8 (4m33s ago)   20m   10.124.44.193     ip-10-124-44-193.eu-central-1.compute.internal    <none>           <none>
% kubectl logs kops-controller-fzdhz
E0902 06:26:05.813281       1 deleg.go:144] setup "msg"="unable to start server" "error"="reading \"kubernetes-ca\" certificate: open /etc/kubernetes/kops-controller/pki/kubernetes-ca.pem: no such file or directory"

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2021
@zetaab zetaab added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 2, 2021
@olemarkus
Copy link
Member

/cc @johngmyers

If I understand things correctly, kops update should fix the file path for the new master.

@zetaab zetaab added this to the v1.22 milestone Sep 3, 2021
@zetaab
Copy link
Member Author

zetaab commented Sep 3, 2021

its impossible currently to update old clusters, all masters are failing

@zetaab
Copy link
Member Author

zetaab commented Sep 5, 2021

I recreated the cluster but now it somehow works?

root@ip-10-124-48-35:/etc/kubernetes/kops-controller# ls -l
total 20
-rw------- 1 kops-controller root  410 Sep  5 13:00 keypair-ids.yaml
-rw-r--r-- 1 kops-controller root 1204 Sep  5 13:00 kops-controller.crt
-rw------- 1 kops-controller root 1679 Sep  5 13:00 kops-controller.key
-rw------- 1 kops-controller root 1082 Sep  5 13:00 kubernetes-ca.crt
-rw------- 1 kops-controller root 1679 Sep  5 13:00 kubernetes-ca.key

there is no .pem file that it tried to search

@zetaab
Copy link
Member Author

zetaab commented Sep 5, 2021

I cannot reproduce this anymore, lets reopen if needed

@zetaab zetaab closed this as completed Sep 5, 2021
@zetaab zetaab reopened this Sep 5, 2021
@zetaab
Copy link
Member Author

zetaab commented Sep 5, 2021

this is valid problem, does not happen always

I think I found the problem:

% kubectl get ds kops-controller -o yaml|grep "image:"
        image: k8s.gcr.io/kops/kops-controller:1.22.0-beta.1

So the daemonset says that we should use 1.22.0 beta 1 image.

% kubectl get pods -o wide|grep kops
kops-controller-44tdp                                                     1/1     Running            0          56m   10.124.82.136     ip-10-124-82-136.eu-central-1.compute.internal    <none>           <none>
kops-controller-5fl68                                                     0/1     CrashLoopBackOff   8          21m   10.124.36.190     ip-10-124-36-190.eu-central-1.compute.internal    <none>           <none>
kops-controller-g87jn                                                     1/1     Running            0          55m   10.124.107.120    ip-10-124-107-120.eu-central-1.compute.internal   <none>           <none>

As we can see the pod that is crashlooping is located in node ip-10-124-36-190.eu-central-1.compute.internal

% kubectl get node ip-10-124-36-190.eu-central-1.compute.internal
NAME                                             STATUS   ROLES                  AGE   VERSION
ip-10-124-36-190.eu-central-1.compute.internal   Ready    control-plane,master   22m   v1.22.1
% kubectl describe pod kops-controller-5fl68|grep image
  Normal   Pulling    22m                   kubelet            Pulling image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3"
  Normal   Pulled     22m                   kubelet            Successfully pulled image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3" in 27.47047876s
  Normal   Pulled     20m (x4 over 22m)     kubelet            Container image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3" already present on machine

so its using OLD image?! I executed delete pod and now it used the newer image. So it might be that addons are updated after first master is updated (and old version of pod is already started before addons are updated in new master)?

@zetaab
Copy link
Member Author

zetaab commented Sep 15, 2021

We have now updated quite many clusters, this problem is not present in OpenStack. But it do exists in all AWS clusters, always the first rolling update will fail to this problem.

@olemarkus
Copy link
Member

I am a bit in "don't understand how this can happen, but it happens" mode. Seems related to #12299 (comment)

@olemarkus
Copy link
Member

So after quite a lot of digging, this happens because we removed version from addons.

1.21 channels will check if a version is set, if it is not, it will skip the update!

That means that a kops update from 1.22 will not be applied until after the first master rolls. This leads to a number of addons breaking, but especially kops-controller and other onDelete DaemonSets. It also delays fixing the upgrade test as the dns-controller change need to be applied before the master is rolled.

A couple of ways to go about this:

Add the Version field to addons again, but set it to a fixed, high version (e.g 99.99.99) to all addons. This is most likely the simplest fix.

Bundle channels into the kops binary and let kops update also do a channels run (if the API is available). This is a larger change, but also has some benefits: channel updates are applied immediately after kops update, and a lot more information about addons is made more easily available to users. This also resolves the common support question of "how do I trigger a reinstall of an addon".

/kind office-hours

@hakman
Copy link
Member

hakman commented Sep 23, 2021

Add the Version field to addons again, but set it to a fixed, high version (e.g 99.99.99) to all addons. This is most likely the simplest fix.

@olemarkus Can we hardcode the version string in the manifest(s) only?

@olemarkus
Copy link
Member

Yeah. That was my idea. Set the version unconditionally before we encode the yaml.

@hakman
Copy link
Member

hakman commented Sep 23, 2021

Meant in the upup/models/cloudup/resources/addons/kops-controller.addons.k8s.io/k8s-1.16.yaml.template for example, not in code.

@olemarkus
Copy link
Member

What we need to ensure is that the following condition does not happen: https://github.com/kubernetes/kops/blob/release-1.21/channels/pkg/channels/channel_version.go#L110-L113

As far as I know, the manifests themselves are not related to this.

@rifelpet
Copy link
Member

rifelpet commented Sep 23, 2021

This also breaks dns-controller (a Deployment with 1 replica) which breaks our upgrade tests:

  • dns-controller used to tolerate all taints, this was fixed in Add specific taints to dns-controller. #12389 which will be in kops 1.23
  • During a kops rolling-update cluster when the single-control-plane instance is cordoned and drained prior to termination, the dns-controller pod gets rescheduled onto it after draining because it can tolerate the cordon taint. Kops then deletes the node object and terminates the instance.
  • With a non-static, non-daemonset pod scheduled on the node being deleted, KCM's PodGC controller on the new instance gets confused and is unable to delete the old dns-controller pod, preventing a replacement dns-controller pod from ever being scheduled. This means the API DNS record never get updated to point to new control plane instance, and cluster validation inevitably fails.

We have the fix in kops 1.23 but because of this bug, the manifest wont be applied by the old control plane instance, only the new instance after the pod that can't be garbage collected which breaks the rolling update.

KCM Logs showing PodGC issues. These lines are repeated every 40 seconds:

I0921 19:08:00.574823 1 gc_controller.go:182] Found orphaned Pod kube-system/dns-controller-56b8dc9b5b-sb7wv assigned to the Node ip-172-20-36-55.ap-northeast-1.compute.internal. Deleting.
I0921 19:08:00.574847 1 gc_controller.go:78] PodGC is force deleting Pod: kube-system/dns-controller-56b8dc9b5b-sb7wv
E0921 19:08:00.576118 1 gc_controller.go:184] pods "dns-controller-56b8dc9b5b-sb7wv" not found

This causes the replicaset operations by the deployment controller to fail, preventing the deployment's rolling update to progress which prevents the new dns-controller pod from being scheduled on the new control plane node:

E0921 19:08:44.888767 1 deployment_controller.go:495] Operation cannot be fulfilled on replicasets.apps "dns-controller-56b8dc9b5b": the object has been modified; please apply your changes to the latest version and try again
I0921 19:08:44.888986 1 deployment_controller.go:496] "Dropping deployment out of the queue" deployment="kube-system/dns-controller" err="Operation cannot be fulfilled on replicasets.apps \"dns-controller-56b8dc9b5b\": the object has been modified; please apply your changes to the latest version and try again"

@naeri-kailash
Copy link

I'm not sure this is the best approach but my solution was:

  1. ssh into the new master node with the working kops-controller pods running in it and move into /etc/kubernetes/kops-controller which will have the following files
keypair-ids.yaml
kubernetes-ca.key
kubernetes-ca.crt
  1. ssh into both of the old master nodes and navigate to /etc/kubernetes/kops-controller. Copy the files from the master node in step one into this folder
  2. restart the errored kops-controller pods - and it started working again

@xgt001
Copy link

xgt001 commented Sep 22, 2022

Hi we ran into this upgrading from 1.21.10 to 1.22.13 on AWS with kops version 1.22.6 on AWS
I believe the issue is with the add-on being updated in the S3 store but not being applied.
The sequence of events:

  1. When the new master comes up in a rolling update, it looks for an updated addon config with the signingCA kubernetes-ca in /etc/kubernetes/kops-controller/ only to not find it and enter into a crash loop
  2. You need to workaround this by downloading the addon file from AWS s3 in a path that should be similar to this
    aws s3 cp s3://cluster-bucket-name/cluster-name/addons/kops-controller.addons.k8s.io/k8s-1.16.yaml .
  3. Apply the addon so that the new master can pick up the right path for the signing CA
  4. This is where we got stumped, the cluster validation failed because one of the kops-controller pods on the older master started crash looping because it some how read this updated 1.22's yaml configuration (!!!)
  5. Luckily we had versioning enabled on our S3 Bucket and we could feed the older kops-controller the addon config from the same bucket for the older version (from 1.21)
  6. For the rolling update with the next master, we had to apply the addon from 1.22 again. From this point on however, we didnt have to revert to 1.21's addons.yaml

Overall this is a very jarring experience and I wish this was really better handled from kops. We are brave enough only to use the corresponding kops version for a given upgrade because we got bitten by the /srv/kubernetes path changes going from 1.20 to 1.21 when we tried to use the latest kops version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/office-hours priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants