Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

zetaab · 2021-09-02T06:26:50Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.22 beta 1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.22.1

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

I am updating cluster from 1.21.x k8s (it was installed by using kops 1.21) to 1.22.1 (using kops 1.22 beta 1)

5. What happened after the commands executed?

I see kops-controller in new 1.22.1 master crashlooping

kops-controller-fzdhz                                                    0/1     CrashLoopBackOff   8 (4m33s ago)   20m   10.124.44.193     ip-10-124-44-193.eu-central-1.compute.internal    <none>           <none>

% kubectl logs kops-controller-fzdhz
E0902 06:26:05.813281       1 deleg.go:144] setup "msg"="unable to start server" "error"="reading \"kubernetes-ca\" certificate: open /etc/kubernetes/kops-controller/pki/kubernetes-ca.pem: no such file or directory"

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

olemarkus · 2021-09-02T06:34:23Z

/cc @johngmyers

If I understand things correctly, kops update should fix the file path for the new master.

zetaab · 2021-09-03T10:43:40Z

its impossible currently to update old clusters, all masters are failing

zetaab · 2021-09-05T13:07:09Z

I recreated the cluster but now it somehow works?

root@ip-10-124-48-35:/etc/kubernetes/kops-controller# ls -l
total 20
-rw------- 1 kops-controller root  410 Sep  5 13:00 keypair-ids.yaml
-rw-r--r-- 1 kops-controller root 1204 Sep  5 13:00 kops-controller.crt
-rw------- 1 kops-controller root 1679 Sep  5 13:00 kops-controller.key
-rw------- 1 kops-controller root 1082 Sep  5 13:00 kubernetes-ca.crt
-rw------- 1 kops-controller root 1679 Sep  5 13:00 kubernetes-ca.key

there is no .pem file that it tried to search

zetaab · 2021-09-05T14:29:42Z

I cannot reproduce this anymore, lets reopen if needed

zetaab · 2021-09-05T17:14:52Z

this is valid problem, does not happen always

I think I found the problem:

% kubectl get ds kops-controller -o yaml|grep "image:"
        image: k8s.gcr.io/kops/kops-controller:1.22.0-beta.1

So the daemonset says that we should use 1.22.0 beta 1 image.

% kubectl get pods -o wide|grep kops
kops-controller-44tdp                                                     1/1     Running            0          56m   10.124.82.136     ip-10-124-82-136.eu-central-1.compute.internal    <none>           <none>
kops-controller-5fl68                                                     0/1     CrashLoopBackOff   8          21m   10.124.36.190     ip-10-124-36-190.eu-central-1.compute.internal    <none>           <none>
kops-controller-g87jn                                                     1/1     Running            0          55m   10.124.107.120    ip-10-124-107-120.eu-central-1.compute.internal   <none>           <none>

As we can see the pod that is crashlooping is located in node ip-10-124-36-190.eu-central-1.compute.internal

% kubectl get node ip-10-124-36-190.eu-central-1.compute.internal
NAME                                             STATUS   ROLES                  AGE   VERSION
ip-10-124-36-190.eu-central-1.compute.internal   Ready    control-plane,master   22m   v1.22.1

% kubectl describe pod kops-controller-5fl68|grep image
  Normal   Pulling    22m                   kubelet            Pulling image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3"
  Normal   Pulled     22m                   kubelet            Successfully pulled image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3" in 27.47047876s
  Normal   Pulled     20m (x4 over 22m)     kubelet            Container image "k8s.gcr.io/kops/kops-controller:1.21.0-alpha.3" already present on machine

so its using OLD image?! I executed delete pod and now it used the newer image. So it might be that addons are updated after first master is updated (and old version of pod is already started before addons are updated in new master)?

zetaab · 2021-09-15T11:42:07Z

We have now updated quite many clusters, this problem is not present in OpenStack. But it do exists in all AWS clusters, always the first rolling update will fail to this problem.

olemarkus · 2021-09-21T19:43:15Z

I am a bit in "don't understand how this can happen, but it happens" mode. Seems related to #12299 (comment)

olemarkus · 2021-09-23T06:45:52Z

So after quite a lot of digging, this happens because we removed version from addons.

1.21 channels will check if a version is set, if it is not, it will skip the update!

That means that a kops update from 1.22 will not be applied until after the first master rolls. This leads to a number of addons breaking, but especially kops-controller and other onDelete DaemonSets. It also delays fixing the upgrade test as the dns-controller change need to be applied before the master is rolled.

A couple of ways to go about this:

Add the Version field to addons again, but set it to a fixed, high version (e.g 99.99.99) to all addons. This is most likely the simplest fix.

Bundle channels into the kops binary and let kops update also do a channels run (if the API is available). This is a larger change, but also has some benefits: channel updates are applied immediately after kops update, and a lot more information about addons is made more easily available to users. This also resolves the common support question of "how do I trigger a reinstall of an addon".

/kind office-hours

hakman · 2021-09-23T08:28:47Z

Add the Version field to addons again, but set it to a fixed, high version (e.g 99.99.99) to all addons. This is most likely the simplest fix.

@olemarkus Can we hardcode the version string in the manifest(s) only?

olemarkus · 2021-09-23T08:34:01Z

Yeah. That was my idea. Set the version unconditionally before we encode the yaml.

hakman · 2021-09-23T08:48:56Z

Meant in the upup/models/cloudup/resources/addons/kops-controller.addons.k8s.io/k8s-1.16.yaml.template for example, not in code.

olemarkus · 2021-09-23T08:56:23Z

What we need to ensure is that the following condition does not happen: https://github.com/kubernetes/kops/blob/release-1.21/channels/pkg/channels/channel_version.go#L110-L113

As far as I know, the manifests themselves are not related to this.

rifelpet · 2021-09-23T12:46:55Z

This also breaks dns-controller (a Deployment with 1 replica) which breaks our upgrade tests:

dns-controller used to tolerate all taints, this was fixed in Add specific taints to dns-controller. #12389 which will be in kops 1.23
During a kops rolling-update cluster when the single-control-plane instance is cordoned and drained prior to termination, the dns-controller pod gets rescheduled onto it after draining because it can tolerate the cordon taint. Kops then deletes the node object and terminates the instance.
With a non-static, non-daemonset pod scheduled on the node being deleted, KCM's PodGC controller on the new instance gets confused and is unable to delete the old dns-controller pod, preventing a replacement dns-controller pod from ever being scheduled. This means the API DNS record never get updated to point to new control plane instance, and cluster validation inevitably fails.

We have the fix in kops 1.23 but because of this bug, the manifest wont be applied by the old control plane instance, only the new instance after the pod that can't be garbage collected which breaks the rolling update.

KCM Logs showing PodGC issues. These lines are repeated every 40 seconds:

I0921 19:08:00.574823 1 gc_controller.go:182] Found orphaned Pod kube-system/dns-controller-56b8dc9b5b-sb7wv assigned to the Node ip-172-20-36-55.ap-northeast-1.compute.internal. Deleting.
I0921 19:08:00.574847 1 gc_controller.go:78] PodGC is force deleting Pod: kube-system/dns-controller-56b8dc9b5b-sb7wv
E0921 19:08:00.576118 1 gc_controller.go:184] pods "dns-controller-56b8dc9b5b-sb7wv" not found

This causes the replicaset operations by the deployment controller to fail, preventing the deployment's rolling update to progress which prevents the new dns-controller pod from being scheduled on the new control plane node:

E0921 19:08:44.888767 1 deployment_controller.go:495] Operation cannot be fulfilled on replicasets.apps "dns-controller-56b8dc9b5b": the object has been modified; please apply your changes to the latest version and try again
I0921 19:08:44.888986 1 deployment_controller.go:496] "Dropping deployment out of the queue" deployment="kube-system/dns-controller" err="Operation cannot be fulfilled on replicasets.apps \"dns-controller-56b8dc9b5b\": the object has been modified; please apply your changes to the latest version and try again"

naeri-kailash · 2022-06-14T00:52:00Z

I'm not sure this is the best approach but my solution was:

ssh into the new master node with the working kops-controller pods running in it and move into /etc/kubernetes/kops-controller which will have the following files

keypair-ids.yaml
kubernetes-ca.key
kubernetes-ca.crt

ssh into both of the old master nodes and navigate to /etc/kubernetes/kops-controller. Copy the files from the master node in step one into this folder
restart the errored kops-controller pods - and it started working again

xgt001 · 2022-09-22T16:12:37Z

Hi we ran into this upgrading from 1.21.10 to 1.22.13 on AWS with kops version 1.22.6 on AWS
I believe the issue is with the add-on being updated in the S3 store but not being applied.
The sequence of events:

When the new master comes up in a rolling update, it looks for an updated addon config with the signingCA kubernetes-ca in /etc/kubernetes/kops-controller/ only to not find it and enter into a crash loop
You need to workaround this by downloading the addon file from AWS s3 in a path that should be similar to this
aws s3 cp s3://cluster-bucket-name/cluster-name/addons/kops-controller.addons.k8s.io/k8s-1.16.yaml .
Apply the addon so that the new master can pick up the right path for the signing CA
This is where we got stumped, the cluster validation failed because one of the kops-controller pods on the older master started crash looping because it some how read this updated 1.22's yaml configuration (!!!)
Luckily we had versioning enabled on our S3 Bucket and we could feed the older kops-controller the addon config from the same bucket for the older version (from 1.21)
For the rolling update with the next master, we had to apply the addon from 1.22 again. From this point on however, we didnt have to revert to 1.21's addons.yaml

Overall this is a very jarring experience and I wish this was really better handled from kops. We are brave enough only to use the corresponding kops version for a given upgrade because we got bitten by the /srv/kubernetes path changes going from 1.20 to 1.21 when we tried to use the latest kops version...

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2021

zetaab added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 2, 2021

zetaab added this to the v1.22 milestone Sep 3, 2021

zetaab closed this as completed Sep 5, 2021

zetaab reopened this Sep 5, 2021

k8s-ci-robot added the kind/office-hours label Sep 23, 2021

olemarkus mentioned this issue Sep 25, 2021

Add fixed version to all addons #12416

Merged

k8s-ci-robot closed this as completed in #12416 Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

zetaab commented Sep 2, 2021

olemarkus commented Sep 2, 2021

zetaab commented Sep 3, 2021

zetaab commented Sep 5, 2021

zetaab commented Sep 5, 2021

zetaab commented Sep 5, 2021 •

edited

Loading

zetaab commented Sep 15, 2021

olemarkus commented Sep 21, 2021

olemarkus commented Sep 23, 2021

hakman commented Sep 23, 2021

olemarkus commented Sep 23, 2021

hakman commented Sep 23, 2021

olemarkus commented Sep 23, 2021

rifelpet commented Sep 23, 2021 •

edited

Loading

naeri-kailash commented Jun 14, 2022

xgt001 commented Sep 22, 2022 •

edited

Loading

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

Unable to update cluster from 1.21 -> 1.22, kops controller crashlooping in AWS #12249

Comments

zetaab commented Sep 2, 2021

6. What did you expect to happen?

olemarkus commented Sep 2, 2021

zetaab commented Sep 3, 2021

zetaab commented Sep 5, 2021

zetaab commented Sep 5, 2021

zetaab commented Sep 5, 2021 • edited Loading

zetaab commented Sep 15, 2021

olemarkus commented Sep 21, 2021

olemarkus commented Sep 23, 2021

hakman commented Sep 23, 2021

olemarkus commented Sep 23, 2021

hakman commented Sep 23, 2021

olemarkus commented Sep 23, 2021

rifelpet commented Sep 23, 2021 • edited Loading

naeri-kailash commented Jun 14, 2022

xgt001 commented Sep 22, 2022 • edited Loading

zetaab commented Sep 5, 2021 •

edited

Loading

rifelpet commented Sep 23, 2021 •

edited

Loading

xgt001 commented Sep 22, 2022 •

edited

Loading