not possible to rolling update cluster #9953

zetaab · 2020-09-16T13:54:45Z

1. What kops version are you running? The command kops version, will display
this information.

1.19 alpha 4

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.19.1

3. What cloud provider are you using?

openstack / aws

4. What commands did you run? What is the simplest way to reproduce this issue?

I am updating clusters from 1.17 / 1.18 -> 1.19.1

kops update cluster --yes && kops rolling-update --yes

5. What happened after the commands executed?

Bastion is rotated fine. However, after the bastion is rotated something is executing new manifests for kops-controller. This will lead to following situation:

kops-controller-68fc4                                        1/1     Running             2          20d
kops-controller-9hq4s                                        1/1     Running             0          15d
kops-controller-b7jrx                                        0/1     ContainerCreating   0          20m

% kubectl describe pod kops-controller-b7jrx

  Warning  FailedMount  14s (x18 over 20m)   kubelet, master-zone-1-1-1-jannem-k8s-local  MountVolume.SetUp failed for volume "kops-controller-pki" : hostPath type check failed: /etc/kubernetes/kops-controller/ is not a directory

So the new kops-controller manifest is NOT backwards compatible and this means that only way to update kops cluster masters currently is roll ALL masters at once (or skip cluster validation or modify kops controller manifest and be fast). The folder does not exist (yet) in old masters installed by older kops version. This folder is available only in newest kops version. So it should not update the manifest before the folder really exists.

6. What did you expect to happen?

I expect that rolling update should work usually

The text was updated successfully, but these errors were encountered:

zetaab · 2020-09-16T15:14:29Z

@johngmyers as you have done #9653 which makes this bug happen, do you have any ideas how we could fix this correctly? My fix is dirty fix and it will not work in all cases. It might be that there is 1 old kops-controller after full kops rolling-update cluster.

This happens only if the cluster is created before kops 1.19 like using kops 1.18.

zetaab · 2020-09-16T17:30:00Z

Ok we fixed this in following way:

kubectl apply -f hack.yaml && sleep 20 && kubectl delete ds folderfix -n kube-system && kubectl delete pods -n kube-system -l k8s-app=kops-controller

Where the hack.yaml is following:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: folderfix
  namespace: kube-system
  labels:
    app.kubernetes.io/name: folderfix
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: folderfix
  template:
    metadata:
      labels:
        app.kubernetes.io/name: folderfix
    spec:
      containers:
      - name: folderfix
        image: busybox
        command: [ "sh", "-c", "mkdir /etc/kubernetes/kops-controller/" ]
        volumeMounts:
        - name: files
          mountPath: /etc/kubernetes
      volumes:
      - name: files
        hostPath:
          path: /etc/kubernetes
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists

warning tested only in OpenStack

johngmyers · 2020-09-17T05:14:55Z

Setting an updateStrategy of OnDelete seems appropriate. A static manifest would probably be even more appropriate, but there might be other issues with that.

The hack is likely to result in non-working (or only partially-working) kops-controllers on AWS, as it won't provision the keys/certs that the AWS bootstrap server needs. It might be enough to get cluster validation to pass long enough for the control plane to update. A simpler form of that hack would be to make the kops-controller-pki volume type: DirectoryOrCreate.

I wonder if we could put an additional nodeSelector on the DaemonSet to keep it from scheduling on old nodes.

As a separate issue, we might want to make it so that bastions don't apply addons. I'm a bit concerned that they even have the credentials to be able to do that. Or is it that the old control plane nodes are picking up and applying the new set of addon manifests?

johngmyers · 2020-09-17T05:26:54Z

If we do go with OnDelete we might need to put a hash derived from the manifest in the NodeupConfig of masters in order to make sure rolling update will deploy any changes.

zetaab · 2020-09-17T07:13:36Z

@johngmyers I can confirm that this workaround does not work for AWS, it do work only in OpenStack

olemarkus · 2020-09-17T07:48:29Z

If we do go with OnDelete we might need to put a hash derived from the manifest in the NodeupConfig of masters in order to make sure rolling update will deploy any changes.

The same "problem" can exist for DaemonSets too. So not only things running on masters. We could add a nodeselector to the channels API and when that selector is set, channels cmd set kops.k8s.io/needs-update annotation on the matching nodes after running it's normal kubectl apply.

alesanmed · 2020-10-08T14:13:23Z

@johngmyers So sorry to ask again but I'm having this exact same problem after upgrading from 1.18.1 to 1.19.0... How can I fix this? :(

PS: Going back to v1.18.1 is definetly an option, I tried but still the same problem. Thanks

olemarkus · 2020-10-08T18:12:12Z

@alesanmed can you file a new issue about this?

alesanmed · 2020-10-10T18:27:39Z

@olemarkus Sorry for not replying. I managed to rollback the cluster version and, since I've seen that the v1.19 is still an alpha, I'll wait for a stable release. Don't want to bother with issues since I'm sure they're going to fix them. Thanks!!

zetaab added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 16, 2020

zetaab mentioned this issue Sep 16, 2020

WIP: fix kops-controller updatestrategy #9954

Closed

johngmyers added the blocks-next label Sep 25, 2020

johngmyers added this to the v1.19 milestone Sep 25, 2020

johngmyers mentioned this issue Sep 28, 2020

Add label to prevent kops-controller from running on old nodes #9998

Merged

k8s-ci-robot closed this as completed in #9998 Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not possible to rolling update cluster #9953

not possible to rolling update cluster #9953

zetaab commented Sep 16, 2020 •

edited

Loading

zetaab commented Sep 16, 2020 •

edited

Loading

zetaab commented Sep 16, 2020 •

edited

Loading

johngmyers commented Sep 17, 2020

johngmyers commented Sep 17, 2020

zetaab commented Sep 17, 2020

olemarkus commented Sep 17, 2020

alesanmed commented Oct 8, 2020 •

edited

Loading

olemarkus commented Oct 8, 2020

alesanmed commented Oct 10, 2020

not possible to rolling update cluster #9953

not possible to rolling update cluster #9953

Comments

zetaab commented Sep 16, 2020 • edited Loading

zetaab commented Sep 16, 2020 • edited Loading

zetaab commented Sep 16, 2020 • edited Loading

johngmyers commented Sep 17, 2020

johngmyers commented Sep 17, 2020

zetaab commented Sep 17, 2020

olemarkus commented Sep 17, 2020

alesanmed commented Oct 8, 2020 • edited Loading

olemarkus commented Oct 8, 2020

alesanmed commented Oct 10, 2020

zetaab commented Sep 16, 2020 •

edited

Loading

zetaab commented Sep 16, 2020 •

edited

Loading

zetaab commented Sep 16, 2020 •

edited

Loading

alesanmed commented Oct 8, 2020 •

edited

Loading