calico-kube-controllers stuck not ready during rolling update. #10737

jetersen · 2021-02-05T07:24:30Z

1. What kops version are you running? The command kops version, will display
this information.
Version 1.18.3 (git-11ec695516)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:18:51Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"62876fc6d93e891aa7fbe19771e6a6c03773b0f7", GitTreeState:"clean", BuildDate:"2020-10-15T01:43:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

1.18.10, upgrading to 1.18.15.

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update 😅

5. What happened after the commands executed?

Over time I have noticed every time I have done rolling updates all they from 1.18.0, to 1.18.1, 1.18.2 through to 1.18.3. I noticed the calico-kube-controllers gets into a not ready state and the cluster validation will fail. Not sure what the underlying issue is but a simple kubectl delete pod --namespace kube-system calico-kube-controllers does the trick.

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 1
  name: my cluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole",
          ],
          "Resource": [
            "*"
          ]
        }
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole",
          ],
          "Resource": [
            "*"
          ]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/euc1: ""
  cloudProvider: aws
  configBase: s3://clusters.company.io/my-cluster.company.io
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - encryptedVolume: true
          instanceGroup: master-eu-central-1a
          name: a
        - encryptedVolume: true
          instanceGroup: master-eu-central-1b
          name: b
        - encryptedVolume: true
          instanceGroup: master-eu-central-1c
          name: c
      memoryRequest: 100Mi
      name: main
      version: 3.2.24
    - cpuRequest: 100m
      etcdMembers:
        - encryptedVolume: true
          instanceGroup: master-eu-central-1a
          name: a
        - encryptedVolume: true
          instanceGroup: master-eu-central-1b
          name: b
        - encryptedVolume: true
          instanceGroup: master-eu-central-1c
          name: c
      memoryRequest: 100Mi
      name: events
      version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    oidcClientID: kubelogin
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.company.io
    oidcUsernameClaim: email
  kubeDNS:
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
    - NA
  kubernetesVersion: 1.18.15
  masterPublicName: api.my-cluster.company.io
  networkCIDR: 172.31.0.0/16
  networkID: vpc-9b9c0bf3
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/17
  sshAccess:
    - NA
  subnets:
    - cidr: 172.31.0.0/20
      id: subnet-a6b8e9ce
      name: eu-central-1a
      type: Public
      zone: eu-central-1a
    - cidr: 172.31.16.0/20
      id: subnet-704bd40a
      name: eu-central-1b
      type: Public
      zone: eu-central-1b
    - cidr: 172.31.32.0/20
      id: subnet-1ffad755
      name: eu-central-1c
      type: Public
      zone: eu-central-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:05Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1a
  role: Master
  subnets:
    - eu-central-1a

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1b
  role: Master
  subnets:
    - eu-central-1b

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1c
  role: Master
  subnets:
    - eu-central-1c

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: nodes-large
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: m5.large
  maxPrice: "0.30"
  maxSize: 10
  minSize: 0
  mixedInstancesPolicy:
    instances:
      - m5.large
      - m5a.large
      - m5d.large
      - m5n.large
      - m5ad.large
      - m5dn.large
      - r5.large
      - r5a.large
      - r5d.large
      - r5n.large
      - r5ad.large
      - r5dn.large
      - c5.xlarge
      - c5a.xlarge
      - c5d.xlarge
      - t3.large
      - t3a.large
      - i3.large
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-large
  role: Node
  subnets:
    - eu-central-1a
    - eu-central-1b
  suspendProcesses:
    - AZRebalance

**8. Please run the commands with most verbose logging by adding the -v 10 flag.
NA

9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

jetersen · 2021-02-05T07:30:13Z

Here are the logs from calico

2021-02-05 07:12:25.710 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0205 07:12:25.712422       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-02-05 07:12:25.713 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2021-02-05 07:12:25.726 [INFO][1] main.go 149: Getting initial config snapshot from datastore
2021-02-05 07:12:25.747 [INFO][1] main.go 152: Got initial config snapshot
2021-02-05 07:12:25.747 [INFO][1] watchersyncer.go 89: Start called
2021-02-05 07:12:25.747 [INFO][1] main.go 169: Starting status report routine
2021-02-05 07:12:25.747 [INFO][1] main.go 402: Starting controller ControllerType="Node"
2021-02-05 07:12:25.748 [INFO][1] node_controller.go 138: Starting Node controller
2021-02-05 07:12:25.748 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2021-02-05 07:12:25.748 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready
2021-02-05 07:12:25.748 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2021-02-05 07:12:25.749 [INFO][1] resources.go 343: Main client watcher loop
2021-02-05 07:12:25.760 [INFO][1] watchercache.go 297: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2021-02-05 07:12:25.760 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2021-02-05 07:12:25.760 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync
2021-02-05 07:12:25.780 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2021-02-05 07:12:25.848 [INFO][1] node_controller.go 151: Node controller is now running
2021-02-05 07:12:25.848 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:12:25.895 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:12:52.239 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:12:52.284 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-172-31-23-181.eu-central-1.compute.internal) with error: nodes "ip-172-31-23-181.eu-central-1.compute.internal" not found
2021-02-05 07:12:52.284 [INFO][1] ipam.go 137: Checking node calicoNode="ip-172-31-23-181.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:12:52.290 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-172-31-23-181.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:12:52.290 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-172-31-23-181.eu-central-1.compute.internal'
2021-02-05 07:12:52.359 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.79.0/26 host="ip-172-31-23-181.eu-central-1.compute.internal"
2021-02-05 07:12:52.360 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.532a4d6e56e8b6a6f1539a79d55dceafb2aa205015cb7a0547ff515bd1dd2eb1'
2021-02-05 07:12:52.453 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.79.0/26 host="ip-172-31-23-181.eu-central-1.compute.internal"
2021-02-05 07:12:52.548 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:21:40.436 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:21:40.465 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-172-31-46-24.eu-central-1.compute.internal) with error: nodes "ip-172-31-46-24.eu-central-1.compute.internal" not found
2021-02-05 07:21:40.465 [INFO][1] ipam.go 137: Checking node calicoNode="ip-172-31-46-24.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:21:40.470 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-172-31-46-24.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:21:40.470 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-172-31-46-24.eu-central-1.compute.internal'
2021-02-05 07:21:40.514 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.93.0/26 host="ip-172-31-46-24.eu-central-1.compute.internal"
2021-02-05 07:21:40.514 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.2dd04949d2d9abb9169da11efea09c50f063bfb9721b76bc4afeee72ed086c68'
2021-02-05 07:21:40.552 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.93.0/26 host="ip-172-31-46-24.eu-central-1.compute.internal"
2021-02-05 07:21:40.660 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:22:06.473 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:06.473 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:38.474 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:58.475 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:58.475 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:30.475 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:50.476 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:50.476 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:22.477 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:42.478 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:42.478 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:14.481 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:34.483 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:34.483 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:06.483 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:26.484 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:26.484 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:58.484 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:18.485 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:18.485 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:50.485 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:10.486 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:10.486 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:42.487 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

hakman · 2021-02-07T14:46:52Z

@jetersen Most likely you are hitting projectcalico/calico#3751, which will only be fixed in Calico 3.18.

jetersen · 2021-02-07T14:55:57Z

@hakman Thanks, for linking my google / github search-fu was not good enough to find the issue.

hakman · 2021-02-07T15:03:59Z

No worries, it was well hidden :).

jetersen closed this as completed Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-kube-controllers stuck not ready during rolling update. #10737

calico-kube-controllers stuck not ready during rolling update. #10737

jetersen commented Feb 5, 2021

jetersen commented Feb 5, 2021

hakman commented Feb 7, 2021

jetersen commented Feb 7, 2021

hakman commented Feb 7, 2021

calico-kube-controllers stuck not ready during rolling update. #10737

calico-kube-controllers stuck not ready during rolling update. #10737

Comments

jetersen commented Feb 5, 2021

jetersen commented Feb 5, 2021

hakman commented Feb 7, 2021

jetersen commented Feb 7, 2021

hakman commented Feb 7, 2021