Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-kube-controllers stuck not ready during rolling update. #10737

Closed
jetersen opened this issue Feb 5, 2021 · 4 comments
Closed

calico-kube-controllers stuck not ready during rolling update. #10737

jetersen opened this issue Feb 5, 2021 · 4 comments

Comments

@jetersen
Copy link

jetersen commented Feb 5, 2021

1. What kops version are you running? The command kops version, will display
this information.

Version 1.18.3 (git-11ec695516)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:18:51Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"62876fc6d93e891aa7fbe19771e6a6c03773b0f7", GitTreeState:"clean", BuildDate:"2020-10-15T01:43:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

1.18.10, upgrading to 1.18.15.

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update 😅

5. What happened after the commands executed?

Over time I have noticed every time I have done rolling updates all they from 1.18.0, to 1.18.1, 1.18.2 through to 1.18.3. I noticed the calico-kube-controllers gets into a not ready state and the cluster validation will fail. Not sure what the underlying issue is but a simple kubectl delete pod --namespace kube-system calico-kube-controllers does the trick.

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 1
  name: my cluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole",
          ],
          "Resource": [
            "*"
          ]
        }
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole",
          ],
          "Resource": [
            "*"
          ]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/euc1: ""
  cloudProvider: aws
  configBase: s3://clusters.company.io/my-cluster.company.io
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - encryptedVolume: true
          instanceGroup: master-eu-central-1a
          name: a
        - encryptedVolume: true
          instanceGroup: master-eu-central-1b
          name: b
        - encryptedVolume: true
          instanceGroup: master-eu-central-1c
          name: c
      memoryRequest: 100Mi
      name: main
      version: 3.2.24
    - cpuRequest: 100m
      etcdMembers:
        - encryptedVolume: true
          instanceGroup: master-eu-central-1a
          name: a
        - encryptedVolume: true
          instanceGroup: master-eu-central-1b
          name: b
        - encryptedVolume: true
          instanceGroup: master-eu-central-1c
          name: c
      memoryRequest: 100Mi
      name: events
      version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    oidcClientID: kubelogin
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.company.io
    oidcUsernameClaim: email
  kubeDNS:
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
    - NA
  kubernetesVersion: 1.18.15
  masterPublicName: api.my-cluster.company.io
  networkCIDR: 172.31.0.0/16
  networkID: vpc-9b9c0bf3
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/17
  sshAccess:
    - NA
  subnets:
    - cidr: 172.31.0.0/20
      id: subnet-a6b8e9ce
      name: eu-central-1a
      type: Public
      zone: eu-central-1a
    - cidr: 172.31.16.0/20
      id: subnet-704bd40a
      name: eu-central-1b
      type: Public
      zone: eu-central-1b
    - cidr: 172.31.32.0/20
      id: subnet-1ffad755
      name: eu-central-1c
      type: Public
      zone: eu-central-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:05Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1a
  role: Master
  subnets:
    - eu-central-1a

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1b
  role: Master
  subnets:
    - eu-central-1b

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: master-eu-central-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.medium
  maxPrice: "0.20"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
      - t3.medium
      - t3a.medium
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1c
  role: Master
  subnets:
    - eu-central-1c

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-22T21:47:06Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-cluster.company.io
  name: nodes-large
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: m5.large
  maxPrice: "0.30"
  maxSize: 10
  minSize: 0
  mixedInstancesPolicy:
    instances:
      - m5.large
      - m5a.large
      - m5d.large
      - m5n.large
      - m5ad.large
      - m5dn.large
      - r5.large
      - r5a.large
      - r5d.large
      - r5n.large
      - r5ad.large
      - r5dn.large
      - c5.xlarge
      - c5a.xlarge
      - c5d.xlarge
      - t3.large
      - t3a.large
      - i3.large
    onDemandAboveBase: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-large
  role: Node
  subnets:
    - eu-central-1a
    - eu-central-1b
  suspendProcesses:
    - AZRebalance

**8. Please run the commands with most verbose logging by adding the -v 10 flag.
NA

9. Anything else do we need to know?

@jetersen
Copy link
Author

jetersen commented Feb 5, 2021

Here are the logs from calico

2021-02-05 07:12:25.710 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0205 07:12:25.712422       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-02-05 07:12:25.713 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2021-02-05 07:12:25.726 [INFO][1] main.go 149: Getting initial config snapshot from datastore
2021-02-05 07:12:25.747 [INFO][1] main.go 152: Got initial config snapshot
2021-02-05 07:12:25.747 [INFO][1] watchersyncer.go 89: Start called
2021-02-05 07:12:25.747 [INFO][1] main.go 169: Starting status report routine
2021-02-05 07:12:25.747 [INFO][1] main.go 402: Starting controller ControllerType="Node"
2021-02-05 07:12:25.748 [INFO][1] node_controller.go 138: Starting Node controller
2021-02-05 07:12:25.748 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2021-02-05 07:12:25.748 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready
2021-02-05 07:12:25.748 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2021-02-05 07:12:25.749 [INFO][1] resources.go 343: Main client watcher loop
2021-02-05 07:12:25.760 [INFO][1] watchercache.go 297: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2021-02-05 07:12:25.760 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2021-02-05 07:12:25.760 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2021-02-05 07:12:25.760 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync
2021-02-05 07:12:25.780 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2021-02-05 07:12:25.848 [INFO][1] node_controller.go 151: Node controller is now running
2021-02-05 07:12:25.848 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:12:25.895 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:12:52.239 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:12:52.284 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-172-31-23-181.eu-central-1.compute.internal) with error: nodes "ip-172-31-23-181.eu-central-1.compute.internal" not found
2021-02-05 07:12:52.284 [INFO][1] ipam.go 137: Checking node calicoNode="ip-172-31-23-181.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:12:52.290 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-172-31-23-181.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:12:52.290 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-172-31-23-181.eu-central-1.compute.internal'
2021-02-05 07:12:52.359 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.79.0/26 host="ip-172-31-23-181.eu-central-1.compute.internal"
2021-02-05 07:12:52.360 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.532a4d6e56e8b6a6f1539a79d55dceafb2aa205015cb7a0547ff515bd1dd2eb1'
2021-02-05 07:12:52.453 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.79.0/26 host="ip-172-31-23-181.eu-central-1.compute.internal"
2021-02-05 07:12:52.548 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:21:40.436 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-02-05 07:21:40.465 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-172-31-46-24.eu-central-1.compute.internal) with error: nodes "ip-172-31-46-24.eu-central-1.compute.internal" not found
2021-02-05 07:21:40.465 [INFO][1] ipam.go 137: Checking node calicoNode="ip-172-31-46-24.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:21:40.470 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-172-31-46-24.eu-central-1.compute.internal" k8sNode=""
2021-02-05 07:21:40.470 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-172-31-46-24.eu-central-1.compute.internal'
2021-02-05 07:21:40.514 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.93.0/26 host="ip-172-31-46-24.eu-central-1.compute.internal"
2021-02-05 07:21:40.514 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.2dd04949d2d9abb9169da11efea09c50f063bfb9721b76bc4afeee72ed086c68'
2021-02-05 07:21:40.552 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.64.93.0/26 host="ip-172-31-46-24.eu-central-1.compute.internal"
2021-02-05 07:21:40.660 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2021-02-05 07:22:06.473 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:06.473 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:38.474 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:58.475 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:22:58.475 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:30.475 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:50.476 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:23:50.476 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:22.477 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:42.478 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:24:42.478 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:14.481 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:34.483 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:25:34.483 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:06.483 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:26.484 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:26.484 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:26:58.484 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:18.485 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:18.485 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:27:50.485 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:10.486 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:10.486 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-02-05 07:28:42.487 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

@hakman
Copy link
Member

hakman commented Feb 7, 2021

@jetersen Most likely you are hitting projectcalico/calico#3751, which will only be fixed in Calico 3.18.

@jetersen
Copy link
Author

jetersen commented Feb 7, 2021

@hakman Thanks, for linking my google / github search-fu was not good enough to find the issue.

@jetersen jetersen closed this as completed Feb 7, 2021
@hakman
Copy link
Member

hakman commented Feb 7, 2021

No worries, it was well hidden :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants