Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

edsonmarquezani · 2019-04-16T19:26:14Z

1. What kops version are you running?
Version 1.12.0-beta.2 (git-d1453d22a)

2. What Kubernetes version are you running?
1.12.7

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Create a brand new cluster
Change any configuration
Run a rolling-update

kops rolling-update cluster --yes

5. What happened after the commands executed?
Rolling update fails at the first node (master instance), due to calico-node not becoming ready.

6. What did you expect to happen?
Rolling-update to get completed without errors.

7. Please provide your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: ************
spec:
  additionalPolicies:
    master: "[\n      {\n        \"Effect\": \"Allow\",\n        \"Action\": [ \"sts:AssumeRole\"
      ],\n        \"Resource\": [\"*\"]\n      },\n     {\n         \"Effect\": \"Allow\",\n
      \        \"Action\": [\n             \"ec2:DescribeInstanceStatus\"\n         ],\n
      \        \"Resource\": \"*\"\n     }\n  ]\n  \n"
    node: "[\n      {\n        \"Effect\": \"Allow\",\n        \"Action\": [ \"sts:AssumeRole\"
      ],\n        \"Resource\": [\"*\"]\n      },\n     {\n         \"Effect\": \"Allow\",\n
      \        \"Action\": [\n             \"ec2:DescribeInstanceStatus\"\n         ],\n
      \        \"Resource\": \"*\"\n     }\n  ]\n  \n"
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://************************
  dnsZone: ************
  etcdClusters:
  - cpuRequest: 200m
    enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-us-east-1a-1
      name: a-1
    - instanceGroup: master-us-east-1c-1
      name: c-1
    - instanceGroup: master-us-east-1b-1
      name: b-1
    memoryRequest: 100Mi
    name: main
    version: 3.2.24
  - cpuRequest: 100m
    enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-us-east-1a-1
      name: a-1
    - instanceGroup: master-us-east-1c-1
      name: c-1
    - instanceGroup: master-us-east-1b-1
      name: b-1
    memoryRequest: 100Mi
    name: events
    version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    oidcClientID: kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.************
    oidcUsernameClaim: email
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    imageGCHighThresholdPercent: 75
    imageGCLowThresholdPercent: 60
    kubeletCgroups: /systemd/system.slice
    runtimeCgroups: /systemd/system.slice
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.7
  masterKubelet:
    kubeletCgroups: /systemd/system.slice
    runtimeCgroups: /systemd/system.slice
  masterPublicName: api.************
  networkCIDR: 10.21.0.0/16
  networkID: vpc-xxxxxxxxxxxxxxxxx
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.x.x.x/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  - cidr: 10.x.x.x/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  - cidr: 10.x.x.0/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

The issue at a glance:

Fresh setup with 1.12 branch, no problems.
After updating cluster spec with authentication parameters (kubeAPIServer.admissionControl), rolling-update fails right on the first master being updated.

[...]
I0416 15:51:31.284819   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
I0416 15:52:00.575052   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
I0416 15:52:30.055031   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
E0416 15:52:58.178054   29901 instancegroups.go:214] Cluster did not validate within 15m0s

master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of \"15m0s\""

The new master join the cluster, but calico-node never gets ready. Readiness check says:

Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.x.x.x,10.x.x.x,10.x.x.x

Log doesn't show any ERROR message, only INFO, though.

2019-04-16 19:00:42.229 [INFO][42] health.go 150: Overall health summary=&health.HealthReport{Live:true, Ready:true}

Deleting the pod and letting it be created again seems to solve the problem.

Any ideas?

The text was updated successfully, but these errors were encountered:

edsonmarquezani · 2019-05-01T16:28:23Z

Seems to be related to projectcalico/calico#2211.

edsonmarquezani · 2019-05-01T17:45:45Z

Guy in Calico's Github suggested to change to Calico v3.6 to solve this. Wouldn't it be possible?

caiohasouza · 2019-05-28T16:37:35Z

I have the same issue on 1.12.1.

edsonmarquezani · 2019-08-07T20:21:32Z

#7249 (release 1.12.3) may have solved it.

Didn't confirm it myself yet, but I'm closing the issue as there has been no return from anyone from the project within almost 3 months.

edsonmarquezani mentioned this issue May 1, 2019

Calico will not restore traffic on node return untill BIRD's "Graceful restart". projectcalico/calico#2211

Closed

edsonmarquezani closed this as completed Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

edsonmarquezani commented Apr 16, 2019 •

edited

edsonmarquezani commented May 1, 2019

edsonmarquezani commented May 1, 2019

caiohasouza commented May 28, 2019 •

edited

edsonmarquezani commented Aug 7, 2019

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

Comments

edsonmarquezani commented Apr 16, 2019 • edited

edsonmarquezani commented May 1, 2019

edsonmarquezani commented May 1, 2019

caiohasouza commented May 28, 2019 • edited

edsonmarquezani commented Aug 7, 2019

edsonmarquezani commented Apr 16, 2019 •

edited

caiohasouza commented May 28, 2019 •

edited