Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

Closed
edsonmarquezani opened this issue Apr 16, 2019 · 4 comments
Closed

Rolling-update fails due to calico-node with 1.12.0-beta.2 #6784

edsonmarquezani opened this issue Apr 16, 2019 · 4 comments

Comments

@edsonmarquezani
Copy link

edsonmarquezani commented Apr 16, 2019

1. What kops version are you running?
Version 1.12.0-beta.2 (git-d1453d22a)

2. What Kubernetes version are you running?
1.12.7

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

  • Create a brand new cluster
  • Change any configuration
  • Run a rolling-update
kops rolling-update cluster --yes

5. What happened after the commands executed?
Rolling update fails at the first node (master instance), due to calico-node not becoming ready.

6. What did you expect to happen?
Rolling-update to get completed without errors.

7. Please provide your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: ************
spec:
  additionalPolicies:
    master: "[\n      {\n        \"Effect\": \"Allow\",\n        \"Action\": [ \"sts:AssumeRole\"
      ],\n        \"Resource\": [\"*\"]\n      },\n     {\n         \"Effect\": \"Allow\",\n
      \        \"Action\": [\n             \"ec2:DescribeInstanceStatus\"\n         ],\n
      \        \"Resource\": \"*\"\n     }\n  ]\n  \n"
    node: "[\n      {\n        \"Effect\": \"Allow\",\n        \"Action\": [ \"sts:AssumeRole\"
      ],\n        \"Resource\": [\"*\"]\n      },\n     {\n         \"Effect\": \"Allow\",\n
      \        \"Action\": [\n             \"ec2:DescribeInstanceStatus\"\n         ],\n
      \        \"Resource\": \"*\"\n     }\n  ]\n  \n"
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://************************
  dnsZone: ************
  etcdClusters:
  - cpuRequest: 200m
    enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-us-east-1a-1
      name: a-1
    - instanceGroup: master-us-east-1c-1
      name: c-1
    - instanceGroup: master-us-east-1b-1
      name: b-1
    memoryRequest: 100Mi
    name: main
    version: 3.2.24
  - cpuRequest: 100m
    enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-us-east-1a-1
      name: a-1
    - instanceGroup: master-us-east-1c-1
      name: c-1
    - instanceGroup: master-us-east-1b-1
      name: b-1
    memoryRequest: 100Mi
    name: events
    version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    oidcClientID: kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.************
    oidcUsernameClaim: email
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    imageGCHighThresholdPercent: 75
    imageGCLowThresholdPercent: 60
    kubeletCgroups: /systemd/system.slice
    runtimeCgroups: /systemd/system.slice
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.7
  masterKubelet:
    kubeletCgroups: /systemd/system.slice
    runtimeCgroups: /systemd/system.slice
  masterPublicName: api.************
  networkCIDR: 10.21.0.0/16
  networkID: vpc-xxxxxxxxxxxxxxxxx
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 10.x.x.x/21
    id: subnet-xxxxxxxxxxxxxxxxx
    name: node-us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.x.x.x/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  - cidr: 10.x.x.x/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  - cidr: 10.x.x.0/23
    id: subnet-xxxxxxxxxxxxxxxxx
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

The issue at a glance:

  • Fresh setup with 1.12 branch, no problems.
  • After updating cluster spec with authentication parameters (kubeAPIServer.admissionControl), rolling-update fails right on the first master being updated.
[...]
I0416 15:51:31.284819   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
I0416 15:52:00.575052   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
I0416 15:52:30.055031   29901 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: kube-system pod "calico-node-5w2lp" is not ready (calico-node).
E0416 15:52:58.178054   29901 instancegroups.go:214] Cluster did not validate within 15m0s

master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of \"15m0s\""

The new master join the cluster, but calico-node never gets ready. Readiness check says:

Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.x.x.x,10.x.x.x,10.x.x.x

Log doesn't show any ERROR message, only INFO, though.

2019-04-16 19:00:42.229 [INFO][42] health.go 150: Overall health summary=&health.HealthReport{Live:true, Ready:true}

Deleting the pod and letting it be created again seems to solve the problem.

Any ideas?

@edsonmarquezani
Copy link
Author

Seems to be related to projectcalico/calico#2211.

@edsonmarquezani
Copy link
Author

Guy in Calico's Github suggested to change to Calico v3.6 to solve this. Wouldn't it be possible?

@caiohasouza
Copy link

caiohasouza commented May 28, 2019

I have the same issue on 1.12.1.

@edsonmarquezani
Copy link
Author

#7249 (release 1.12.3) may have solved it.

Didn't confirm it myself yet, but I'm closing the issue as there has been no return from anyone from the project within almost 3 months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants