calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

Nuru · 2019-06-04T22:53:40Z

1. What kops version are you running? The command kops version, will display
this information.
Version 1.12.1 (git-e1c317f9c)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Honestly, I have a bit lost track. We created the cluster using kops 1.12.0-beta.2 to install Kubernetes 1.12.7 with calico networking. We upgraded the cluster a few times to get to kops 1.12.1 and Kubernetes 1.12.8.

5. What happened after the commands executed?

Somewhere along the line, calico-kube-controllers stopped working, with the error "No etcd endpoints specified in etcdv3 API config".

Note that we started the cluster at Kubernetes 1.12.7 with etcd3, so there was no major upgrade like etcd v2 -> v3.

6. What did you expect to happen?
Expected calico-kube-controllers to be properly configured to use etcd3, or, if calico-kube-controllers is no longer needed, as is suggested by this comment, I expected that to be better documented so I can show whoever is concerned that this missing configuration is not an issue that needs to be fixed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: example.io
spec:
  additionalPolicies:
    bastion: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    master: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    node: |
      [
        {
           "Effect": "Allow",
           "Action": [
               "ec2:DescribeTags"
           ],
           "Resource": "*"
        }
      ]
  api:
    loadBalancer:
      idleTimeoutSeconds: 600
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Cluster: <redacted>
  cloudProvider: aws
  configBase: s3://<redacted>
  dnsZone: <redacted>
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
  hooks:
  - manifest: |
      Type=oneshot
      ExecStart=/bin/sh -c '/sbin/iptables -t nat -A PREROUTING -d 169.254.169.254/32 \
          -i cali+ -p tcp -m tcp --dport 80 -j DNAT \
          --to-destination $(curl -s http://169.254.169.254/latest/meta-data/local-ipv4):8181'
    name: kiam-iptables.service
    roles:
    - Node
  iam:
    legacy: true
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    - Initializers
    - DenyEscalatingExec
    anonymousAuth: false
    authorizationMode: RBAC
    oidcClientID: <redacted>
    oidcGroupsClaim: <redacted>
    oidcGroupsPrefix: 'oidc:'
    oidcIssuerURL: https://<redacted>
    oidcUsernameClaim: <redacted>
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.8
  masterPublicName: <redacted>.io
  networkCIDR: 10.105.0.0/17
  networkID: vpc-<redcated>
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.105.0.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 10.105.16.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 10.105.32.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 10.105.48.0/20
    id: subnet-<redacted>
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 10.105.64.0/20
    id: subnet-<redacted>
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 10.105.80.0/20
    id: subnet-<redacted>
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    bastion:
      bastionPublicName: <redacted>
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: bastions
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Bastion
  subnets:
  - utility-us-west-2a
  - utility-us-west-2b
  - utility-us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2a
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2b
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2c
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: nodes
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 3
  minSize: 3
  role: Node
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c

9. Anything else do we need to know?
What does it mean to run calico "in CRD mode"? I cannot find that anywhere in the Calico documentation.

The text was updated successfully, but these errors were encountered:

rifelpet · 2019-06-06T15:17:32Z

I haven't used calico but I believe "CRD mode" means that calico stores its state through the Kubernetes API Server using custom resource definitions rather than storing state directly on etcd. I think that might be the DATASTORE_TYPE variable here.

Can you confirm that setting DATASTORE_TYPE=kubernetes fixes the issue? I'll see if we can add some clarity to the Kops documentation.

Nuru · 2019-06-06T22:39:08Z

@rifelpet Thanks for your contribution to the discussion.

I don't know how or where to set DATASTORE_TYPE=kubernetes using kops, and since kops is managing everything else to do with Calico, I do not want to configure it via some other mechanism.

Also, even if DATASTORE_TYPE=kubernetes makes the error message go away, it does not answer the question of whether or not calico-kube-controllers needs to run at all (and why or why not).

opusmagnum · 2019-08-26T20:20:23Z

After upgrade to Kops 1.12.3 I had the following constellation:

calico-node version 3.7.4 running with env DATASTORE_TYPE: kubernetes
calico-kube-controllers version 1.0.3 producing millions of errors -- Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: malformed HTTP response "x15x03x01x00x02x02"

It seems, that downscaling of the Deployment calico-kube-controllers in

kops/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.12.yaml.template

Line 442 in 6ea097d

# This manifest scales the Calico Kubernetes controllers down to size 0.

doesn't work properly?

I have downscaled this deployment manually and error messages stopped without loosing any Calico functionality. If I understood correctly -- this functionality, previously implemented as a separate set of controllers, is now built into calico-node.

grv231 · 2019-10-22T23:10:53Z

@opusmagnum have you encountered any new errors? I have started to encounter new ones after downscaling the deployment to zero in protokube.service. Just wanted to see if you have encountered anything like that as well?

Nuru · 2019-10-23T01:38:51Z

UPDATES:

"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in etcd via its API.

According to this PR, calico-kube-controllers does still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.

I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.

opusmagnum · 2019-10-28T08:00:28Z

Thank you for the explanation -- @Nuru ! If the cluster node is removed, Calico-daemon on this node is removed as well, no additional clean up is necessary, isn't it? @Nuru Which kind of resources should be removed and under which circumstances?

@grv231 since downscaling of calico-kube-controller I didn't notice any negative effects of masses of errors, but I have to check it again, considering explanation (last comment) from @Nuru . Just to be sure, that the "remove resources", after Cluster/Node-downscaling or similar, is working as well.

Nuru · 2019-10-28T19:11:16Z

@opusmagnum I'm not sure exactly what resources should be removed and when, but it seems that at least ipamblocks and blockaffinities need to be removed when a node is removed.

grv231 · 2019-10-28T19:27:09Z

@opusmagnum The issue for us started randomly happening after 3-4 weeks after migrating the cluster to 1.12.8 version. I can concur that the issue didn't come up soon after migrating to this version (I guess because I was making version up from 1.12.0 --> 1.12.8). Somewhere along the lines, the changes were not picked up and I had to scale the calico-kube-controller deployment down as the cluster was not getting validated (which raised other errors after 3-4 weeks in protkube service).

This weekend I migrated the cluster to 1.13 and this has resolved the issues. However @Nuru I see a significant change in the amount of logging from the etcd-managers pod. The previous etcd-server-events pod logging was lower (as I can see it in Kibana). Is this an expected behavior? The logs all seem to be non-error messages.

fejta-bot · 2020-01-26T20:21:04Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-02-25T21:02:59Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-03-26T21:45:54Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-03-26T21:46:10Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

grv231 mentioned this issue Oct 9, 2019

Couldn't find key etcd_endpoints in ConfigMap kube-system/calico-config #7140

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2020

k8s-ci-robot closed this as completed Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

Nuru commented Jun 4, 2019

rifelpet commented Jun 6, 2019

Nuru commented Jun 6, 2019

opusmagnum commented Aug 26, 2019 •

edited

Loading

grv231 commented Oct 22, 2019

Nuru commented Oct 23, 2019

opusmagnum commented Oct 28, 2019

Nuru commented Oct 28, 2019

grv231 commented Oct 28, 2019 •

edited

Loading

fejta-bot commented Jan 26, 2020

fejta-bot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

k8s-ci-robot commented Mar 26, 2020

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

Comments

Nuru commented Jun 4, 2019

rifelpet commented Jun 6, 2019

Nuru commented Jun 6, 2019

opusmagnum commented Aug 26, 2019 • edited Loading

grv231 commented Oct 22, 2019

Nuru commented Oct 23, 2019

opusmagnum commented Oct 28, 2019

Nuru commented Oct 28, 2019

grv231 commented Oct 28, 2019 • edited Loading

fejta-bot commented Jan 26, 2020

fejta-bot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

k8s-ci-robot commented Mar 26, 2020

opusmagnum commented Aug 26, 2019 •

edited

Loading

grv231 commented Oct 28, 2019 •

edited

Loading