Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

Closed
Nuru opened this issue Jun 4, 2019 · 12 comments
Closed

calico-kube-controllers No etcd endpoints specified in etcdv3 API config #7093

Nuru opened this issue Jun 4, 2019 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Nuru
Copy link

Nuru commented Jun 4, 2019

1. What kops version are you running? The command kops version, will display
this information.

Version 1.12.1 (git-e1c317f9c)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Honestly, I have a bit lost track. We created the cluster using kops 1.12.0-beta.2 to install Kubernetes 1.12.7 with calico networking. We upgraded the cluster a few times to get to kops 1.12.1 and Kubernetes 1.12.8.

5. What happened after the commands executed?

Somewhere along the line, calico-kube-controllers stopped working, with the error "No etcd endpoints specified in etcdv3 API config".

Note that we started the cluster at Kubernetes 1.12.7 with etcd3, so there was no major upgrade like etcd v2 -> v3.

6. What did you expect to happen?
Expected calico-kube-controllers to be properly configured to use etcd3, or, if calico-kube-controllers is no longer needed, as is suggested by this comment, I expected that to be better documented so I can show whoever is concerned that this missing configuration is not an issue that needs to be fixed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: example.io
spec:
  additionalPolicies:
    bastion: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    master: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    node: |
      [
        {
           "Effect": "Allow",
           "Action": [
               "ec2:DescribeTags"
           ],
           "Resource": "*"
        }
      ]
  api:
    loadBalancer:
      idleTimeoutSeconds: 600
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Cluster: <redacted>
  cloudProvider: aws
  configBase: s3://<redacted>
  dnsZone: <redacted>
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
  hooks:
  - manifest: |
      Type=oneshot
      ExecStart=/bin/sh -c '/sbin/iptables -t nat -A PREROUTING -d 169.254.169.254/32 \
          -i cali+ -p tcp -m tcp --dport 80 -j DNAT \
          --to-destination $(curl -s http://169.254.169.254/latest/meta-data/local-ipv4):8181'
    name: kiam-iptables.service
    roles:
    - Node
  iam:
    legacy: true
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    - Initializers
    - DenyEscalatingExec
    anonymousAuth: false
    authorizationMode: RBAC
    oidcClientID: <redacted>
    oidcGroupsClaim: <redacted>
    oidcGroupsPrefix: 'oidc:'
    oidcIssuerURL: https://<redacted>
    oidcUsernameClaim: <redacted>
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.8
  masterPublicName: <redacted>.io
  networkCIDR: 10.105.0.0/17
  networkID: vpc-<redcated>
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.105.0.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 10.105.16.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 10.105.32.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 10.105.48.0/20
    id: subnet-<redacted>
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 10.105.64.0/20
    id: subnet-<redacted>
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 10.105.80.0/20
    id: subnet-<redacted>
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    bastion:
      bastionPublicName: <redacted>
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: bastions
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Bastion
  subnets:
  - utility-us-west-2a
  - utility-us-west-2b
  - utility-us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2a
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2b
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2c
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: nodes
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 3
  minSize: 3
  role: Node
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c

9. Anything else do we need to know?
What does it mean to run calico "in CRD mode"? I cannot find that anywhere in the Calico documentation.

@rifelpet
Copy link
Member

rifelpet commented Jun 6, 2019

I haven't used calico but I believe "CRD mode" means that calico stores its state through the Kubernetes API Server using custom resource definitions rather than storing state directly on etcd. I think that might be the DATASTORE_TYPE variable here.

Can you confirm that setting DATASTORE_TYPE=kubernetes fixes the issue? I'll see if we can add some clarity to the Kops documentation.

@Nuru
Copy link
Author

Nuru commented Jun 6, 2019

@rifelpet Thanks for your contribution to the discussion.

I don't know how or where to set DATASTORE_TYPE=kubernetes using kops, and since kops is managing everything else to do with Calico, I do not want to configure it via some other mechanism.

Also, even if DATASTORE_TYPE=kubernetes makes the error message go away, it does not answer the question of whether or not calico-kube-controllers needs to run at all (and why or why not).

@opusmagnum
Copy link

opusmagnum commented Aug 26, 2019

After upgrade to Kops 1.12.3 I had the following constellation:

  • calico-node version 3.7.4 running with env DATASTORE_TYPE: kubernetes
  • calico-kube-controllers version 1.0.3 producing millions of errors -- Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: malformed HTTP response "x15x03x01x00x02x02"

It seems, that downscaling of the Deployment calico-kube-controllers in

# This manifest scales the Calico Kubernetes controllers down to size 0.
doesn't work properly?

I have downscaled this deployment manually and error messages stopped without loosing any Calico functionality. If I understood correctly -- this functionality, previously implemented as a separate set of controllers, is now built into calico-node.

@grv231
Copy link

grv231 commented Oct 22, 2019

@opusmagnum have you encountered any new errors? I have started to encounter new ones after downscaling the deployment to zero in protokube.service. Just wanted to see if you have encountered anything like that as well?

@Nuru
Copy link
Author

Nuru commented Oct 23, 2019

UPDATES:

"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in etcd via its API.

According to this PR, calico-kube-controllers does still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.

I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.

@opusmagnum
Copy link

Thank you for the explanation -- @Nuru ! If the cluster node is removed, Calico-daemon on this node is removed as well, no additional clean up is necessary, isn't it? @Nuru Which kind of resources should be removed and under which circumstances?

@grv231 since downscaling of calico-kube-controller I didn't notice any negative effects of masses of errors, but I have to check it again, considering explanation (last comment) from @Nuru . Just to be sure, that the "remove resources", after Cluster/Node-downscaling or similar, is working as well.

@Nuru
Copy link
Author

Nuru commented Oct 28, 2019

@opusmagnum I'm not sure exactly what resources should be removed and when, but it seems that at least ipamblocks and blockaffinities need to be removed when a node is removed.

@grv231
Copy link

grv231 commented Oct 28, 2019

@opusmagnum The issue for us started randomly happening after 3-4 weeks after migrating the cluster to 1.12.8 version. I can concur that the issue didn't come up soon after migrating to this version (I guess because I was making version up from 1.12.0 --> 1.12.8). Somewhere along the lines, the changes were not picked up and I had to scale the calico-kube-controller deployment down as the cluster was not getting validated (which raised other errors after 3-4 weeks in protkube service).

This weekend I migrated the cluster to 1.13 and this has resolved the issues. However @Nuru I see a significant change in the amount of logging from the etcd-managers pod. The previous etcd-server-events pod logging was lower (as I can see it in Kibana). Is this an expected behavior? The logs all seem to be non-error messages.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants