Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out #714

Closed
dschaaff opened this issue Sep 3, 2021 · 58 comments
Assignees
Labels
type/bug Something isn't working

Comments

@dschaaff
Copy link

dschaaff commented Sep 3, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Since upgrading the consul-k8s project to 0.33 We are seeing frequent failures within our primary cluster. I have not, yet at least, observed this behavior in other clusters we run. This primary difference with this cluster is that it runs close to 100 nodes and has several connect injected services that are frequently scaling up and down from the horizontal pod autoscaler.

Observed Symptoms

  • new pod starts
  • endpoints controller does not regster the service with consul
  • the consul-connect-inject-init is stuck on Unable to find registered services; retrying
  • the consul agent on the service pods node logs errors due to the service not being registered
[ERROR] agent.http: Request error: method=GET url=/v1/agent/services?filter=Meta%5B%22pod-name%22%5D+%3D%3D+%22example-service-6848b97cd7-fr27c%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22aggregates-service%22 from=10.0.36.110:49094 error="ACL not found"

At this point the pod is stuck in this state. The endpoints controller never actually registers the service. After a few minutes our on call engineers our paged due to the stuck pods. Deleting the pods in this state usually gets things back on track.

  • AWS EKS 1.20 Server Version: v1.20.7-eks-d88609
  • Servers on Consul 1.10.2
  • Agents on 1.10.2

I have confirmed that these pods are present in the service's endpoints.

Helm Values

fullnameOverride: consul
# Available parameters and their default values for the Consul chart.

global:
  enabled: false
  domain: consul
  image: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/consul:1.10.2"
  imageK8S: "xxxx.dkr.ecr.us-west-2.amazonaws.com/consul-k8s-control-plane:0.33.0"
  imageEnvoy: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/envoy:v1.16.4"
  datacenter: xxxx
  enablePodSecurityPolicies: false
  gossipEncryption:
    secretName: consul-secrets
    secretKey: gossip-encryption-key
  tls:
    enabled: true
    enableAutoEncrypt: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    httpsOnly: false
    caCert:
      secretName: consul-secrets
      secretKey: ca.crt
    caKey:
      secretName: null
      secretKey: null

server:
  enabled: false
externalServers:
  enabled: true
  hosts: [xxxxxxxxxx]
  httpsPort: 443
  tlsServerName: null
  useSystemRoots: true
client:
  enabled: true
  image: null
  join:
    - "provider=aws tag_key=consul-datacenter tag_value=xxxxxx"
  grpc: true
  exposeGossipPorts: false
  resources:
    requests:
      memory: "400Mi"
      cpu: "200m"
    limits:
      cpu: "500m"
      memory: "400Mi"
  extraConfig: |
    {
      "telemetry": {
        "disable_hostname": true,
        "prometheus_retention_time": "6h"
      }
    }
  extraVolumes:
    - type: secret
      name: consul-secrets
      load: false
    - type: secret
      name: consul-acl-config
      load: true 
  tolerations: ""
  nodeSelector: null
  annotations: null
  extraEnvironmentVars:
    CONSUL_HTTP_TOKEN_FILE: /consul/userconfig/consul-secrets/consul.token

dns:
  enabled: true

ui:
  enabled: false

syncCatalog:
  enabled: true
  image: null
  default: true # true will sync by default, otherwise requires annotation
  toConsul: true
  toK8S: false
  k8sPrefix: null
  consulPrefix: null
  k8sTag: k8s-cluster-name
  syncClusterIPServices: true
  nodePortSyncType: ExternalFirst
  aclSyncToken:
    secretName: consul-secrets
    secretKey: consul-k8s-sync.token

connectInject:
  enabled: true
  replicas: 2
  default: false
  resources:
    requests:
      memory: "500Mi"
      cpu: "100m"
    limits:
      cpu: null
      memory: "750Mi"
  overrideAuthMethodName: kubernetes
  aclInjectToken:
    secretName: consul-secrets
    secretKey: connect-inject.token
  centralConfig:
    enabled: true
  sidecarProxy:
    resources:
      requests:
        memory: 150Mi
        cpu: 100m
      limits:
        memory: 150Mi

I could use some guidance on what information would be most useful to help debug this. Thanks for your help!

@dschaaff dschaaff added the type/bug Something isn't working label Sep 3, 2021
@david-yu
Copy link
Contributor

david-yu commented Sep 3, 2021

Hi @dschaaff thanks for filing. Would you be able to provide a YAML manifest either an example or repro YAML that may introduce this behavior? We have seen some issues around multi-port pods but you may not have the same scenario or architecture.

@dschaaff
Copy link
Author

dschaaff commented Sep 3, 2021

Here is the deployment and service manifest for 1 of our services. Our other microservices mirror this same setup and differ just by name and image.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "30"
    meta.helm.sh/release-name: example-service
    meta.helm.sh/release-namespace: example-service
  labels:
    app.kubernetes.io/instance: example-service
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: example-service
    app.kubernetes.io/version: "1.0"
    helm.sh/chart: example-service-2.2.1
  name: example-service
  namespace: example-service
spec:
  # note this is managed by HPA, scales based cpu percentage
  replicas: 19
  selector:
    matchLabels:
      app.kubernetes.io/instance: example-service
      app.kubernetes.io/name: example-service
  strategy:
    rollingUpdate:
      maxSurge: 200%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: example-service
        consul.hashicorp.com/connect-service-port: "80"
        consul.hashicorp.com/service-meta-namespace: example-service
        consul.hashicorp.com/service-tags: eks
      labels:
        app.kubernetes.io/instance: example-service
        app.kubernetes.io/name: example-service
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - example-service
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - args:
        - agent
        - -config=/etc/vault/vault-agent-config.hcl
        env:
        - name: VAULT_ADDR
          value: https://redacted
        - name: VAULT_CAPATH
          value: /etc/ssl/certs
        image: vault:1.6.1
        imagePullPolicy: IfNotPresent
        name: vault-agent-auth
        resources:
          limits:
            cpu: 200m
          requests:
            cpu: 5m
            memory: 150Mi
        volumeMounts:
        - mountPath: /etc/vault
          name: vault-agent-config
          readOnly: true
        - mountPath: /etc/ssl/certs/vault-ca.pem
          name: vault-cert
          readOnly: true
          subPath: vault-ca.pem
        - mountPath: /home/vault
          name: vault-token
      - env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: VAULT_ADDR
          value: http://127.0.0.1:8200
        - name: VAULT_CAPATH
          value: /etc/ssl/certs
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: example-service
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sleep
              - "30"
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 80
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: example-service
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readinezz
            port: 80
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          limits:
            cpu: "2"
            memory: 700Mi
          requests:
            cpu: "1"
            memory: 700Mi
        volumeMounts:
        - mountPath: /home/vault
          name: vault-token
        - mountPath: /etc/vault
          name: vault-agent-config
          readOnly: true
      dnsConfig:
        options:
        - name: ndots
          value: "1"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: example-service
      serviceAccountName: example-service
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: example-service-vault-agent-config
        name: vault-agent-config
      - configMap:
          defaultMode: 420
          items:
          - key: vault-ca.pem
            path: vault-ca.pem
          name: example-service-vault-agent-config
        name: vault-cert
      - emptyDir:
          medium: Memory
        name: vault-token
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    consul.hashicorp.com/service-tags: example-service
    meta.helm.sh/release-name: example-service
    meta.helm.sh/release-namespace: example-service
  labels:
    app.kubernetes.io/instance: example-service
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: example-service
    app.kubernetes.io/version: "1.0"
    helm.sh/chart: example-service-2.2.1
  name: example-service
  namespace: example-service
spec:
  clusterIP: 172.20.239.195
  clusterIPs:
  - 172.20.239.195
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/instance: example-service
    app.kubernetes.io/name: example-service
  sessionAffinity: None
  type: ClusterIP

@dschaaff
Copy link
Author

dschaaff commented Sep 3, 2021

I'm the processing of updating the consul client pods to disable the streaming backend. If the behavior still exists after that then I'm going to downgrade consul-k8s to back the sidecar architecture. Unfortunately, this is very disruptive and has led to the on-call engineer get paging quite frequently.

I'm hoping to grab some more info in the meantime.

@david-yu
Copy link
Contributor

david-yu commented Sep 3, 2021

Thanks @dschaaff we have seen some issues when some folks are declaring the service and service-port via annotation. You shouldn't need to do this any longer since we are able to grab this information from the Service object itself. We'd need to investigate this further but I don't think we'll have answers for you in the immediate short term. I do wonder if removing those annotations changes anything.

        consul.hashicorp.com/connect-service: example-service
        consul.hashicorp.com/connect-service-port: "80"

@dschaaff
Copy link
Author

dschaaff commented Sep 3, 2021

I had to go ahead and downgrade our prod clusters. Next week I'll try and reproduce in a test cluster.

lawliet89 pushed a commit to lawliet89/consul-k8s that referenced this issue Sep 13, 2021
@dschaaff
Copy link
Author

Just to check-in, I haven't had time to work on reproducing the issue. We have only experienced it in large eks clusters (50-100 nodes, 1000s of pods)thus far. I had to downgrade our last larger cluster due to the issues. I hope to find time to work on reproducing it outside of high traffic environment soon.

@dschaaff
Copy link
Author

dschaaff commented Oct 13, 2021

I am now seeing this behavior in our staging cluster. Can you please let me know what information I should grab to help debug this?

I have a pod that has been stuck for 10 min crash looping. The consul-connect-inject-init container keeps saying

2021-10-13T19:02:14.381Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:15.382Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:16.383Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:17.385Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:18.387Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:19.388Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:20.390Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:21.391Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:22.392Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:23.394Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:23.394Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2021-10-13T19:02:24.395Z [INFO]  Unable to find registered services; retrying
2021-10-13T19:02:25.396Z [INFO]  Unable to find registered services; retrying

The consul agent on the node is up and healthy. One thing caught my eye in the injector logs

2021-10-13T18:47:44.888Z	INFO	handler.connect	received pod	{"name": "", "ns": "my-namespace"}

Is it expected for the pod name to be empty like that?

It was in this state for 15 mins. Afterwards the injector finally registered the service and the pod was able to start.

2021-10-13T19:02:25.673Z	INFO	controller.endpoints	retrieved	{"name": "my-app", "ns": "my-namespace"}
2021-10-13T19:02:25.673Z	INFO	controller.endpoints	registering service with Consul	{"name": "my-app", "id": "my-app-7d8b97fd89-tnf95-track", "agentIP": "10.20.59.1"}
2021-10-13T19:02:25.685Z	INFO	controller.endpoints	registering proxy service with Consul	{"name": "my-app-sidecar-proxy"}

@dschaaff
Copy link
Author

I attempted to upgrade one of our production clusters yesterday with 0.39.0. Unfortunately we ran into the same issue described here. Several pods were stuck at the init phase for more than 10m because the endpionts controller had not recognized the new pods and registered them with consul. Restart the controller pods fixed the issue.

@ishustava
Copy link
Contributor

Hi @dschaaff sorry for not getting back to you.

Do you see that the endpoints controller not receiving the event for that pod (I'm talking specifically about controller.endpoints logs) for 10min? I'm wondering if there's an issue where when there are a lot of events to process it gets stuck processing that event queue.

IIRC this line handler.connect received pod {"name": "", "ns": "my-namespace"} is normal as the pods that are part of a deployment may not have a name yet when they are being injected since injection happens before scheduling.

@ndhanushkodi
Copy link
Contributor

When reproducing the issue, could you grab and provide the full endpoints controller logs (all the logs from both pods of consul-connect-injector-webhook-deployment), and grab the events while the pods are being applied as well:

kubectl get events -o custom-columns=OBJECT:.involvedObject.kind,NAME:.involvedObject.name,CREATETIMESTAMP:.metadata.creationTimestamp,REASON:.reason,MESSAGE:.message -w > k8sevents.txt

We can try to correlate K8s events and the endpoint controller logs that way.

Also could you provide the CPU and memory usage of the consul and injector pods while this happens? I was able to use k top pods in the namespace where consul is installed after installing metrics server via these instructions

@dschaaff
Copy link
Author

dschaaff commented Jan 24, 2022

I captured some logs from an issue this morning. I'm currently running consul-k8s 0.39.0 in our staging clusters. I'm leaving production on 0.25.0 until I can sort why we get so many injection errors on the newer version. We had alerts go off for a pod in our staging environment today due envoy not being available.

The pod looks like it was rescheduled due to cluster autoscaling activity. When the pod started the consul connect inject init container put out these logs

``` LAST SEEN TYPE REASON OBJECT MESSAGE 44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container envoy-sidecar 44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container logrotate 44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container vault-agent-auth 44m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-mt6bx 43m Warning FailedScheduling pod/client-replication-1-6ff8bcb699-mt6bx 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 7 Insufficient memory. 44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container client-replication 44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container laravel-log 43m Normal TriggeredScaleUp pod/client-replication-1-6ff8bcb699-mt6bx pod triggered scale-up: [{eks-stg-arm-spot-m-2021101817251900310000000c-fabe4a37-d6a9-2c06-075f-e07215ca1dd3 1->2 (max: 15)}] 42m Warning FailedScheduling pod/client-replication-1-6ff8bcb699-mt6bx 0/22 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 7 Insufficient memory. 42m Normal Scheduled pod/client-replication-1-6ff8bcb699-mt6bx Successfully assigned client-replication/client-replication-1-6ff8bcb699-mt6bx to ip-10-20-2-136.us-west-2.compute.internal 42m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg" 41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 23.294743018s 41m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container config-template-copy 41m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container config-template-copy 41m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/vault:1.7.7" 41m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container vault-agent-init 41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/vault:1.7.7" in 4.260832414s 41m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container vault-agent-init 41m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/cordial/consul-template:main-c2c65811" 41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 4.907906483s 40m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container consul-template-init 40m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container consul-template-init 40m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Container image "registry/cordial/consul-template:main-c2c65811" already present on machine 40m Warning BackOff pod/client-replication-1-6ff8bcb699-mt6bx Back-off restarting failed container 40m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Container image "registry/consul:1.10.6" already present on machine 37m Normal Pulling pod/client-replication-1-6ff8bcb699-drctt Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg" 37m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-drctt 37m Normal Scheduled pod/client-replication-1-6ff8bcb699-drctt Successfully assigned client-replication/client-replication-1-6ff8bcb699-drctt to ip-10-20-11-129.us-west-2.compute.internal 36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 23.167028987s 36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container config-template-copy 36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container config-template-copy 36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container vault-agent-init 36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container vault-agent-init 36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/vault:1.7.7" already present on machine 36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container consul-template-init 36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/cordial/consul-template:main-c2c65811" already present on machine 36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container consul-template-init 36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container copy-consul-bin 36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/consul:1.10.6" already present on machine 36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container copy-consul-bin 36m Normal Pulling pod/client-replication-1-6ff8bcb699-drctt Pulling image "registry/consul-k8s-control-plane:0.39.0" 36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Successfully pulled image "registry/consul-k8s-control-plane:0.39.0" in 1.860779427s 34m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container consul-connect-inject-init 34m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container consul-connect-inject-init 34m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/consul-k8s-control-plane:0.39.0" already present on machine 33m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-cw2cq 33m Normal Scheduled pod/client-replication-1-6ff8bcb699-cw2cq Successfully assigned client-replication/client-replication-1-6ff8bcb699-cw2cq to ip-10-20-24-127.us-west-2.compute.internal 33m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg" 32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 21.855575333s 32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container config-template-copy 32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container config-template-copy 32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/vault:1.7.7" 32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/vault:1.7.7" in 4.673750726s 32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container vault-agent-init 32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container vault-agent-init 32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/cordial/consul-template:main-c2c65811" 32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container consul-template-init 32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container consul-template-init 32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 4.893254952s 32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/consul:1.10.6" already present on machine 32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container copy-consul-bin 32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container copy-consul-bin 32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/consul-k8s-control-plane:0.39.0" 32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/consul-k8s-control-plane:0.39.0" in 1.947241264s 28m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container consul-connect-inject-init 28m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container consul-connect-inject-init 28m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/consul-k8s-control-plane:0.39.0" already present on machine 28m Warning BackOff pod/client-replication-1-6ff8bcb699-cw2cq Back-off restarting failed container 19m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/cordial/consul-template:main-c2c65811" already present on machine 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container vault-agent-auth 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container dedicated-client-replication 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container laravel-log 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container consul-template 5m51s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 7 Insufficient memory. 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container envoy-sidecar 5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container logrotate 5m52s Normal SuccessfulCreate replicaset/dedicated-client-replication-1-757549d787 Created pod: dedicated-client-replication-1-757549d787-j4wl4 5m49s Warning FailedPreStopHook pod/dedicated-client-replication-1-757549d787-zftjz Exec lifecycle hook ([bin/bash -c /usr/local/bin/consul logout -http-addr="${HOST_IP}:8500" -token-file /consul-template/acl-token]) for Container "consul-template" in Pod "dedicated-client-replication-1-757549d787-zftjz_client-replication(8b437f7f-93a6-4528-b57b-ecae7b39105b)" failed - error: command 'bin/bash -c /usr/local/bin/consul logout -http-addr="${HOST_IP}:8500" -token-file /consul-template/acl-token' exited with 1: Error destroying token: Unexpected response code: 403 (rpc error making call: rpc error making call: ACL not found)... 5m36s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 8 Insufficient memory. 5m46s Normal TriggeredScaleUp pod/dedicated-client-replication-1-757549d787-j4wl4 pod triggered scale-up: [{eks-stg-spot-c-20211222190131236200000001-9cbef1c2-8e1d-848c-700a-2ab387a46cca 3->4 (max: 15)}] 5m16s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/21 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 8 Insufficient memory. 5m6s Normal Scheduled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully assigned client-replication/dedicated-client-replication-1-757549d787-j4wl4 to ip-10-20-9-22.us-west-2.compute.internal 5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "vault-agent-config" : failed to sync configmap cache: timed out waiting for the condition 5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "vault-cert" : failed to sync configmap cache: timed out waiting for the condition 5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "logrotate-conf" : failed to sync configmap cache: timed out waiting for the condition 4m47s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg" 4m25s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 22.414868737s 4m15s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container config-template-copy 4m15s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container config-template-copy 4m14s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/vault:1.7.7" 4m12s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container vault-agent-init 4m12s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container vault-agent-init 4m12s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/vault:1.7.7" in 2.386491168s 4m9s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/cordial/consul-template:main-c2c65811" 4m6s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 3.911779387s 3m50s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container consul-template-init 3m50s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container consul-template-init 3m25s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Container image "registry/cordial/consul-template:main-c2c65811" already present on machine 3m37s Warning BackOff pod/dedicated-client-replication-1-757549d787-j4wl4 Back-off restarting failed container ```
2022-01-24T18:24:36.265Z [INFO]  Consul login complete
2022-01-24T18:24:36.265Z [INFO]  Checking that the ACL token exists when reading it in the stale consistency mode
2022-01-24T18:24:36.266Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-01-24T18:24:36.367Z [INFO]  Successfully read ACL token from the server
2022-01-24T18:24:36.370Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:37.371Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:38.372Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:39.373Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:40.374Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:41.375Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:42.376Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:43.378Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:44.379Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:45.380Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:45.380Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-01-24T18:24:46.381Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:47.382Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:48.383Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:49.385Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:50.386Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:51.388Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:52.389Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:53.390Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:54.391Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:55.392Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:55.392Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-01-24T18:24:56.393Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:57.394Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:58.395Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:24:59.397Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:00.399Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:01.400Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:02.401Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:03.402Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:04.403Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:05.404Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:05.404Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-01-24T18:25:06.407Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:07.408Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:08.409Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:09.410Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:10.411Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:11.412Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:12.413Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:13.414Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:14.415Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:15.417Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:15.417Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-01-24T18:25:16.419Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:17.420Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:18.421Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:19.422Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:20.423Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:21.424Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:22.425Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:23.426Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:24.427Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:25.428Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:25.428Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-01-24T18:25:26.430Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:27.431Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:28.432Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:29.433Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:30.434Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:31.435Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:32.437Z [INFO]  Unable to find registered services; retrying
2022-01-24T18:25:33.438Z [INFO]  Registered service has been detected: service=client-replication-1
2022-01-24T18:25:33.438Z [INFO]  Registered service has been detected: service=client-replication-1-sidecar-proxy
2022-01-24T18:25:33.438Z [INFO]  Connect initialization completed

I'm attaching the endpoints controller logs since there is a lot of them. What jumps out to me is

sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-01-24T18:24:35.035Z	ERROR	controller.endpoints	failed to deregister endpoints on all agents	{"name": "client-replication-1", "ns": "client-replication", "error": "Get \"https://10.20.8.125:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22client-replication-1%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22client-replication%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.20.14.172:44074->10.20.8.125:8501: read: no route to host"}

10.20.8.125 belonged to a node that no longer exists in the cluster.

consul-out2.zip
consul-out.zip

@dschaaff
Copy link
Author

I have just discovered the setting advertise_reconnect_timeout that was added to the agent. I have set that at 15m for the k8s based agents so that dead nodes get cleaned up from the member list faster. I do not know if this is related to the errors described in this ticket but I want to not that I made that change after collecting the logs in my previous comment.

@dschaaff
Copy link
Author

I believe this issue may be caused by #779. Our environment auto scales quite frequently and there are often scenarios where a node that has gone away will still show up in the Consul member list.

I deployed 0.40 with #991 into a staging cluster last week and things are looking better thus far. I have expanded the test by deploying 0.40 to a lower traffic production cluster. I'll update the issue after observing behavior further.

@david-yu
Copy link
Contributor

Thank you @dschaaff for the update!

@dschaaff
Copy link
Author

dschaaff commented Feb 1, 2022

Unfortunately, this behavior continues. We just had alert fire for 2 pods that were not healthy. They both had been stuck on the consul-connect-inject-init phase for 10m.

2022-02-01T01:19:11.786Z [INFO]  Consul login complete
2022-02-01T01:19:11.786Z [INFO]  Checking that the ACL token exists when reading it in the stale consistency mode
2022-02-01T01:19:11.788Z [INFO]  Successfully read ACL token from the server
2022-02-01T01:19:11.790Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:12.791Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:13.792Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:14.793Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:15.795Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:16.797Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:17.798Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:18.799Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:19.801Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:20.802Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:20.802Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-02-01T01:19:21.803Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:22.805Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:23.806Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:24.807Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:25.808Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:26.810Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:27.811Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:28.812Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:29.813Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:30.815Z [INFO]  Unable to find registered services; retrying
2022-02-01T01:19:30.815Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-02-01T01:19:31.816Z [INFO]  Unable to find registered services; retrying

I checked the endpoints controller logs and the pod ids for the that alerted do not appear at all. I'm going to attempt to leave this cluster at the latest version a bit longer, but this issue does not appear to resolved by #991. All of our other production clusters remain on 0.25.0 of this project which has been completely reliable.

Reminder I attached the previously requests logs #714 (comment)

@ishustava
Copy link
Contributor

Hey @dschaaff thanks so much for the update and for trying it out!

Could you describe roughly what is happening in the cluster when you're seeing these errors? I'm asking more about how many pods are be created by the HPA at about the same time. Do you know roughly what is the range of pods per service you could have at any given time?

I'm thinking that this may be related to scalability issues with Endpoints objects and trying to find out if you may be coming across that. We have looked at supporting EndpointSlices but haven't gotten to it yet.

@dschaaff
Copy link
Author

dschaaff commented Feb 1, 2022

The service that failed to register today happened while we were deploying 3 different services. Between these 3 services, there was a total of 14 pods at the time. This is inside a secondary cluster that does not receive as much traffic as our main cluster. The HPA for that service that errored has min/max of 2/10. I'm including a screenshot of every HPA in that cluster. The pod count for that cluster hovers averages around 400.

image

For context, our main cluster is about 6 times the size of this one and the services scale up and down a lot more often. Hopefully, that is helpful. I'm happy to dig into any more info that would be useful in troubleshooting the issues. Thanks!

@dschaaff
Copy link
Author

dschaaff commented Feb 1, 2022

I may have to downgrade the production cluster I was testing this upgrade in. The on call engineer was paged again due to pods stuck an init state.

NAME                                          READY   STATUS     RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
service-86dd585d4-q4785           0/3     Init:1/2   0          20s   10.0.77.65    ip-10-0-94-66.us-west-1.compute.internal    <none>           0/1
service-86dd585d4-xb5ng           0/3     Init:1/2   4          11m   10.0.86.135   ip-10-0-64-120.us-west-1.compute.internal   <none>           0/1
kubectl logs service-86dd585d4-9x25x -c consul-connect-inject-init    
2022-02-01T17:48:10.339Z [INFO]  Consul login complete
2022-02-01T17:48:10.339Z [INFO]  Checking that the ACL token exists when reading it in the stale consistency mode
2022-02-01T17:48:10.340Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-02-01T17:48:10.441Z [INFO]  Successfully read ACL token from the server
2022-02-01T17:48:10.443Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:11.444Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:12.445Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:13.447Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:14.448Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:15.450Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:16.451Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:17.452Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:18.453Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:19.455Z [INFO]  Unable to find registered services; retrying

The pod ids service-86dd585d4-q4785 and service-86dd585d4-xb5ng do not appear anywhere in the inject controller logs. The consul agent is healthy on both of those nodes. I do see the following in the consul agent logs on those nodes

2022-02-01T17:50:28.013Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.77.65:33428 error="ACL not found"
2022-02-01T17:49:16.015Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.86.135:52496 error="ACL not found"
2022-02-01T17:52:38.993Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.86.135:45138 error="ACL not found"
$ kubectl get events | grep service-86dd585d4-xb5ng                     
18m         Normal    Scheduled                pod/service-86dd585d4-xb5ng               Successfully assigned service/service-86dd585d4-xb5ng to ip-10-0-64-120.us-west-1.compute.internal
18m         Normal    Pulled                   pod/service-86dd585d4-xb5ng               Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/consul:1.10.7" already present on machine
18m         Normal    Created                  pod/service-86dd585d4-xb5ng               Created container copy-consul-bin
18m         Normal    Started                  pod/service-86dd585d4-xb5ng               Started container copy-consul-bin
9m12s       Normal    Pulled                   pod/service-86dd585d4-xb5ng               Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/consul-k8s-control-plane:0.40.0" already present on machine
9m12s       Normal    Created                  pod/service-86dd585d4-xb5ng               Created container consul-connect-inject-init
9m12s       Normal    Started                  pod/service-86dd585d4-xb5ng               Started container consul-connect-inject-init
6m42s       Warning   BackOff                  pod/service-86dd585d4-xb5ng               Back-off restarting failed container
18m         Normal    SuccessfulCreate         replicaset/service-86dd585d4              Created pod: service-86dd585d4-xb5ng

Eventually pod service-86dd585d4-xb5ng started but the delay between the pod starting and it become healthy is way too long.

The only thing jumping out at me in the injector logs is this

2022-02-01T17:38:15.443Z	INFO	controller.endpoints	Consul client agent is not ready, skipping deregistration	{"consul-agent": "consul-client-rrfww", "svc": "service"}
2022-02-01T17:38:15.459Z	ERROR	controller.endpoints	Reconciler error	{"reconciler group": "", "reconciler kind": "Endpoints", "name": "service", "namespace": "service", "error": "1 error occurred:\n\t* Put \"https://10.0.66.169:8501/v1/agent/service/register\": dial tcp 10.0.66.169:8501: connect: connection refused\n\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-02-01T17:38:19.157Z	INFO	controller.endpoints	retrieved	{"name": "fluent-bit", "ns": "fluentd"}
2022-02-01T17:38:19.157Z	INFO	controller.endpoints	ignoring because endpoints pods have not been injected	{"name": "fluent-bit", "ns": "fluentd"}
2022-02-01T17:38:35.940Z	INFO	controller.endpoints	retrieved	{"name": "service", "ns": "service"}
2022-02-01T17:38:35.940Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-dxwnr-service", "agentIP": "10.0.94.60"}
2022-02-01T17:38:35.942Z	INFO	controller.endpoints	registering proxy service with Consul	{"name": "service-sidecar-proxy"}
2022-02-01T17:38:35.945Z	INFO	controller.endpoints	updating health check status for service	{"name": "service", "reason": "Kubernetes health checks passing", "status": "passing"}
2022-02-01T17:38:35.946Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-r6vdl-service", "agentIP": "10.0.72.103"}
2022-02-01T17:38:35.949Z	INFO	controller.endpoints	registering proxy service with Consul	{"name": "service-sidecar-proxy"}
2022-02-01T17:38:35.951Z	INFO	controller.endpoints	updating health check status for service	{"name": "service", "reason": "Kubernetes health checks passing", "status": "passing"}
2022-02-01T17:38:35.952Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-9x25x-service", "agentIP": "10.0.66.169"}
2022-02-01T17:38:35.955Z	ERROR	controller.endpoints	failed to register service	{"name": "service", "error": "Unexpected response code: 403 (Permission denied: Missing service:write on service)"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-02-01T17:38:35.955Z	ERROR	controller.endpoints	failed to register services or health check	{"name": "service", "ns": "service", "error": "Unexpected response code: 403 (Permission denied: Missing service:write on service)"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227

This error 2022-02-01T17:38:15.459Z ERROR controller.endpoints Reconciler error {"reconciler group": "", "reconciler kind": "Endpoints", "name": "service", "namespace": "service", "error": "1 error occurred:\n\t* Put \"https://10.0.66.169:8501/v1/agent/service/register\": dial tcp 10.0.66.169:8501: connect: connection refused\n\n"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227 is for a node that no longer exists ip-10-0-66-169.us-west-1.compute.internal 10.0.86.68:8301 left client 1.10.7 2 datacenter default <default>

@dschaaff
Copy link
Author

dschaaff commented Mar 17, 2022

Just a note that I once again tried to upgrade our primary production cluster to today (to 0.41). I'm now going to roll it back. We've been flooded with alerts off and on for pods that are stuck in the init phase logging

2022-02-01T17:48:10.441Z [INFO]  Successfully read ACL token from the server
2022-02-01T17:48:10.443Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:11.444Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:12.445Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:13.447Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:14.448Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:15.450Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:16.451Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:17.452Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:18.453Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:19.455Z [INFO]  Unable to find registered services; retrying

Our alerts only fire after a pod has been stuck like this for 8m. I'd really like to be able to use the up to date injector but it's definitely really buggy.

@raphlcx
Copy link

raphlcx commented Mar 31, 2022

Hello,

I'm also facing the same issue on our deployment. On one of our services, the init container logs from consul-connect-inject-init look pretty similar:

2022-03-31T09:58:02.760Z [INFO]  Unable to find registered services; retrying
2022-03-31T09:58:03.761Z [INFO]  Unable to find registered services; retrying
2022-03-31T09:58:04.763Z [INFO]  Unable to find registered services; retrying
2022-03-31T09:58:04.763Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
2022-03-31T09:58:05.766Z [INFO]  Unable to find registered services; retrying
2022-03-31T09:58:05.766Z [ERROR] Timed out waiting for service registration: error="did not find correct number of services, found: 0, services: map[]"

When the init container fails to start, the entire pod will get stuck. If we are persistent to restart the deployment multiple times, the init container would eventually start up successfully.

We connected our Kubernetes cluster to HCP Consul. The cluster has about 50 worker nodes, constantly spun up and down periodically.

On our Consul Helm deployment, we noticed the consul-connect-injector pods crashed rather frequently. This is the visual view from our Argo CD, note that each injector pod restarts for 27 and 33 times respectively (number on the far right):

Screenshot 2022-03-31 at 17 30 39

One of the pod's summary:

Screenshot 2022-03-31 at 18 47 56

On its events:

Screenshot 2022-03-31 at 18 48 01

The injector pod's logs did not contain any exceptions, but it did flag that one of our services, which was annotated with consul.hashicorp.com/connect-inject: true, was not injected yet:

2022-03-31T09:54:00.908Z	INFO	controller.endpoints	retrieved	{"name": "xxxxx", "ns": "staging"}
2022-03-31T09:54:00.908Z	INFO	controller.endpoints	ignoring because endpoints pods have not been injected	{"name": "xxxxx", "ns": "staging"}

Hope this helps in the investigation.

Components version:

  • HCP Consul: v1.11.4
  • Consul Helm: 0.41.1
  • AWS EKS: v1.21

@lkysow
Copy link
Member

lkysow commented Apr 5, 2022

@raphlcx the one issue with "ignore pods because..." is here: #1143. I have a fix for that.

Regarding the container failing its liveness probe that's unexpected. Is there anything in the events?

@raphlcx
Copy link

raphlcx commented Apr 5, 2022

@raphlcx the one issue with "ignore pods because..." is here: #1143. I have a fix for that.

@lkysow Thanks, looking forward to adopt the fix!

Regarding the container failing its liveness probe that's unexpected. Is there anything in the events?

We have managed to resolve the root cause of connect-injector pod crashing. It was due to allocating insufficient memory to the pod. The default memory allocation for each pod is 50Mi but the pod was actually using up to 70Mi and that led to frequent crashes.

After bumping the memory allocation to 100Mi, it resolved the crash and we don't get liveness probe errors anymore in the events.

@leowmjw
Copy link

leowmjw commented Apr 6, 2022

@lkysow does this mean there is some sort of memory leak that happens due to maybe many pods coming and going due to HPA? How to monitor this so that it is caught before this occurs.

@raphlcx Is the error same with what @dschaaff observed in production? @dschaaff is your production test also using the same default memory allocation of 50Mi? or did you encounter the issue even with higher memory allocation?

@lkysow
Copy link
Member

lkysow commented Apr 6, 2022

does this mean there is some sort of memory leak that happens due to maybe many pods coming and going due to HPA? How to monitor this so that it is caught before this occurs.

If @raphlcx is seeing stability at 100Mi then I don't think there's a memory leak. Probably just with a certain size of environment you need more mem. If they're seeing memory creep up then yes there may be a leak.

@dschaaff
Copy link
Author

dschaaff commented Apr 6, 2022

In production I run the endpoints controller with

resources:
    requests:
      memory: "1000Mi"
      cpu: "200m"
    limits:
      cpu: null # the null removes the default cpu limit present in the helm chart.
      memory: "1500Mi"

I haven't had any issues with the controller pods crashing and the have never been closing to using the amount of memory I've allocated for them. That sounds like a different issue than what I've been seeing.

@raphlcx
Copy link

raphlcx commented Apr 7, 2022

We'll have to continue observing the memory usage, so far it hovers around 50~70Mi, but at one point it spiked to 118Mi, which was above the threshold that we have set (100Mi).

Initially, we weren't expecting the memory usage for the connect-injector pod to rise past the default limit set. Were there documented use cases, tutorials, or similar past incidents that are also related to rising memory usage from the injector pod?

@raphlcx
Copy link

raphlcx commented Apr 7, 2022

@leowmjw I think we share the same issue with @dschaaff on init containers getting stuck, but our setup did not contain errors or exceptions from the connect-injector pods, i.e. controller.endpoints Reconciler error ...

@lkysow
Copy link
Member

lkysow commented Apr 7, 2022

@raphlcx can you share your values.yaml and your environment details (kube version, hosting provider, etc.)

@david-yu
Copy link
Contributor

Hi everyone, we are definitely monitoring this issue closely. It sounds like this is occurring on K8s clusters that have the following properties:

  1. On K8s 1.20 and above, but from what I can tell both @dschaaff and @raphlcx are running on EKS
  2. Over 50 worker nodes

A couple of questions as well.

  1. How may services and pods on the Mesh are running in the Consul K8s cluster?
  2. Do the nodes and/or pods scale up and down periodically?

We plan to try to reproduce this issue but currently are in the midst of preparing for a major release, any more helpful indicators on how to repro this would be appreciated.

@dschaaff
Copy link
Author

dschaaff commented Apr 11, 2022

  1. This is a rough estimate given it is the pod count of connect services at this specific time
     278

We have 53 connect services total on the mesh. 8 of those run directly on ec2 outside of Kubernetes.

  1. We autoscale both our the underlying nodes and the pods on top of them quite frequently.

image

image


We never experience behavior like this with consul-k8s 0.25. I have tried every version of consul-k8s project since 0.33 and have run into these problems in production every time. I have almost never seen this behavior in our pre-prod environments which have been using the latest consul-k8s starting with 0.33 up to 0.42 now. The only difference between them is the scale. The workloads get a lot more traffic and autoscale to handle spikes.

Our consul cluster is usually around 144 nodes. The raft commit time is between 10 and 25ms at all times. Consul usually has 10-40 catalog operations a second. We run 5 consul servers on c6g.xlarge instances. CPU utilization has peaked at 10% on the leader.

I'm happy to provide any other helpful info. I'd love to not rely on restarting the inject controller every 3 minutes to keep things working :)

@lkysow
Copy link
Member

lkysow commented Apr 12, 2022

Just to give some info on why this is happening with 0.25.0+ it's because we moved from the pods registering themselves to being registered via the endpoints controller. This was needed to support transparent proxying–being able to use kube dns URLs (among other reasons such as being able to deregister pods that are force killed and re-register pods efficiently when consul clients restart).

@dschaaff
Copy link
Author

dschaaff commented Apr 12, 2022

Yep. The issue I have had is that the endpoints controller has not been reliably registering services with the local agent. Because of that pods are randomly get stuck in in the init phase, sometimes for as long as 30 minutes. Restarting the endpoints controller regularly has made the alert noise manageable but we still have issues during large deployments.

@lkysow
Copy link
Member

lkysow commented Apr 12, 2022

Yeah for sure; this is a high priority for us to fix now.

@dschaaff
Copy link
Author

dschaaff commented Apr 14, 2022

While the restart cron has greatly reduced the number of errors we see, it is cropping up. We just hit this for example

my-app-768b7ccbf9-558g9                    0/7     Init:CrashLoopBackOff   4 (21s ago)     13m
my-app-768b7ccbf9-5ghsp                    0/7     Init:CrashLoopBackOff   4 (41s ago)     13m
my-app-768b7ccbf9-6j4zw                    0/7     Init:CrashLoopBackOff   4 (43s ago)     13m
my-app-768b7ccbf9-7m5pn                    0/7     Init:CrashLoopBackOff   4 (36s ago)     13m
my-app-768b7ccbf9-8b64b                    0/7     Init:4/5                4 (2m26s ago)   13m
my-app-768b7ccbf9-8ntqz                    0/7     Init:4/5                4 (2m28s ago)   13m
my-app-768b7ccbf9-97klt                    0/7     Init:CrashLoopBackOff   4 (29s ago)     13m
my-app-768b7ccbf9-bdlxv                    0/7     Init:4/5                4 (2m26s ago)   13m
my-app-768b7ccbf9-c5f2j                    0/7     Init:4/5                4               13m
my-app-768b7ccbf9-c9r68                    0/7     Init:4/5                4 (2m24s ago)   13m
my-app-768b7ccbf9-cbhn4                    0/7     Init:4/5                4 (2m22s ago)   13m
my-app-768b7ccbf9-cdvgr                    0/7     Init:CrashLoopBackOff   4 (48s ago)     13m
my-app-768b7ccbf9-ch6dl                    0/7     Init:CrashLoopBackOff   4 (25s ago)     13m
my-app-768b7ccbf9-cp9t6                    0/7     Init:CrashLoopBackOff   4 (40s ago)     13m
my-app-768b7ccbf9-cw6jf                    0/7     Init:CrashLoopBackOff   4 (41s ago)     13m
my-app-768b7ccbf9-ddst2                    0/7     Init:CrashLoopBackOff   4 (30s ago)     13m
my-app-768b7ccbf9-fqxp9                    0/7     Init:CrashLoopBackOff   4 (42s ago)     13m
my-app-768b7ccbf9-fz76q                    0/7     Init:4/5                4 (2m25s ago)   13m
my-app-768b7ccbf9-g82wx                    0/7     Init:CrashLoopBackOff   4 (40s ago)     13m
my-app-768b7ccbf9-jn5r8                    0/7     Init:4/5                4 (2m27s ago)   13m
my-app-768b7ccbf9-jqhv4                    0/7     Init:CrashLoopBackOff   4 (79s ago)     13m
my-app-768b7ccbf9-jtfn8                    0/7     Init:CrashLoopBackOff   4 (40s ago)     13m
my-app-768b7ccbf9-jwkh8                    0/7     Init:4/5                4 (2m27s ago)   13m
my-app-768b7ccbf9-k555j                    0/7     Init:4/5                4 (2m23s ago)   13m
my-app-768b7ccbf9-kjdt8                    0/7     Init:4/5                4 (2m27s ago)   13m
my-app-768b7ccbf9-kk8tx                    0/7     Init:4/5                4 (2m27s ago)   13m
my-app-768b7ccbf9-kpnbd                    0/7     Init:4/5                4 (2m24s ago)   13m
my-app-768b7ccbf9-lgcb4                    0/7     Init:CrashLoopBackOff   4 (36s ago)     13m
my-app-768b7ccbf9-lxn9j                    0/7     Init:CrashLoopBackOff   4 (40s ago)     13m
my-app-768b7ccbf9-m646q                    0/7     Init:CrashLoopBackOff   4 (31s ago)     13m
my-app-768b7ccbf9-mvgnd                    0/7     Init:4/5                4 (2m35s ago)   13m
my-app-768b7ccbf9-nnk6f                    0/7     Init:CrashLoopBackOff   4 (26s ago)     13m
my-app-768b7ccbf9-q74zg                    0/7     Init:4/5                4 (2m36s ago)   13m
my-app-768b7ccbf9-qvmvk                    0/7     Init:4/5                4 (2m26s ago)   13m
my-app-768b7ccbf9-rbbpt                    0/7     Init:4/5                4 (2m31s ago)   13m
my-app-768b7ccbf9-rkgg4                    0/7     Init:4/5                4 (2m28s ago)   13m
my-app-768b7ccbf9-rmwnc                    0/7     Init:4/5                4 (2m26s ago)   13m
my-app-768b7ccbf9-sdqkz                    0/7     Init:CrashLoopBackOff   4 (43s ago)     13m
my-app-768b7ccbf9-tpqs2                    0/7     Init:CrashLoopBackOff   4 (28s ago)     13m
my-app-768b7ccbf9-vpcx9                    0/7     Init:CrashLoopBackOff   4 (28s ago)     13m
my-app-768b7ccbf9-vrcpx                    0/7     Init:CrashLoopBackOff   4 (35s ago)     13m
my-app-768b7ccbf9-x89b7                    0/7     Init:CrashLoopBackOff   4 (30s ago)     13m
my-app-768b7ccbf9-xg8mc                    0/7     Init:4/5                4 (2m20s ago)   13m
my-app-768b7ccbf9-zzfht                    0/7     Init:4/5                4 (2m22s ago)   13m
my-app-768b7ccbf9-zzw82                    0/7     Init:CrashLoopBackOff   4 (28s ago)     13m

That's 43 pods that have been stuck in the init phase due to the service not being registered for 13m+

2022-04-14T20:52:44.051Z [INFO]  Consul login complete
2022-04-14T20:52:44.051Z [INFO]  Checking that the ACL token exists when reading it in the stale consistency mode
2022-04-14T20:52:44.052Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-04-14T20:52:44.153Z [INFO]  Successfully read ACL token from the server
2022-04-14T20:52:44.155Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:45.156Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:46.157Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:47.159Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:48.160Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:49.161Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:50.162Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:51.163Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:52.164Z [INFO]  Unable to find registered services; retrying
2022-04-14T20:52:53.165Z [INFO]  Unable to find registered services; retrying

After 15m they finally got registered by the controller and were able to start. This is brutal. During this time the consul agent on the node was logging the below errors repeatedly until the service registration finally happened.

2022-04-14T20:44:54.762Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.129.65:51228 error="ACL not found"
2022-04-14T20:44:55.759Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.129.64:37266 error="ACL not found"
2022-04-14T20:47:22.860Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.129.65:52708 error="ACL not found"
2022-04-14T20:47:23.760Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.129.64:38746 error="ACL not found"

@dschaaff
Copy link
Author

dschaaff commented Apr 16, 2022

I disabled my restart cron temporarily hoping to get some more debug data. We hit this issue again afterwards. I've followed the chain of events in the logs as best as I can.

The pod in question was removed for a node scale down.

FirstSeen              LastSeen               Count   From                 Type     Reason      Message
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       cluster-autoscaler   Normal   ScaleDown   deleting pod for node scale down
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       kubelet              Normal   Killing     Stopping container vault-agent-auth
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       kubelet              Normal   Killing     Stopping container envoy-sidecar
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       kubelet              Normal   Killing     Stopping container my-app
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       kubelet              Normal   Killing     Stopping container consul-template

The consul client on that node shows an event for the node shut down as well.

FirstSeen              LastSeen               Count   From                 Type     Reason      Message
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       cluster-autoscaler   Normal   ScaleDown   deleting pod for node scale down
2022-04-16T01:34:08Z   2022-04-16T01:34:08Z   1       kubelet              Normal   Killing     Stopping container consul

I then see the inject controller deleting the ACL token and trying to de-register the service

consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072847.4007716,"logger":"controller.endpoints","msg":"retrieved","name":"my-app-2","ns":"my-app"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072847.4053361,"logger":"controller.endpoints","msg":"Consul client agent is not ready, skipping deregistration","consul-agent":"consul-client-db45h","svc":"my-app-2"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"debug","ts":1650072847.45133,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/mutate","UID":"c403a733-a85f-4f16-863c-dca1b9502c3b","kind":"/v1, Kind=Pod","resource":{"group":"","version":"v1","resource":"pods"}}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072847.4523847,"logger":"handler.connect","msg":"received pod","name":"","ns":"my-app"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"debug","ts":1650072847.4648118,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/mutate","code":200,"reason":"valid Pod request","UID":"c403a733-a85f-4f16-863c-dca1b9502c3b","allowed":true}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072847.4652326,"logger":"controller.endpoints","msg":"deregistering service from consul","svc":"my-app-2-6d697fcc78-tbx2z-my-app-2"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072847.4778728,"logger":"controller.endpoints","msg":"reconciling ACL tokens for service","svc":"my-app-2"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072848.3142743,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072848.3151798,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"debug","ts":1650072849.7905092,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/mutate","UID":"2f231545-6b7f-4121-b628-887b2660e345","kind":"/v1, Kind=Pod","resource":{"group":"","version":"v1","resource":"pods"}}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"debug","ts":1650072849.7909575,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/mutate","code":200,"reason":"Pod  does not require injection","UID":"2f231545-6b7f-4121-b628-887b2660e345","allowed":true}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.0938203,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"my-app-2-6d697fcc78-tbx2z"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.10548,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"my-app-2-6d697fcc78-tbx2z"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.1123652,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"my-app-2-6d697fcc78-tbx2z"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.1145678,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"my-app-2-6d697fcc78-tbx2z"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.1218631,"logger":"controller.endpoints","msg":"deregistering service from consul","svc":"my-app-2-6d697fcc78-tbx2z-my-app-2-sidecar-proxy"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072850.128808,"logger":"controller.endpoints","msg":"reconciling ACL tokens for service","svc":"my-app-2-sidecar-proxy"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072852.6961482,"logger":"controller.endpoints","msg":"Consul client agent is not ready, skipping deregistration","consul-agent":"consul-client-dffp7","svc":"my-app-2"}

Next it logs some messages referring to the consul client on the node that was just killed.

consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072859.96141,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072859.96151,"logger":"controller.endpoints","msg":"ignoring Consul client pod because it's not ready","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072859.96152,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072859.9615448,"logger":"controller.endpoints","msg":"ignoring Consul client pod because it's not ready","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072860.1644208,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072860.1644871,"logger":"controller.endpoints","msg":"ignoring Consul client pod because it's not ready","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072860.1644964,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072860.1645176,"logger":"controller.endpoints","msg":"ignoring Consul client pod because it's not ready","name":"consul-client-7cwvx"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650072860.1754212,"logger":"controller.endpoints","msg":"received update for Consul client pod","name":"consul-client-7cwvx"}

The new pod my-app-2-6d697fcc78-tc447 has already been bound to a node at this point and is attempting to start. It gets stuck at the init phase due to the service not being registered.

FirstSeen              LastSeen               Count   From                 Type      Reason      Message
2022-04-16T01:34:07Z   2022-04-16T01:34:07Z   1       default-scheduler    Normal    Scheduled   Successfully assigned client-replication/my-app-2-6d697fcc78-tc447 to ip-10-0-148-34.us-west-1.compute.internal
2022-04-16T01:34:08Z   2022-04-16T01:34:08Z   1       kubelet              Normal    Pulling     Pulling image "my-app"
2022-04-16T01:34:24Z   2022-04-16T01:34:24Z   1       kubelet              Normal    Pulled      Successfully pulled image "063772531253.dkr.ecr.us-west-1.amazonaws.com/cordial/core:apache-r-20220415.us-west-1.2338850040" in 15.471640969s
2022-04-16T01:34:34Z   2022-04-16T01:34:34Z   1       kubelet              Normal    Created     Created container config-template-copy
2022-04-16T01:34:34Z   2022-04-16T01:34:34Z   1       kubelet              Normal    Started     Started container config-template-copy
2022-04-16T01:34:34Z   2022-04-16T01:34:34Z   1       kubelet              Normal    Pulled      Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/vault:1.8.8" already present on machine
2022-04-16T01:34:34Z   2022-04-16T01:34:34Z   1       kubelet              Normal    Created     Created container vault-agent-init
2022-04-16T01:34:34Z   2022-04-16T01:34:34Z   1       kubelet              Normal    Started     Started container vault-agent-init
2022-04-16T01:34:37Z   2022-04-16T01:34:37Z   1       kubelet              Normal    Pulled      Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/cordial/consul-template:main-c2c65811" already present on machine
2022-04-16T01:34:37Z   2022-04-16T01:34:37Z   1       kubelet              Normal    Created     Created container consul-template-init
2022-04-16T01:34:37Z   2022-04-16T01:34:37Z   1       kubelet              Normal    Started     Started container consul-template-init
2022-04-16T01:34:46Z   2022-04-16T01:34:46Z   1       kubelet              Normal    Pulled      Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/ecr-public/hashicorp/consul:1.10.9" already present on machine
2022-04-16T01:34:46Z   2022-04-16T01:34:46Z   1       kubelet              Normal    Created     Created container copy-consul-bin
2022-04-16T01:34:46Z   2022-04-16T01:34:46Z   1       kubelet              Normal    Started     Started container copy-consul-bin
2022-04-16T01:34:47Z   2022-04-16T01:44:11Z   5       kubelet              Normal    Pulled      Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/cordial/consul-k8s-control-plane:0.42.0" already present on machine
2022-04-16T01:34:47Z   2022-04-16T01:39:02Z   3       kubelet              Normal    Created     Created container consul-connect-inject-init
2022-04-16T01:34:48Z   2022-04-16T01:39:02Z   3       kubelet              Normal    Started     Started container consul-connect-inject-init
2022-04-16T01:38:49Z   2022-04-16T01:49:38Z   14      kubelet              Warning   BackOff     Back-off restarting failed container

It remained in this state for about 16 minutes. I also confirmed the new pod was listed in the endpoints object during this period.

subsets:
- addresses:
  - ip: 10.0.47.177
    nodeName: ip-10-0-55-14.us-west-1.compute.internal
    targetRef:
      kind: Pod
      name: my-app-1-84694468b4-dlvhc
      namespace: my-app
      resourceVersion: "805617845"
      uid: 9f461291-c909-4bc5-8d9e-e07ceeeed2fb

Then I see the controller log the following connection error trying to talk to the consul agent the node that no longer exists

consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"error","ts":1650073807.9558406,"logger":"controller.endpoints","msg":"failed to get service instances","name":"my-app-2","error":"Get \"https://10.0.43.66:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22my-app-2%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22my-app%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.0.33.54:35738->10.0.43.66:8501: read: no route to host","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"error","ts":1650073807.9559186,"logger":"controller.endpoints","msg":"failed to deregister endpoints on all agents","name":"my-app-2","ns":"my-app","error":"Get \"https://10.0.43.66:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22my-app-2%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22my-app%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.0.33.54:35738->10.0.43.66:8501: read: no route to host","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"error","ts":1650073807.9559655,"logger":"controller.endpoints","msg":"Reconciler error","reconciler group":"","reconciler kind":"Endpoints","name":"my-app-2","namespace":"my-app","error":"1 error occurred:\n\t* Get \"https://10.0.43.66:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22my-app-2%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22my-app%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.0.33.54:35738->10.0.43.66:8501: read: no route to host\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227"}

Then roughly 1 minute later the new pod is finally registered.

consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650073863.0606906,"logger":"controller.endpoints","msg":"retrieved","name":"my-app-2","ns":"my-app"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650073863.0609226,"logger":"controller.endpoints","msg":"registering service with Consul","name":"my-app-2","id":"my-app-2-6d697fcc78-tc447-my-app-2","agentIP":"10.0.148.34"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650073863.0728903,"logger":"controller.endpoints","msg":"registering proxy service with Consul","name":"my-app-2-sidecar-proxy"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650073863.0892863,"logger":"controller.endpoints","msg":"updating health check status for service","name":"my-app-2","reason":"Pod \"my-app/my-app-2-6d697fcc78-tc447\" is not ready","status":"critical"}
consul-connect-injector-746c4d5944-427fn sidecar-injector {"level":"info","ts":1650073863.107011,"logger":"controller.endpoints","msg":"updating health check","id":"my-app/my-app-2-6d697fcc78-tc447-my-app-2/kubernetes-health-check"}

It's a pain to convert the unix timestamps in the logs but if you do you'll see quite a lot of time passed during this period. Hopefully this is helpful in tracking this down.
It definitely seems like the node scale up/down is playing role.

@dschaaff
Copy link
Author

Quick note that we hit these injection errors in our staging cluster early this morning as well. The timing coincides with several spot terminations forcing the recycle of nodes. There was again a long period where services were not being registered and again we see these logs in the injector

2022-04-18T14:55:24.978Z	ERROR	controller.endpoints	failed to get service instances	{"name": "core-internal", "error": "Get \"https://10.20.12.144:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22core-internal%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22core-internal%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.20.12.82:60044->10.20.12.144:8501: read: no route to host"}

@jmurret
Copy link
Contributor

jmurret commented Apr 19, 2022

Hey @dschaaff, thank you for all of the detailed logs and timelines. Please correct me if you think differently, but it sounds like:

  • endpoints controller is unable to connect to agents on client pods that have been terminated (and probably should not be trying to connect to these agents)
  • this gets into a retry loop that causes queueing of new service registrations
  • new service pods are not able to come on line for some time until the max retries for contacting the agent on the terminated pod have been hit
  • we'd like to either prevent agents that are on terminated pods from being contacted by endpoints controller...or institute a check to see if it is terminated that shortcuts the existing retry loop

I spent some a few hours trying to recreate this yesterday. I think there is a missing element that you are running into that I am not adding into recreation or maybe just won't hit when not at scale. I tried to recreate it:

  • deploying a 3 node cluster with my service with 3 replicas
  • making sure that one service and client were deployed to each node
  • killing the node that had my service but did not have the endpoints controller pod so that endpoints controller is not restarted

I tried a few different variations of this and every time the scenario was handled gracefully and it processed the termination similar to the TestConnectInject_CleanupKilledPods in connect_inject_test.go. I am unsure what the sequence of events or state of the client, service, and node need to be (or better said how to recreate this) for this retry loop to occur. If you have any thoughts, please let me know.

@dschaaff
Copy link
Author

I ran into this condition today by just doing a deployment of a service, unfortunately I'm fighting with our elk stack right now to get the connect injector logs. Here is the kubectl events output for the pod in question though.

FirstSeen              LastSeen               Count   From                Type      Reason      Message
2022-04-19T16:32:10Z   2022-04-19T16:32:10Z   1       default-scheduler   Normal    Scheduled   Successfully assigned my-app/my-app-dc6f6f7f6-qs5q6 to ip-10-0-154-100.us-west-1.compute.internal
2022-04-19T16:32:11Z   2022-04-19T16:32:11Z   1       kubelet             Normal    Pulled      Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/cordial/core:cli-r-20220419.us-west-1.2347772239" already present on machine
2022-04-19T16:32:11Z   2022-04-19T16:32:11Z   1       kubelet             Normal    Created     Created container config-template-copy
2022-04-19T16:32:11Z   2022-04-19T16:32:11Z   1       kubelet             Normal    Started     Started container config-template-copy
2022-04-19T16:32:12Z   2022-04-19T16:32:12Z   1       kubelet             Normal    Pulled      Container image "vault:1.8.8" already present on machine
2022-04-19T16:32:12Z   2022-04-19T16:32:12Z   1       kubelet             Normal    Created     Created container vault-agent-init
2022-04-19T16:32:12Z   2022-04-19T16:32:12Z   1       kubelet             Normal    Started     Started container vault-agent-init
2022-04-19T16:32:15Z   2022-04-19T16:32:15Z   1       kubelet             Normal    Pulled      Container image "0consul-template:main-c2c65811" already present on machine
2022-04-19T16:32:15Z   2022-04-19T16:32:15Z   1       kubelet             Normal    Created     Created container consul-template-init
2022-04-19T16:32:15Z   2022-04-19T16:32:15Z   1       kubelet             Normal    Started     Started container consul-template-init
2022-04-19T16:32:25Z   2022-04-19T16:32:25Z   1       kubelet             Normal    Pulled      Container image "/consul:1.10.9" already present on machine
2022-04-19T16:32:25Z   2022-04-19T16:32:25Z   1       kubelet             Normal    Created     Created container copy-consul-bin
2022-04-19T16:32:25Z   2022-04-19T16:32:25Z   1       kubelet             Normal    Started     Started container copy-consul-bin
2022-04-19T16:32:26Z   2022-04-19T16:39:07Z   4       kubelet             Normal    Pulled      Container image "consul-k8s-control-plane:0.42.0" already present on machine
2022-04-19T16:32:26Z   2022-04-19T16:39:07Z   4       kubelet             Normal    Created     Created container consul-connect-inject-init
2022-04-19T16:32:26Z   2022-04-19T16:36:41Z   3       kubelet             Normal    Started     Started container consul-connect-inject-init
2022-04-19T16:36:29Z   2022-04-19T16:38:54Z   3       kubelet             Warning   BackOff     Back-off restarting failed container
2022-04-19T16:42:41Z   2022-04-19T16:42:41Z   1       kubelet             Normal    Pulled      Container image "vault:1.8.8" already present on machine

I'm disabling our restart cron for the day to try and recreate the issue and grab another set of detailed logs.


endpoints controller is unable to connect to agents on client pods that have been terminated (and probably should not be trying to connect to these agents)

This does seem to be happening quite regularly.

Get "https://10.0.138.3:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22my-appa%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%my-app%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22": dial tcp 10.0.138.3:8501: i/o timeout
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227

this gets into a retry loop that causes queueing of new service registrations

Is there a way for me to verify this? Any way to instropect the queue?

new service pods are not able to come on line for some time until the max retries for contacting the agent on the terminated pod have been hit

What is the max retry count for this?


Scale definitely seems to be a factor at play. Our staging cluster has only hit this error a handful of times. In production we hit it quite regularly, especially without the cronjob to restart the endpoints controller.

@dschaaff
Copy link
Author

We just hit the error again in our staging cluster. A new pod was created and stuck in the init phase
for about 10 minutes.

FirstSeen              LastSeen               Count   From                Type      Reason      Message
2022-04-19T18:24:29Z   2022-04-19T18:24:29Z   1       default-scheduler   Normal    Scheduled   Successfully assigned my-app/my-app-da-7bbf4c45fd-phdfw to ip-10-20-26-109.us-west-2.compute.internal
2022-04-19T18:24:29Z   2022-04-19T18:24:29Z   1       kubelet             Normal    Pulling     Pulling image "my-app"
2022-04-19T18:24:53Z   2022-04-19T18:24:53Z   1       kubelet             Normal    Pulled      Successfully pulled image "my-app" in 23.83703243s
2022-04-19T18:24:57Z   2022-04-19T18:24:57Z   1       kubelet             Normal    Created     Created container config-template-copy
2022-04-19T18:24:58Z   2022-04-19T18:24:58Z   1       kubelet             Normal    Started     Started container config-template-copy
2022-04-19T18:24:58Z   2022-04-19T18:24:58Z   1       kubelet             Normal    Pulling     Pulling image "vault:1.8.8"
2022-04-19T18:25:01Z   2022-04-19T18:25:01Z   1       kubelet             Normal    Pulled      Successfully pulled image "vault:1.8.8" in 3.478759515s
2022-04-19T18:25:02Z   2022-04-19T18:25:02Z   1       kubelet             Normal    Created     Created container vault-agent-init
2022-04-19T18:25:02Z   2022-04-19T18:25:02Z   1       kubelet             Normal    Started     Started container vault-agent-init
2022-04-19T18:25:05Z   2022-04-19T18:25:05Z   1       kubelet             Normal    Pulling     Pulling image "cordial/consul-template:main-c2c65811"
2022-04-19T18:25:09Z   2022-04-19T18:25:09Z   1       kubelet             Normal    Pulled      Successfully pulled image "cordial/consul-template:main-c2c65811" in 4.001146143s
2022-04-19T18:25:09Z   2022-04-19T18:25:09Z   1       kubelet             Normal    Created     Created container consul-template-init
2022-04-19T18:25:09Z   2022-04-19T18:25:09Z   1       kubelet             Normal    Started     Started container consul-template-init
2022-04-19T18:25:17Z   2022-04-19T18:25:17Z   1       kubelet             Normal    Pulled      Container image "ecr-public/hashicorp/consul:1.10.9" already present on machine
2022-04-19T18:25:17Z   2022-04-19T18:25:17Z   1       kubelet             Normal    Created     Created container copy-consul-bin
2022-04-19T18:25:17Z   2022-04-19T18:25:17Z   1       kubelet             Normal    Started     Started container copy-consul-bin
2022-04-19T18:25:18Z   2022-04-19T18:25:18Z   1       kubelet             Normal    Pulling     Pulling image "cordial/consul-k8s-control-plane:0.42.0"
2022-04-19T18:25:20Z   2022-04-19T18:25:20Z   1       kubelet             Normal    Pulled      Successfully pulled image "cordial/consul-k8s-control-plane:0.42.0" in 2.140288814s
2022-04-19T18:25:20Z   2022-04-19T18:29:35Z   3       kubelet             Normal    Created     Created container consul-connect-inject-init
2022-04-19T18:25:21Z   2022-04-19T18:29:35Z   3       kubelet             Normal    Started     Started container consul-connect-inject-init
2022-04-19T18:27:21Z   2022-04-19T18:29:35Z   2       kubelet             Normal    Pulled      Container image "cordial/consul-k8s-control-plane:0.42.0" already present on machine
2022-04-19T18:29:23Z   2022-04-19T18:34:40Z   7       kubelet             Warning   BackOff     Back-off restarting failed container

The endpoints controller did not register the service until after that long period had elapsed.

consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.948Z	INFO	controller.endpoints	registering service with Consul	{"name": "my-app-da", "id": "my-app-da-7bbf4c45fd-phdfw-my-app-da", "agentIP": "10.20.26.109"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.950Z	INFO	controller.endpoints	registering proxy service with Consul{"name": "my-app-da-sidecar-proxy"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.953Z	INFO	controller.endpoints	updating health check status for service	{"name": "my-app-da", "reason": "Kubernetes health checks passing", "status": "passing"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.954Z	INFO	controller.endpoints	updating health check	{"id": "my-app/my-app-da-7bbf4c45fd-phdfw-my-app-da/kubernetes-health-check"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.961Z	INFO	controller.endpoints	registering service with Consul	{"name": "my-app-da", "id": "my-app-da-7bbf4c45fd-9g66w-my-app-da", "agentIP": "10.20.19.59"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.962Z	INFO	controller.endpoints	registering proxy service with Consul{"name": "my-app-da-sidecar-proxy"}
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:28.966Z	INFO	controller.endpoints	updating health check status for service	{"name": "my-app-da", "reason": "Kubernetes health checks passing", "status": "passing"}

Prior to it registering the service I again see the errors from it attempting to connect to a non-existent host. I'm also suprised by this particular error because the external-dns pods are not annotated
with consul.hashicorp.com/connect-inject: true.

consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:20.762Z	ERROR	controller.endpoints	failed to get service instances	{"name": "external-dns", "error": "Get \"https://10.20.14.50:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.20.9.234:35676->10.20.14.50:8501: read: no route to host"}
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:20.762Z	ERROR	controller.endpoints	failed to deregister endpoints on all agents	{"name": "external-dns", "ns": "external-dns", "error": "Get \"https://10.20.14.50:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.20.9.234:35676->10.20.14.50:8501: read: no route to host"}
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
consul-connect-injector-766757665d-v87zw sidecar-injector 2022-04-19T18:39:20.762Z	ERROR	controller.endpoints	Reconciler error	{"reconciler group": "", "reconciler kind": "Endpoints", "name": "external-dns", "namespace": "external-dns", "error": "1 error occurred:\n\t* Get \"https://10.20.14.50:8501/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22external-dns%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": read tcp 10.20.9.234:35676->10.20.14.50:8501: read: no route to host\n\n"}
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
consul-connect-injector-766757665d-v87zw sidecar-injector sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
consul-connect-injector-766757665d-v87zw sidecar-injector 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227

The host client pod 10.20.14.50 was running on an instance that was removed by cluster autoscaler Fast evaluation: node ip-10-20-31-124.us-west-2.compute.internal may be removed Successfully added DeletionCandidateTaint on node ip-10-20-31-124.us-west-2.compute.internal. Other agents in the cluster log 2022-04-19T18:23:07.602Z [INFO] agent.client.serf.lan: serf: EventMemberFailed: ip-10-20-31-124.us-west-2.compute.internal 10.20.14.50.
It was eventually reaped 2022-04-19T18:38:08.126Z [INFO] agent.client.serf.lan: serf: EventMemberReap: ip-10-20-31-124.us-west-2.compute.internal. advertise_reconnect_timeout is set 15m in this cluster.

This follows the same pattern where the service happens to be registered after I see errors connecting to nodes that no longer exist. They definitely all seem to be following this pattern.

@jmurret
Copy link
Contributor

jmurret commented Apr 19, 2022

Hi @dschaaff , we've been talking internally. It seems the http client that is used to connect to the client from endpoints controller is the default go client with no timeout configured. We believe this, rather than retries, are the culprit because when endpoints controller gets an endpoints entry, it gets the current clients that are available. So, there should not be a retry scenario where this does not self correct the next time that endpoint record is processed.
I am pulling together some changes to set a timeout of 2 seconds and would be able to provide this in a docker image that can be specified in your helm values. We are having a tough time recreating this here. Would you have an environment where you could pilot this and help us verify?

(This would be a temporary image to verify and we'd have to release the code changes in a following patch release.)

@dschaaff
Copy link
Author

Yes, I'd be happy to pilot it in our environments.

@dschaaff
Copy link
Author

Just for my own benefit, is the http client here the one in question https://github.com/hashicorp/consul/blob/main/api/api.go#L739 ?

@jmurret
Copy link
Contributor

jmurret commented Apr 20, 2022

Yes, we use this function above it that calls the one you linked. It allows us to pass in an httpClient in the config. it will use that if it is there otherwise it will create a new setting only the transport. we'll have to do some slight refactoring in Consul to accommodate the change in Consul on Kubernetes. I'll link draft PRs here for your reference when I send over instructions with the docker image to use in the values file.

@jmurret
Copy link
Contributor

jmurret commented Apr 20, 2022

@dschaaff This PR has the changes, the link to the changes in Consul, and also a note on how to set the image to test this out. Please reach out with questions, issues, etc. Thank you for trying this out and helping provide such helpful information to get this far....and also your patience :).

@dschaaff
Copy link
Author

@jmurret Thank you so much for the help on this. I'm deploying the image now and will report back.

@dschaaff
Copy link
Author

My initial feedback on this change is positive. I've deployed the container in 2 different clusters.

In the stg cluster I did a rolling replace of the kubernetes nodes. I did not encounter any errors during this operation.

After running it in the stg cluster for a few hours I went ahead and deployed it to the production cluster. I also disabled the cronjob that was restarting the controller every 3 minutes. Since deploying the updated controller in that environment cluster autoscaler has removed 43 nodes due to scale down operations. We have not experienced any service registration delays thus far. During this period of scale down I saw the timeout being hit and logged by the controller quite frequently. For example {"level":"error","ts":1650488943.5005586,"logger":"controller.endpoints","msg":"failed to deregister endpoints on all agents","name":"my-app","ns":"my-app","error":"failed to get a list of tokens from Consul: context deadline exceeded (Client.Timeout or context cancellation while reading body)","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227"}

I'm going to leave this setup in production and will report back if the service registration delay comes up at all. So far I am optimistic that this fixes our issue.

@jmurret jmurret self-assigned this Apr 20, 2022
@jmurret
Copy link
Contributor

jmurret commented Apr 20, 2022

That's great to hear. Please keep us posted.

@dschaaff
Copy link
Author

It's been almost a full 24 hours since I deployed the change, and we haven't had a single issue. At this point I'd say this has addressed the problem. Thank you so much for you effort looking into this and fixing. We will continue to run the custom build until the official release.

@jmurret jmurret changed the title Endpoints Controller Not Registering All Pods Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out May 11, 2022
jmurret added a commit that referenced this issue May 11, 2022
jmurret added a commit that referenced this issue May 16, 2022
* Adding GH-714 as bug fix to change log.

* Update CHANGELOG.md

Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>

Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com>
@david-yu
Copy link
Contributor

Hi @dschaaff I'll go ahead and close the issue as 0.44.0 now provides the fix that @jmurret provided in #1178. If the issue is not resolved, let us know. Thank you for you patience on this one.

@dschaaff
Copy link
Author

Thanks again for all the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants