Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes PreStop hook failing to deregister ACLs and service instances #540

Closed
pedrohdz opened this issue Jun 18, 2021 · 8 comments · Fixed by #571
Closed

Kubernetes PreStop hook failing to deregister ACLs and service instances #540

pedrohdz opened this issue Jun 18, 2021 · 8 comments · Fixed by #571
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working

Comments

@pedrohdz
Copy link

Overview of the Issue

Consul Connect Inject created ACLs not being deleted after a Kubernetes pods is deleted. It seems that the issue is that the envoy-sidecar preStop hook is failing. As far as I can tell neither the consul services deregister or consul logout are executed.

The service instance in Consul is eventually gets deleted. The consul-connect-injector-webhook pod identifies the missing pod and deletes it, but the ACL remains.

Looking at the Kubernetes events with kubectl get events --all-namespaces -w shows:

pedher-server-delete   0s          Warning   FailedPreStopHook   pod/static-server-7ffbbfb96f-jr7jf    Exec lifecycle hook ([/bin/sh -ec /consul/connect-inject/consul services deregister \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"]) for Container "envoy-sidecar" in Pod "static-server-7ffbbfb96f-jr7jf_pedher-server-delete(adf7afbf-7358-498f-88c1-c875351f010e)" failed - error: command '/bin/sh -ec /consul/connect-inject/consul services deregister \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"' exited with 1: Error registering service "static-server": Put "https://10.31.236.67:8501/v1/agent/service/deregister/static-server-7ffbbfb96f-jr7jf-static-server": dial tcp 10.31.236.67:8501: i/o timeout
, message: "Error registering service \"static-server\": Put \"https://10.31.236.67:8501/v1/agent/service/deregister/static-server-7ffbbfb96f-jr7jf-static-server\": dial tcp 10.31.236.67:8501: i/o timeout\n"

This is a bit of a side note... If terminationGracePeriodSeconds is not long enough on the pod you get a 137 exit code.

K8s event error if terminationGracePeriodSeconds not set
pedher-server-delete   0s          Warning   FailedPreStopHook   pod/static-server-855444b4d7-c5bgp    Exec lifecycle hook ([/bin/sh -ec /consul/connect-inject/consul services deregister \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"]) for Container "envoy-sidecar" in Pod "static-server-855444b4d7-c5bgp_pedher-server-delete(7d0cda43-8dc3-43b6-8da8-8f04f7a8a4b7)" failed - error: command '/bin/sh -ec /consul/connect-inject/consul services deregister
 \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"' exited with 137: , message: ""

Execing into the envoy-sidecar container and running the preStop commands manually seems to work just fine, both the service instance and the ACL are deleted, though you start getting a bunch of other errors about agent.client: RPC failed to server: method=Catalog.Deregister server=10.31.236.71:8300 error="rpc error making call: ACL not found" which I won't get to here.

$ kubectl exec -ti static-server-855444b4d7-fdnsj -c envoy-sidecar -- /consul/connect-inject/consul services deregister -token-file="/consul/connect-inject/acl-token" /consul/connect-inject/service.hcl
Deregistered service: static-server-855444b4d7-fdnsj-static-server
Deregistered service: static-server-855444b4d7-fdnsj-static-server-sidecar-proxy

$ kubectl exec -ti static-server-855444b4d7-fdnsj -c envoy-sidecar -- /consul/connect-inject/consul logout -token-file="/consul/connect-inject/acl-token"

The preStop hook in question is:

    name: envoy-sidecar
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -ec
          - |-
            /consul/connect-inject/consul services deregister \
              -token-file="/consul/connect-inject/acl-token" \
              /consul/connect-inject/service.hcl
            /consul/connect-inject/consul logout \
              -token-file="/consul/connect-inject/acl-token"

Reproduction Steps

I created a test cluster based off of the Secure Consul and Registered Services on Kubernetes tutorial and reproduced it there in order to keep things as simple as possible.

Steps to reproduce this issue, eg:

  1. Create your gossip key kubectl create secret generic consul-gossip-encryption-key --from-literal=key=$(consul keygen).

  2. Create a cluster via Helm with helm install -f helm-consul-h0-values.yaml demo-0 hashicorp/consul --wait:

    ---
    global:
      name: consul
      image: hashicorp/consul:1.9.6
      domain: consul.i.avinode.com
      datacenter: demo-h0
    
      tls:
        enabled: true
        enableAutoEncrypt: true
        verify: true
    
      gossipEncryption:
        secretName: consul-gossip-encryption-key
        secretKey: key
    
      acls:
        manageSystemACLs: true
        # createReplicationToken: true
    
    ui:
      service:
        annotations: |
          service.beta.kubernetes.io/azure-load-balancer-internal: 'true'
        type: LoadBalancer
        additionalSpec: |
          loadBalancerIP: 10.31.237.250
    
    connectInject:
      enabled: true
    controller:
      enabled: true
    ...
    
  3. Create the namespace: kubectl create namespace pedher-server-delete

  4. Deploy the test service with kubectl --namespace=pedher-server-delete apply -f static-server.yaml:

    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: static-server
      namespace: pedher-server-delete
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: static-server
      namespace: pedher-server-delete
    spec:
      selector:
        app: static-server
      ports:
        - port: 1234
          targetPort: 8080
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: static-server
      name: static-server
      namespace: pedher-server-delete
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: static-server
      template:
        metadata:
          namespace: pedher-server-delete
          annotations:
            consul.hashicorp.com/connect-inject: 'true'
          labels:
            app: static-server
        spec:
          terminationGracePeriodSeconds: 120
          containers:
            - name: static-server
              image: hashicorp/http-echo:latest
              args:
                - -text="hello world"
                - -listen=:8080
              ports:
                - containerPort: 8080
                  name: http
          serviceAccountName: static-server
    ...
    
  5. Watch the events on the K8s cluster in a separate terminal: kubectl get events --all-namespaces -w

  6. Delete the pod with kubectl --namespace=pedher-server-delete delete pods -l 'app=static-server' and wait for the error to pop up in the K8s events. Might take a minute.

Consul info for both Client and Server

This output is suspect...

Client info
$ kubectl exec -ti static-server-855444b4d7-fdnsj -c envoy-sidecar -- /consul/connect-inject/consul info  -token-file="/consul/connect-inject/acl-token"
Error querying agent: Unexpected response code: 403 (Permission denied)
command terminated with exit code 1
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = bbcbb733
	version = 1.9.6
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.31.236.71:8300
	server = true
raft:
	applied_index = 9991
	commit_index = 9991
	fsm_pending = 0
	last_contact = 1.973439ms
	last_log_index = 9991
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:cae1293d-2e1a-5f15-a3a4-f34240d9e86d Address:10.31.236.18:8300} {Suffrage:Voter ID:5afe13d6-adf4-9b02-a11e-28d96581f5f8 Address:10.31.236.59:8300} {Suffrage:Voter ID:ec88fd17-b382-36c8-b135-c1fd28a06f24 Address:10.31.236.71:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 2
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 110
	max_procs = 2
	os = linux
	version = go1.15.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6
	members = 6
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

Running on Azure AKS.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"6b3f9b283463c1d5a2455df301182805e65c7145", GitTreeState:"clean", BuildDate:"2021-05-19T22:28:47Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
$ helm --namespace=pedher-consul-server-1 list
NAME    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART           APP VERSION
demo-0  pedher-consul-server-1  1               2021-06-17 16:34:10.162353 +0200 CEST   deployed        consul-0.31.1   1.9.4

$ helm version
version.BuildInfo{Version:"v3.5.4", GitCommit:"1b5edb69df3d3a08df77c9902dc17af864ff05d1", GitTreeState:"clean", GoVersion:"go1.15.11"}

Log Fragments

Connect injector logs via kubectl logs -l 'app=consul,component=connect-injector' -f:

2021-06-18T08:31:31.960Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-06-18T08:33:44.389Z [INFO]  cleanupResource: found service instance from terminated pod still registered: pod=static-server-7ffbbfb96f-pvkjk id=static-server-7ffbbfb96f-pvkjk-static-server ns=
2021-06-18T08:33:44.414Z [INFO]  cleanupResource: service instance deregistered: id=static-server-7ffbbfb96f-pvkjk-static-server ns=
2021-06-18T08:33:44.416Z [INFO]  cleanupResource: found service instance from terminated pod still registered: pod=static-server-7ffbbfb96f-pvkjk id=static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy ns=
2021-06-18T08:33:44.455Z [INFO]  cleanupResource: service instance deregistered: id=static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy ns=

Consul server logs via kubectl logs -l 'app=consul,component=server' -f: Nothing in the logs

Consul client logs via kubectl logs -l 'app=consul,component=client' -f:

2021-06-18T08:31:36.420Z [INFO]  agent: Synced service: service=static-server-7ffbbfb96f-6zxhz-static-server
2021-06-18T08:31:36.589Z [INFO]  agent: Synced service: service=static-server-7ffbbfb96f-6zxhz-static-server-sidecar-proxy
2021-06-18T08:31:42.611Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-6zxhz-static-server-sidecar-proxy:2
2021-06-18T08:31:42.634Z [INFO]  agent: Synced check: check=pedher-server-delete/static-server-7ffbbfb96f-6zxhz-static-server/kubernetes-health-check
2021-06-18T08:31:42.648Z [INFO]  agent: Synced check: check=pedher-server-delete/static-server-7ffbbfb96f-6zxhz-static-server/kubernetes-health-check
2021-06-18T08:31:42.665Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-6zxhz-static-server-sidecar-proxy:2
2021-06-18T08:31:43.722Z [INFO]  agent: Synced check: check=pedher-server-delete/static-server-7ffbbfb96f-6zxhz-static-server/kubernetes-health-check
2021-06-18T08:31:43.738Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-6zxhz-static-server-sidecar-proxy:2
2021-06-18T08:31:45.075Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-6zxhz-static-server-sidecar-proxy:1
2021-06-18T08:31:47.463Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:31:47.463Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:31:47.480Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:32:07.464Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:32:07.464Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:32:27.464Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:32:27.464Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:32:47.465Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:32:47.465Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:33:07.465Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:33:07.465Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:33:27.465Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:33:27.466Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1
2021-06-18T08:33:44.413Z [INFO]  agent: Deregistered service: service=static-server-7ffbbfb96f-pvkjk-static-server
2021-06-18T08:33:44.413Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service 'static-server-7ffbbfb96f-pvkjk-static-server' not in agent state" index=1
2021-06-18T08:33:44.442Z [INFO]  agent: Synced check: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:2
2021-06-18T08:33:44.442Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service 'static-server-7ffbbfb96f-pvkjk-static-server' not in agent state" index=1
2021-06-18T08:33:44.442Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service 'static-server-7ffbbfb96f-pvkjk-static-server' not in agent state" index=1
2021-06-18T08:33:44.443Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service 'static-server-7ffbbfb96f-pvkjk-static-server' not in agent state" index=1
2021-06-18T08:33:44.454Z [INFO]  agent: Deregistered service: service=static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy
2021-06-18T08:33:47.466Z [WARN]  agent: Check socket connection failed: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1 error="dial tcp 10.31.236.77:20000: i/o timeout"
2021-06-18T08:33:47.466Z [WARN]  agent: Check is now critical: check=service:static-server-7ffbbfb96f-pvkjk-static-server-sidecar-proxy:1

Kubernetes events kubectl get events --all-namespaces -w:

pedher-server-delete     0s          Warning   FailedPreStopHook        pod/static-server-7ffbbfb96f-pvkjk                                  Exec lifecycle hook ([/bin/sh -ec /consul/connect-inject/consul services deregister \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"]) for Container "envoy-sidecar" in Pod "static-server-7ffbbfb96f-pvkjk_pedher-server-delete(62d042c9-6171-4f0c-8d91-2a9e6534c2df)" failed - error: command '/bin/sh -ec /consul/connect-inject/consul services deregister \
  -token-file="/consul/connect-inject/acl-token" \
  /consul/connect-inject/service.hcl
/consul/connect-inject/consul logout \
  -token-file="/consul/connect-inject/acl-token"' exited with 1: Error registering service "static-server": Put "https://10.31.236.67:8501/v1/agent/service/deregister/static-server-7ffbbfb96f-pvkjk-static-server": dial tcp 10.31.236.67:8501: i/o timeout
, message: "Error registering service \"static-server\": Put \"https://10.31.236.67:8501/v1/agent/service/deregister/static-server-7ffbbfb96f-pvkjk-static-server\": dial tcp 10.31.236.67:8501: i/o timeout\n"
@billyshambrook
Copy link

This also happens in an AWS EKS cluster

@kschoche
Copy link
Contributor

Hi! Are either of you able to confirm if the issue still persists using the latest release (consul 1.10, consul-k8s 0.26, consul-helm 0.32)? We've done some pretty major changes to how services are managed. If you're able to try, please do have a look at the changelog first!

@david-yu david-yu added the waiting-reply Waiting on the issue creator for a response before taking further action label Jun 25, 2021
@lkysow
Copy link
Member

lkysow commented Jun 25, 2021

Upon further investigation, we think the service de-registration will be fixed in consul-helm 0.32.0 with Consul 1.10 but we are still leaking the ACL tokens.

@lkysow lkysow added area/connect Related to Connect service mesh, e.g. injection and removed waiting-reply Waiting on the issue creator for a response before taking further action labels Jun 25, 2021
@pedrohdz
Copy link
Author

@kschoche,
I upgraded to the 0.32.0 Helm chart with Consul 1.10.0 and was able to confirm what @lkysow mentioned. The service is instance is de-registering, but the token remains. :-(

@pedrohdz
Copy link
Author

So I did a little more checking and it seems that the connect-injector is "deregistering service from consul", but I do not see anything in the logs about deleting the token.

2021-06-29T13:03:57.107Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": "pedher-server-delete"}
2021-06-29T13:03:57.123Z        INFO    controller.endpoints    deregistering service from consul       {"svc": "static-server-7ffbbfb96f-wchnk-static-server"}
2021-06-29T13:03:57.150Z        INFO    controller.endpoints    deregistering service from consul       {"svc": "static-server-7ffbbfb96f-wchnk-static-server-sidecar-proxy"}
2021-06-29T13:03:57.299Z        INFO    handler.connect received pod    {"name": "", "ns": ""}
2021-06-29T13:03:58.923Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": "pedher-server-delete"}
2021-06-29T13:03:58.930Z        INFO    controller.endpoints    registering service with Consul {"name": "static-server", "id": "static-server-7ffbbfb96f-84hb9-static-server", "agentIP": "10.31.236.98"}
2021-06-29T13:03:58.962Z        INFO    controller.endpoints    registering proxy service with Consul   {"name": "static-server-sidecar-proxy"}
2021-06-29T13:03:59.010Z        INFO    controller.endpoints    updating health check status for service        {"name": "static-server", "reason": "Pod \"pedher-server-delete/static-server-7ffbbfb96f-84hb9\" is not ready", "status": "critical"}
2021-06-29T13:03:59.088Z        INFO    controller.endpoints    updating health check   {"id": "pedher-server-delete/static-server-7ffbbfb96f-84hb9-static-server/kubernetes-health-check"}
2021-06-29T13:04:40.004Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": "pedher-server-delete"}
2021-06-29T13:04:40.008Z        INFO    controller.endpoints    registering service with Consul {"name": "static-server", "id": "static-server-7ffbbfb96f-84hb9-static-server", "agentIP": "10.31.236.98"}
2021-06-29T13:04:40.019Z        INFO    controller.endpoints    registering proxy service with Consul   {"name": "static-server-sidecar-proxy"}
2021-06-29T13:04:40.046Z        INFO    controller.endpoints    updating health check status for service        {"name": "static-server", "reason": "Kubernetes health checks passing", "status": "passing"}
2021-06-29T13:04:40.048Z        INFO    controller.endpoints    updating health check   {"id": "pedher-server-delete/static-server-7ffbbfb96f-84hb9-static-server/kubernetes-health-check"}

@thisisnotashwin
Copy link
Contributor

@pedrohdz you're right. we do have something setup in our backlog to get the connect injector to also delete the token once the service has been de-registered. Thanks so much for bringing this up.

@pedrohdz
Copy link
Author

pedrohdz commented Jul 1, 2021

@thisisnotashwin
Any rough estimate to when it might be fixed? Having many dead tokens left around when there is a high turnover of pods is not ideal. 😅

@thisisnotashwin
Copy link
Contributor

@pedrohdz i hear ya bud. I dont want to make specific time estimates but we will try and get this prioritized for our next release.

ishustava added a commit that referenced this issue Jul 29, 2021
…istered (#571)

Fixes #540

* Modify endpoints controller to delete ACL tokens for each service instance that it deregisters
* Remove TLS+ACLs table tests from endpoints controller tests. These tests were testing that endpoints controller works with a client configured to have TLS and ACLs. I thought this test was not necessary because there isn't any code in the controller that behaves differently if the consul client is configured with any of those and as a result there's no way these tests could fail. The tests testing to the new ACL logic are there but they are only testing the logic that was added and configure test agent to accommodate for that.
* Create test package under helper and move GenerateServerCerts function from subcommand/common there because it was used outside of subcommand.
* Create a helper test function to set up auth methods and refactor existing connect-init command tests to use that function.
* Minor editing fixes of comments etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants