Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller did not remove targets from target group even when pods were deleted #1131

Open
a6kme opened this issue Jan 15, 2020 · 13 comments

Comments

@a6kme
Copy link

@a6kme a6kme commented Jan 15, 2020

Hey,

Today during deployment, all of our pods were upgraded to newer pods, but the ingress controller did not remove old targets and add new targets to the target group, leading to an outage on our platform. I had to kill the existing alb-ingress-controller pod and the new pod then did a bulk add/remove operations, bringing back the systems live.

My current deployment spec

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
  creationTimestamp: "2019-10-18T07:06:04Z"
  generation: 4
  labels:
    app.kubernetes.io/name: alb-ingress-controller
  name: alb-ingress-controller
  namespace: kube-system
  resourceVersion: "24312289"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/alb-ingress-controller
  uid: bfd406a4-f175-11e9-8ab4-0a8f42fc94d4
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: alb-ingress-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: alb-ingress-controller
    spec:
      containers:
      - args:
        - --watch-namespace=pushowl-backend
        - --ingress-class=alb
        - --cluster-name=pushowl-prod
        image: docker.io/amazon/aws-alb-ingress-controller:v1.1.4
        imagePullPolicy: IfNotPresent
        name: alb-ingress-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alb-ingress-controller
      serviceAccountName: alb-ingress-controller
      terminationGracePeriodSeconds: 30

Tail of some logs from my running pod which did not trigger remove/add targets

I0115 04:38:38.675563       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 04:38:38.675576       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 2 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 04:38:38.700499       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 2 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 04:38:38.717394       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/85a58dd3cc19e932
I0115 04:38:38.740799       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
W0115 04:45:29.750439       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24297318 (24297906)
W0115 04:58:32.755131       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24300146 (24300476)
W0115 05:11:23.759743       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24302702 (24303025)
I0115 05:25:02.062652       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:25:02.103377       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 05:25:02.103395       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 2 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:25:02.137018       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 2 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 05:25:02.164946       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/85a58dd3cc19e932
I0115 05:25:02.201251       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]

Logs of new pod which did a bulk remove/ addition of targets

-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.4
  Build:      git-0db46039
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0115 05:31:38.016418       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0115 05:31:38.051846       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0115 05:31:38.051981       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0115 05:31:38.052016       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0115 05:31:38.052113       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0115 05:31:38.052132       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0115 05:31:38.052201       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0115 05:31:38.052365       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0115 05:31:38.052520       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
I0115 05:31:54.475501       1 leaderelection.go:214] successfully acquired lease kube-system/ingress-controller-leader-alb
I0115 05:31:54.575721       1 controller.go:134] kubebuilder/controller "level"=0 "msg"="Starting Controller"  "controller"="alb-ingress-controller"
I0115 05:31:54.675851       1 controller.go:154] kubebuilder/controller "level"=0 "msg"="Starting workers"  "controller"="alb-ingress-controller" "worker count"=1
I0115 05:31:55.229237       1 targets.go:80] pushowl-backend/pushowl-backend-ingress-alb: Adding targets to arn:aws:elasticloadbalancing:us-east-1:956017394124:targetgroup/d8bc12a0-bee0406332ff9abb91a/2f780fbba9190638: 10.10.104.192:8000, 10.10.117.146:8000, 10.10.118.85:8000, 10.10.119.203:8000, 10.10.120.247:8000, 10.10.125.149:8000, 10.10.138.102:8000, 10.10.146.86:8000, 10.10.151.153:8000, 10.10.156.47:8000, 10.10.96.246:8000
I0115 05:31:55.652305       1 targets.go:95] pushowl-backend/pushowl-backend-ingress-alb: Removing targets from arn:aws:elasticloadbalancing:us-east-1:956017394124:targetgroup/d8bc12a0-bee0406332ff9abb91a/2f780fbba9190638: 10.10.97.165:8000, 10.10.130.28:8000, 10.10.126.17:8000, 10.10.96.112:8000, 10.10.135.8:8000, 10.10.110.98:8000, 10.10.125.182:8000, 10.10.150.107:8000, 10.10.117.135:8000
I0115 05:31:55.726399       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:31:55.760397       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]

I have upgraded ingress controller to 1.1.5, but seeing at the release logs, I don't see any change being done to mitigate this issue. Can you please point me in the right direction, as in what are my options.

@M00nF1sh

This comment has been minimized.

Copy link
Collaborator

@M00nF1sh M00nF1sh commented Jan 15, 2020

Hi,
When did the deployment happens? Is it between '05:25:02.062652' and '05:31:55.229237'?
BTW, What's your ingress spec? did you used WAF on it? (the v1.1.5 introduced an cache around WAF calls)

@a6kme

This comment has been minimized.

Copy link
Author

@a6kme a6kme commented Jan 16, 2020

@M00nF1sh - Yes, the deployment happened at around - 05:25. I killed the old pod of ingress controller, which triggered a new pod, after I received an alarm of spike in 5XX error at around 05:30.

My ingress and service spec for your reference:

alb-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: {{ .Release.Name -}}-ingress-alb
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.global.acmCertificate }}
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    {{ if eq .Values.global.env "production" }}
    alb.ingress.kubernetes.io/load-balancer-attributes: 'access_logs.s3.enabled=false,access_logs.s3.bucket=pushowl-alb-production-logs,deletion_protection.enabled=true,routing.http2.enabled=true'
    {{ end }}
  labels:
    env: {{ .Values.global.env }}
    release: {{ .Release.Name }}
spec:
  rules:
    - http:
        paths:
          - path: /*
            backend:
              serviceName: ssl-redirect
              servicePort: use-annotation
          - path: /*
            backend:
              serviceName: "gunicorn"
              servicePort: 80

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: gunicorn
  annotations:
    alb.ingress.kubernetes.io/healthy-threshold-count: '2'
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '3'
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck/
    alb.ingress.kubernetes.io/healthcheck-port: {{ .Values.containerPort | quote}}
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '10'
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '30'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
spec:
  selector:
    app: {{ .Chart.Name }}
    release: {{ .Release.Name }}
  ports:
    - name: http
      port: 80
      targetPort: {{ .Values.containerPort }}
  type: ClusterIP

I have not enabled WAF on my ingress. Also, I use preStop in my deployment to sleep for 25 seconds, so that my application get SIGTERM after 25 seconds, and ingress controller has sufficient time to remove the target from target group.

@jorihardman

This comment has been minimized.

Copy link

@jorihardman jorihardman commented Jan 16, 2020

This is probably the same as #1112 #1124 #1065 and #1064.

Occasionally, removal/addition of target groups doesn't keep up with the kubernetes deployment causing them to go unhealthy. It's unclear whether the delay is in the ingress controller or the AWS alb api.

I see variations of this daily, unfortunately. Sometimes everything goes well, and no targets go unhealthy. Sometimes a pod or two will not deregister in time in time and 502s pop up. Sometimes, all the pods go unhealthy, and there is downtime. It's a critical issue for us, but unfortunately I don't know of any better alternatives for load balancing at the moment.

@a6kme

This comment has been minimized.

Copy link
Author

@a6kme a6kme commented Jan 17, 2020

@jorihardman - Totally understood. I am thinking of shifting back to classic load balancer, i.e setting type: LoadBalancer in my service declaration till #955 is merged.

However, I still feel that merging of #955 will not solve the issue, because there will still potentially be a delay between Kubernetes terminating a pod and a delay in ingress controller removing the target from ALB. That delta in time can potentially lead to 502s or 504s in the service. Unless we get to an RCA of this lag, I don't see that issue getting fixed. Do let me know if I could help in any way.

@jorihardman

This comment has been minimized.

Copy link

@jorihardman jorihardman commented Jan 17, 2020

@a6kme thanks for chiming back in. As far as RCA goes, #1064 is my ticket and suggests that the delay is on the AWS side. My deploy logs indicated that ALB was routing to pods after the ingress-controller sent the deregistration request. I notice that most of my deploy failures are during peak hours, and it might be the case that AWS has queue delays during the day that increase the likelihood of lagging target updates.

Totally agree with your analysis of #955 - it solves half the problem, but at least it prevents a full outage.

Just wanted to mention one caveat of type: LoadBalancer. If you use cluster-autoscaler, and scaling happens frequently, you may occasionally see 502s/504s due to in-flight requests being lost by kube-proxy when a node scales down. If you're using eks nodegroups, pods are drained when a node is terminating, however daemon pods, like kube-proxy are not drained (as of initial release of nodegroups anyway). This means that when a node scales down, in-flight requests can sometimes be lost. See kubernetes/autoscaler#1907.

If you don't have a lot of churn in your nodes, the classic elb definitely seems to be the more stable solution for now. For us, the choice was "random 502s/504s on a regular basis" or "maybe 502s, just during deploys". Catch 22 :P

@a6kme

This comment has been minimized.

Copy link
Author

@a6kme a6kme commented Jan 18, 2020

@jorihardman Thanks a lot for pointing that caveat out. I did not know that before.

However, I would like to add/ clarify one more thing regarding #1064 that you mentioned. I faced Ingress Controller NOT removing the targets during deployment. My pods were all recycled from old replica set from previous deployment to new replica set, but there was nothing happening at Ingress Controller. You can correlate that from the logs that I posted in my first post. Do you think it also has something to do with AWS APIs?

I have also noticed Ingress Controller removing the pod and ALB still sending traffic while in draining state.

@jorihardman

This comment has been minimized.

Copy link

@jorihardman jorihardman commented Jan 20, 2020

@a6kme Rereading your logs, I can see that the bad ingress-controller pod didn't seem to trigger removal at all. You're right - the root cause of this issue does seem unrelated. In mine, I had the Removing targets log message, but then latency actually removing the targets in AWS. Can't say I've personally encountered this one... 😬

@M00nF1sh

This comment has been minimized.

Copy link
Collaborator

@M00nF1sh M00nF1sh commented Jan 22, 2020

Just a quick update, i think i found the cause and will try to fix it in next release.
In the meantime, upgrade to v1.1.5 should ease this problem(unless your ingress spec is incorrect and have been keep failing).
the old versions will make WAF calls on every target change even waf annotation is not used, which have only 1TPS throttling, the controller will backoff retry and triggers this race condition

@a6kme

This comment has been minimized.

Copy link
Author

@a6kme a6kme commented Jan 23, 2020

@M00nF1sh - Hey you are right actually. After I upgraded, I have not seen any instance of controller failing to register target updates on load balancers.

Though I still see ALB sending requests to draining targets but thats a different issue. My 5XXs have reduced a lot after I have upgraded to 1.15 and used preStop hook.

@keperry

This comment has been minimized.

Copy link

@keperry keperry commented Jan 23, 2020

@M00nF1sh - we experienced a very similar issue as this. It appeared that alb-ingress-controller did not keep up with the changes causing a spike in 502/504. We were using v1.1.3. I have since updated to v1.1.5. Any ETA on a release/fix?

@a6kme a6kme closed this Feb 3, 2020
@endeepak

This comment has been minimized.

Copy link

@endeepak endeepak commented Feb 7, 2020

We are facing a similar issue intermittently with v1.1.5. We have deployed 15-20 times since we upgraded to v1.1.5 last week, we faced this issue 2 times so far. The new targets were not registered and old targets were not removed for 5-10 minutes leading to 502/504 errors.

Logs from ingress controller confirm this delay as the below logs are displayed 5-10 minutes after replacement of old pods with new pods

Removing targets from arn:aws:elasticloadbalancing:[REDACTED]
Adding targets to arn:aws:elasticloadbalancing:[REDACTED]

@a6kme do you mind keeping this issue open till there is a confirmation on the fix?

@a6kme a6kme reopened this Feb 7, 2020
@cw-sakamoto

This comment has been minimized.

Copy link

@cw-sakamoto cw-sakamoto commented Feb 8, 2020

@M00nF1sh

Thanks for great product maintenance!

#1131 (comment)

i think i found the cause and will try to fix it in next release.

What is the cause you think and when will the fixed version be released?

This problem still occurs with 1.1.5...

@M00nF1sh

This comment has been minimized.

Copy link
Collaborator

@M00nF1sh M00nF1sh commented Feb 19, 2020

@a6kme You can reduce more 502 by disable HTTP keep-alive or use HTTP keep-alive but send Connection:Close header up-on received sigTerm See example app for zero-downtime deployment.

@cw-sakamoto @endeepak While i didn't reproduce this issue by around(200 continuous deployment), the symptom is most like been caused by exponential backoff(5ms up to 16minutes) due to errors like wrong rule or throttling.
In the next version, it will reduce calls to EC2.DescribeInstanceHealth, also i'm considering change the backoff to (100ms up to 5 minute)

It will be helpful to send me some logs if the delay happens:
* alb ingress controller's logs (if throttling happens, there should be error logs from the controller)
* the timestamp when the event happens
* the cluster arn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.