Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller did not remove targets from target group even when pods were deleted #1131

Closed
a6kme opened this issue Jan 15, 2020 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@a6kme
Copy link

a6kme commented Jan 15, 2020

Hey,

Today during deployment, all of our pods were upgraded to newer pods, but the ingress controller did not remove old targets and add new targets to the target group, leading to an outage on our platform. I had to kill the existing alb-ingress-controller pod and the new pod then did a bulk add/remove operations, bringing back the systems live.

My current deployment spec

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
  creationTimestamp: "2019-10-18T07:06:04Z"
  generation: 4
  labels:
    app.kubernetes.io/name: alb-ingress-controller
  name: alb-ingress-controller
  namespace: kube-system
  resourceVersion: "24312289"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/alb-ingress-controller
  uid: bfd406a4-f175-11e9-8ab4-0a8f42fc94d4
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: alb-ingress-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: alb-ingress-controller
    spec:
      containers:
      - args:
        - --watch-namespace=pushowl-backend
        - --ingress-class=alb
        - --cluster-name=pushowl-prod
        image: docker.io/amazon/aws-alb-ingress-controller:v1.1.4
        imagePullPolicy: IfNotPresent
        name: alb-ingress-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alb-ingress-controller
      serviceAccountName: alb-ingress-controller
      terminationGracePeriodSeconds: 30

Tail of some logs from my running pod which did not trigger remove/add targets

I0115 04:38:38.675563       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 04:38:38.675576       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 2 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 04:38:38.700499       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 2 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 04:38:38.717394       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/85a58dd3cc19e932
I0115 04:38:38.740799       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
W0115 04:45:29.750439       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24297318 (24297906)
W0115 04:58:32.755131       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24300146 (24300476)
W0115 05:11:23.759743       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 24302702 (24303025)
I0115 05:25:02.062652       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:25:02.103377       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 05:25:02.103395       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 2 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:25:02.137018       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 2 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]
I0115 05:25:02.164946       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/85a58dd3cc19e932
I0115 05:25:02.201251       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]

Logs of new pod which did a bulk remove/ addition of targets

-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.4
  Build:      git-0db46039
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0115 05:31:38.016418       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0115 05:31:38.051846       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0115 05:31:38.051981       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0115 05:31:38.052016       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0115 05:31:38.052113       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0115 05:31:38.052132       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0115 05:31:38.052201       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0115 05:31:38.052365       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0115 05:31:38.052520       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
I0115 05:31:54.475501       1 leaderelection.go:214] successfully acquired lease kube-system/ingress-controller-leader-alb
I0115 05:31:54.575721       1 controller.go:134] kubebuilder/controller "level"=0 "msg"="Starting Controller"  "controller"="alb-ingress-controller"
I0115 05:31:54.675851       1 controller.go:154] kubebuilder/controller "level"=0 "msg"="Starting workers"  "controller"="alb-ingress-controller" "worker count"=1
I0115 05:31:55.229237       1 targets.go:80] pushowl-backend/pushowl-backend-ingress-alb: Adding targets to arn:aws:elasticloadbalancing:us-east-1:956017394124:targetgroup/d8bc12a0-bee0406332ff9abb91a/2f780fbba9190638: 10.10.104.192:8000, 10.10.117.146:8000, 10.10.118.85:8000, 10.10.119.203:8000, 10.10.120.247:8000, 10.10.125.149:8000, 10.10.138.102:8000, 10.10.146.86:8000, 10.10.151.153:8000, 10.10.156.47:8000, 10.10.96.246:8000
I0115 05:31:55.652305       1 targets.go:95] pushowl-backend/pushowl-backend-ingress-alb: Removing targets from arn:aws:elasticloadbalancing:us-east-1:956017394124:targetgroup/d8bc12a0-bee0406332ff9abb91a/2f780fbba9190638: 10.10.97.165:8000, 10.10.130.28:8000, 10.10.126.17:8000, 10.10.96.112:8000, 10.10.135.8:8000, 10.10.110.98:8000, 10.10.125.182:8000, 10.10.150.107:8000, 10.10.117.135:8000
I0115 05:31:55.726399       1 rules.go:82] pushowl-backend/pushowl-backend-ingress-alb: modifying rule 1 on arn:aws:elasticloadbalancing:us-east-1:956017394124:listener/app/d8bc12a0-pushowlbackend-pu-0f8c/1d46bbc5e6085be6/2b39ba9f40de7b66
I0115 05:31:55.760397       1 rules.go:98] pushowl-backend/pushowl-backend-ingress-alb: rule 1 modified with conditions [{    Field: "path-pattern",    Values: ["/*"]  }]

I have upgraded ingress controller to 1.1.5, but seeing at the release logs, I don't see any change being done to mitigate this issue. Can you please point me in the right direction, as in what are my options.

@M00nF1sh
Copy link
Collaborator

Hi,
When did the deployment happens? Is it between '05:25:02.062652' and '05:31:55.229237'?
BTW, What's your ingress spec? did you used WAF on it? (the v1.1.5 introduced an cache around WAF calls)

@a6kme
Copy link
Author

a6kme commented Jan 16, 2020

@M00nF1sh - Yes, the deployment happened at around - 05:25. I killed the old pod of ingress controller, which triggered a new pod, after I received an alarm of spike in 5XX error at around 05:30.

My ingress and service spec for your reference:

alb-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: {{ .Release.Name -}}-ingress-alb
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.global.acmCertificate }}
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    {{ if eq .Values.global.env "production" }}
    alb.ingress.kubernetes.io/load-balancer-attributes: 'access_logs.s3.enabled=false,access_logs.s3.bucket=pushowl-alb-production-logs,deletion_protection.enabled=true,routing.http2.enabled=true'
    {{ end }}
  labels:
    env: {{ .Values.global.env }}
    release: {{ .Release.Name }}
spec:
  rules:
    - http:
        paths:
          - path: /*
            backend:
              serviceName: ssl-redirect
              servicePort: use-annotation
          - path: /*
            backend:
              serviceName: "gunicorn"
              servicePort: 80

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: gunicorn
  annotations:
    alb.ingress.kubernetes.io/healthy-threshold-count: '2'
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '3'
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck/
    alb.ingress.kubernetes.io/healthcheck-port: {{ .Values.containerPort | quote}}
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '10'
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '30'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
spec:
  selector:
    app: {{ .Chart.Name }}
    release: {{ .Release.Name }}
  ports:
    - name: http
      port: 80
      targetPort: {{ .Values.containerPort }}
  type: ClusterIP

I have not enabled WAF on my ingress. Also, I use preStop in my deployment to sleep for 25 seconds, so that my application get SIGTERM after 25 seconds, and ingress controller has sufficient time to remove the target from target group.

@jorihardman
Copy link

jorihardman commented Jan 16, 2020

This is probably the same as #1112 #1124 #1065 and #1064.

Occasionally, removal/addition of target groups doesn't keep up with the kubernetes deployment causing them to go unhealthy. It's unclear whether the delay is in the ingress controller or the AWS alb api.

I see variations of this daily, unfortunately. Sometimes everything goes well, and no targets go unhealthy. Sometimes a pod or two will not deregister in time in time and 502s pop up. Sometimes, all the pods go unhealthy, and there is downtime. It's a critical issue for us, but unfortunately I don't know of any better alternatives for load balancing at the moment.

@a6kme
Copy link
Author

a6kme commented Jan 17, 2020

@jorihardman - Totally understood. I am thinking of shifting back to classic load balancer, i.e setting type: LoadBalancer in my service declaration till #955 is merged.

However, I still feel that merging of #955 will not solve the issue, because there will still potentially be a delay between Kubernetes terminating a pod and a delay in ingress controller removing the target from ALB. That delta in time can potentially lead to 502s or 504s in the service. Unless we get to an RCA of this lag, I don't see that issue getting fixed. Do let me know if I could help in any way.

@jorihardman
Copy link

@a6kme thanks for chiming back in. As far as RCA goes, #1064 is my ticket and suggests that the delay is on the AWS side. My deploy logs indicated that ALB was routing to pods after the ingress-controller sent the deregistration request. I notice that most of my deploy failures are during peak hours, and it might be the case that AWS has queue delays during the day that increase the likelihood of lagging target updates.

Totally agree with your analysis of #955 - it solves half the problem, but at least it prevents a full outage.

Just wanted to mention one caveat of type: LoadBalancer. If you use cluster-autoscaler, and scaling happens frequently, you may occasionally see 502s/504s due to in-flight requests being lost by kube-proxy when a node scales down. If you're using eks nodegroups, pods are drained when a node is terminating, however daemon pods, like kube-proxy are not drained (as of initial release of nodegroups anyway). This means that when a node scales down, in-flight requests can sometimes be lost. See kubernetes/autoscaler#1907.

If you don't have a lot of churn in your nodes, the classic elb definitely seems to be the more stable solution for now. For us, the choice was "random 502s/504s on a regular basis" or "maybe 502s, just during deploys". Catch 22 :P

@a6kme
Copy link
Author

a6kme commented Jan 18, 2020

@jorihardman Thanks a lot for pointing that caveat out. I did not know that before.

However, I would like to add/ clarify one more thing regarding #1064 that you mentioned. I faced Ingress Controller NOT removing the targets during deployment. My pods were all recycled from old replica set from previous deployment to new replica set, but there was nothing happening at Ingress Controller. You can correlate that from the logs that I posted in my first post. Do you think it also has something to do with AWS APIs?

I have also noticed Ingress Controller removing the pod and ALB still sending traffic while in draining state.

@jorihardman
Copy link

@a6kme Rereading your logs, I can see that the bad ingress-controller pod didn't seem to trigger removal at all. You're right - the root cause of this issue does seem unrelated. In mine, I had the Removing targets log message, but then latency actually removing the targets in AWS. Can't say I've personally encountered this one... 😬

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Jan 22, 2020

Just a quick update, i think i found the cause and will try to fix it in next release.
In the meantime, upgrade to v1.1.5 should ease this problem(unless your ingress spec is incorrect and have been keep failing).
the old versions will make WAF calls on every target change even waf annotation is not used, which have only 1TPS throttling, the controller will backoff retry and triggers this race condition

@a6kme
Copy link
Author

a6kme commented Jan 23, 2020

@M00nF1sh - Hey you are right actually. After I upgraded, I have not seen any instance of controller failing to register target updates on load balancers.

Though I still see ALB sending requests to draining targets but thats a different issue. My 5XXs have reduced a lot after I have upgraded to 1.15 and used preStop hook.

@keperry
Copy link
Contributor

keperry commented Jan 23, 2020

@M00nF1sh - we experienced a very similar issue as this. It appeared that alb-ingress-controller did not keep up with the changes causing a spike in 502/504. We were using v1.1.3. I have since updated to v1.1.5. Any ETA on a release/fix?

@a6kme a6kme closed this as completed Feb 3, 2020
@endeepak
Copy link

endeepak commented Feb 7, 2020

We are facing a similar issue intermittently with v1.1.5. We have deployed 15-20 times since we upgraded to v1.1.5 last week, we faced this issue 2 times so far. The new targets were not registered and old targets were not removed for 5-10 minutes leading to 502/504 errors.

Logs from ingress controller confirm this delay as the below logs are displayed 5-10 minutes after replacement of old pods with new pods

Removing targets from arn:aws:elasticloadbalancing:[REDACTED]
Adding targets to arn:aws:elasticloadbalancing:[REDACTED]

@a6kme do you mind keeping this issue open till there is a confirmation on the fix?

@a6kme a6kme reopened this Feb 7, 2020
@cw-sakamoto
Copy link
Contributor

cw-sakamoto commented Feb 8, 2020

@M00nF1sh

Thanks for great product maintenance!

#1131 (comment)

i think i found the cause and will try to fix it in next release.

What is the cause you think and when will the fixed version be released?

This problem still occurs with 1.1.5...

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Feb 19, 2020

@a6kme You can reduce more 502 by disable HTTP keep-alive or use HTTP keep-alive but send Connection:Close header up-on received sigTerm See example app for zero-downtime deployment.

@cw-sakamoto @endeepak While i didn't reproduce this issue by around(200 continuous deployment), the symptom is most like been caused by exponential backoff(5ms up to 16minutes) due to errors like wrong rule or throttling.
In the next version, it will reduce calls to EC2.DescribeInstanceHealth, also i'm considering change the backoff to (100ms up to 5 minute)

It will be helpful to send me some logs if the delay happens:
* alb ingress controller's logs (if throttling happens, there should be error logs from the controller)
* the timestamp when the event happens
* the cluster arn

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2020
@fillbit
Copy link

fillbit commented Jul 4, 2020

@M00nF1sh What's that status of this issue. I'm seeing this issue pretty frequently with throttling errors in the alb controller logs.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@khushboo121
Copy link

khushboo121 commented Jun 2, 2021

we are still facing the 502 error when ever there is new deployment happen and pod start deregister and register with ALB ingress controller version 1.1.9
eks version : 1.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants