Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

502/503 During deploys and/or pod termination #814

Closed
justinwalz opened this issue Jan 15, 2019 · 28 comments
Closed

502/503 During deploys and/or pod termination #814

justinwalz opened this issue Jan 15, 2019 · 28 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@justinwalz
Copy link

Hi! First of all, I appreciate the community and all their work on this project, it is very helpful and a good solution to route directly to pods from an ALB.

However, during testing, I've noticed intermittent 502/503s during deploys of our statefulset. My current hypothesis is that during a deploy, the statefulset controller kills a pod in need of updates, and there is latency between this happening and the alb ingress controller updating the alb target to draining. During this delay, requests are sent to the terminating pod and return 502 (our nginx sidecar) and/or 503 (aws alb).

Has anyone else seen this problem, and potentially have a solution for it? Ideally we'd remove the pod from the alb target group before killing the pod, if this is in fact what is happening.

I have the following Service and Ingress:

---
kind: Service
apiVersion: v1
metadata:
  name: svc-headless
  namespace: dev
spec:
  clusterIP: None
  selector:
    app: svc
  ports:
  - name: http
    port: 9000

Ingress

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: svc-external
  namespace: dev
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxxx,sg-yyyyyyyyyyy
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: 5
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: 3
    alb.ingress.kubernetes.io/success-codes: 200,201,401
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:XXXXXXXXXX:certificate/uuid
    alb.ingress.kubernetes.io/subnets: subnet-aaaaa,subnet-bbbbb,subnet-cccc
  labels:
    app: svc
spec:
  rules:
    - http:
        paths:
         - path: /*
           backend:
             serviceName: ssl-redirect
             servicePort: use-annotation
         - path: /*
           backend:
             serviceName: svc-headless
             servicePort: 9000
@justinwalz justinwalz changed the title 502/503 During deploy 502/503 During deploys and/or pod termination Jan 15, 2019
@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Jan 15, 2019

Hi,
This is indeed what happened.

  1. There is an gap between k8s updated the endpoints objects, and our controller get notified.
  2. New pods requires several seconds(for initial health checks by ALB) before starting to accept traffic.(Your problem should be mainly caused by this).

The best way to work around this for now is to use NodePort service with "mode instance" for our ingress. (You can create a separate nodePort service along with your headless service).

An more robust way might be support this with ReadinessGate and dynamicAdmissionControllers, i haven't have a deeper thought about this though, will do some prototyping to see whether it works 😄

@justinwalz
Copy link
Author

Hi @M00nF1sh, thanks for the response.

That would work, however it gets us back to the exact problem I'm trying to solve. We have a large amount of instances, in various node groups. This quickly balloons the amount of attached instances to the target group. The pods we'd like to direct traffic to belong to a small instance group -- so this would work, if we could select those ec2 instances (k8s nodes) directly. Is there a way to filter or limit which cluster nodes get attached (via kubernetes node label, ec2 tag, or otherwise) ?

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Jan 21, 2019

@justinwalz
It's not supported for now. I can make a change to support alpha.service-controller.kubernetes.io/exclude-balancer, but that will require you to tag all nodes you don't want with that tag, will that be acceptable?

@justinwalz
Copy link
Author

@M00nF1sh That would work, we can add a node label to exclude a fleet of instances for specific service ALBs.

Would it be possible to also have the inverse, maybe alpha.service-controller.kubernetes.io/include-balancer and do a union of the matching nodes between the whitelist and blacklist?

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Jan 22, 2019

@justinwalz It's possible(and make sense to me) to have the inverse, but i tend to not have it since it's not in k8s core. By only having exclude-balancer let us remains more compatible with k8s core.

@justinwalz
Copy link
Author

Got it - no problem. Thanks for the help on this!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2019
@shnhrrsn
Copy link

shnhrrsn commented Jun 5, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 5, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 3, 2019
@delilahlah
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 4, 2019
@oovs
Copy link

oovs commented Oct 7, 2019

I've faced a similar issue running aws-alb-controller:v1.1.3 with ip mode but I found it hard to switch to the instance mode + tagged instances approach due to the current setup limitations. Please, advice is there any easy ways to deal with this lag between a pod being killed and being deregistered from a load balancer?

@prcongithub
Copy link

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2020
@jebeaudet
Copy link

/remove-lifecycle stale

@douglaz
Copy link

douglaz commented Feb 24, 2020

We're getting this every single deploy. What are the workarounds available?
We have a service like:

apiVersion: v1
kind: Service
metadata:
  name: fortio
  annotations:
    alb.ingress.kubernetes.io/healthcheck-path: /
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  type: NodePort
  selector:
    app: fortio

@jorihardman
Copy link

@douglaz See this thread which covers the same issue with a couple of solutions: #1064

tldr:

  • Add a preStop sleep to your pod so that the container is delayed prior to shut down. This keeps the container alive while the load balancer updates its targets. You might need to increase terminationGracePeriodSeconds to allow for graceful shutdown after the sleep.
  • Add --feature-gates=waf=false to alb-ingress-controller container args. Right now the controller makes WAF requests for every deploy, and AWS throttling these requests can cause a delay in updating targets. If you're not using waf, skipping it entirely prevents these delays.

@alfredkrohmer
Copy link
Contributor

@jorihardman Could you try if the pod readiness gates feature I added solves the problem? You would need to build a custom docker image from master since it's not released yet:
https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/master/docs/guide/ingress/pod-conditions.md

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 11, 2020
@delilahlah
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 13, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shyr
Copy link

shyr commented Jun 22, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@shyr: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@M00nF1sh
Copy link
Collaborator

@shyr
Have you tried our podReadinessGate?
With podReadinessGate & proper sleep time in prestop hook, there should be 502/503s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests