502/503 During deploys and/or pod termination #814

justinwalz · 2019-01-15T02:41:49Z

Hi! First of all, I appreciate the community and all their work on this project, it is very helpful and a good solution to route directly to pods from an ALB.

However, during testing, I've noticed intermittent 502/503s during deploys of our statefulset. My current hypothesis is that during a deploy, the statefulset controller kills a pod in need of updates, and there is latency between this happening and the alb ingress controller updating the alb target to draining. During this delay, requests are sent to the terminating pod and return 502 (our nginx sidecar) and/or 503 (aws alb).

Has anyone else seen this problem, and potentially have a solution for it? Ideally we'd remove the pod from the alb target group before killing the pod, if this is in fact what is happening.

I have the following Service and Ingress:

---
kind: Service
apiVersion: v1
metadata:
  name: svc-headless
  namespace: dev
spec:
  clusterIP: None
  selector:
    app: svc
  ports:
  - name: http
    port: 9000

Ingress

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: svc-external
  namespace: dev
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxxx,sg-yyyyyyyyyyy
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: 5
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: 3
    alb.ingress.kubernetes.io/success-codes: 200,201,401
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:XXXXXXXXXX:certificate/uuid
    alb.ingress.kubernetes.io/subnets: subnet-aaaaa,subnet-bbbbb,subnet-cccc
  labels:
    app: svc
spec:
  rules:
    - http:
        paths:
         - path: /*
           backend:
             serviceName: ssl-redirect
             servicePort: use-annotation
         - path: /*
           backend:
             serviceName: svc-headless
             servicePort: 9000

The text was updated successfully, but these errors were encountered:

M00nF1sh · 2019-01-15T23:36:24Z

Hi,
This is indeed what happened.

There is an gap between k8s updated the endpoints objects, and our controller get notified.
New pods requires several seconds(for initial health checks by ALB) before starting to accept traffic.(Your problem should be mainly caused by this).

The best way to work around this for now is to use NodePort service with "mode instance" for our ingress. (You can create a separate nodePort service along with your headless service).

An more robust way might be support this with ReadinessGate and dynamicAdmissionControllers, i haven't have a deeper thought about this though, will do some prototyping to see whether it works 😄

justinwalz · 2019-01-21T07:55:32Z

Hi @M00nF1sh, thanks for the response.

That would work, however it gets us back to the exact problem I'm trying to solve. We have a large amount of instances, in various node groups. This quickly balloons the amount of attached instances to the target group. The pods we'd like to direct traffic to belong to a small instance group -- so this would work, if we could select those ec2 instances (k8s nodes) directly. Is there a way to filter or limit which cluster nodes get attached (via kubernetes node label, ec2 tag, or otherwise) ?

M00nF1sh · 2019-01-21T18:34:38Z

@justinwalz
It's not supported for now. I can make a change to support alpha.service-controller.kubernetes.io/exclude-balancer, but that will require you to tag all nodes you don't want with that tag, will that be acceptable?

justinwalz · 2019-01-22T04:00:15Z

@M00nF1sh That would work, we can add a node label to exclude a fleet of instances for specific service ALBs.

Would it be possible to also have the inverse, maybe alpha.service-controller.kubernetes.io/include-balancer and do a union of the matching nodes between the whitelist and blacklist?

M00nF1sh · 2019-01-22T18:01:37Z

@justinwalz It's possible(and make sense to me) to have the inverse, but i tend to not have it since it's not in k8s core. By only having exclude-balancer let us remains more compatible with k8s core.

justinwalz · 2019-01-23T05:45:07Z

Got it - no problem. Thanks for the help on this!

fejta-bot · 2019-04-29T00:28:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-29T01:10:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

shnhrrsn · 2019-06-05T21:33:48Z

/remove-lifecycle rotten

fejta-bot · 2019-09-03T21:44:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-10-03T22:28:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

delilahlah · 2019-10-04T18:09:45Z

/remove-lifecycle rotten

oovs · 2019-10-07T12:07:00Z

I've faced a similar issue running aws-alb-controller:v1.1.3 with ip mode but I found it hard to switch to the instance mode + tagged instances approach due to the current setup limitations. Please, advice is there any easy ways to deal with this lag between a pod being killed and being deregistered from a load balancer?

prcongithub · 2019-10-24T17:37:21Z

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

fejta-bot · 2020-01-22T17:44:53Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jebeaudet · 2020-01-28T11:47:09Z

/remove-lifecycle stale

douglaz · 2020-02-24T17:50:08Z

We're getting this every single deploy. What are the workarounds available?
We have a service like:

apiVersion: v1
kind: Service
metadata:
  name: fortio
  annotations:
    alb.ingress.kubernetes.io/healthcheck-path: /
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  type: NodePort
  selector:
    app: fortio

jorihardman · 2020-03-05T19:49:34Z

@douglaz See this thread which covers the same issue with a couple of solutions: #1064

tldr:

Add a preStop sleep to your pod so that the container is delayed prior to shut down. This keeps the container alive while the load balancer updates its targets. You might need to increase terminationGracePeriodSeconds to allow for graceful shutdown after the sleep.
Add --feature-gates=waf=false to alb-ingress-controller container args. Right now the controller makes WAF requests for every deploy, and AWS throttling these requests can cause a delay in updating targets. If you're not using waf, skipping it entirely prevents these delays.

alfredkrohmer · 2020-03-12T09:29:57Z

@jorihardman Could you try if the pod readiness gates feature I added solves the problem? You would need to build a custom docker image from master since it's not released yet:
https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/master/docs/guide/ingress/pod-conditions.md

fejta-bot · 2020-04-11T09:50:41Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

delilahlah · 2020-04-13T13:21:21Z

/remove-lifecycle rotten

fejta-bot · 2020-07-12T13:48:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-08-11T14:30:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-09-10T15:12:41Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-09-10T15:12:49Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shyr · 2021-06-22T16:57:54Z

/reopen

k8s-ci-robot · 2021-06-22T16:57:57Z

@shyr: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

M00nF1sh · 2021-10-22T22:33:47Z

@shyr
Have you tried our podReadinessGate?
With podReadinessGate & proper sleep time in prestop hook, there should be 502/503s.

justinwalz changed the title ~~502/503 During deploy~~ 502/503 During deploys and/or pod termination Jan 15, 2019

M00nF1sh mentioned this issue Feb 1, 2019

alb.ingress.kubernetes.io/target-type having instance as "default" option. #839

Closed

BrianChristie mentioned this issue Apr 1, 2019

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 5, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 3, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 4, 2019

bmbferreira mentioned this issue Nov 6, 2019

400/502/504 errors while doing rollout restart or rolling update #1065

Closed

ihcsim mentioned this issue Dec 10, 2019

Configure linkerd-proxy to ignore SIGTERM on a per-workload basis linkerd/linkerd2#3747

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2020

aalimovs mentioned this issue Jan 23, 2020

K8S Network policy applied is not allowing traffic #973

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 11, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 13, 2020

runningman84 mentioned this issue Apr 27, 2020

50x errors due to pods termination #1237

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2020

k8s-ci-robot closed this as completed Sep 10, 2020

foriequal0 mentioned this issue Jan 20, 2021

Create ability to do zero downtime deployments when using externalTrafficPolicy: Local kubernetes/kubernetes#85643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

502/503 During deploys and/or pod termination #814

502/503 During deploys and/or pod termination #814

justinwalz commented Jan 15, 2019

M00nF1sh commented Jan 15, 2019 •

edited

justinwalz commented Jan 21, 2019

M00nF1sh commented Jan 21, 2019 •

edited

justinwalz commented Jan 22, 2019

M00nF1sh commented Jan 22, 2019 •

edited

justinwalz commented Jan 23, 2019

fejta-bot commented Apr 29, 2019

fejta-bot commented May 29, 2019

shnhrrsn commented Jun 5, 2019

fejta-bot commented Sep 3, 2019

fejta-bot commented Oct 3, 2019

delilahlah commented Oct 4, 2019

oovs commented Oct 7, 2019

prcongithub commented Oct 24, 2019

fejta-bot commented Jan 22, 2020

jebeaudet commented Jan 28, 2020

douglaz commented Feb 24, 2020

jorihardman commented Mar 5, 2020

alfredkrohmer commented Mar 12, 2020

fejta-bot commented Apr 11, 2020

delilahlah commented Apr 13, 2020

fejta-bot commented Jul 12, 2020

fejta-bot commented Aug 11, 2020

fejta-bot commented Sep 10, 2020

k8s-ci-robot commented Sep 10, 2020

shyr commented Jun 22, 2021

k8s-ci-robot commented Jun 22, 2021

M00nF1sh commented Oct 22, 2021

502/503 During deploys and/or pod termination #814

502/503 During deploys and/or pod termination #814

Comments

justinwalz commented Jan 15, 2019

M00nF1sh commented Jan 15, 2019 • edited

justinwalz commented Jan 21, 2019

M00nF1sh commented Jan 21, 2019 • edited

justinwalz commented Jan 22, 2019

M00nF1sh commented Jan 22, 2019 • edited

justinwalz commented Jan 23, 2019

fejta-bot commented Apr 29, 2019

fejta-bot commented May 29, 2019

shnhrrsn commented Jun 5, 2019

fejta-bot commented Sep 3, 2019

fejta-bot commented Oct 3, 2019

delilahlah commented Oct 4, 2019

oovs commented Oct 7, 2019

prcongithub commented Oct 24, 2019

fejta-bot commented Jan 22, 2020

jebeaudet commented Jan 28, 2020

douglaz commented Feb 24, 2020

jorihardman commented Mar 5, 2020

alfredkrohmer commented Mar 12, 2020

fejta-bot commented Apr 11, 2020

delilahlah commented Apr 13, 2020

fejta-bot commented Jul 12, 2020

fejta-bot commented Aug 11, 2020

fejta-bot commented Sep 10, 2020

k8s-ci-robot commented Sep 10, 2020

shyr commented Jun 22, 2021

k8s-ci-robot commented Jun 22, 2021

M00nF1sh commented Oct 22, 2021

M00nF1sh commented Jan 15, 2019 •

edited

M00nF1sh commented Jan 21, 2019 •

edited

M00nF1sh commented Jan 22, 2019 •

edited