Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALB sending requests to pods after ingress controller deregisters them leading to 504s #1064

Closed
jorihardman opened this issue Nov 6, 2019 · 42 comments

Comments

@jorihardman
Copy link

jorihardman commented Nov 6, 2019

I have the following ingress controller defined with the relevant bits being that it is target-type ip with a 30 second deregistration_delay.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  namespace: "default"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: REDACTED
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
    alb.ingress.kubernetes.io/healthcheck-path: /up
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30,access_logs.s3.enabled=true,access_logs.s3.bucket=REDACTED,access_logs.s3.prefix=REDACTED
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
spec:
  rules:
    - host: REDACTED
      http:
        paths:
          - path: /*
            backend:
              serviceName: ssl-redirect
              servicePort: use-annotation
          - path: /
            backend:
              serviceName: graphql-api
              servicePort: 80

When I delete a pod, either manually or as part of a rolling deploy, I see 504s returned from ALB. 504s are returned when ALB cannot form a connection to its target within 10s. Here is one such message from the ALB logs:

https 2019-11-06T01:35:39.438256Z app/bd528925-default-stagingin-1b90/1a99d4560435cd18 54.248.220.14:24888 172.18.158.197:8080 -1 -1 -1 504 - 229 303 "GET REDACTED:443/up HTTP/1.1" "Amazon-Route53-Health-Check-Service (ref af4d4b0b-e40c-4f18-a18f-12f472889080; report http://amzn.to/1vsZADi)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587 "Root=1-5dc22361-b0d66bf23f845202c2c390ec" "REDACTED" "arn:aws:acm:us-east-1:REDACTED:certificate/REDACTED" 3 2019-11-06T01:35:29.436000Z "forward" "-" "-" "172.18.158.197:8080" "-"

There's a lot going on there, but the important part is that the request is received at 2019-11-06T01:35:29.436000Z and the error is emitted 10s later at 2019-11-06T01:35:39.438256Z.

I investigated the ingress controller logs and I can see that the pod in question, 172.18.158.197:8080, is deregistered at 2019-11-06T01:35:25.891546Z 4 seconds prior to when the above request is received.

I1106 01:35:25.891546       1 targets.go:95] default/staging-ingress: Removing targets from arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587: 172.18.158.197:8080

My understanding is that once a target is set to "deregistering" ALB will not forward it anymore requests. It's unclear to me how this requests seems to be breaking that rule - any thoughts?

@jorihardman jorihardman changed the title ALB sending requests to pod after ingress controller deregisters them leading to 504s ALB sending requests to pods after ingress controller deregisters them leading to 504s Nov 6, 2019
@jorihardman
Copy link
Author

I should add that I've implemented a preStop delay on my pod to allow for deregistration to occur prior to pod shut down. This seems to be functioning correctly. I can see the deregistration in the controller logs, and several seconds later I see a SIGTERM on my pod. The issue seems to be between ALB and the ingress controller.

@rparsonsbb
Copy link

I'm definitely experiencing the same thing. Powershell preStop commands don't seem to like sleeping first either so had to really rig that up.

Is there any insight as to why terminated pods still get sent requests? Happens if the type is IP or instance.

@nicholasgcoles
Copy link

@jorihardman is this happening every single time you update a deployment? Or only sometimes.

@jorihardman
Copy link
Author

@nicholasgcoles To be honest when I was investigating this issue, my sample size was only about 5 deployments, each rolling 4 pods. I saw at least 1 504 for each of these. I didn't have constant traffic for each deploy, so honestly, it wasn't a perfect experimental setup. Still, the logs seemed to me to be a smoking gun for what I would expect to be an impossible situation, so I figured it was worth raising the question. Happy to investigate further with some guidance.

@mailjunze
Copy link

This looks similar to #905

@krunalnsoni
Copy link

We are able to reproduce this 100% of time with target-type as ip. We added 30 seconds sleep delay for preStop which resolved 502s. However, 504s happen even after target has been removed from ALB. We have verified that 504s are caused by ALB forwarding requests to target that was removed.

@jorihardman
Copy link
Author

Checking back in here. I think I was able to resolve this issue with the preStop sleep workaround to achieve 100% availability though deployments.

Couple things I changed from my initial config:

# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60

# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
  preStop:
    exec:
      command: ["sleep", "25"]

Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds of 60s allows that whole shutdown process to happen without killing the pod.

I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.

@jorihardman
Copy link
Author

I'm going to close this issue as it seems to be an AWS problem and not related to the controller. The workaround is to add sufficient preStop sleep to allow ALB to complete deregistration. I did 25s which might be overkill but seemed to mitigate the problem.

@jorihardman
Copy link
Author

Previous testing was with our staging environment. When I shipped the changes to production, I still see 504s from ALB sending requests to terminating pods.

@rparsonsbb
Copy link

Agreed. I'm running windows containers but all of the theory is the same. Actually using a 60s sleep at this point and still getting 504s from the ALB during rollout.

            preStop:
              exec:
                command:
                - powershell.exe
                - -c
                - echo "Stopping container"; Start-Sleep 60; Stop-Website "Main"

@jorihardman
Copy link
Author

Strangely, I haven't gotten a 504 during deploys for the last 48 hours and I've done 3 or 4 deploys at peak traffic times. I did tweak the readiness checks on my pod to make them ready quicker (lowered initialDelaySeconds). Perhaps that has allowed them to become healthy on the LB faster.

@jorihardman
Copy link
Author

Naturally, as soon as I signal success, I see another spurt of 502s. There's definitely still a deregistration race condition here. This time I got 502s due to a pod terminating prior to being deregistered. This seems to be pretty non-deterministic at this point. Sometimes alb-ingress-controller keeps up with changing targets, and sometimes it doesn't.

@mohitanchlia
Copy link

Any ETA on this?

@KIVagant
Copy link

KIVagant commented Jan 29, 2020

For those who use the setup:

  • ALB
  • AWS CNI
  • ALB Ingress Controller
  • Linkerd v2

...here's an example of how to avoid the problem:

linkerd/linkerd2#3747 (comment)

I am not an expert in ALB internals but I feel it happens because of cross-region nature (and, maybe, multi-tenant). It takes time to propagate changes to all instances, so ALB will send some traffic for a while even after a target is requested to be deregister. So the overall "sleep" configuration for preStop hook should be calculated based on:

  • time for ALB Ingress Controller to send the deregistration API call, especially if AWS WAF is in play
  • Deregistration Delay configured for the ALB
  • some additional time for ALB to propagate all the changes (from my tests it takes up to 5 seconds, but it can be really different in your setup)
  • and after the main container exited, some additional time (yet another preStop hook) specifically for sidecar containers such as Istio, Linkerd or any other to keep internal network up till the very end.

@aditya-pr
Copy link

Have you guys tried --feature-gates=waf=false? This fixed the delay issue for us. You can see in following graph, most of retires are for waf[1] which seems to backup the ratelimit queue. You can collect queue related metrics by adding MetricsBindAddress: ":8080" here. One can see longest running process is reaching upwards of 300 sec.

Screenshot from 2020-01-21 16-01-26

After disabling waf, longest running process was down to max of 5 sec (still needs sleep of 10 sec in prestop).

I have verified that the issue exists in v1.1.5. Let me know if you need any further info or help fixing this.
[1] ingress aws api metrics collected with prometheus.io/port: '10254'

@KIVagant
Copy link

KIVagant commented Feb 3, 2020

@aditya-pr , yes, and it is not enough to get the smooth rolling update.

@jorihardman
Copy link
Author

@aditya-pr Amazing dashboard you've got there. Wish I had that when I was digging into this issue. To clarify, are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

@aditya-pr
Copy link

are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

Yes, there is lag even after disabling waf. There are other aws api calls which are still being retried. Adding all endpoints of an ingress to queue whenever any endpoint updates must be making this worse. Deregistration delay of even 5 seconds means elb is sending to a terminated pod, ie 504 Gateway Timeout error.
Screenshot from 2020-02-05 14-13-14

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 5, 2020
@erks
Copy link

erks commented May 5, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 5, 2020
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 8, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elyahou
Copy link

Elyahou commented Jan 26, 2021

Having the same issue here, I have a prestop hook of 60s sleep, but I still get 504 from ALB after deleting a pod manually or scaled down the deployment, the timeout requests are always received 8s after the deregistration started, and are timed out 10s after that so 18s after the deregistration started. At this time, the pod still works and reply correctly to requests that I send manually to it directly...

Anyone here have found a solution to this ?

@rparsonsbb
Copy link

@Elyahou I don't feel too bad because this is an ongoing issue of over a year at this point but I moved on to kong-ingress-controller and have been very happy with it. The things that ALB handled nicely - for me primarily was the AWS certs - have now all been handled by adding a couple more pieces, namely cert-manager and external-dns so that I no longer worry about the ALB lifecycle conflicts.

ALB ingress seemed like an awesome idea and was dead-simple to use, but the inability to sync up the lifecycles b/t pod and ALB delivery just became too much.

@blhagadorn
Copy link

Any update on this? This should be reopened given the comments here, IMO.

@blang
Copy link

blang commented Oct 27, 2021

Is it possible that lb controller removes the security group entries despite target is still deregistering?
Will investigate this further next couple of weeks

@kevin-lindsay-1
Copy link

I think I might be currently encountering this issue. I'll have to do further tests to verify, but it seems like this issue should be reopened.

@123BLiN
Copy link

123BLiN commented Feb 8, 2023

Still an issue btw, for those who come here after me, the solution AWS suggests here is to use healthchecks looks like and do not rely on target deregistration https://aws.amazon.com/blogs/containers/how-to-rapidly-scale-your-application-with-alb-on-eks-without-losing-traffic/
so pod should just start to fail healthchecks on SIGTERM while still serving the traffic for some time, it that case PreStop hook is not required so that also should improve rolling update time a bit in theory

@davidroth
Copy link

davidroth commented Nov 13, 2023

@123BLiN According to the sample from aws, pre-stop hook ist still required.

I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.

@armenr
Copy link

armenr commented Nov 13, 2023

This is an issue for us too. A BIG one.

...It's incredible to me how this has been allowed to go this long without concrete fix or resolution 😂 . Talk about "Bias for Action" and "Ownership" 🎃

AWS philosophizes and writes literature on the 6 Well-Architected Pillars.

Also releases a Load Balancer Ingress Controller that regularly throws 50xs, because it doesn't work as advertised/as expected, even though the customer using their tool is doing everything in exactly the way they're being told to.

...And then lets the issue just kinda linger and languish for years...and then tells the customer that the "fix" is just a bunch of workarounds that impact the customer's kubernetes cluster in other ways...like delayed pod termination, which then impacts the scale-speed and cost characteristics of the cluster. 👍🏼

@dusansusic
Copy link

@123BLiN According to the sample from aws, pre-stop hook ist still required.

I am curious if the "health-check + sleep xx" hook has any benefit over a simple "sleep xx" hook. According to my tests, there is no difference.

I guess this isn't health-check but setting healthy to the unhealthy so health-check start to fail.

sed -i 's/health/nothealthy/g'

I believe that remaining 50x errors are caused because of this, as he wrote in the sentenced before Conclusion chapter.

@armenr
Copy link

armenr commented Nov 26, 2023

We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.

Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.

@debu99
Copy link

debu99 commented Oct 16, 2024

We found a useful pattern for this, although it's fairly bespoke (using s6-overlay). I'm happy to provide a sample implementation once I test it a little while longer.

Anyone on this thread using s6-overlay, or at least willing to? :) +1 so I know whether I should waste my breath or not.

could you share more details?

@leonardocaylent
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 5, 2024
@leonardocaylent
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@leonardocaylent: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@leonardocaylent
Copy link

leonardocaylent commented Nov 5, 2024

Why was this issue mark as closed if this is still happening after 5 years?

@armenr
Copy link

armenr commented Nov 6, 2024

It definitely continues to happen.

The posted blog/article regarding how to do this uses an example where you use sed to modify running code in order to "kill" a server inside the pod, prior to deregistration, so health checks start to fail...and it's honestly not a good solution.

We worked around it a different way, but it required non-trivial effort and a change to the way we fundamentally configure and manage pods and the application(s) inside them.

@leonardocaylent
Copy link

leonardocaylent commented Nov 11, 2024

It definitely continues to happen.

The posted blog/article regarding how to do this uses an example where you use sed to modify running code in order to "kill" a server inside the pod, prior to deregistration, so health checks start to fail...and it's honestly not a good solution.

We worked around it a different way, but it required non-trivial effort and a change to the way we fundamentally configure and manage pods and the application(s) inside them.

Hi @armenr , Is the article you're talking about this one?

My concern is that this is an AWS ALB issue that should be in "Open" status not closed. I saw some workarounds are moving to NGINX ingress class just because of this issue which doesn't seem like a real solution, it's just a change of controller.

On the article that I shared they use 2 different changes applied together:
1)Use of Pre-Stop lifecycle on the pods (where the number of seconds of the sleep command needs to be adjust regarding your app)

lifecycle:
    preStop:
        exec:
            command:["/bin/sh", "-c", "sleep 15"]

2)Usage of pod readinessGate injection
kubectl label namespace your_namespace elbv2.k8s.aws/pod-readiness-gate-inject=enabled

You mentioned that you managed it differently from the application, can you share some details if possible? (i.e. some NestJS apps need to use app.enableShutdownHooks() in the code, some others change the sleep 15 to sleep 60)
Before applying your final fix, how did you test for this bug consistently?(i.e. forcing a new deployment of the pods rolling out a "new" image for the pods in the middle of a load test against the application ingress using a fixed amount of virtual users with postman)

@armenr
Copy link

armenr commented Nov 12, 2024

What we do is the following:

  1. Use s6-overlay to manage an additional "heartbeat" process inside the pod...so it's just a python script that listens for healthchecks on the designated health check port and returns an OK 200. It itself also polls the actual service that is being load balanced, so we ensure that it will fail if the real service we're keeping an eye on is unhealthy.
  2. We use pre-stop execution hooks to gracefully stop/kill the heartbeat process ahead of the pod being decommissioned, in addition to various other "graceful pre-stop" steps or signals we want to issue before the traffic-facing loadbalanced service goes away.

For us, resource budgeting and management (limits on mem/cpu) is imperative, so shoveling in additional crap like sidecars is just a headache to have to worry about. As such, a "multi-process" container with s6-overlay was (contrary to our skepticism and resistance toward it) a really lightweight and useful solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests