-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingress controller did not remove targets from target group even when pods were deleted #1131
Comments
Hi, |
@M00nF1sh - Yes, the deployment happened at around - 05:25. I killed the old pod of ingress controller, which triggered a new pod, after I received an alarm of spike in My ingress and service spec for your reference: alb-ingress.yaml
service.yaml
I have not enabled WAF on my ingress. Also, I use |
This is probably the same as #1112 #1124 #1065 and #1064. Occasionally, removal/addition of target groups doesn't keep up with the kubernetes deployment causing them to go unhealthy. It's unclear whether the delay is in the ingress controller or the AWS alb api. I see variations of this daily, unfortunately. Sometimes everything goes well, and no targets go unhealthy. Sometimes a pod or two will not deregister in time in time and 502s pop up. Sometimes, all the pods go unhealthy, and there is downtime. It's a critical issue for us, but unfortunately I don't know of any better alternatives for load balancing at the moment. |
@jorihardman - Totally understood. I am thinking of shifting back to classic load balancer, i.e setting However, I still feel that merging of #955 will not solve the issue, because there will still potentially be a delay between Kubernetes terminating a pod and a delay in ingress controller removing the target from ALB. That delta in time can potentially lead to |
@a6kme thanks for chiming back in. As far as RCA goes, #1064 is my ticket and suggests that the delay is on the AWS side. My deploy logs indicated that ALB was routing to pods after the ingress-controller sent the deregistration request. I notice that most of my deploy failures are during peak hours, and it might be the case that AWS has queue delays during the day that increase the likelihood of lagging target updates. Totally agree with your analysis of #955 - it solves half the problem, but at least it prevents a full outage. Just wanted to mention one caveat of If you don't have a lot of churn in your nodes, the classic elb definitely seems to be the more stable solution for now. For us, the choice was "random 502s/504s on a regular basis" or "maybe 502s, just during deploys". Catch 22 :P |
@jorihardman Thanks a lot for pointing that caveat out. I did not know that before. However, I would like to add/ clarify one more thing regarding #1064 that you mentioned. I faced Ingress Controller NOT removing the targets during deployment. My pods were all recycled from old replica set from previous deployment to new replica set, but there was nothing happening at Ingress Controller. You can correlate that from the logs that I posted in my first post. Do you think it also has something to do with AWS APIs? I have also noticed Ingress Controller removing the pod and ALB still sending traffic while in |
@a6kme Rereading your logs, I can see that the bad ingress-controller pod didn't seem to trigger removal at all. You're right - the root cause of this issue does seem unrelated. In mine, I had the |
Just a quick update, i think i found the cause and will try to fix it in next release. |
@M00nF1sh - Hey you are right actually. After I upgraded, I have not seen any instance of controller failing to register target updates on load balancers. Though I still see ALB sending requests to draining targets but thats a different issue. My |
@M00nF1sh - we experienced a very similar issue as this. It appeared that alb-ingress-controller did not keep up with the changes causing a spike in 502/504. We were using v1.1.3. I have since updated to v1.1.5. Any ETA on a release/fix? |
We are facing a similar issue intermittently with v1.1.5. We have deployed 15-20 times since we upgraded to v1.1.5 last week, we faced this issue 2 times so far. The new targets were not registered and old targets were not removed for 5-10 minutes leading to 502/504 errors. Logs from ingress controller confirm this delay as the below logs are displayed 5-10 minutes after replacement of old pods with new pods
@a6kme do you mind keeping this issue open till there is a confirmation on the fix? |
Thanks for great product maintenance!
What is the cause you think and when will the fixed version be released? This problem still occurs with 1.1.5... |
@a6kme You can reduce more 502 by @cw-sakamoto @endeepak While i didn't reproduce this issue by around(200 continuous deployment), the symptom is most like been caused by exponential backoff(5ms up to 16minutes) due to errors like wrong rule or throttling. It will be helpful to send me some logs if the delay happens: |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@M00nF1sh What's that status of this issue. I'm seeing this issue pretty frequently with throttling errors in the alb controller logs. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
we are still facing the 502 error when ever there is new deployment happen and pod start deregister and register with ALB ingress controller version 1.1.9 |
Hey,
Today during deployment, all of our pods were upgraded to newer pods, but the ingress controller did not remove old targets and add new targets to the target group, leading to an outage on our platform. I had to kill the existing
alb-ingress-controller
pod and the new pod then did a bulk add/remove operations, bringing back the systems live.My current deployment spec
Tail of some logs from my running pod which did not trigger remove/add targets
Logs of new pod which did a bulk remove/ addition of targets
I have upgraded ingress controller to 1.1.5, but seeing at the release logs, I don't see any change being done to mitigate this issue. Can you please point me in the right direction, as in what are my options.
The text was updated successfully, but these errors were encountered: