New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target does not get registered if using readinessGate #1205
Comments
I'm not sure if it could be the same issue, but at least looks like I'm experimenting a similar behaviour. I also started to test the new 1.1.6 version in order to use pod readiness gate and for me it's happening the other way around, when a new pod is launched using readiness gate, it takes around 5-10 minutes to update the pod readiness gate status and register the new pod to the target group. Also I've found some cases where the pod never gets updated having to delete the pod to create a new one. At the time when a new pod is launched I got exactly the same status reported in this issue:
Also during that period of time the readiness gate is being updated for the new pod, I've been checking the target group and the new pod is not being registered there and no activity is showed in the controller logs. Just after waiting several minutes, controller starts to generate this logs and it successfully updates the readiness gate status for the pod and register it to the target group:
As you can see in this sample it took exactly 8 minutes to register the new pod. If it helps, here it's the ingress and service definitions:
|
@vedatappsamurai @ivanmp91 would you help attach the logs from the controller pod during the deployment? |
I have the following deployment strategy on my deployment:
And ALB Ingress Controller left the following logs when I started the deployment:
Since I have
And it shows the following status on describe as we mentioned before:
At the time of my writing, I was waiting for readiness gate to see if it would pass, and it did, but it took 10 minutes. Notice the timestamps in the log:
It removed the old target at |
In my case, I've just tested increasing the number of pod replicas by 1. Here you can see the status of the pods, which you can see the new pod is ready, up and running, but still the new pod doesn't have the readiness gate status updated:
And this is the status condition of the new pod launched:
After waiting for 23 minutes the readiness gate status was updated:
These are the full controller logs during all this amount of time, since I increased the replica count for the deployment by 1, until the readiness gate status was updated:
|
I can confirm we have the same problem. New pods come up as part of the deployment but them controller doesn't seem to add them to the target group until some (seemingly) random period of time later (in the order of minutes). This only applies to deployments using the Our unchanged deployments (that use the various health check/lifecycle hacks) are working as per usual. |
Hi! I have the same problem and after some tests, I found that removing the readiness probe from my deployment fixes it. Is there some restriction for using readiness gates and readiness probe in the same deployment with ALB Ing Controller? |
I can confirm the same issue. I created a new deployment (2 replicas total) and the first pod to come up in the replicaset got stuck in "ReadinessGatesNotReady" state. I tried removing my readinessprobes per @cupello, but that did not allow my deployment to progress. I rebooted the ingress controller which allowed that stuck pod to progress to ready. HOWEVER, the next pod to be booted in the replicaset then got stuck in the same state. It seems the new pod is not being registered as a target on the alb. Here is my controller logs after the first reboot:
I noticed the "Stopping reconciliation of pod condition status for target group" at the end there - could that be related? |
I have the same problem. I overcome it by restarting Attaching some logs pre and post re-start of ALB Ingress Controller, which might give some hint. Existing running pod and newly created pod of the deployment. New pod is not ready even after 6 minutes.
Pre-restart logs
Post restart logs
|
We are experiencing the same thing. Registration sometimes doesn't happen, and deregistration also sometimes doesn't happen, even when the pods have no running containers. Restarting the ALB ingress controller pod seems to fix the issue, which is tough to manage during hpa scaling. |
I discovered, that this is basically always true. when new targets are added to a target group this is also somehow 0 |
@nirnanaaa and I found the underlying issue. When all containers in a pod have started, the pod IP appears in the The solution to this would be to add a pod reconciler which triggers endpoints reconciliations for any associated |
@bpineau your PR works very good. Testing this right now at scale |
We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers ContainerReady?) to either act upon, or swallow those events. Both (pods changes, ep changes) can be out of sync. For instance: a starting pod whose containers aren't all ContainerReady might have its addresses registered to an ep subset's NotReadyAddresses, kicking a reconcile event which won't propagate to the TargetGroups (+ conditions updates) since the pod is evaluated as not ready. Further pod changes won't kick an endpoint change (due to readiness gates, the pod's address will stay in NotReadyAddresses until we do something). As probably seen in kubernetes-sigs#1205 . In order to react reliably on pods changes, we have to hook in a pod watch. Doing so is slightly expensive as we have to map pod -> [service ->] endpoint -> ingress on pods events, though we limit the search to pod's ns (services can only reference pods from their own ns, and ingress services from their ns).
This should be fixed on master now. |
Can we get a point release for this bugfix? It affects us as well. |
@casret - FWIW, I built my own image from |
@M00nF1sh please add this to this week release. Maybe it the most important fix. |
Seems like this went into |
Fixed via : #1214 Release notes https://github.com/kubernetes-sigs/aws-alb-ingress-controller/releases/ |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Bumping this up. |
Experiencing the same with alb-ingress-controller v1.1.8 and EKS 1.17 |
Pretty sure I'm still seeing this on alb-ingress-controller v1.1.9 and EKS 1.16. |
Well maybe I found something? It turns out that if I use the named port like is described in the docs, then the target never gets added. However if I flip my annotation back to using the port number, it's all working as expected. @M00nF1sh I don't think anything has changed regarding the port naming requirement has it? Is this expected behavior? |
Also having the same issue of '' with version aws-alb-ingress-controller:v1.1.8 & EKS 1.17 |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@wildersachin: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For people that might still have the issue with readiness gate, it seems we have to label the namespace before we can use readiness gate. It's documented here but this documentation is not visible under the main branch |
@alamsyahho, pod readiness gate documentation is available in the live docs as well - https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/pod_readiness_gate/ |
@kishorj Thanks for pointing it out for me 👍 |
We're relying on endpoints events to re-trigger reconciliations during rollouts, and we're considering pod's containers status (eg. are all pod's containers ContainerReady?) to either act upon, or swallow those events. Both (pods changes, ep changes) can be out of sync. For instance: a starting pod whose containers aren't all ContainerReady might have its addresses registered to an ep subset's NotReadyAddresses, kicking a reconcile event which won't propagate to the TargetGroups (+ conditions updates) since the pod is evaluated as not ready. Further pod changes won't kick an endpoint change (due to readiness gates, the pod's address will stay in NotReadyAddresses until we do something). As probably seen in kubernetes-sigs#1205 . In order to react reliably on pods changes, we have to hook in a pod watch. Doing so is slightly expensive as we have to map pod -> [service ->] endpoint -> ingress on pods events, though we limit the search to pod's ns (services can only reference pods from their own ns, and ingress services from their ns).
I've upgraded to v1.1.6 to make use of the pod readiness gate feature to reduce 502/504's during HPA scales. Then I've proceeded to update my deployment per this document.
After I updated my deployment to have the readiness gate, the first pod that had the readinessGate spec got registered by the controller just fine with following status:
The following pod of the same deployment, had its status stuck like this:
I've waited for a good 5 minutes to see that pod get registered into the target group, but it did not. I could reproduce this error by performing multiple rollovers for the same deployment. However if I delete the controller pod and let it restart, it recognizes these pods and registers them into the target group. After the registrations at this point, it stops to register pods again.
Is there anything wrong with the approach I took? Or is this an issue at controllers end?
The text was updated successfully, but these errors were encountered: