Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target does not get registered if using readinessGate #1205

Closed
vedatappsamurai opened this issue Mar 25, 2020 · 34 comments
Closed

Target does not get registered if using readinessGate #1205

vedatappsamurai opened this issue Mar 25, 2020 · 34 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@vedatappsamurai
Copy link

I've upgraded to v1.1.6 to make use of the pod readiness gate feature to reduce 502/504's during HPA scales. Then I've proceeded to update my deployment per this document.

After I updated my deployment to have the readiness gate, the first pod that had the readinessGate spec got registered by the controller just fine with following status:

status:
  conditions:
  - lastProbeTime: "2020-03-25T07:17:09Z"
    lastTransitionTime: "2020-03-25T07:13:48Z"
    status: "True"
    type: target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>

The following pod of the same deployment, had its status stuck like this:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-25T07:10:29Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

I've waited for a good 5 minutes to see that pod get registered into the target group, but it did not. I could reproduce this error by performing multiple rollovers for the same deployment. However if I delete the controller pod and let it restart, it recognizes these pods and registers them into the target group. After the registrations at this point, it stops to register pods again.

Is there anything wrong with the approach I took? Or is this an issue at controllers end?

@ivanmp91
Copy link

ivanmp91 commented Mar 25, 2020

I'm not sure if it could be the same issue, but at least looks like I'm experimenting a similar behaviour. I also started to test the new 1.1.6 version in order to use pod readiness gate and for me it's happening the other way around, when a new pod is launched using readiness gate, it takes around 5-10 minutes to update the pod readiness gate status and register the new pod to the target group. Also I've found some cases where the pod never gets updated having to delete the pod to create a new one. At the time when a new pod is launched I got exactly the same status reported in this issue:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-25T11:13:10Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-03-25T11:13:10Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/link-handling-puma_link-handling-puma_80"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

Also during that period of time the readiness gate is being updated for the new pod, I've been checking the target group and the new pod is not being registered there and no activity is showed in the controller logs. Just after waiting several minutes, controller starts to generate this logs and it successfully updates the readiness gate status for the pod and register it to the target group:

I0325 11:21:09.174585       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}
I0325 11:21:10.580955       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}
I0325 11:21:51.091404       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"web-qa-ckapp-web"}
I0325 11:21:51.591457       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"admin-web-qa-ckapp-web"}
I0325 11:21:52.159385       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"admin-web-staging-ckapp-web"}
I0325 11:21:52.651093       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"web-staging-ckapp-web"}
I0325 11:21:52.855920       1 targethealth.go:149] link-handling/link-handling-puma: Starting reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-2:XXXXXX:targetgroup/dc11cbf1-537c2cee1b9ff018039/6a88d59c627c58b1
I0325 11:21:52.855952       1 targets.go:93] link-handling/link-handling-puma: Adding targets to arn:aws:elasticloadbalancing:us-east-2:XXXXXX:targetgroup/XXXXXX-XXXXXXXXXXXX: 10.11.93.9:80
I0325 11:21:53.305202       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}

I0325 11:22:13.285831       1 targethealth.go:174] link-handling/link-handling-puma: Stopping reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-2:XXXXXX:targetgroup/XXXXXX-XXXXXX/XXXXXX
I0325 11:22:13.768052       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}

As you can see in this sample it took exactly 8 minutes to register the new pod. If it helps, here it's the ingress and service definitions:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-2:XXXXXX:certificate/XXXXXX-XXXXXX,arn:aws:acm:us-east-2:XXXXXX:certificate/XXXXXX-XXXXXX,arn:aws:acm:us-east-2:XXXXXX:certificate/XXXXXX-XXXXXX,arn:aws:acm:us-east-2:XXXXXX:certificate/XXXXXX-XXXXXX
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
    alb.ingress.kubernetes.io/healthy-threshold-count: "2"
    alb.ingress.kubernetes.io/inbound-cidrs: 0.0.0.0/0
    alb.ingress.kubernetes.io/ip-address-type: ipv4
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/load-balancer-attributes: deletion_protection.enabled=true,routing.http2.enabled=true,idle_timeout.timeout_seconds=60
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/success-codes: "200"
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30,load_balancing.algorithm.type=round_robin
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"
    kubernetes.io/ingress.class: alb
  labels:
    name: link-handling-puma
    namespace: link-handling
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: link-handling-puma
          servicePort: 80
        path: /*

apiVersion: v1
kind: Service
metadata:
  labels:
  name: link-handling-puma
  namespace: link-handling
spec:
  clusterIP: 172.20.13.140
  ports:
  - name: puma
    port: 80
    protocol: TCP
    targetPort: nginx
  selector:
    app.kubernetes.io/component: puma
    app.kubernetes.io/instance: link-handling
    app.kubernetes.io/name: link-handling
  sessionAffinity: None
  type: ClusterIP

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Mar 25, 2020

@vedatappsamurai @ivanmp91 would you help attach the logs from the controller pod during the deployment?
seems like a controller bug if pod are belong to same deployment get different behavior.

@vedatappsamurai
Copy link
Author

I have the following deployment strategy on my deployment:

strategy:
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 1
  type: RollingUpdate

And ALB Ingress Controller left the following logs when I started the deployment:

I0326 06:48:30.713813       1 targets.go:108] default/<ingress-name>: Removing targets from arn:aws:elasticloadbalancing:<region>:<account-id>:targetgroup/<target-group-id>: 10.2.42.0:80
I0326 06:48:38.065220       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}
I0326 06:48:49.118161       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}

Since I have maxUnavailable: 1 it deregisters the 1 pod that is being terminated by the deployment, but it never registers the new pod. I continuously run kubectl get pods -o wide to monitor the pod and the result is like this:

NAME                                             READY   STATUS    RESTARTS   AGE     IP            NODE                                      NOMINATED NODE   READINESS GATES
<pod-id>                                         2/2     Running   0          4m16s   10.2.4.113    <node-id>                                 <none>           0/1

And it shows the following status on describe as we mentioned before:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-26T06:48:30Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/<ingress>_<service>_<service-port>"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

At the time of my writing, I was waiting for readiness gate to see if it would pass, and it did, but it took 10 minutes. Notice the timestamps in the log:

I0326 06:48:30.713813       1 targets.go:108] default/<ingress-name>: Removing targets from arn:aws:elasticloadbalancing:<region>:<account-id>:targetgroup/<target-group-id>: 10.2.42.0:80
I0326 06:48:38.065220       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}
I0326 06:48:49.118161       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}
I0326 06:58:23.650176       1 targethealth.go:149] default/<ingress-name>: Starting reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:<region>:<account-id>:targetgroup/<target-group-id>
I0326 06:58:23.650203       1 targets.go:93] default/<ingress-name>: Adding targets to arn:aws:elasticloadbalancing:<region>:<account-id>:targetgroup/<target-group-id>: 10.2.4.113:80
I0326 06:58:31.211503       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}
I0326 06:58:43.981059       1 targethealth.go:174] default/<ingress-name>: Stopping reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:<region>:<account-id>:targetgroup/<target-group-id>
I0326 06:58:49.713082       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"<ingress-name>"}

It removed the old target at 06:48:30.713813 but added the new target at 06:58:23.650203

@ivanmp91
Copy link

In my case, I've just tested increasing the number of pod replicas by 1. Here you can see the status of the pods, which you can see the new pod is ready, up and running, but still the new pod doesn't have the readiness gate status updated:

➜ kubectl get pods -o wide
NAME                                     READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
link-handling-puma-fd676dc5c-72kwt       2/2     Running   0          23h     10.11.86.200   ip-10-11-88-116.us-east-2.compute.internal   <none>           1/1
link-handling-puma-fd676dc5c-hl857       2/2     Running   0          2m47s   10.11.28.255   ip-10-11-29-240.us-east-2.compute.internal   <none>           0/1

And this is the status condition of the new pod launched:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-26T10:49:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-03-26T10:49:47Z"
    message: corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/link-handling-puma_link-handling-puma_80"
      does not exist.
    reason: ReadinessGatesNotReady
    status: "False"
    type: Ready

After waiting for 23 minutes the readiness gate status was updated:

NAME                                     READY   STATUS    RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
link-handling-puma-fd676dc5c-72kwt       2/2     Running   0          23h   10.11.86.200   ip-10-11-88-116.us-east-2.compute.internal   <none>           1/1
link-handling-puma-fd676dc5c-hl857       2/2     Running   0          23m   10.11.28.255   ip-10-11-29-240.us-east-2.compute.internal   <none>           1/1 

These are the full controller logs during all this amount of time, since I increased the replica count for the deployment by 1, until the readiness gate status was updated:

I0326 10:49:40.580117       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}
I0326 10:49:41.161758       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"admin-web-staging-ckapp-web"}
I0326 10:49:41.701416       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"web-staging-ckapp-web"}
I0326 10:49:42.205008       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"web-qa-ckapp-web"}
I0326 10:49:42.732805       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"admin-web-qa-ckapp-web"}
I0326 10:50:07.826808       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}
I0326 11:11:43.865086       1 targethealth.go:149] link-handling/link-handling-puma: Starting reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-2:XXXXXXXXXX:targetgroup/XXXXXXXXXX-XXXXXXXXXX/XXXXXXXXXX
I0326 11:11:43.865113       1 targets.go:93] link-handling/link-handling-puma: Adding targets to arn:aws:elasticloadbalancing:us-east-2:XXXXXXXXXX:targetgroup/XXXXXXXXXX-XXXXXXXXXX/XXXXXXXXXX: 10.11.28.255:80
I0326 11:11:44.563524       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}
I0326 11:11:45.144717       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"admin-web-staging-ckapp-web"}
I0326 11:11:45.650006       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-web","Name":"web-staging-ckapp-web"}
I0326 11:11:46.201310       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"admin-web-qa-ckapp-web"}
I0326 11:11:46.705419       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"ckapp-qa-web","Name":"web-qa-ckapp-web"}
I0326 11:12:04.215478       1 targethealth.go:174] link-handling/link-handling-puma: Stopping reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-2:XXXXXXXXXX:targetgroup/XXXXXXXXXX-XXXXXXXXXX/XXXXXXXXXX
I0326 11:12:04.492053       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"link-handling","Name":"link-handling-puma"}

@jtnz
Copy link

jtnz commented Mar 27, 2020

I can confirm we have the same problem. New pods come up as part of the deployment but them controller doesn't seem to add them to the target group until some (seemingly) random period of time later (in the order of minutes).

This only applies to deployments using the readinessGate feature of v1.1.6.

Our unchanged deployments (that use the various health check/lifecycle hacks) are working as per usual.

@cupello
Copy link

cupello commented Mar 30, 2020

Hi! I have the same problem and after some tests, I found that removing the readiness probe from my deployment fixes it. Is there some restriction for using readiness gates and readiness probe in the same deployment with ALB Ing Controller?

@jorihardman
Copy link

I can confirm the same issue. I created a new deployment (2 replicas total) and the first pod to come up in the replicaset got stuck in "ReadinessGatesNotReady" state. I tried removing my readinessprobes per @cupello, but that did not allow my deployment to progress. I rebooted the ingress controller which allowed that stuck pod to progress to ready. HOWEVER, the next pod to be booted in the replicaset then got stuck in the same state. It seems the new pod is not being registered as a target on the alb. Here is my controller logs after the first reboot:

-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.6
  Build:      git-95ee2ac8
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------
W0331 15:23:04.900241       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0331 15:23:08.061157       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0331 15:23:08.061353       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0331 15:23:08.061410       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0331 15:23:08.061535       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0331 15:23:08.061566       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0331 15:23:08.061671       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0331 15:23:08.061924       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0331 15:23:08.062127       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
I0331 15:23:25.129355       1 leaderelection.go:214] successfully acquired lease kube-system/ingress-controller-leader-alb
I0331 15:23:25.129510       1 recorder.go:53] kubebuilder/manager/events "level"=1 "msg"="Normal"  "message"="alb-ingress-controller-6ff568d65c-ttg64_84750dc1-7363-11ea-a14e-66f84451d774 became leader" "object"={"kind":"ConfigMap","namespace":"kube-system","name":"ingress-controller-leader-alb","uid":"573e0c1e-1dbe-11ea-8bfb-0e1d9427aea1","apiVersion":"v1","resourceVersion":"118836213"} "reason"="LeaderElection"
I0331 15:23:25.231649       1 controller.go:134] kubebuilder/controller "level"=0 "msg"="Starting Controller"  "controller"="alb-ingress-controller"
I0331 15:23:25.331774       1 controller.go:154] kubebuilder/controller "level"=0 "msg"="Starting workers"  "controller"="alb-ingress-controller" "worker count"=1
I0331 15:23:29.442755       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"staging-graphql-api-ingress"}
I0331 15:23:29.746935       1 targethealth.go:149] default/staging-rails-server-web-ingress: Starting reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-1:XXXX:targetgroup/bd528925-9af27d2c227753e1a41/152a9fe94eb2ac9a
I0331 15:23:29.746964       1 targets.go:93] default/staging-rails-server-web-ingress: Adding targets to arn:aws:elasticloadbalancing:us-east-1:XXXX:targetgroup/bd528925-9af27d2c227753e1a41/152a9fe94eb2ac9a: 172.18.242.141:8080
I0331 15:23:30.415652       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"staging-rails-server-web-ingress"}
I0331 15:23:45.176710       1 targets.go:108] default/staging-rails-server-web-ingress: Removing targets from arn:aws:elasticloadbalancing:us-east-1:XXXX:targetgroup/bd528925-9af27d2c227753e1a41/152a9fe94eb2ac9a: 172.18.166.195:8080
I0331 15:23:45.176787       1 targethealth.go:174] default/staging-rails-server-web-ingress: Stopping reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-1:XXXX:targetgroup/bd528925-9af27d2c227753e1a41/152a9fe94eb2ac9a
I0331 15:23:45.625641       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"staging-rails-server-web-ingress"}
I0331 15:23:46.321509       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"staging-rails-server-web-ingress"}
I0331 15:23:47.055137       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"staging-rails-server-web-ingress"}

I noticed the "Stopping reconciliation of pod condition status for target group" at the end there - could that be related?

@a6kme
Copy link

a6kme commented Apr 2, 2020

I have the same problem. I overcome it by restarting alb-ingress-controller pod, whenever I notice rollouts getting stuck. Linking earlier issue #1131 that I raised some time ago.

Attaching some logs pre and post re-start of ALB Ingress Controller, which might give some hint.

Existing running pod and newly created pod of the deployment. New pod is not ready even after 6 minutes.

events-api-events-api-6f6945bcfd-ggm49                           1/1     Running     0          47h     10.10.92.192   ip-10-10-79-119.ec2.internal   <none>           1/1
events-api-events-api-75ddbc8bc7-2xq7d                           1/1     Running     0          6m37s   10.10.92.33    ip-10-10-79-119.ec2.internal   <none>           0/1

Pre-restart logs

I0402 13:55:05.722808       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"pushowl-backend","Name":"pushowl-backend-ingress-alb"}
I0402 13:55:06.377859       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"pushowl-backend","Name":"pushowl-backend-ingress-alb"}
W0402 13:58:43.365978       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 5461814 (5462767)
W0402 14:13:21.370567       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 5464961 (5465449)
W0402 14:26:36.375179       1 reflector.go:270] pkg/mod/k8s.io/client-go@v0.0.0-20181213151034-8d9ed539ba31/tools/cache/reflector.go:95: watch of *v1.Endpoints ended with: too old resource version: 5467439 (5467792)
I0402 14:34:42.674948       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"pushowl-backend","Name":"pushowl-backend-ingress-alb"}

Post restart logs

-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.6
  Build:      git-95ee2ac8
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0402 14:43:19.783472       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0402 14:43:23.006003       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0402 14:43:23.006262       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0402 14:43:23.006318       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0402 14:43:23.006445       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0402 14:43:23.006480       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0402 14:43:23.006595       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0402 14:43:23.006818       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0402 14:43:23.007038       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
I0402 14:43:39.178794       1 leaderelection.go:214] successfully acquired lease kube-system/ingress-controller-leader-alb
I0402 14:43:39.178929       1 recorder.go:53] kubebuilder/manager/events "level"=1 "msg"="Normal"  "message"="alb-ingress-controller-7b76cf75b4-9xzng_4ba51478-74f0-11ea-9828-6a688ffcef0a became leader" "object"={"kind":"ConfigMap","namespace":"kube-system","name":"ingress-controller-leader-alb","uid":"39944e91-6457-11ea-92c8-12bcfda3de27","apiVersion":"v1","resourceVersion":"5473020"} "reason"="LeaderElection"
I0402 14:43:39.279089       1 controller.go:134] kubebuilder/controller "level"=0 "msg"="Starting Controller"  "controller"="alb-ingress-controller"
I0402 14:43:39.379248       1 controller.go:154] kubebuilder/controller "level"=0 "msg"="Starting workers"  "controller"="alb-ingress-controller" "worker count"=1
I0402 14:43:43.983815       1 targethealth.go:149] pushowl-backend/pushowl-backend-ingress-alb: Starting reconciliation of pod condition status for target group: arn:aws:elasticloadbalancing:us-east-1:480681643791:targetgroup/60a14aa0-ff40d97fa59e9381bb7/635fcfe05b0b0107
I0402 14:43:43.983842       1 targets.go:93] pushowl-backend/pushowl-backend-ingress-alb: Adding targets to arn:aws:elasticloadbalancing:us-east-1:480681643791:targetgroup/60a14aa0-ff40d97fa59e9381bb7/635fcfe05b0b0107: 10.10.92.33:8000
I0402 14:43:44.785796       1 controller.go:236] kubebuilder/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="alb-ingress-controller" "request"={"Namespace":"pushowl-backend","Name":"pushowl-backend-ingress-alb"}

@metacyclic
Copy link

metacyclic commented Apr 3, 2020

We are experiencing the same thing. Registration sometimes doesn't happen, and deregistration also sometimes doesn't happen, even when the pods have no running containers.

Restarting the ALB ingress controller pod seems to fix the issue, which is tough to manage during hpa scaling.

@nirnanaaa
Copy link

I discovered, that this is basically always true.

when new targets are added to a target group this is also somehow 0

@alfredkrohmer
Copy link
Contributor

@nirnanaaa and I found the underlying issue.

When all containers in a pod have started, the pod IP appears in the NotReadyAddresses field of a corresponding Endpoints object. For this update of the Endpoints object, we trigger a reconciliation loop. However, we ignore it if not all containers in the pod are ready at this point in time (which can be the case if there is a readinessProbe defined). Then, if all containers in the pod become ready, the pod still doesn't count as ready (because the readiness gate is not fulfilled yet) - its IP will not be moved from NotReadyAddress to Addresses - the Endpoints object doesn't get updated - thus no reconcile is triggered again for this Endpoints object - new pods might just sit there until an another update happens on the Endpoints object (for whatever reason).

The solution to this would be to add a pod reconciler which triggers endpoints reconciliations for any associated Endpoints objects. However, this will likely have other implications for the controller (more CPU & memory requirements because it will keep all the pod objects in memory). @M00nF1sh any thoughts?

@bpineau
Copy link

bpineau commented Apr 6, 2020

@devkid We came to the same conclusion (as seen in #1214). I confirm we're back to fast rollouts with that change.

AFAIK we already have implicitly hooked informers on pods objects (see for instance the failures messages in #1209)

@nirnanaaa
Copy link

@bpineau your PR works very good. Testing this right now at scale

bpineau added a commit to DataDog/aws-alb-ingress-controller that referenced this issue Apr 7, 2020
We're relying on endpoints events to re-trigger reconciliations during
rollouts, and we're considering pod's containers status (eg. are all pod's
containers ContainerReady?) to either act upon, or swallow those events.

Both (pods changes, ep changes) can be out of sync. For instance: a starting
pod whose containers aren't all ContainerReady might have its addresses
registered to an ep subset's NotReadyAddresses, kicking a reconcile event
which won't propagate to the TargetGroups (+ conditions updates) since the pod
is evaluated as not ready. Further pod changes won't kick an endpoint change
(due to readiness gates, the pod's address will stay in NotReadyAddresses
until we do something). As probably seen in  kubernetes-sigs#1205 .

In order to react reliably on pods changes, we have to hook in a pod watch.
Doing so is slightly expensive as we have to map pod -> [service ->]
endpoint -> ingress on pods events, though we limit the search to pod's ns
(services can only reference pods from their own ns, and ingress services from
their ns).
@alfredkrohmer
Copy link
Contributor

This should be fixed on master now.

@casret
Copy link

casret commented Apr 13, 2020

Can we get a point release for this bugfix? It affects us as well.

@a6kme
Copy link

a6kme commented Apr 14, 2020

@casret - FWIW, I built my own image from master branch by following build instructions, and it's working like a charm. Feel free to use it if you like. It is hosted at pushowl/aws-alb-ingress-controller:v1.1.6.

@M00nF1sh
Copy link
Collaborator

@casret we plan to do a release this week. will try to include #1211, and will release without it if we cannot merge before friday

@bgbraga
Copy link

bgbraga commented Apr 15, 2020

@M00nF1sh please add this to this week release. Maybe it the most important fix.

@metacyclic
Copy link

Seems like this went into 1.1.7?

@mailjunze
Copy link

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2020
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 19, 2020
@MaximF
Copy link

MaximF commented Aug 26, 2020

Bumping this up.
I'm seeing that same issue with alb-ingress-controller 1.1.9 and EKS 1.17.
Basically those are the latest versions available as of the moment of writing.

@larrywax
Copy link

larrywax commented Sep 1, 2020

Experiencing the same with alb-ingress-controller v1.1.8 and EKS 1.17

@clayvan
Copy link

clayvan commented Sep 2, 2020

Pretty sure I'm still seeing this on alb-ingress-controller v1.1.9 and EKS 1.16.

@clayvan
Copy link

clayvan commented Sep 2, 2020

Well maybe I found something?

It turns out that if I use the named port like is described in the docs, then the target never gets added. However if I flip my annotation back to using the port number, it's all working as expected.

@M00nF1sh I don't think anything has changed regarding the port naming requirement has it? Is this expected behavior?

@johnniee
Copy link

Also having the same issue of '' with version aws-alb-ingress-controller:v1.1.8 & EKS 1.17
corresponding condition of pod readiness gate "target-health.alb.ingress.k8s.aws/".

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wildersachin
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@wildersachin: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alamsyahho
Copy link

For people that might still have the issue with readiness gate, it seems we have to label the namespace before we can use readiness gate. It's documented here but this documentation is not visible under the main branch

@kishorj
Copy link
Collaborator

kishorj commented Aug 26, 2021

@alamsyahho, pod readiness gate documentation is available in the live docs as well - https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/pod_readiness_gate/
From the live docs page https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/, click on the deployment tab on the top left section, and click on Pod Readiness Gate item on the left menu. If you do not see the tab on the live docs, you might have to click on the three dashes in the top left corner.

@alamsyahho
Copy link

@kishorj Thanks for pointing it out for me 👍

alebedev87 pushed a commit to alebedev87/aws-load-balancer-controller that referenced this issue Oct 26, 2023
We're relying on endpoints events to re-trigger reconciliations during
rollouts, and we're considering pod's containers status (eg. are all pod's
containers ContainerReady?) to either act upon, or swallow those events.

Both (pods changes, ep changes) can be out of sync. For instance: a starting
pod whose containers aren't all ContainerReady might have its addresses
registered to an ep subset's NotReadyAddresses, kicking a reconcile event
which won't propagate to the TargetGroups (+ conditions updates) since the pod
is evaluated as not ready. Further pod changes won't kick an endpoint change
(due to readiness gates, the pod's address will stay in NotReadyAddresses
until we do something). As probably seen in  kubernetes-sigs#1205 .

In order to react reliably on pods changes, we have to hook in a pod watch.
Doing so is slightly expensive as we have to map pod -> [service ->]
endpoint -> ingress on pods events, though we limit the search to pod's ns
(services can only reference pods from their own ns, and ingress services from
their ns).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests