There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

nessa829 · 2024-02-22T05:48:01Z

Describe the bug
A concise description of what the bug is.

We are using argo-rollout / canary strategy / ingress with ALB configuration in AWS EKS environment.
When i trigger a new image sync before the rollout canary deployment is complete, downtime occurs. (This is not a typical deployment process, but sometimes it happens)
As i analyzed and followed up the argo rollout code and the actual remaining rollout logs, it progresses as follows.:

As soon as synchronized, rollout detects new canary (2024-02-22T01:32:14Z UTC)
Change service selector to new canary from old canary (2024-02-22T01:32:14Z UTC)
Change canary hash value of rollout status to new canary (2024-02-22T01:32:14Z UTC)
Change ingress weight from 50:50 to 100:0 (because new canary is not ready) (2024-02-22T01:32:14Z UTC)
delete old canary (2024-02-22T01:32:14Z UTC)

I was curious that 5xx errors occurred even though the ingress weight was changed to 100:0.
so I analyzed the lb-controller log as well, and found that an update operation request was received at almost the same time.

{"level":"debug","ts":"**2024-02-22T01:32:14Z**","logger":"validating_handler","msg":"validating webhook request","request":{"uid":"1b140c2d-b347-46aa-9f2b-5efe2528a3da","kind":{"group":"networking.k8s.io","version":"v1","kind":"Ingress"},"resource":{"group":"networking.k8s.io","version":"v1","resource":"ingresses"},"requestKind":{"group":"networking.k8s.io","version":"v1","kind":"Ingress"},"requestResource":{"group":"networking.k8s.io","version":"v1","resource":"ingresses"},"name":"internal-alb-ingress-init-ingress-canary","namespace":"service-sre-test","operation":"UPDATE","userInfo":{"username":"system:serviceaccount:argo-rollouts:argo-rollouts-redated","uid":"722a51d6-484b-4fd6-9d78-b0df5aeca6d7","groups":["system:serviceaccounts","system:serviceaccounts:argo-rollouts","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["argo-rollouts-redated"],"authentication.kubernetes.io/pod-uid":["0be3d23e-8bd7-4698-b510-845b48a85cf4"]}},"object":{"kind":"Ingress","apiVersion":"networking.k8s.io/v1","metadata":{"name":"internal-alb-ingress-init-ingress-canary","namespace":"service-sre-test","uid":"cbfbaa01-768a-45f7-9cda-3e9bff7f571d","resourceVersion":"278711068","generation":4,"creationTimestamp":"2023-10-19T07:07:54Z","labels":{"app.kubernetes.io/name":"internal-alb-ingress-init-ingress-canary","argocd.argoproj.io/instance":"alpha-sre-test"},"annotations":{"alb.ingress.kubernetes.io/actions.sre-test-root":"{\"Type\":\"forward\",\"ForwardConfig\":{\"TargetGroups\":[{\"ServiceName\":\"sre-test-canary\",\"ServicePort\":\"80\",\"Weight\":0},{\"ServiceName\":\"sre-test-stable\",\"ServicePort\":\"80\",\"Weight\":100}]}}","alb.ingress.kubernetes.io/conditions.sre-test-root":"[{\"field\":\"host-header\",\"hostHeaderConfig\":{\"values\":[redated]}}]\n","alb.ingress.kubernetes.io/group.name":"internal-service","alb.ingress.kubernetes.io/healthcheck-path":"/hello","alb.ingress.kubernetes.io/listen-ports":"[{\"HTTP\": 8080}]","alb.ingress.kubernetes.io/load-balancer-name":"ingress-internal-alb","alb.ingress.kubernetes.io/scheme":"internal","alb.ingress.kubernetes.io/subnets":"redated","alb.ingress.kubernetes.io/target-group-attributes":"deregistration_delay.timeout_seconds=40","alb.ingress.kubernetes.io/target-type":"ip","kubectl.kubernetes.io/last-applied-configuration":"...


{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"validating_handler","msg":"validating webhook response","response":{"Patches":null,"uid":"","allowed":true,"status":{"metadata":{},"code":200}}}


{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validate-networking-v1-ingress","code":200,"reason":"","UID":"1b140c2d-b347-46aa-9f2b-5efe2528a3da","allowed":true}

{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/mutate-v1-pod","UID":"9c41e46c-b6b9-40aa-b20f-de3e2c802d68","kind":"/v1, Kind=Pod","resource":{"group":"","version":"v1","resource":"pods"}}

However, the actual ModifyRule action to change weight triggered only after about 1 minute and 10 seconds. (From, Cloudtrail logs. February 22, 2024, 1:33:23 UTC)
After that, when new canary is ready, it changes to 50:50 again.

Ultimately, the problem seems to be that there is a time difference between the lb-controller's update operation request for weight change and the actual lb listner rule modification api.

I don't see any other errors in lb -controller logs, or limitation related issue.

Steps to reproduce
as above.

Expected outcome
A concise description of what you expected to happen.

no downtime when weight changed.

Environment

LB controller version : aws-load-balancer-controller:v2.6.2
alb annotation / ingress settings

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/conditions.sre-test-root: >
      [{"field":"host-header","hostHeaderConfig":{"values":["domain-info"]}}]
    alb.ingress.kubernetes.io/group.name: internal-service
    alb.ingress.kubernetes.io/healthcheck-path: /hello
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 8080}]'
    alb.ingress.kubernetes.io/load-balancer-name: ingress-internal-alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/subnets: 'subnet-info'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=40
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/name: internal-alb-ingress-init-ingress-canary
    argocd.argoproj.io/instance: alpha-sre-test
  name: internal-alb-ingress-init-ingress-canary
  namespace: service-sre-test
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - backend:
              service:
                name: sre-test-root
                port:
                  name: use-annotation
            path: /
            pathType: Prefix

AWS Load Balancer controller version
aws-load-balancer-controller:v2.6.2
Kubernetes version
1.27
Using EKS (yes/no), if so version?
Yes, 1.27

Additional Context:

The text was updated successfully, but these errors were encountered:

oliviassss · 2024-02-28T23:34:15Z

@nessa829, hi I'm wondering how many resources(ingresses/services/TGs) are there in your VPC? since the controller utilizes an equeue to handle the CRUD events, it may have latency if there are a large scale of resources/events. For our latest version v2.7.1, we improved the performance of the controller by adding an ELB cache. Also, you can consider to enable the resourcegrouptagging api via the flag --feature-gates=EnableRGTAPI=true to improve the performance too.
Pls check the release note for detailed info: https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.5.2

nessa829 · 2024-02-29T05:26:07Z

@oliviassss hi, thank you for the reply.
There are several hundreds of ingress / services / TGs for each ( < 300).
I already enabled --feature-gates=EnableRGTAPI=true, and I have updated my lb-controller to 2.7.1 (helm 1.7.1) as you suggested, but i still experience periodic rate exceed error.

I also tried rollout --aws-verify-target-group option, and observed sight delay in old canary termination, but there is still downtime. :(
Additionally, there are some failed verification logs in rollout pod log.

time="2024-02-29T03:36:31Z" level=warning msg="Failed to verify weight: operation error Elastic Load Balancing v2: DescribeLoadBalancers, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 060f9c79-964e-4c34-8c0e-500cd25f4e53, api error Throttling: Rate exceeded" event_reason=WeightVerifyError namespace=service-sre-test rollout=sre-test

As I wrote above, rollout ordered weight to be 100:0 on time, and lb-controller also received the request on time.
However, API (ModifyRule) happens late.
I guess maybe it is AWS ALB's problem, due to heavy api. :(

I only have < 30 LBs, and < 300 TGs, < 100 Listener per LB. Can it be too much for LB APIs?

Thank you.

oliviassss · 2024-02-29T17:30:59Z

@nessa829, thanks, it should be fine with your scale level and all the mitigations. Would you be able to provide the controller logs during the time window you saw the issue for us to take a further look? Also, it would help if you can provide the arn. You can send out via email k8s-alb-controller-triage AT amazon.com; or reach out to me via Kubernetes slack oliviassss. Thanks

nessa829 · 2024-03-04T06:07:34Z

@oliviassss Thank you for your interest.
I have sent the email k8s-alb-controller-triage at amazon.com with the title 'lb-controller github issue #3588'

Thank you!

Mufaddal5253110 · 2024-04-03T08:31:29Z

@oliviassss It's the identical problem that I am having. When I dug deeper, I discovered that there is a shift delay from the load balancer side. Because dynamicStableScale is enabled, it downscaled the older replica set, which caused traffic to those pods in that delay duration to cause problems. This is what I saw during the most recent shift of traffic from 80% to 100% in my canary steps.

@nessa829 I currently fixed this by disabling the dynamicStableScale and increasing the scaleDownDelaySeconds to 60 seconds (the default is 30 seconds). However, as a result, we were unable to take advantage of reducing the older ReplicaSets during deployment, which led to doubles in CPU, memory, and other

deathsurgeon1 · 2024-05-24T11:16:37Z

Hi @Mufaddal5253110 @oliviassss @nessa829 I am also stuck with similar problem where as soon as canary deployment completes and last set of pods in stable rollout is marked for deletion, we see lot of 503s being thrown. In our case, we are supposed to use dynamicStableScale = true as we have to scale down the old replicas in stable rollout due to resource constraint. Can you folks guide me regarding any possible solution for this problem.
Thanks!!

Mufaddal5253110 · 2024-06-08T07:17:42Z

@deathsurgeon1 By adjusting the canary steps, you can reduce that, but not eliminate it as far as this issue is fixed from the ALB controller side. Here is the explained blog I have written explaining the entire scenario we faced and the solution we tried to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

nessa829 commented Feb 22, 2024

oliviassss commented Feb 28, 2024

nessa829 commented Feb 29, 2024

oliviassss commented Feb 29, 2024

nessa829 commented Mar 4, 2024

Mufaddal5253110 commented Apr 3, 2024 •

edited

Loading

deathsurgeon1 commented May 24, 2024

Mufaddal5253110 commented Jun 8, 2024

There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

Comments

nessa829 commented Feb 22, 2024

oliviassss commented Feb 28, 2024

nessa829 commented Feb 29, 2024

oliviassss commented Feb 29, 2024

nessa829 commented Mar 4, 2024

Mufaddal5253110 commented Apr 3, 2024 • edited Loading

deathsurgeon1 commented May 24, 2024

Mufaddal5253110 commented Jun 8, 2024

Mufaddal5253110 commented Apr 3, 2024 •

edited

Loading