Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are too long delay between controller UPDATE operation and ModifyRule action. (which is causing downtime) #3588

Open
nessa829 opened this issue Feb 22, 2024 · 7 comments

Comments

@nessa829
Copy link

Describe the bug
A concise description of what the bug is.

  • We are using argo-rollout / canary strategy / ingress with ALB configuration in AWS EKS environment.
  • When i trigger a new image sync before the rollout canary deployment is complete, downtime occurs. (This is not a typical deployment process, but sometimes it happens)
  • As i analyzed and followed up the argo rollout code and the actual remaining rollout logs, it progresses as follows.:
  1. As soon as synchronized, rollout detects new canary (2024-02-22T01:32:14Z UTC)
  2. Change service selector to new canary from old canary (2024-02-22T01:32:14Z UTC)
  3. Change canary hash value of rollout status to new canary (2024-02-22T01:32:14Z UTC)
  4. Change ingress weight from 50:50 to 100:0 (because new canary is not ready) (2024-02-22T01:32:14Z UTC)
  5. delete old canary (2024-02-22T01:32:14Z UTC)
  • I was curious that 5xx errors occurred even though the ingress weight was changed to 100:0.
  • so I analyzed the lb-controller log as well, and found that an update operation request was received at almost the same time.
{"level":"debug","ts":"**2024-02-22T01:32:14Z**","logger":"validating_handler","msg":"validating webhook request","request":{"uid":"1b140c2d-b347-46aa-9f2b-5efe2528a3da","kind":{"group":"networking.k8s.io","version":"v1","kind":"Ingress"},"resource":{"group":"networking.k8s.io","version":"v1","resource":"ingresses"},"requestKind":{"group":"networking.k8s.io","version":"v1","kind":"Ingress"},"requestResource":{"group":"networking.k8s.io","version":"v1","resource":"ingresses"},"name":"internal-alb-ingress-init-ingress-canary","namespace":"service-sre-test","operation":"UPDATE","userInfo":{"username":"system:serviceaccount:argo-rollouts:argo-rollouts-redated","uid":"722a51d6-484b-4fd6-9d78-b0df5aeca6d7","groups":["system:serviceaccounts","system:serviceaccounts:argo-rollouts","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["argo-rollouts-redated"],"authentication.kubernetes.io/pod-uid":["0be3d23e-8bd7-4698-b510-845b48a85cf4"]}},"object":{"kind":"Ingress","apiVersion":"networking.k8s.io/v1","metadata":{"name":"internal-alb-ingress-init-ingress-canary","namespace":"service-sre-test","uid":"cbfbaa01-768a-45f7-9cda-3e9bff7f571d","resourceVersion":"278711068","generation":4,"creationTimestamp":"2023-10-19T07:07:54Z","labels":{"app.kubernetes.io/name":"internal-alb-ingress-init-ingress-canary","argocd.argoproj.io/instance":"alpha-sre-test"},"annotations":{"alb.ingress.kubernetes.io/actions.sre-test-root":"{\"Type\":\"forward\",\"ForwardConfig\":{\"TargetGroups\":[{\"ServiceName\":\"sre-test-canary\",\"ServicePort\":\"80\",\"Weight\":0},{\"ServiceName\":\"sre-test-stable\",\"ServicePort\":\"80\",\"Weight\":100}]}}","alb.ingress.kubernetes.io/conditions.sre-test-root":"[{\"field\":\"host-header\",\"hostHeaderConfig\":{\"values\":[redated]}}]\n","alb.ingress.kubernetes.io/group.name":"internal-service","alb.ingress.kubernetes.io/healthcheck-path":"/hello","alb.ingress.kubernetes.io/listen-ports":"[{\"HTTP\": 8080}]","alb.ingress.kubernetes.io/load-balancer-name":"ingress-internal-alb","alb.ingress.kubernetes.io/scheme":"internal","alb.ingress.kubernetes.io/subnets":"redated","alb.ingress.kubernetes.io/target-group-attributes":"deregistration_delay.timeout_seconds=40","alb.ingress.kubernetes.io/target-type":"ip","kubectl.kubernetes.io/last-applied-configuration":"...


{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"validating_handler","msg":"validating webhook response","response":{"Patches":null,"uid":"","allowed":true,"status":{"metadata":{},"code":200}}}


{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validate-networking-v1-ingress","code":200,"reason":"","UID":"1b140c2d-b347-46aa-9f2b-5efe2528a3da","allowed":true}

{"level":"debug","ts":"2024-02-22T01:32:14Z","logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/mutate-v1-pod","UID":"9c41e46c-b6b9-40aa-b20f-de3e2c802d68","kind":"/v1, Kind=Pod","resource":{"group":"","version":"v1","resource":"pods"}}
  • However, the actual ModifyRule action to change weight triggered only after about 1 minute and 10 seconds. (From, Cloudtrail logs. February 22, 2024, 1:33:23 UTC)
  • After that, when new canary is ready, it changes to 50:50 again.

Ultimately, the problem seems to be that there is a time difference between the lb-controller's update operation request for weight change and the actual lb listner rule modification api.

I don't see any other errors in lb -controller logs, or limitation related issue.

Steps to reproduce
as above.

Expected outcome
A concise description of what you expected to happen.

no downtime when weight changed.

Environment

  • LB controller version : aws-load-balancer-controller:v2.6.2
  • alb annotation / ingress settings
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/conditions.sre-test-root: >
      [{"field":"host-header","hostHeaderConfig":{"values":["domain-info"]}}]
    alb.ingress.kubernetes.io/group.name: internal-service
    alb.ingress.kubernetes.io/healthcheck-path: /hello
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 8080}]'
    alb.ingress.kubernetes.io/load-balancer-name: ingress-internal-alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/subnets: 'subnet-info'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=40
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/name: internal-alb-ingress-init-ingress-canary
    argocd.argoproj.io/instance: alpha-sre-test
  name: internal-alb-ingress-init-ingress-canary
  namespace: service-sre-test
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - backend:
              service:
                name: sre-test-root
                port:
                  name: use-annotation
            path: /
            pathType: Prefix

  • AWS Load Balancer controller version
    aws-load-balancer-controller:v2.6.2
  • Kubernetes version
    1.27
  • Using EKS (yes/no), if so version?
    Yes, 1.27

Additional Context:

@oliviassss
Copy link
Collaborator

@nessa829, hi I'm wondering how many resources(ingresses/services/TGs) are there in your VPC? since the controller utilizes an equeue to handle the CRUD events, it may have latency if there are a large scale of resources/events. For our latest version v2.7.1, we improved the performance of the controller by adding an ELB cache. Also, you can consider to enable the resourcegrouptagging api via the flag --feature-gates=EnableRGTAPI=true to improve the performance too.
Pls check the release note for detailed info: https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.5.2

@nessa829
Copy link
Author

@oliviassss hi, thank you for the reply.
There are several hundreds of ingress / services / TGs for each ( < 300).
I already enabled --feature-gates=EnableRGTAPI=true, and I have updated my lb-controller to 2.7.1 (helm 1.7.1) as you suggested, but i still experience periodic rate exceed error.

I also tried rollout --aws-verify-target-group option, and observed sight delay in old canary termination, but there is still downtime. :(
Additionally, there are some failed verification logs in rollout pod log.

time="2024-02-29T03:36:31Z" level=warning msg="Failed to verify weight: operation error Elastic Load Balancing v2: DescribeLoadBalancers, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 060f9c79-964e-4c34-8c0e-500cd25f4e53, api error Throttling: Rate exceeded" event_reason=WeightVerifyError namespace=service-sre-test rollout=sre-test

As I wrote above, rollout ordered weight to be 100:0 on time, and lb-controller also received the request on time.
However, API (ModifyRule) happens late.
I guess maybe it is AWS ALB's problem, due to heavy api. :(

I only have < 30 LBs, and < 300 TGs, < 100 Listener per LB. Can it be too much for LB APIs?

Thank you.

@oliviassss
Copy link
Collaborator

@nessa829, thanks, it should be fine with your scale level and all the mitigations. Would you be able to provide the controller logs during the time window you saw the issue for us to take a further look? Also, it would help if you can provide the arn. You can send out via email k8s-alb-controller-triage AT amazon.com; or reach out to me via Kubernetes slack oliviassss. Thanks

@nessa829
Copy link
Author

nessa829 commented Mar 4, 2024

@oliviassss Thank you for your interest.
I have sent the email k8s-alb-controller-triage at amazon.com with the title 'lb-controller github issue #3588'

Thank you!

@Mufaddal5253110
Copy link

Mufaddal5253110 commented Apr 3, 2024

@oliviassss It's the identical problem that I am having. When I dug deeper, I discovered that there is a shift delay from the load balancer side. Because dynamicStableScale is enabled, it downscaled the older replica set, which caused traffic to those pods in that delay duration to cause problems. This is what I saw during the most recent shift of traffic from 80% to 100% in my canary steps.

@nessa829 I currently fixed this by disabling the dynamicStableScale and increasing the scaleDownDelaySeconds to 60 seconds (the default is 30 seconds). However, as a result, we were unable to take advantage of reducing the older ReplicaSets during deployment, which led to doubles in CPU, memory, and other

@deathsurgeon1
Copy link

Hi @Mufaddal5253110 @oliviassss @nessa829 I am also stuck with similar problem where as soon as canary deployment completes and last set of pods in stable rollout is marked for deletion, we see lot of 503s being thrown. In our case, we are supposed to use dynamicStableScale = true as we have to scale down the old replicas in stable rollout due to resource constraint. Can you folks guide me regarding any possible solution for this problem.
Thanks!!

@Mufaddal5253110
Copy link

@deathsurgeon1 By adjusting the canary steps, you can reduce that, but not eliminate it as far as this issue is fixed from the ALB controller side. Here is the explained blog I have written explaining the entire scenario we faced and the solution we tried to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants