Skip to content

Rate limit issues causes deadlock issues, which causes outages. #1048

@tecnobrat

Description

@tecnobrat

Version: 1.1.2

We encountered a deadlock in ALB ingress controller due to rate limits (and likely the backoff adding to rate limit issues), which caused an outage.

The deadlock was caused by this:

E1015 14:45:11.579790       1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="error getting web acl for load balancer arn:aws:elasticloadbalancing:us-east-1:062151437226:loadbalancer/app/fffb5690-default-broadcast-bf66/a84fdc267491f064: ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: 9f03f8c3-f087-4c5c-88e3-545fb3ae5c47"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"broadcaster-job-ui"}

This blocked the reconciler for 10 minutes.

During our rolling deploy of new pods, which normally are added and removed from the ALB one by one, instead due to the deadlock all of the pods became unhealthy until it updated them all at the same time.

I1015 14:45:16.861165       1 targets.go:80] default/accounts-rest: Adding targets to arn:aws:elasticloadbalancing:us-east-1:062151437226:targetgroup/fffb5690-77f48c3689d480d89a0/39213728bb743224: 10.128.0.90:3000, 10.128.15.161:3000, 10.128.9.102:3000
I1015 14:45:17.148664       1 targets.go:95] default/accounts-rest: Removing targets from arn:aws:elasticloadbalancing:us-east-1:062151437226:targetgroup/fffb5690-77f48c3689d480d89a0/39213728bb743224: 10.128.10.211:3000, 10.128.1.169:3000, 10.128.0.191:3000

Normally what you see is it adds a single target, then removes a single target, repeat until all three are done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions