Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry manager to reduce RateLimitExceeded errors #2010

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Apr 15, 2024

Is this a bug fix or adding new feature?
Feature

What is this PR about? / Why do we need it?
If the driver calls an EC2 API repeatedly enough within a few seconds to exceed its request token bucket’s refill rate, EC2 will start returning the RequestLimitExceeded client errors to each CSI RPC.

Today, the plugin as a whole is not aware if it is being rate-limited by an EC2 API. This leads to each RPC to continuously retry EC2 API calls, even if they are likely to fail due to concurrent RPCs also failing. This leads to surges of RequestLimitExceeded errors.

By introducing client-side throttling for each EC2 Mutating API action via the AWS SDK's Adaptive Retry Mode, we can prevent barrages of RateLimitExceeded errors by coordinating CSI RPCs to make their EC2 calls at a slower rate until the request-token buckets refill.

What testing is done?

Amount of RateLimitExceeded client errors during scalability tests with default account limits.

# Pods Deployed Pre-PR Adaptive Retry
1000 9.16k (in 12 min) 410 (in 12 min)
2500 1.2k
5000 67.5k (in 62 min) 2.4k (in 55 min)

Note: Even with Adaptive Retry we still will see RateLimitExceeded errors because default AWS account only restores 5 mutating API tokens per second. With Adaptive Retry we see ~0.6 RateLimitExceeded errors per second, which seems optimal (too low of a number means extra latency in volume creation time due to client-side throttling).

In design document we deemed 1-2 errors per second acceptable so this is a success.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 15, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 15, 2024
@AndrewSirenko
Copy link
Contributor Author

AndrewSirenko commented Apr 15, 2024

/hold

Until:

  • 5k, 10k scalability tests completed
  • PR Description updated
  • 1k pod scalability test chart posted

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024
Copy link

github-actions bot commented Apr 15, 2024

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go 85.4% 84.1% -1.3
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/retry_manager.go Does not exist 100.0%

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved
pkg/cloud/cloud_test.go Outdated Show resolved Hide resolved
pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved
pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code largely looks good, currently manually testing changes. Will report back with lgtm if no issues surface from manual tests

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not run into any issues manually testing. Will apply lgtm label after Connor's feedback is addressed. I've got one non-blocking idea and question.


Idea: worth considering using a helper function to apply the retryer instead of passing the retryer function inline for each EC2 API call, ie:

func withRetryer(fn func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error), retryer *retry.AdaptiveMode) func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
    return func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
        optFns = append(optFns, func(o *ec2.Options) {
            o.Retryer = retryer
        })
        return fn(ctx, input, optFns...)
    }
}

and then we can use it like this:

response, err := withRetryer(c.ec2.CreateVolume, c.rm.createVolumeRetryer)(ctx, requestInput)

This way, retryer logic is centralized.


Question:

Currently, each EC2 API call has its own retryer field in the retryManager struct - is there a requirement for doing so, or is that level of granularity necessary? (given a subset of API calls may have similar retry requirements)

@AndrewSirenko
Copy link
Contributor Author

Currently, each EC2 API call has its own retryer field in the retryManager struct - is there a requirement for doing so, or is that level of granularity necessary? (given a subset of API calls may have similar retry requirements)

We need each Mutating EC2 API call to have their own retryer field because the sdk throttles on a retryer object level, not by API name. While default AWS accounts share request tokens between each EC2 Mutating API, if a customer raises limits for a particular Mutating API (ie CreateVolume) then the token bucket is no longer shared.

E.g. If we have them share the same retryer, then if a CreateVolume hits RateLimitExceeded, AttachVolume calls will also be throttled by client-side adaptive retryer, even if the two APIs no longer share the same AWS request token bucket (due to raised limits).

Hope that is clearer.

@AndrewSirenko
Copy link
Contributor Author

AndrewSirenko commented Apr 18, 2024

Idea: worth considering using a helper function to apply the retryer instead of passing the retryer function inline for each EC2 API call, ie:

Thank you for the suggestion. I personally think that a call-site option is clearer because it matches the example approach of overriding AWS SDK service operations and avoids a function pointer. Also we may want to add more call-site options in the future.

@torredil
Copy link
Member

@AndrewSirenko Perfect, that is as clear as can be. Can we add a comment in the retryManager to help future maintainers understand the reasoning behind having separate retryer fields for each mutating API?

@AndrewSirenko AndrewSirenko force-pushed the cst-adaptive-retry-manager branch 2 times, most recently from 4f2bef7 to 727c4c4 Compare April 18, 2024 16:56
pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved
pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved
@torredil
Copy link
Member

/retest

@ConnorJC3
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@AndrewSirenko
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2024
@k8s-ci-robot k8s-ci-robot merged commit 275e68c into kubernetes-sigs:master Apr 19, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants