fix: use adaptive retry for AWS API calls to handle rate limiting#757
fix: use adaptive retry for AWS API calls to handle rate limiting#757dangrondahl merged 3 commits intomainfrom
Conversation
Switch from the default AWS SDK retry strategy (3 attempts, basic exponential backoff) to adaptive retry mode with up to 10 attempts. Adaptive mode maintains a client-side token bucket that tracks throttling responses over time and proactively slows the request rate. Standard mode just does backoff-and-retry; adaptive mode avoids hammering a rate-limited API by draining tokens on 429s.
|
Claude finished @dangrondahl's task in 1m 52s —— View job Review: PR #757 — Adaptive retry for AWS API calls
Static AnalysisAll three checks pass cleanly:
Code Review
The implementation is correct and well-scoped. Key observations:
This reverts the parallelism reduction from #747, as requested by @mbevc1. The three targets ( Minor NoteNo issues found. One optional thought for the future (not blocking):
SummaryClean, well-targeted change. The adaptive retry mode is the right tool for the concurrent-goroutine throttling problem, the implementation uses the SDK idiomatically, and the comment explains the reasoning clearly. The #747 revert is included as requested. No issues found — LGTM. |
|
Thanks, we should also revert #747 as part of this change |
Reverts #747 which reduced parallelism from 8 to 6 as a workaround for AWS rate limiting. The adaptive retry config now handles this properly, so we can restore the original parallelism.
Summary
Switch the AWS SDK retry strategy from the default (3 attempts, basic exponential backoff) to adaptive retry mode with up to 10 attempts.
Why this matters: Commands like
kosli snapshot lambdafetch all Lambda functions concurrently — one goroutine per function (seeGetLambdaPackageData). These goroutines all share the same AWS client. When many requests fire at once, AWS responds with "slow down" (HTTP 429). With the default retry config (3 attempts, exponential backoff), each goroutine independently retries and gives up quickly — often all of them fail. This has been one of the top contributors to flakytest / Testjobs in the Main workflow (example failure).What changed: We switched to the AWS SDK's "adaptive" retry mode with up to 10 attempts. Like before, retries use exponential backoff. The key difference is that adaptive mode maintains an in-memory token bucket, shared across all goroutines using the same client. When AWS starts throttling, the token bucket drains and proactively slows down the entire batch of requests — rather than each goroutine blindly retrying on its own. This makes the CLI much more resilient to transient AWS rate limits, both in CI and for customers with many Lambda functions or ECS services.
Scope: The change is in
NewAWSConfigFromEnvOrFlags(), which is the shared config constructor for all AWS clients (Lambda, ECS, S3). All AWS API calls made by the CLI benefit from this.