Skip to content

fix: use adaptive retry for AWS API calls to handle rate limiting#757

Merged
dangrondahl merged 3 commits intomainfrom
fix_ci
Apr 1, 2026
Merged

fix: use adaptive retry for AWS API calls to handle rate limiting#757
dangrondahl merged 3 commits intomainfrom
fix_ci

Conversation

@dangrondahl
Copy link
Copy Markdown
Contributor

@dangrondahl dangrondahl commented Apr 1, 2026

Summary

Switch the AWS SDK retry strategy from the default (3 attempts, basic exponential backoff) to adaptive retry mode with up to 10 attempts.

Why this matters: Commands like kosli snapshot lambda fetch all Lambda functions concurrently — one goroutine per function (see GetLambdaPackageData). These goroutines all share the same AWS client. When many requests fire at once, AWS responds with "slow down" (HTTP 429). With the default retry config (3 attempts, exponential backoff), each goroutine independently retries and gives up quickly — often all of them fail. This has been one of the top contributors to flaky test / Test jobs in the Main workflow (example failure).

What changed: We switched to the AWS SDK's "adaptive" retry mode with up to 10 attempts. Like before, retries use exponential backoff. The key difference is that adaptive mode maintains an in-memory token bucket, shared across all goroutines using the same client. When AWS starts throttling, the token bucket drains and proactively slows down the entire batch of requests — rather than each goroutine blindly retrying on its own. This makes the CLI much more resilient to transient AWS rate limits, both in CI and for customers with many Lambda functions or ECS services.

Scope: The change is in NewAWSConfigFromEnvOrFlags(), which is the shared config constructor for all AWS clients (Lambda, ECS, S3). All AWS API calls made by the CLI benefit from this.

Switch from the default AWS SDK retry strategy (3 attempts, basic
exponential backoff) to adaptive retry mode with up to 10 attempts.

Adaptive mode maintains a client-side token bucket that tracks
throttling responses over time and proactively slows the request
rate. Standard mode just does backoff-and-retry; adaptive mode
avoids hammering a rate-limited API by draining tokens on 429s.
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @dangrondahl's task in 1m 52s —— View job


Review: PR #757 — Adaptive retry for AWS API calls

  • Phase 1: Static analysis (build, vet, mod tidy)
  • Phase 2: Code review
  • Phase 3: Post review feedback

Static Analysis

All three checks pass cleanly:

  • go build ./... — compiles without errors
  • go vet ./... — no issues
  • go mod tidy — no diff (the retry package is part of the existing aws-sdk-go-v2/aws module)

Code Review

internal/aws/aws.go — Adaptive retry config

The implementation is correct and well-scoped. Key observations:

  1. Correct API usage. retry.NewAdaptiveMode with StandardOptions to set MaxAttempts is the idiomatic way to configure this in the AWS SDK v2. The config.WithRetryer closure correctly returns a new retryer per client, which is the expected pattern.

  2. Shared token bucket works as intended. Since NewAWSConfigFromEnvOrFlags() is called once per client constructor (NewLambdaClient, NewECSClient, NewS3Client), all goroutines sharing that client (e.g., the concurrent GetLambdaPackageData goroutines at line 211) will share the same adaptive token bucket. This is exactly the behavior described in the PR summary.

  3. MaxAttempts = 10 is reasonable. The default is 3; 10 gives adaptive mode enough headroom to work through bursts of 429s without being excessive. The exponential backoff + token bucket prevents this from turning into an aggressive retry storm.

  4. Good comment. The doc comment on NewAWSConfigFromEnvOrFlags() clearly explains why adaptive mode is used and links to the SDK docs. This will save future readers from wondering about the non-default config.

  5. No user-facing config knob. The MaxAttempts is hardcoded to 10. This is fine for now — adding a flag would be premature unless customers ask for it. If it ever needs to be configurable, GlobalOpts.MaxAPIRetries already exists for the Kosli API retries, so there's a pattern to follow.

Makefile — Revert -p=6 back to -p=8

This reverts the parallelism reduction from #747, as requested by @mbevc1. The three targets (test_integration, test_integration_full, test_integration_restart_server) are all updated consistently. With adaptive retry in place, the root cause of the flaky CI (AWS rate limiting) should be addressed, making the parallelism reduction unnecessary.

Minor Note

No issues found. One optional thought for the future (not blocking):

  • If the CLI ever creates multiple AWS clients for different services in the same command, each client gets its own token bucket (since NewAWSConfigFromEnvOrFlags is called per client). For Lambda and ECS snapshot commands this is fine since they use a single client. Just something to be aware of if future commands compose multiple AWS services concurrently.

Summary

Clean, well-targeted change. The adaptive retry mode is the right tool for the concurrent-goroutine throttling problem, the implementation uses the SDK idiomatically, and the comment explains the reasoning clearly. The #747 revert is included as requested. No issues found — LGTM.

@mbevc1
Copy link
Copy Markdown
Contributor

mbevc1 commented Apr 1, 2026

Thanks, we should also revert #747 as part of this change

@dangrondahl
Copy link
Copy Markdown
Contributor Author

Thanks, we should also revert #747 as part of this change

@mbevc1 should I do it in this PR, to stress test it?

@mbevc1
Copy link
Copy Markdown
Contributor

mbevc1 commented Apr 1, 2026

Thanks, we should also revert #747 as part of this change

@mbevc1 should I do it in this PR, to stress test it?

Yeah, I think that's a good approach here

Reverts #747 which reduced parallelism from 8 to 6 as a workaround
for AWS rate limiting. The adaptive retry config now handles this
properly, so we can restore the original parallelism.
@dangrondahl dangrondahl merged commit 58b0ea9 into main Apr 1, 2026
24 of 25 checks passed
@dangrondahl dangrondahl deleted the fix_ci branch April 1, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants