Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce cache retry load #23025

Merged
merged 2 commits into from Mar 21, 2023
Merged

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented Mar 14, 2023

Introduces a new exponential backoff type and applies it to non-control-plane caches. The intent of this change is to help us better reduce thundering herd effects in large clusters (10k+ agents).

The new exponential backoff starts off with comparatively sized delays, but escalates much quicker on recurring errors. We also increase the maximum retry for non-control-plane caches from 90s to 256s, though most users will actually experience an increase from 60s to 256s since the switch from 60s to 90s was very recent.

The old standard retry behavior resulted in an approximate sequence of 0s, 12s, 24s, 36s, 48s, 60s for all instances. The new behavior as of this change results in an approximate sequence of 0s, 5s, 10s, 20s, 40s, 80s, 90s for control-plane elements and an approximate sequence of 0s, 16s, 32s, 64s, 128s, 256s for peripheral agents.


In addition to the primary behavioral changes described above, this PR also makes two more minor changes. Caches now allow their full max backoff for the init event to show up instead of a fixed one minute (slow init event propagation has been observed as a source of excessive cache re-init errors in high load clusters). The maximum backoff used by the cache is now also configurable like so:

teleport:
  cache:
    enabled: yes
    max_backoff: 12m

While the default value of 256s is probably sufficiently large for most clusters, very large clusters might benefit from bumping this even higher.

@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-cache-retry-load branch from 70b9d2c to c1266b8 Compare March 14, 2023 04:55
@fspmarshall fspmarshall marked this pull request as ready for review March 14, 2023 05:02
api/utils/retryutils/retry_test.go Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
api/utils/retryutils/retryv2.go Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
lib/defaults/defaults.go Outdated Show resolved Hide resolved
api/utils/retryutils/retryv2.go Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
@zmb3
Copy link
Collaborator

zmb3 commented Mar 15, 2023

Wouldn't hurt to rerun the GHA tests a few extra times on this, as these types of changes have caused flakiness in the past.

@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-cache-retry-load branch from c1266b8 to 09c8df4 Compare March 20, 2023 17:56
@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-cache-retry-load branch from 09c8df4 to 6162d76 Compare March 20, 2023 18:25
api/utils/retryutils/retryv2.go Outdated Show resolved Hide resolved
@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-cache-retry-load branch from 6162d76 to 8fe0141 Compare March 20, 2023 19:55
@fspmarshall fspmarshall added this pull request to the merge queue Mar 21, 2023
@fspmarshall fspmarshall merged commit 924dcdd into master Mar 21, 2023
19 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/reduce-cache-retry-load branch March 21, 2023 05:18
justinas pushed a commit that referenced this pull request Apr 18, 2023
* add exponential backoff

* improve cache backoff
justinas added a commit that referenced this pull request Apr 18, 2023
* add exponential backoff

* improve cache backoff

Co-authored-by: Forrest <30576607+fspmarshall@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants