Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxycfg: rate-limit delivery of config snapshots #14960

Merged
merged 3 commits into from
Oct 14, 2022

Conversation

boxofrad
Copy link
Contributor

Description

Adds a user-configurable rate limiter to proxycfg snapshot delivery, with a default limit of 250 updates per second.

This addresses a problem observed in our load testing of Consul Dataplane where updating a "global" resource such as a wildcard intention or the proxy-defaults config entry could starve the Raft or memberlist goroutines of CPU time, causing general cluster instability.

Testing & Reproduction steps

Tested using the Consul Dataplane load test harness. Observed that creating a wildcard intention with 5k connected streams no longer triggers missed memberlist pings (and suspicions from other peers).

@boxofrad boxofrad requested a review from a team as a code owner October 12, 2022 14:37
@github-actions github-actions bot added theme/config Relating to Consul Agent configuration, including reloading type/docs Documentation needs to be created/updated/clarified labels Oct 12, 2022
Copy link
Contributor

@trujillo-adam trujillo-adam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I made a few suggestions related to style guide consistency.

website/content/docs/agent/config/config-files.mdx Outdated Show resolved Hide resolved
// This runs in another goroutine so we can't just do the send
// directly here as access to snap is racy. Instead, signal the main
// loop above.
sendCh <- struct{}{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ There's a (pre-existing) goroutine leak here, where if ctx is canceled when there's a pending update this goroutine will never halt.

The fix is simple, but for posterity, I'll add it in a follow-up PR once this one is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally try to avoid time.AfterFunc and instead just use regular timers which can be selected on in the main loop.

Copy link
Contributor Author

@boxofrad boxofrad Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. AfterFunc definitely hides the goroutine here. We had a similar leak related to memdb's WatchCh method: #14916.

boxofrad and others added 2 commits October 14, 2022 10:33
Adds a user-configurable rate limiter to proxycfg snapshot delivery,
with a default limit of 250 updates per second.

This addresses a problem observed in our load testing of Consul
Dataplane where updating a "global" resource such as a wildcard
intention or the proxy-defaults config entry could starve the Raft or
Memberlist goroutines of CPU time, causing general cluster instability.
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
@boxofrad boxofrad requested review from a team and mkeeler and removed request for a team October 14, 2022 09:35
scheduleUpdate := func() {
r := s.rateLimiter.Reserve()

coalesceTimer = time.AfterFunc(coalesceTimeout+r.Delay(), func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this instead wait for max(coalesceTimeout, r.Delay())?

coalesce timeout will ensure this states snapshot doesn't get emitted too frequently
r.Delay ensures that we don't emit snapshots too frequently globally.

This might be sort of a premature optimization but figured I would note it anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I think either is fine honestly, but I guess using the max could save 200ms of unnecessary waiting - will change it 🙇🏻‍♂️

Saves 200ms of unnecessary waiting in cases where the snapshot delivery
is rate-limited.
@boxofrad boxofrad merged commit 328e3ff into main Oct 14, 2022
@boxofrad boxofrad deleted the boxofrad/proxycfg-rate-limit branch October 14, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/config Relating to Consul Agent configuration, including reloading type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants