Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming: split event buffer by key #12080

Merged
merged 17 commits into from Jan 28, 2022

Conversation

boxofrad
Copy link
Contributor

@boxofrad boxofrad commented Jan 14, 2022

Summary

This is an improvement to Consul's streaming backend to address performance problems observed in cases of high parallelism - i.e. many subscribers, high service churn, and servers with a large number of CPU cores.

It reduces lock-contention in the kernel's futex subsystem (more on that below) by only waking subscribers when there is an update that is relevant to them.

Background

Events are published to subscribers by appending them to a data structure called the eventBuffer. It is a single-writer, multiple-reader, linked list in which each element contains a channel that is used to notify subscribers when the subsequent element is available (similar to a monitor/condition variable). Subscribers block on a receive from this channel until it is closed by the publisher — this pattern will be familiar if you've used the context package (particularly the Done method).

Currently, there is a single eventBuffer for the entire topic. In other words, subscribers will be woken whenever any health event occurs, regardless of whether or not it pertains to a service they're watching, and it's up to them to ignore events that they're not interested in.

On a busy cluster with many services and open streams, when an event is published, the majority of scheduled goroutines have no real work to do, and will immediately attempt to receive/block on the next element’s channel.

As well as the general overhead of scheduling these goroutines and spending a few CPU cycles on checking for relevant events, there's a much larger cost incurred by them all attempting to become receivers on the new channel at the same time.

Internally, channels maintain a lock that is held at various points, including when adding a blocked receiver; therefore, the many subscriber goroutines will compete to acquire the lock, resulting in contention.

There is some additional nuance here too, though. It turns out that reducing the contention of this particular lock - e.g. by sharding the same large number of subscribers across many channels so they're not all competing for the same lock - doesn't actually solve the problem. 🤔

On Linux, the Go runtime's internal locks are backed by futexes. The kernel stores futexes in a hashmap in which each bucket is guarded by a spinlock. The contention we're observing doesn't go away when we shard the channel because it isn't contention on the individual channel's lock, but rather the kernel's own internal (semi-global) lock! This explains why we have observed such a large amount of CPU time spent in native_queued_spin_lock_slowpath, and why the graph below shows such a high proportion of CPU time spent in kernel-space.

The downstream impact of this contention is that event publishing gets slower, and with it so does applying changes to the Raft state machine, which has a massive effect on the general performance of a server. As an example, sustained FSM Apply latency of just 20ms, when Raft's internal queues/buffers are full, can result in several minutes of RPC latency.

By splitting the topicBuffer by key (e.g. service name) we can limit the number of goroutines that are woken by an event to just those with "real" work to do — which likely also adds an element of jitter as the goroutine will need to wait on sending gRPC stream frames etc.

Naturally, this doesn't improve the pathological case where all subscribers are watching a single service. But in testing, I've found that this case doesn't exhibit the same issue, because the subscribers always have "real" work to do, and so won't immediately contend to re-acquire the channel lock.

Huge kudos to @banks, @mkeeler, @dnephin et al. for all of the incredible research work on this issue 👏🏻 ⭐

Benchmark

To test the efficacy of this change, I reused the test harness from when we last compared the performance of the streaming backend to that of blocking queries.

I deployed a single server node on an m6i.32xlarge EC2 instance (128 CPU cores). I then ran the benchmark tool on 50 worker nodes, each creating 14 gRPC connections to the server and consuming 100 streams per connection (for a total of 70k concurrent subscriptions). The tool then began generating write load by updating random tags on the watched services at a rate of 50Hz.

Results

The results below show a significant reduction in "system" CPU time, lower FSM Apply time, and crucially less time spent waiting in Raft's internal queues/buffers (leading to pathologically bad RPC latency).

Before and after graphs

This is a pre-requisite for splitting the internal topic buffer by key.

While we technically could support subscribing to all of a topic's
events by passing an empty key, it would require publishing events on
multiple buffers (a key-specific and a global buffer) and as there isn't
a compelling real world use-case for this, it doesn't seem worth the
extra complexity.
@boxofrad boxofrad requested review from banks, dnephin and a team January 14, 2022 11:41
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @boxofrad 👏

I left a few inline comments, mostly notes to myself or random thoughts that don't need changes. There's one or two small things I think it would be nice to change though including fixing the racey map accesses in tests that presumably fail race detector?

The only other one that seems important to consider is whether we can simplify the code (and reasoning about correctness although I think it is correct) and have the bookkeeping function happen atomically with the unsubscribe state being updated.

agent/consul/stream/event.go Show resolved Hide resolved
agent/consul/stream/event_publisher.go Show resolved Hide resolved
agent/consul/stream/event_publisher.go Show resolved Hide resolved
agent/consul/stream/event_publisher.go Show resolved Hide resolved
agent/consul/stream/event_publisher.go Outdated Show resolved Hide resolved
agent/consul/stream/event_publisher_test.go Outdated Show resolved Hide resolved
agent/consul/stream/event_publisher_test.go Outdated Show resolved Hide resolved
agent/consul/stream/event_publisher_test.go Show resolved Hide resolved
agent/consul/stream/topic_key.go Outdated Show resolved Hide resolved
proto/pbsubscribe/subscribe.proto Show resolved Hide resolved
agent/consul/stream/topic_key.go Outdated Show resolved Hide resolved
agent/consul/stream/event.go Show resolved Hide resolved
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, and excellent detail in the description!

I've only done a quick review so far. Mostly responding to the questions and commenting on stylistic things like TopicKey. I'll hopefully have a chance to read over more of the functional changes in the last two commits next week.

agent/consul/stream/event_publisher.go Outdated Show resolved Hide resolved
agent/consul/stream/event.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul January 17, 2022 16:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 17, 2022 16:58 Inactive
@boxofrad
Copy link
Contributor Author

Thanks for the reviews folks! 🙇🏻‍♂️

@banks, @dnephin: I think I've addressed (or at least responded to) your comments, but do let me know if I've missed anything.

@vercel vercel bot temporarily deployed to Preview – consul January 17, 2022 17:12 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 17, 2022 17:12 Inactive
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my comments were addressed here so LGTM! I think Daniel said he'd not reviewed in depth yet so I'd wait for an approval from him too!

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 17, 2022 17:21 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 17, 2022 17:21 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 17, 2022 19:47 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 17, 2022 19:47 Inactive
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple minor questions and suggestions, but nothing blocking a merge.

agent/rpc/subscribe/subscribe_test.go Outdated Show resolved Hide resolved
agent/consul/stream/subscription.go Outdated Show resolved Hide resolved
agent/consul/stream/subscription_test.go Show resolved Hide resolved
agent/consul/stream/event_publisher_test.go Outdated Show resolved Hide resolved
@boxofrad boxofrad force-pushed the boxofrad/streaming-contention-split-buffer branch from 86efe0a to 4a78287 Compare January 28, 2022 12:02
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 28, 2022 12:02 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 28, 2022 12:02 Inactive
@boxofrad boxofrad merged commit fdfe079 into main Jan 28, 2022
@boxofrad boxofrad deleted the boxofrad/streaming-contention-split-buffer branch January 28, 2022 12:27
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/569544.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit fdfe079 onto release/1.11.x succeeded!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants