Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkTransport make pipelining configurable and default to max 2 in flight. #541

Merged
merged 6 commits into from Apr 21, 2023

Conversation

banks
Copy link
Member

@banks banks commented Feb 16, 2023

This PR changes the default behaviour of the "pipelining" optimization in our NetworkTransport which is used by all HashiCorp's products.

This feature has been found to cause major performance problems in practice although it seems like it should be an optimization.

TL;DR Summary

  • Pipeline max in-flight matters a lot when raft throughput is high enough to keep the disk busy.
  • For LAN (low latency networks) no or extremely limited pipeline performs much better than longer pipeline regardless of disk speed.
    • This is roughly because when replicating, the leader batches entries from the follower's last acknowledged index to it's most recent one, the fewer pipelined requests you allow the more opportunity you leave for the leader to batch. There is an optimal balance between inflight requests and batching, and the current max of 128 pipelined requests is never it!
  • Since HashiCorp products all specify low-latency networks and require synchronous disk writes the new default for the library will be MaxInFlight = 2.
  • Old behaviour can be recovered by setting MaxInFlight = 130 in the NetworkTransport config.
  • Larger values probably only help if you have long network RTTs and will require careful tuning.

This issue was so unintuitive to us that we thought there must be some mistake. We have gone to significant lengths to explain to oursleves what is happening, as well as test different settings in actual Consul clusters in a lab setting. Read on for more details.

Replication Basics

The problem with pipelining is that it interacts unintuitively with our batching optimization. Specifically, the batching that occurs during follower replication in Raft.pipelineReplicate:

raft/replication.go

Lines 480 to 482 in 60ddd51

case <-s.triggerCh:
lastLogIdx, _ := r.getLastLog()
shouldStop = r.pipelineSend(s, pipeline, &nextIndex, lastLogIdx)

(There are other sources of batching in our raft library such as the "group commit" a leader does before writing to it's own log which is not the same thing although could end up correlating.)

Each call to r.pipelineSend() will send all messages since the last one the follower has been sent, up to the most recent one on the leader's disk in a batch. It will also block if the pipeline has hit it's macInFlightRPCs until the next response comes back from the follower and so is no longer "in-flight". When it does block, this caused batch sizes to increase to catch the follower up faster on the next send etc.

Until now NetworkTransport has always used a constant of 128 as the size of the buffered channel used to control the number of in-flight RPCs.

raft/net_transport.go

Lines 28 to 30 in 60ddd51

// rpcMaxPipeline controls the maximum number of outstanding
// AppendEntries RPC calls.
rpcMaxPipeline = 128

Meanwhile, the follower handles each AppendEntries RPC in strict serial order. There is a single goroutine running handleConn per inbound TCP connection from the leader, and it calls handleCommand serially inline. handleCommand serially reads the request from the wire, decides what to do, in the case of AppendEntries with actual data, it calls processLogs synchronously (which waits for the write to disk to complete) before writing a response to the wire and returning to accept the next request.

In a happy case where writes are far enough apart that they can be written to disk individually, all is well:

image

This diagram models a case where the follower takes 2 time units (we'll say milliseconds here as it's close to realistic but the numbers are arbitrary) to write each log to disk, and the round trip time between the leader and follower is 0.5ms. This is approximately what we saw when testing for real in AWS on an io2 Volume across AZs in the same region.

The Problem

The problem occurs when the write rate increases. As soon as there are enough requests per second that the next one comes in while the last is still being written on the follower, the pipeline starts to fill up. The scenario below has a Max In-Flight of 5 not 128 like the actual code to make it more readable! These are generated using LaTeX and a simple go program.

image

The same thing happens just a little more slowly if the rate is only just quicker than the disk can commit, but we show it as about double here to make the pattern easier to see. Pretty soon the pipeline channel is full and it will stay full for as long as the write rate is maintained.

That means even though batching is kicking in to ensure the follower does keep up with replication, the latency of each some requests is now on average 5x longer.

That's with a Max in-flight of only 5, with 128 like this library currently has, you very quickly get into a case were the average latency is about 130x the mean tome to commit to disk when it should be only just over 1!

We observed this in real-world performance testing in Consul: mean commit times were around 5ms right up to the point where the inbound rate suddenly crossed the threshold and then suddenly every request was taking about 300ms. All this while the actual time it took for a follower to write to disk stayed constant: around 2ms. So why were commit times hundreds of times larger? Because of pipelining.

The Solution

Perhaps surprisingly the solution is just not to use pipelining (almost). This is what the exact same scenario as above looks like without pipelining at all:

image

It's a bit slower that before on average because of batching. The diagrams above are showing the worst case latency not the mean. Without batching though, the follower would be unable to keep up at all at this throughput and the cluster would be in-operable already and failing most writes.

So should we just rip out pipelining altogether then? Surely there must be some cases where it is actually a win.

Where Pipelining Helps

Pipelining helps when the rount-trip-time on the network is much more significant compared to the processing time of each request. For example, If you have a much bigger RTT - say 5ms, and super faster disks that can commit in 1ms, the diagram looks like this:

image

image

Now clearly pipelining is much better as we can keep that follower busy and don't have to wait for nearly 2 RTTs in the worst case shown.

Here's the thing though: all HashiCorp products that use raft only support low-latency local-area networks. The follower's work involves a synchronous disk write. Even super fast NVMe disks are not that much quicker than a LAN round trip on average. In my testing on EC2 committing a small batch of logs on NVMe was about 200-300us for Consul's workload and the networking mean RTT was about 500us. That implies that any pipelining above 2 or 3 inflight is not going to help and having a much larger limit will eventually still hurt latency. This is hard to see with more realisitic numbers for NVMe, but you can see that eventually if we add enough throughput we start to hit the same pattern of full buffer and added latency:

image

Real World

So far the diagrams are using a simplified model. They intentionally leave out a bunch of complications that matter in the real world. For example they don't model the time taken to encode and decode messages on the leader which are overlapped with pipelining but synchronous without.

But the following is taken from real-world testing. First we test on an io2 disk that takes about 2ms to commit each batch to disk and see the latency problem outlined here clearly.

image

The legend indicates the number allowed in-flight where 1 indicates that pipelining is entirely disabled and every request is synchronous. The right hand graph is the same but with Y axis zoomed to show the detail for the lower pipeline sizes.

We see that with 128 in flight (current behaviour), the mean request took over 300ms even though the disk was committing consistently in < 10ms. Once the pipeline number is brought right down, or even disabled, the same hardware could service the same workload with only around 10ms mean latency and only about 30ms for 99.99 percentile!

We also wanted to validate that a much smaller pipeline wasn't significantly detrimental when the disk is somewhat faster than the network. These results show the same workload and hardware but using an NVME disk. The throughput is doubled.

image

In this case the right hand graph is different and shows windows latency percentiles on individual followers. (Taken from Consul'sconsul.raft.replication.appendEntries.rpc metrics, taking max of the 99percentile values for each 10s window).

First though the left graph: in this case we are not pushing this disk hard enough (despite 10k writes per second) to consistently fill up the pipeline and cause orders of magnitude worse latency. In fact latencies are pretty amazing across the board here. Even with 128 pipeline, we are still getting a 99.99 percentile around 50ms and a p50 of less than 2ms!

The tail latencies are also noisy enough that although lower pipeline numbers are generally better it's not totally consistent. We'll come back to those in a second.

The right hand graph tells an interesting story (which was observed more clearly on time series graphs). Although the overall commit times are low, the worst follower is still occasionally taking 45ms to commit a single log batch with 128 in-flight limit. It's not the reason for the P99.99 results on the left since those are similar for all on the left, we'll get to that. Commit in Raft only requires 1 follower to ack in a 3 node cluster like this so the worst follower hitting spikes of 45ms is not enough to have client-observed latencies spike. The time series graph showed both followers having spikes like this every so often throughout the test but recovering quickly enough that commit times weren't impacted. Before we move on, the P50 on the right hand graph is important: it's showing us that pipeline disabled, 1, and 2 buffering are all roughly the same.

We need some more detail to see what is going on:

image

The left chart shows the max consul.raft.commitTime P50 and P99 (calculated over each 10s window) observed in the same test run. The right one is the same for consul.raft.rpc.appendEntries which measures how long the follower actually took to process each AppendEntries RPC.

Here we can see that even in the 128 in-flight case, even the P99 commitLatency was never more than 4.5ms - that's an order of magnitude smaller than the 45ms the client observed. The reason in this case is that the NVMe disk is so fast that we are hitting other bottlenecks in Consul where we apply the committed change to the in-memory store. Because we can't batch that update right now and it involves quite a few allocations due to the way MemDB works, it seems to be the thing that's causing those occasional spikes of client-observed write latency at 10k req/s.

Also interesting is that service times on the right never have a P99 above 0.6ms! Even when the leader is observing the same RPC taking 45.7ms. This is evidence of the pipeline issue we saw above causing trouble again - even though it's not consistent enough to cause commit times to suffer, it's still present and could still tip over the edge to impacting client requests if the workload increased further or hardware had a latency spike (e.g. caused by a large snapshot being taken).

The commit latency graph is the most interesting one. It seems to indicate that for this setup, there is some benefit though mostly in the P99 of keeping pipelining with a short buffer. This is likely because the NVMe disk is fast enough that the encoding and decoding time for the RPC is just about significant enough that overlapping it with the network call itself allows things to get through a little quicker.

Since io2 was only marginally better with no batching and NVMe was slightly better with it, we've choosen a default of MaxInFlight = 2 which means we still pipeline and overlap encoding/decoding on the leader but without letting a backlog of requests build up in the network.

Conclusions

  • Pipeline max in-flight matters a lot when raft throughput is high enough to keep the disk busy.
  • For LAN (low latency networks) no or extremely limited pipeline performs much better than longer pipeline regardless of disk speed.
  • Since HashiCorp products all specify low-latency networks and require synchronous disk writes the new default for the library will be MaxInFlight = 2.
  • Old behaviour can be recovered by setting MaxInFlight = 130 in the NetworkTransport config.
  • Larger values probably only help if you have long network RTTs and will require careful tuning.

@loshz loshz self-requested a review March 7, 2023 18:49
Copy link
Contributor

@jmurret jmurret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and amazing description with the visualizations to make the issue easier to follow. I left one minor nit suggestion, but overall approving...we should get others with more experience with this package to also approve.

net_transport.go Outdated
if maxInFlight < 1 {
// Default (zero or nonsense negative value given) to 2 (which translates to
// a zero length channel buffer)
maxInFlight = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] For readability, can we use a constant for 2 that describes it as "default" and use it here and AppendEntriesPipeline() and newNetPipeline()

Copy link
Member Author

@banks banks Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I was just doing this, and I think we should make this case a Constant like the other defaults.

But I realised that all the 2s in newNetPipeline and AppendEntriesPipeline probably should not be changed. The rationale is: they are are 2 because 2 is the minimum value that is possible in our current implementation - they are guarding against panics or using code that can't work with a value specifically less than 2.

For example if we ever decided that the best default was actually 1 (disabling pipelining) or maybe a bit more than two, every usage except this one should still be 2 since that's a property of the actual code implementation not just the default value.

Do you think it's worth making two separate constants something like DefaultMaxInFlight = 2 and minValidMaxInFlight = 2 so that we can show that we are using the 2 to protect the implementation for doing something invalid in those other cases not necessarily because it's the default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit with the approach above - two separate constants to express the semantics of the subtly-different 2 values!

Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @banks ! I have an inline question but nothing blocking.

I'm also wondering about the implication of this to a gRPC transport. The PR that @mkeeler put together to add gRPC transport (#538) leverage streaming to achieve pipelining of append entries, I'm wondering if that end up reproducing the same behaviour we are trying to avoid with this PR 🤔 ?

net_transport.go Outdated
Comment on lines 416 to 424
if n.maxInFlight < 2 {
// Pipelining is disabled since no more than one request can be outstanding
// at once. Skip the whole code path and use synchronous requests.
return nil, ErrPipelineReplicationNotSupported
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implication of making a pipeline of size 1 instead of retuning an error in that use case? With this, if we decide to thread this config up to Consul would we need to add extra logic to call the right API?

Copy link
Member Author

@banks banks Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pipeline of size one doesn't make any sense with our current implementation and will panic.

This is because we only block on the buffer after sending the next request, and because the first request sent is immediately consumed by the receiving go routine. That means even with a buffer size of zero you still allow 2 requests to be in flight: one because the receiving goroutine immediately consumes and unblocks the buffer while it waits, and the second because we send it before pushing to inflight where we block waiting for the first response.

We could change the behaviour of the pipeline to push to the channel first before sending so that a zero-buffered chan would correspond to MaxInFlight = 1... But I don't see much benefit to that - that would just make it ambigous whether MaxInFlight = 1 mean "no pipeline at all" or "use pipelining code path for parallel decoding/encoding, but don't allow more than one outstanding request". In theory that still has some potential performance benefit but I was not able to measure one so I don't think it's worth adding the extra complexity/risk by changing how pipelining works and making there be even more subtle config options.

Does that make sense?

The flip side to this is that it might in some ways be simpler to make the config hide the complexity less by exposing a separate DisablePipelining and PipelineBufferSize, but then operators have to work out which thing they want and we still have to document all the unintuitive stuff like "Buffer size of 0 actually allows 2 requests to be in flight". I don't have a strong opinion but choose to lean towards making the config map more directly to the behaviour it enabled rather than obscure implementation details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I was trying to make is mostly that from UX point of you I expect a pipeline size of 1 or 0 (depending on how we design it) to work as if I call AppendEntries() without any pipelining. I guess we already missed that opportunity when we made the choice to make separate APIs for the pipelining option way before this PR. The user, in that case raft, need to deal with making the right choice which is already in place.

I agree hard failing, by panic is a reasonable choice to avoid misconfiguration here.

net_transport.go Show resolved Hide resolved
net_transport.go Outdated Show resolved Hide resolved
@banks banks mentioned this pull request Mar 9, 2023
4 tasks
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a looong time since I've been over this code, so thanks to you and @mkeeler for walking me through it offline.

I suspect we'll be referring back to this excellent PR for years to come, so it might be worth adding something like the following to the tl;dr or Conclusion sections. They currently seem to focus on the parameters and effects while not mentioning the shift in mechanisms adjusting the parameters (ideally) produces.

Since the leader batches entries from the follower's last acknowledged index, the fewer pipelined requests you allow the more opportunity you leave for the leader to batch. There is an optimal balance between inflight requests and batching, and the current max of 128 pipelined requests is not it!

...or something to that effect.

@banks banks requested a review from a team as a code owner March 30, 2023 11:32
Copy link
Member

@loshz loshz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PR, Paul! As a newcomer to this codebase, I very much appreciate the time taken to outline the specifics on this change and the comprehensive real-world examples.

}()

select {
case <-errCh:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here := we aren't capturing the error from the channel to use in the assertion below...

@banks
Copy link
Member Author

banks commented Apr 21, 2023

Thanks for all the feedback folks. Finally got around to the last few bits here, but all minor docs comments/constant replacement or a minor change to just ignore negative values (i.e. treat them like "no pipelining" rather than treat them like "default amount of pipelining".

@banks banks merged commit 8fdc4ce into main Apr 21, 2023
6 checks passed
@banks banks deleted the fix-pipeline branch April 21, 2023 12:14
@banks banks mentioned this pull request Apr 21, 2023
3 tasks
otoolep added a commit to rqlite/rqlite that referenced this pull request Apr 25, 2023
This includes upgrading to Hashicorp Raft 1.5, which brings in some
performance improvements. See hashicorp/raft#541
for more details.
@otoolep
Copy link
Contributor

otoolep commented Apr 25, 2023

Thanks for your hard work. All my rqlite testing passes with this change in place, so I'm looking forward to seeing if I can see any performance improvements with this release.

wip-sync pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Jun 11, 2023
BREAKING CHANGES:

* extensions: The Lua extension now targets local proxy listeners for the
  configured service's upstreams, rather than remote downstream listeners for
  the configured service, when ListenerType is set to outbound in extension
  configuration. See CVE-2023-2816 changelog entry for more details.

SECURITY:

* Update to UBI base image to 9.2.
* Upgrade golang.org/x/net to address CVE-2022-41723
* Upgrade to use Go 1.20.4. This resolves vulnerabilities
  CVE-2023-24537(go/scanner), CVE-2023-24538(html/template),
  CVE-2023-24534(net/textproto) and CVE-2023-24536(mime/multipart). Also,
  golang.org/x/net has been updated to v0.7.0 to resolve CVEs
  CVE-2022-41721, CVE-2022-27664 and CVE-2022-41723
* extensions: Disable remote downstream proxy patching by Envoy Extensions
  other than AWS Lambda. Previously, an operator with service:write ACL
  permissions for an upstream service could modify Envoy proxy config for
  downstream services without equivalent permissions for those services. This
  issue only impacts the Lua extension. [CVE-2023-2816]

FEATURES:

* hcp: Add new metrics sink to collect, aggregate and export server metrics
  to HCP in OTEL format.

IMPROVEMENTS:

* Fixes a performance issue in Raft where commit latency can increase by
  100x or more when under heavy load. For more details see
  hashicorp/raft#541.
* agent: add a configurable maximimum age (default: 7 days) to prevent
  servers re-joining a cluster with stale data
* agent: add new metrics to track cpu disk and memory usage for server
  hosts (defaults to: enabled)
* connect: update supported envoy versions to 1.22.11, 1.23.8, 1.24.6,
  1.25.4
* envoy: add MaxEjectionPercent and BaseEjectionTime to passive health
  check configs.
* hcp: Add support for linking existing Consul clusters to HCP management
  plane.
* logging: change snapshot log header from agent.server.snapshot to
  agent.server.raft.snapshot
* peering: allow re-establishing terminated peering from new token without
  deleting existing peering first.
* peering: gRPC queries for TrustBundleList, TrustBundleRead, PeeringList,
  and PeeringRead now support blocking semantics, reducing network and CPU
  demand. The HTTP APIs for Peering List and Read have been updated to
  support blocking.
* raft: Remove expensive reflection from raft/mesh hot path
* xds: rename envoy_hcp_metrics_bind_socket_dir to
  envoy_telemetry_collector_bind_socket_dir to remove HCP naming references.

BUG FIXES:

* Fix an bug where decoding some Config structs with unset pointer fields
  could fail with reflect: call of reflect.Value.Type on zero Value.
* acl: (Enterprise only) Check permissions in correct partition/namespace
  when resolving service in non-default partition/namespace
* acl: Fix an issue where the anonymous token was synthesized in
  non-primary datacenters which could cause permission errors when federating
  clusters with ACL replication enabled.
* acls: Fix ACL bug that can result in sidecar proxies having incorrect
  endpoints.
* connect: Fix multiple inefficient behaviors when querying service health.
* gateways: Fix an bug where targeting a virtual service defined by a
  service-resolver was broken for HTTPRoutes.
* grpc: ensure grpc resolver correctly uses lan/wan addresses on servers
* namespaces: adjusts the return type from HTTP list API to return the api
  module representation of a namespace. This fixes an error with the consul
  namespace list command when a namespace has a deferred deletion timestamp.
* peering: Fix issue where modifying the list of exported services did not
  correctly replicate changes for services that exist in a non-default
  namespace.
* peering: Fix issue where peer streams could incorrectly deregister
  services in various scenarios.
* peering: ensure that merged central configs of peered upstreams for
  partitioned downstreams work
* xds: Fix possible panic that can when generating clusters before the root
  certificates have been fetched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants