Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gossipsub): More lenient flood publishing #3666

Closed
wants to merge 9 commits into from

Conversation

AgeManning
Copy link
Contributor

Description

In gossipsub, there can be latency issues when performing flood publishing (which is on by default). Although the default mesh size is around 10 peers, users can have many more connections (in the Ethereum use case, exceeding 100). It can be the case that all of these extra peers also subscribe to a given topic.

When publishing, a node then attempts to try and send a message to all peers that are subscribed to a given topic. When the peer count is large, this can create a large delay in message propagation due to bandwidth limitation and lack of prioritization of peers. This is particularly true for larger message sizes.

I don't know of an easy way to manage back pressure when burst sending large amounts of data in rust-libp2p. I think forcing sequential sends deprives us of the benefits of parallelism in sending between many peers which would saturate our bandwidth (a good thing).

My current solution (this PR) is to offload some of these decisions to the user (or at the least give them ability to modify the behaviour). I've added a new option to flood publish and made it the default. The options for flood publishing are enumerated by this new struct:

pub enum FloodPublish {
    /// Flood publishing is disabled.
    Disabled,
    /// Flood publishing is enabled. Messages are sent immediately to all peers in the mesh and
    /// flood messages are queued to be published in the next heartbeat up to a maximum number of
    /// peers. If 0 is set, this publish to all known peers in the next heartbeat.
    Heartbeat(usize),
    /// Flood publishing is enabled and messages are immediately published to all peers on the topic.
    Rapid,
}

Disabled is removing flood publishing altogether (this options existed before). Rapid is what the default used to be and Heartbeat(usize) is the new option added in this PR.

The heartbeat option allows the user to specify a maximum number of peers to flood publish to, which are selected at random. Thus if 100 peers are a connection limit, a user may opt to flood publish to only half of them. Instead of publishing the message immediately to all peers, this option will publish a message immediately to its mesh peers (and explicit + fanout) and then wait until the next heartbeat to attempt to publish to the rest. The idea being that we stagger the sending of large data such that the first few messages can get sent before we start sending the rest in parallel.

Notes & open questions

I've made some small optimizations whilst changing some logic here. I removed double protobuf encoding of some messages in the publish function.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@mxinden
Copy link
Member

mxinden commented Apr 4, 2023

I don't know of an easy way to manage back pressure when burst sending large amounts of data in rust-libp2p.

I would suggest a separate channel from the Gossipsub NetworkBehaviour to each ConnectionHandler, same as described in #3710. The ConnectionHandler only reads from the channel in case its outbound stream is in WaitingOutput state. The channel has some reasonable capacity, say 20. The NetworkBehaviour, when publishing, only sends the gossip message to those ConnectionHandlers that have capacity in its channel.

That way we can get rid of the potentially unbounded growth buffer send_queue and we best utilize the underlying network resources by sending to those that can actually receive it in a low latency fashion (i.e. have capacity in their NetworkBehaviour -> ConnectionHandler channel) and not have messages wait in a potentially large send_queue queue.

I think forcing sequential sends deprives us of the benefits of parallelism in sending between many peers which would saturate our bandwidth (a good thing).

Paraphrasing to make sure we are on the same page. I agree that we should send the same gossipsub message to multiple peers in parallel. I don't think we should send a single message to peers in sequence.

My current solution (this PR) is to offload some of these decisions to the user (or at the least give them ability to modify the behaviour).

I would like to not introduce yet another tunable magic number to the already large Gossipsub configuration surface. Based on intuition I would argue that choosing the right value for FloodPublish::Heartbeat is impossible. With my above suggestion, no magic number (apart from the channel capacity) is needed.

The idea being that we stagger the sending of large data such that the first few messages can get sent before we start sending the rest in parallel.

With my above suggestion we would make sure to first send the message to fast peers, i.e. peers with capacity in the NetworkBehaviour -> ConnectionHandler channel. That would enable those fast peers to then help propagate the message to the slow peers in the network.

I've made some small optimizations whilst changing some logic here. I removed double protobuf encoding of some messages in the publish function.

To ease review and to make sure these optimizations for sure make it into master, would you mind moving them to separate pull requests?


On the meta level, very much appreciate the continued work you are putting into GossipSub!

@AgeManning
Copy link
Contributor Author

I would suggest a separate channel from the Gossipsub NetworkBehaviour to each ConnectionHandler, same as described in #3710. The ConnectionHandler only reads from the channel in case its outbound stream is in WaitingOutput state. The channel has some reasonable capacity, say 20. The NetworkBehaviour, when publishing, only sends the gossip message to those ConnectionHandlers that have capacity in its channel.

That way we can get rid of the potentially unbounded growth buffer send_queue and we best utilize the underlying network resources by sending to those that can actually receive it in a low latency fashion (i.e. have capacity in their NetworkBehaviour -> ConnectionHandler channel) and not have messages wait in a potentially large send_queue queue.

I may need to check my understanding of this. This solution would solve the problem of a potentially unbounded send_queue and resolve latency if these queues are large currently. It would allow the behaviour then to drop messages if we have a very large send queue. I guess if we are forwarding or publishing a message and the queue is full for all our peers, I guess we then drop the message?

I'm not sure if this will help the particular problem I'm trying to solve however (bursts of large messages). The problem we have is large message sizes. We publish these large messages (blocks) periodically every 12 seconds. I imagine that our send queue for each peer would be more or less empty. We then need to publish the large message such that it reaches the network as fast as it can. I think we're seeing delays when we send the message to all the connection handlers at once for all our peers (80 or so, due to flood publishing) (which as I understand would still happen with the channel you have suggested). Then when the OS tries to send 80x our large message it takes quite a while to get the first one out to the network. The idea here would be to try and stagger the send (assuming the send_queue is 0 for all peers) and send just to mesh peers, then optionally some additional ones later.

I would like to not introduce yet another tunable magic number to the already large Gossipsub configuration surface. Based on intuition I would argue that choosing the right value for FloodPublish::Heartbeat is impossible. With my above suggestion, no magic number (apart from the channel capacity) is needed.

I agree. I was trying to make the config simple, it that you can just use the default and all extra configs are not used at all by most users. The extra number I had imagined would be based on peer-count. I had intended to use it to send to at most 20 or 30 extra peers even if my peer count was 80-100. In Lighthouse the user is able to adjust their peer count, with the extra config parameter I can decouple the user-configured Lighthouse peer count from the flood publishing in gossipsub for these large messages.

I guess its difficult, in that there are many ways we can tune a gossipsub network, and I wanted to try and make it as generic as we can for all users. We could simplify the config by fixing some of the parameters, this case included.

With my above suggestion we would make sure to first send the message to fast peers, i.e. peers with capacity in the NetworkBehaviour -> ConnectionHandler channel. That would enable those fast peers to then help propagate the message to the slow peers in the network.

If we could do this, I think this would be ideal. Perhaps I've misunderstood the solution. As I understand, we classify fast from slow based on the size of the channel. I expect most of the channels to be empty in my particular circumstances. We also have a few variety of messages. Some are very small and some are large. Therefore the queue size is not trivially indicative of fast/slow peers.

If we can identify reasonable which peers are fast/slow and which have immediate capacity then this would be a great approach. But also that when we burst send, we dont immediately saturate our bandwidth and make all our peers slow.

To ease review and to make sure these optimizations for sure make it into master, would you mind moving them to separate pull requests?

Yeah, i'll try and do this more in the future. I just get in there and notice these things and fix them while i'm there. There is another fairly important change I should get in, but its lurking in an episub branch somewhere

@mergify
Copy link

mergify bot commented Apr 14, 2023

This pull request has merge conflicts. Could you please resolve them @AgeManning? 🙏

@mergify
Copy link

mergify bot commented May 1, 2023

This pull request has merge conflicts. Could you please resolve them @AgeManning? 🙏

@mergify
Copy link

mergify bot commented May 2, 2023

This pull request has merge conflicts. Could you please resolve them @AgeManning? 🙏

@mxinden
Copy link
Member

mxinden commented May 29, 2023

I would suggest a separate channel from the Gossipsub NetworkBehaviour to each ConnectionHandler, same as described in #3710. The ConnectionHandler only reads from the channel in case its outbound stream is in WaitingOutput state. The channel has some reasonable capacity, say 20. The NetworkBehaviour, when publishing, only sends the gossip message to those ConnectionHandlers that have capacity in its channel.
That way we can get rid of the potentially unbounded growth buffer send_queue and we best utilize the underlying network resources by sending to those that can actually receive it in a low latency fashion (i.e. have capacity in their NetworkBehaviour -> ConnectionHandler channel) and not have messages wait in a potentially large send_queue queue.

I may need to check my understanding of this. This solution would solve the problem of a potentially unbounded send_queue and resolve latency if these queues are large currently. It would allow the behaviour then to drop messages if we have a very large send queue. I guess if we are forwarding or publishing a message and the queue is full for all our peers, I guess we then drop the message?

In the case of forwarding: a message would be dropped, along the lines of don't accept work you can't handle.

In the case of publishing: a message, one would backpressure to the user, i.e. libp2p_gossipsub::Behaviour::publish would return Poll::Pending.

I think we're seeing delays when we send the message to all the connection handlers at once for all our peers (80 or so, due to flood publishing) (which as I understand would still happen with the channel you have suggested).

Understanding of my suggestion is correct. Thanks for expanding. You are right, my suggestion won't solve your immediate problem, namely where the bottleneck is the direct uplink of the local machine.

To help me form a better understanding, can you share some data? E.g. resource graphs of a machine running into this issue (bandwidth, CPU, ...), or the latency of delivery of one message across remote nodes? What order of magnitude are we talking about here? Is this on the order of 100ms?

@AgeManning
Copy link
Contributor Author

AgeManning commented May 29, 2023

@ackintosh has done some simulations in testground. I've not run them myself but he reports a 30% reduction in latency for given message sizes. (You can see and run them here: sigp/gossipsub-testground#15)

This issue is highly correlated with message sizes. On Ethereum mainnet, we are publishing 100-200kb block messages. These are seeing delays mostly around 1s, with a tail up to around 4s.
The reason this (along with the work towards episub) is important to us, is because there are planned upgraded where we may need to increase the message size. As the message sizes increase, these delays become untenable and we're now actively looking for ways to reduce/improve the efficiency of message propagation, particularly with larger messages.

This PR seemed like a good low-hanging-fruit with decent gains to quickly throw in before introducing more severe modifications (episub).

@mxinden
Copy link
Member

mxinden commented May 31, 2023

Thank you for expanding on the eth2 perspective. This is helpful.

This PR seemed like a good low-hanging-fruit with decent gains to quickly throw in before introducing more severe modifications (episub).

Agreed that based on sigp/gossipsub-testground#15 this provides "decent gains". I don't think it is a low-hanging fruit. In my eyes, relative to its impact it adds a lot of complexity to an already very complex machinery.

With my above suggestion we would make sure to first send the message to fast peers, i.e. peers with capacity in the NetworkBehaviour -> ConnectionHandler channel. That would enable those fast peers to then help propagate the message to the slow peers in the network.

If we could do this, I think this would be ideal.

We seem to be in agreement that we should add proper backpressure to libp2p-gossipsub, detached from the problem described here.

I expect most of the channels to be empty in my particular circumstances.

Do you have data on this? In other words, do you know EnabledHandler::send_queue size characteristics in production environments?

Backpressure might as well have an impact on the problem this pull request is trying to solve, namely to keep buffers small and thus sending latency low. Next to always forwarding a new message to all mesh peers, one could e.g. forward to non-mesh peers on a best effort basis, depending on whether there is space in the channel to their ConnectionHandler, signaling whether it is a fast or slow connection / peer. In my eyes that is a simplification over the current delayed sending approach.

Next to backpressure I wonder whether we should solve this problem at the libp2p-gossipsub level or at the Transport level. Do you have metrics on whether rust-libp2p is able to exhaust the link of a local node when sending a message? In other words, is rust-libp2p able to exhaust the uplink of a node? Is the bottleneck in the Transport.

In sigp/gossipsub-testground#15 (comment) I am suggesting to experiment with QUIC. I know that moving eth2 to QUIC is a larger effort. But this will at least give us a baseline of what would be possible, potentially revealing potential improvements in our TCP stack.

In case you think a synchronous discussion is helpful, I am currently based in Japan, thus our timezones align nicely for a call. Also happy to continue here.

@AgeManning
Copy link
Contributor Author

I've been meaning to add QUIC support for Lighthouse. There are some complications around doing it, however.

Anyway, yeah it sounds like a call might be useful on this to get your thinking about it.

bors bot pushed a commit to sigp/lighthouse that referenced this pull request Oct 19, 2023
## Issue Addressed

Following the conversation on libp2p/rust-libp2p#3666 the changes introduced in this PR will allow us to give more insights if the bandwidth limitations happen at the transport level, namely if quic helps vs yamux and it's [window size limitation](libp2p/rust-yamux#162) or if the bottleneck is at the gossipsub level.
## Proposed Changes

introduce new quic and tcp bandwidth metric gauges.

cc @mxinden (turned out to be easier, Thomas gave me a hint)
Copy link

mergify bot commented Nov 27, 2023

This pull request has merge conflicts. Could you please resolve them @AgeManning? 🙏

@AgeManning
Copy link
Contributor Author

Closing this for now. There are potentially other avenues of improving the send times of messages.

@AgeManning AgeManning closed this Nov 29, 2023
@diegomrsantos
Copy link
Contributor

Closing this for now. There are potentially other avenues of improving the send times of messages.

@AgeManning could you please share the other avenues being considered?

@AgeManning
Copy link
Contributor Author

We are experimenting on a fork.
There are some details of some improvements here: sigp/lighthouse#4918 (comment)

Also there was a yamux improvement which should saturating the bandwidth of the node: #4970

We are also experimenting without flood publishing.

Our goal is to have a very high peer count, and we may make a new version of this PR with partial flood-publishing, where its only to a small section of peers. So perhaps this issue can be revived later.

@diegomrsantos
Copy link
Contributor

We are experimenting on a fork.
There are some details of some improvements here: sigp/lighthouse#4918 (comment)

Thanks, I'll take a look.

Also there was a yamux improvement which should saturating the bandwidth of the node: #4970

This test ackintosh/gossipsub-testground#3 claims there is only a 5% improvement when using QUIC, do you think it could be different with Yamux?

@diegomrsantos
Copy link
Contributor

@AgeManning have you considered adding the msgs related to flood publishing to the non-priority queue?

@AgeManning
Copy link
Contributor Author

@AgeManning have you considered adding the msgs related to flood publishing to the non-priority queue?

This is an interesting idea.

We were suspecting that our issue was with parallel sending a large message to many peers all at once. The idea of staggering was that we get a few messages out into the network before trying to send more to other peers.

I think adding it to the non-priority queue, would help (and is a good idea) but because the queues are per-peer, we'd still run into the problem of trying to send all the messages at once.

Our current strategy is just to remove flood-publishing. We are also going to build a v1.2 of gossipsub with an IDONTWANT message to reduce bandwidth, and if further improvements are necessary, we may revisit staggered sending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants