Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Published messages can be dropped silently #217

Open
1 of 5 tasks
raulk opened this issue Oct 18, 2019 · 10 comments
Open
1 of 5 tasks

Published messages can be dropped silently #217

raulk opened this issue Oct 18, 2019 · 10 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@raulk
Copy link
Member

raulk commented Oct 18, 2019

Currently if the outbound message queues for all peers in a mesh/fanout are full, gossipsub drops the message silently, which has already bitten us various times (in stress testing, and during actual usage).

This happens here: https://github.com/libp2p/go-libp2p-pubsub/blob/master/gossipsub.go#L359

Proposal

  • Extend the signature of Publish() so it can take functional options to configure the reliability characteristics we want.

    • WaitUntilQueued(One | All | float64) => blocks the publish until the message has been queued to as many peers as the arg indicates. Can be implemented easily.
    • WaitUntilSent(One | All | float64) => blocks the publish until the message has actually been dispatched to as many peers as the arg indicates. Requires deeper changes, but reliability is higher.
  • Aggregate RPCs waiting in the queue. Messages are associative and can be merged, in theory.

  • Make queue size configurable. It's currently hardcoded to 32, and cannot be changed.

Affected users

Filecoin, ETH2 gossipsub testing.

@raulk
Copy link
Member Author

raulk commented Oct 18, 2019

cc @whyrusleeping

@aschmahmann
Copy link
Contributor

Extend the signature of Publish() so it can take functional options to configure the reliability characteristics we want.

Note this is already available in #184

@hsanjuan
Copy link
Contributor

Make queue size configurable. It's currently hardcoded to 32, and cannot be changed.

woa I noticed this yesterday in a completely unrelated thing. What I saw is that hammering gossipsub results in messages seemingly not sent around (or dropped on receive).

@aschmahmann
Copy link
Contributor

TLDR: This the proposal does not fully solve this problem, but it's not clear it has to. If we want to more fully solve this problem there are more things we can talk through.

@hsanjuan, yep there are some things we can do here but there are some protocol level problems to tackle if we actually wanted to "solve" this problem (see: #197 for more about the problem).

For example, even if WaitUntilSent was implemented such that it waited until the other party confirmed receipt of the data a couple things could still go wrong.

  1. The receiver could crash, leaving the message effectively unsent
  2. The receiver could not have WaitUntilSent activated (or not with the same parameters) meaning that while your peers may receive the data their peers might not.
    • We sometimes have conventions where people associate the same options with the same topics, but it's possible we may want to formalize this in some way. (e.g. topicID = fn(topicName + options))

As was mentioned in libp2p/go-libp2p#721 the current answer to dropped messages is to say "pubsub does not guarantee message delivery, if you want that layer something on top of pubsub", but that if we have enough users asking for guaranteed message delivery we should probably come up with something for them.

I don't think this response is super unreasonable and we already have some solutions (e.g. go-libp2p-pubsub-router) that handle the case of guaranteed message delivery. However, there are some protocol extensions we could look into like sending an ACK to the peer that sent us a message when once we've sent the message out or we failed/are out of peers.

@raulk
Copy link
Member Author

raulk commented Oct 19, 2019

Pubsub indeed does not guarantee message deliverability in terms of distributed behaviour, but in terms of local behaviour, it should absolutely guarantee that a message you intended to publish has actually left your box.

@hsanjuan
Copy link
Contributor

A lot of this problem can be solved by not hardcoding some channels' length to things like 32 (setting way higher, configurable defaults to give some space for bursts), and by complaining loudly when a message is dropped because channels are full. Also, more documentation about the implementation.

@aarshkshah1992
Copy link
Contributor

@raulk @aschmahmann I completely understand what we want to accomplish here. Please can I start working on the WaitUntilQueued & WaitUntilSent options ? Will share some notes/code soon.

@aschmahmann
Copy link
Contributor

@aarshkshah1992 fine by me, not sure if @raulk has any thoughts/concerns here.

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Nov 18, 2019

Adding a thought about the WaitUntilQueuedALL behaviour.

In the current PR, the WaitUntilQueuedALL fails if we drop even a single message in the event loop. I think this is the ideal behaviour we want because in this case, it is hard for the publisher to set a expected count & stop when that target is achieved because the state of the mesh/fanout could change when we are actually writing to the outbound queues in the event loop.

However, this is very tricky to achieve if we don't want the writer/event loop to block or spin up a go-routine to do a blocking write because we could end up a dropping a "failure event" & the publish would wrongly conclude that everything is okay.

So, we might have to trade off this strong behaviour for something lighter if we want to avoid all forms of blocking.

@aschmahmann might have some opinions on this.

@aschmahmann
Copy link
Contributor

aschmahmann commented Dec 2, 2019

@raulk I'm running into some problems with this in go-libp2p-pubsub-router which builds a persistent value store on top of pubsub.

I'd like to solve the "messages can be dropped silently" by having pubsub emit an event that we dropped a message.

Aggregate RPCs waiting in the queue. Messages are associative and can be merged, in theory.

Sounds good, although it'd be really nice if the merge function was configurable instead of just concatenating the messages.

@vyzo this seem like a reasonable plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

5 participants