Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Designing PubSub #9

Open
1 of 24 tasks
nicola opened this issue Aug 15, 2016 · 7 comments
Open
1 of 24 tasks

Designing PubSub #9

nicola opened this issue Aug 15, 2016 · 7 comments

Comments

@nicola
Copy link
Contributor

nicola commented Aug 15, 2016

Note: This issue summarizes the work done on PubSub by the IPFS community (#1 (8 Aug), Q3 team week, Q3 workshop ipfs-inactive/2016-IPFS-Workshop-Lisbon#17) and a brain dump from some of my personal notes that may not be representative of everyone's view and understanding of the problem space.

Background

  • Developers want to be able to have dynamic content on their applications
  • Peers should be able to subscribe to particular content and when other peers publish the matching query, the subscribers should receive an update
  • ipfs/notes#pubsub Extensive list of pubsub conversations

Literature review

Designing PubSub v1

The aim is to collect use cases and requirements for a minimum viable pubsub v1.

Use cases

  • Build a list of use cases from community and implementors
  • Find a way to collaboratively extract values from use-cases
  • Give deadline to this

Requirements

  • Small networks (thousand of nodes)
  • Survive massive amount of churn in browsers
  • Minimize flood
  • Do not prevent future improvement (hard!)
  • define what are the other properties of this system:
    • Reliable? (every peer receives all the messages and in order)
    • Authenticated? (should peers sign messages)

PubSub layers

Note: The following is my personal view of the problem space (from our conversations and use cases), please chip in/argue by commenting, I will update accordingly

Terminology

  • Publishers: they are the authors of the new content that need to be published
  • Brokers: they can be subscribers, they receive messages from publishers and send them to subscribers
  • Subscribers: simply subscribe to messages
  • Glossary #5 - Add new terms to ipfs/glossary

Problem space

  • What type of content can users subscribe to?
    • Permissioned vs Permissionless pubsub
      • tl;dr Users must subscribe to the content where the pubkeys of the publishers are known. If not, they are basically blindly subscribing to spam.
      • Permissioned:
        • (in theory) subscribers subscribe to a known set of publishers (each publisher is known by a public key)
        • (in practice) subscribers subscribe either to an IPRS or an IPNS (known set of pubkeys), the publishers of the content publishes to the network that route event to the subscribers.
        • (verdict) a lot of work has been done in this direction!
        • note: this does not mean that the numbers of subscribers must be fixed, new users can join/leave a permissioned network
      • Permissionless:
        • (in theory) subscribers subscribe to an identifier (which can also be a query), however, in order to avoid complete spam (since we don't know the pubkeys everyone can publish at that identifiers) we need to add some way to reduce spam, e.g. proof-of-work type of approach
        • (in practice) subscribers could subscribe to a genesis block of a blockchain (say bitcoin) and receive updates from the majority
        • (verdict) this is mostly an exciting gray area, anti-spam is hard and this would just turn pubsub into consensus/atomic broadcast problem

Layers of abstraction

(layer 3) Developers API .pub, .sub and events message, error

This layer is the final interface that should be used by application developers (a-la orbit)

  • Strawman application interface (create an issue on this)
    • Define the properties of this final layer (retrying subscription, ensure order, keeping a state of messages received & so on)
    • Define subscribe signature
      • .sub(id, [set of keys in the network])
      • find if the ID itself could be a pointer to an IPLD that has a list of Pubkeys and a query
      • find if the ID, pubkeys and query can be separate params
    • Define publish signature
      • .pub(id, content)
      • find if the content must be encrypted externally, could/must be signed externally
      • find a way to point to previous updates (authenticated streams) via the content itself or another param
    • Define a .on('message', (content) =>)
      • should the check for authentication of content be done behind the scenes?
    • Define the set of errors that could happen in this
    • Define the need for snapshot - say that one peer has missed an entire year of updates, they can either download each individual update or get a final snapshot (maybe too complex but could be done with IPLD transforms)

(layer 2) Data model

This is layer is the data model that describes a sequence of events

(layer 1 and 0) Pubsub Routing (better name needed)

This consists of peer discovery and event propagation (same as content-distribution).

This is the layer that routes an event from publisher (that has only a partial view of the subscribers - otherwise this would just this broadcast!) to the subscribers. There are different strategies for distributing events. If this layer change, there should be no change in the Data Model or Developers API.

Structured vs Unstructured networks

There are different strategies to do this:

  • structured networks: in this case the peers of the network will self-organize in a very specific structure (this is great for reliable networks where nodes are not continuously disconnecting so that the structure is not constantly altered), the great thing about this is that we can reuse the DHT (see Pastry, PastryStrings)
  • unstructured networks: in this case peers will distribute messages without following a specific structure (these are mostly gossip algorithms). Messages are often received multiple time.

Questions to answer:

  • Define the network requirements
    • Is there a "one-fits all" routing strategy?
    • What are the type of networks we plan to have pubsub?
  • Should the pub/sub strategy be adaptive? So multiple peers talk multiple pubsub (we once referred to this as multipubsub)?

The role of the peers

The author of the content is a publisher, however they have a partial view of the subscribers, so some of those subscribers must be distributing the content themselves (becoming brokers themselves - since the content is authenticated, they could be considered publisher themselves too, they are just publishing someelse content).

  • Agree on terminology of who's who (publisher, broker, subscriber)
  • Are all the subcribers brokers? (meaning every peer participate to the distribution? please keep in mind cheap devices)

Pubsub router interface

This should define a low level interface where subscribers and publishers talk individually to other peers, for example exchanging peers, exchanging interests, basically a generic pubsub protocol. This can be helpful in case we are implementing different pubsub routers.

  • Find if this abstraction is needed
  • Nail down an interface

Testing the network

  • Design and build the harness for pubsub tests #10 - find a way to spin up a simulation with thousand of nodes that subscribe and publish content considering any type of fault (byzantine, network partition, fail-stop, network delay)
  • benchmarks: build tools to measure the efficiency of a pubsub network (if we are going to try different strategies)

Note

  1. I haven't covered many things that we left pending or that I don't remember right now, please chip in - nothing of the above is cast in stone :)
  2. We have to create and link the issues related to most of the TODO items, so that we can have separate conversations outside of this issue

cc @jbenet, @diasdavid, @haadcode, @dignifiedquire

@haadcode
Copy link

Excellent work putting all this together @nicola! Will take time to read this through more carefully.

I might be jumping the gun here, and this might not be the right place to discuss particular details, but couple of things that caught my attention. And I've seen references to this in prior discussion as well as in the use cases: pubsub vs. message queues. I think the destinction needs to be made clear.

Define the need for snapshot - say that one peer has missed an entire year of updates, they can either download each individual update or get a final snapshot (maybe too complex but could be done with IPLD transforms)

That sounds a lot like a log, is that the case? Does the publisher need to know about your "history" to decide which messages to send to you? Does the subscriber need to know about their previous history with a topic?

Define the properties of this final layer (retrying subscription, ensure order, keeping a state of messages received & so on)

ensure order and keeping a state of messages received sound like message queue behaviour. Or am I talking about the wrong layer here? Do we see order and history to be desired properties of the pubsub design?

@nicola what are your thoughts on this? can you clarify and get me up to date?

@jbenet
Copy link

jbenet commented Aug 15, 2016

Do we see order and history to be desired properties of the pubsub design?

They are part of the desired "authenticated streams" pub/sub design. Not the bare-bones byte sequence broadcast. So "maybe", depending on how these layers shake out.

The point that many brought up during the workshop and around is that establishing the authentication invariant makes several pub/sub mechanisms much easier to think about and implement.

nicola added a commit that referenced this issue Aug 16, 2016
@mitar
Copy link

mitar commented Aug 21, 2016

(I am not sure if this is the best place to comment or some other issue, but I looked around and it seems this is the most design-oriented issue.)

I have two things I would like to add here. One is from my experience with Meteor pub/sub, their migration to Apollo stack, and my experience with various data-flow systems I have worked on in the past. I think in general the question with pub/sub is: do you want to send over it only information that something changed and then the other subscriber can fetch it if they want, or are you sending the whole stuff immediately. Lately, I think the first approach is better because it allows you also to ignore some updates or batch them together, while on the second case you cannot really do that because that is already at your place. In some way, one question to ask is about rate limiting control, what if there are too many changes you (or peer on your path) cannot handle. What is backpressure story here?

The other is a use case we are developing. It is blockchain on top of IPFS and current approach is:

  • we have a constant content which all peers know, so they ask IPFS for addresses of peers who host that content (and everyone can host it because content is known, like program name + version or something), and this is how we do discovery of other peers of our blockchain service on top of IPFS
  • we randomly pick a random subset of those peers and instruct swarm to directly connect to them
  • then we currently subscribe to IPNS name of each of those connected-to peers and peers update their entry when they want to publish something, like /ipns/<id>/transactions for list of pending transactions as seen from peer with ID id; we currently simulate subscription by regularly polling the IPNS endpoint for each peer

This last point I think is something where it would be great to have a pub/sub. So that we could say: please inform us of any change to that IPNS entry as it is happening. For us it would be OK that for pub/sub to work we would have to directly connect to that peer who owns the IPNS entry, because we care only about peers who are online at that moment. But it would be great if maybe that would be done automatically by IPFS. So that if you subscribe to some IPNS entry, IPFS automatically tries to optimize that with a direct connection.

@gritzko
Copy link

gritzko commented Aug 23, 2016

Speaking of stack layers. I believe that static data, Merkle structures and pub/sub can be all unified as a stream of IPLD frames.

By the way. From what I understand, IPLD does not have header+content frames. It is either-or. Am I right? Are JSON IPLD frames supposed to reference blob IPLD frames? For example, git objects have both (a header and content).

In case our primitive is a header+content frame, a partially ordered stream of frames is the waist of the stack. Pub/sub is just one use case.

Then, the core protocol is 3 messages (send/sub/unsub). No matter, whether we fetch a static file or subscribe to a topic. An intermediary node may not care at all whether it relays a video file or a real-time document editing session. Hash checks may work slightly differently of course, but the general action sequence is exactly the same:

  1. subscribe
  2. receive
  3. verify
  4. cache?
  5. relay?
  6. unsubscribe.

(where 1-4 repeat in a loop)

A position in a stream can be expressed by one-or-many hashes (cause of partial order).
Alternatively, by an offset. (That gets complicated with partially ordered streams, but it is well-defined anyway.)

Obviously, implementations may differ. DHT, HTTP, anything.

Stream snapshotting is definitely a different layer of the stack, cause it depends on data semantics.

@mitra42
Copy link

mitra42 commented May 29, 2017

Stream snapshotting may be a different layer of the stack, but retrieving past state seems like a crucial part in many applications.
I'm thinking for example of the "Facebook like" example. When a page that implements this opens, it wants to retrieve the current (usually short) list, and then subscribe to receive notifications. of updates.

Note this also brings up another level of scaling to think about - not just 1000's of browsers watching one stream but also very large numbers of small streams.

@MichaelMure
Copy link

MichaelMure commented Jun 20, 2017

Edit: moved to it's own issue: #14

@ghost
Copy link

ghost commented Jun 28, 2018

The latest pubsub work is in libp2p/specs#73 -- gossipsub, the successor to floodsub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants