Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rendezvous protocol #44

Closed
wants to merge 17 commits into from
Closed

rendezvous protocol #44

wants to merge 17 commits into from

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented Mar 12, 2018

Scope:

  • Generalized Peer Discovery
  • Bootstrap
  • Real-time applications that require rendezvous
  • Content routing

@ghost ghost assigned vyzo Mar 12, 2018
@ghost ghost added the in progress label Mar 12, 2018
@vyzo
Copy link
Contributor Author

vyzo commented Mar 12, 2018

summoning @whyrusleeping @lgierth @diasdavid @mkg20001
cc @dryajov

@Stebalien
Copy link
Member

Thanks for writing this up!

The rendezvous protocol provides facilities for real-time peer discovery within application specific namespaces.

This is more a global discovery or something like that. Rendezvous means to meet at an agreed upon time at an agreed upon place. websocket-star-randezvous is a rendezvous protocol because peers connect to each other through the server. As far as I can tell, we'll be using p2p-circuit for that here. While this protocol could be used to facilitate rendezvousing, it's more general than that.

Peers connect to the rendezvous service and register their presence in one or more namespaces.

That sounds awfully centralized. Can we not just use the DHT (plus pubsub)? Really, this service seems like a generalization of content discovery. If possible we should try to merge the two services.


Before going any further into protocol design, we should briefly sketch out some concrete motivating use-cases. Suggestions:

  • A chat room.
  • Content discovery.

@whyrusleeping
Copy link
Contributor

Really, this service seems like a generalization of content discovery.

Right, this is pretty much a generalization of just discovery.

That sounds awfully centralized. Can we not just use the DHT (plus pubsub)?

Ideally, this can be either centralized or decentralized depending on what the specific application wants (thus, decentralized). You could phrase storing something on the DHT as "connect to the DHT and PUT X", I could say that the DHT is awfully centralized because its the only way of doing content routing right now ;)

In any case, I like where this is going. Once this proposal is complete, we could reframe the DHTs content routing in terms of this interface: provide -> register(cid).


Client peer `A` connects to the rendezvous service `R` and registers for namespace
`my-app` with a `REGISTER` message. It subsequently enters rendezvous with
a `RENDEZVOUS` message and waits for `REGISTER`/`UNREGISTER` announcements from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on having an explicit Rendezvous thingy. Isnt it technically just the same as polling the Discovery protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a more efficient version of that, as it listens for new peers on top of the initial lookup:

  • DISCOVER returns a list of current peers and stops
  • RENDEZVOUS returns the current peers and then sends deltas (new REGISTER/UNREGISTER).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's pull -vs- push.

@whyrusleeping
Copy link
Contributor

Really, this service seems like a generalization of content discovery.

Right, this is pretty much a generalization of just discovery.

That sounds awfully centralized. Can we not just use the DHT (plus pubsub)?

Ideally, this can be either centralized or decentralized depending on what the specific application wants (thus, decentralized). You could phrase storing something on the DHT as "connect to the DHT and PUT X", I could say that the DHT is awfully centralized because its the only way of doing content routing right now ;)

In any case, I like where this is going. Once this proposal is complete, we could reframe the DHTs content routing in terms of this interface: provide -> register(cid).

@whyrusleeping
Copy link
Contributor

cc @lgierth for review as well

@whyrusleeping
Copy link
Contributor

also @mkg20001 and @diasdavid

@vyzo
Copy link
Contributor Author

vyzo commented Mar 13, 2018

I think I want to add a limit field to DISCOVER so that we can get a delimited view for bootstrap.

@vyzo
Copy link
Contributor Author

vyzo commented Mar 14, 2018

We may want to consider droppping the RENDEZVOUS broadcast functionality, as it complicates the implementation and has daemon scalability implications -- cf @whyrusleeping's concerns.

Copy link
Member

@mkg20001 mkg20001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completly sold on the server requirement. I think something like this could be done using a special pubsub for discovery without the need of a server.
But I think it's enough for a ws-star replacement.


```protobuf
message Message {
enum MessageType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to merge register and unregister requests? Or dropping the message type completly?
The client could simply check the length of the array for all types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want unregister so that we can depart from a namespace without having to disconnect.
The message type is needed to disambiguate the protocol.

}

message Peer {
optional string id = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use the smaller non-encoded version of the peer-id instead of the b58 string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's just the data type -- we use string for unencoded addresses in go.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 11, 2018

Revision 2, after discussion with @whyrusleeping and consensus with @dryajov and @mkg20001 for ws-star replacement.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 11, 2018

Summoning @whyrusleeping @Stebalien @mkg20001 @dryajov for a second round of review.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 11, 2018

To summarize the changes in rev2:

  • push discovery has been dropped, as it is complicated to implement and requires potentially expensive state tracking.
  • discovery is a proper request-response interaction, with a DiscoverResponse.
  • namespace is optional in the query, implying discovery of all registered peers.
  • added motivating paragraph and use cases section.

The rendezvous protocol provides facilities for real-time peer
discovery within application specific namespaces. Peers connect to the
rendezvous point and register their presence in one or more
namespaces. Registrations persist until the peer disconnects or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiring that nodes be connected to the rendezvous in order for their data to be served by it is strange. Maybe don't specify that bit? Maybe values could have a timeout or something similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe value lifetime deserves its own section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it has upsides for real time discovery -- just keep the node visible while it's still connected.
But you are right, there are use cases where we just want to register with some TTL and not keep a connection open.

I think we should just add an optional TTL in the REGISTER message -- if it's omitted, keep the registration until the node disconnects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an extra paragraph about the lifetime of the registration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section is better actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; we have an optional TTL in the REGISTER message now.

Another client `D` can discover peers in `my-app` by sending a `DISCOVER` message; the
rendezvous point responds with the list of current peer reigstrations.
```
D -> R: DISCOVER{my-app}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a way of providing a limit of peers and a timestamp of when they connected.

DISCOVER:{[my-app], [LIMIT, SINCE]}, where:

  • LIMIT (integer) - optional, the max peer we're willing to receive (should prevent various attacks as well as allow peers/apps adjusting this depending on env/conditions). If none provided return all peers.
  • SINCE (timestamp) - optional, return peers connected from this time onward, allows returning fresh peers. If none provided, return all peers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of SINCE, as it is useful indeed. But we will have to correlate this with a rendezvous point timestamp, so we should probably include one in the DiscoverResponse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we should! Great idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


message Discover {
optional string ns = 1;
optional int limit = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be theLIMIT I'm proposing in the comment above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: Could not resolve int: JS still complains

@dryajov
Copy link
Member

dryajov commented Apr 12, 2018

I thought it won't hurt to create a companion issue - #47.

}

message PeerInfo {
optional string id = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no crypto-challenge for the id the only way to prove ownership is the SECIO challenge, which is only done for the current peer-id. Therefore this field seems useless to me...
Why not just drop the PeerInfo msg and use repeated addrs on register?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registration is forwarded in the DiscoverResponse (modulo ttl) and it would thus be ambiguous without the peer id.

You also are missing the potential use case of rendezvous points sharing registrations (say in federation of daemons, using pubsub, etc).
More generally, having a peer id allows registrations to be passed around and reconstruct full PeerInfo objects from them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also are missing the potential use case of rendezvous points sharing registrations (say in federation of daemons, using pubsub, etc).

Then the data must be somehow authenticated, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can authenticate the sharing peer instead of the data.
Authenticating the registration itself would require a signature in them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can authenticate the sharing peer instead of the data.

If the data isn't authenticated, wouldn't that allow for replay attacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you'd get that too if you were authenticating with a nonce.

I don't think that replay attacks are a big concern in this context, as the sharing peers can establish a trust model with each other.

Note that signatures along are not sufficient to prevent replay attacks in a shared setting either, and trying to add a signature timestamp gets complicated quick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@mkg20001
Copy link
Member

mkg20001 commented Apr 15, 2018

Shouldn't namespace naming be restricted somehow? Like only [a-zA-z0-9_.-]+ ?
Edit: And maybe also restricted in length?

@vyzo
Copy link
Contributor Author

vyzo commented Apr 15, 2018

Updated to replace timestamps with cookies.

@mkg20001
Copy link
Member

Updated to replace timestamps with cookies.

Done: libp2p/js-libp2p-rendezvous@42dd587

Now I'll start looking into how to build a cluster with that.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 19, 2018

I think we want to remove the "keep registration until disconnect" behaviour when the TTL is missing.
It complicates the daemon and will also make a mess with federation, as we'd have to handle notifications for peer disconnections and have them trigger db/pubsub updates.

How about we use a default TTL when it's omitted?
Also, the daemon should probably have an upper bound for the TTL.

So I propose a 2hr default TTL with an upper bound of 1day.

@mkg20001
Copy link
Member

So I propose a 2hr default TTL with an upper bound of 1day.

I'd go for 1h-24h with default 2h. (I think smaller TTLs do not really make sense)

@whyrusleeping
Copy link
Contributor

👍 on default TTL of a couple hours, and not doing the 'remove on disconnect' thing.

Upper bound is interesting, but i'd be tempted to increase it to 48 or 72 hours. We can always drop it back down if that turns out to be too much, but it may end up enabling some usecases we didnt hadnt already thought about

@vyzo
Copy link
Contributor Author

vyzo commented Apr 20, 2018

Agreed on having a longer upper bound, let's do 72hrs.
I also think we should have an error response for overly long TTLs, instead of silently truncating and blacking out a node that refreshes close to TTL expiration.

Also specify a minimum upper bound of 72hrs and an E_INVALID_TTL status
code.
@vyzo
Copy link
Contributor Author

vyzo commented Apr 20, 2018

Updated for default of 2hrs and minimum upper bound of 72hrs.
I also added an E_INVALID_TTL status code for rejecting overly long registrations.

@Kubuxu
Copy link
Member

Kubuxu commented Apr 20, 2018

We could allow the upper limit of TTL to drop if the server has too many registered clients already but I don't know how I feel about it.

The RegisterResponse might also need to return upper TTL. Otherwise, it is a guessing game.

I would also be for defining a programming interface so in future we can create a drop-in fully decentralized replacement.

@whyrusleeping
Copy link
Contributor

I would also be for defining a programming interface so in future we can create a drop-in fully decentralized replacement

Definitely, I really see this doc as a generic interface to any service that provides this functionality. Whether centralized or decentralized. Some usecases call for a high performance centralized or federated solution, where other usecases call for a resilient decentralized one. The calling conventions should be the same either way.

@Kubuxu
Copy link
Member

Kubuxu commented Apr 20, 2018

Note on the API topic, I think of something similar we as we did for libp2p consensus protocol.

Copy link

@phritz phritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I work with @whyrusleeping on filecoin and since we'll likely be using this protocol he invited me to add comments. Apologies for swooping in, feel free to take my comments or leave them as I am lacking a lot of context for this protocol and libp2p conventions.


```protobuf
message Message {
enum MessageType {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For messages with enums it is easy to to create messages that accidentally say something they don't intend to. The way I've seen this happen is that if an enum field has a default value that carries some semantics (name of a message, or a status) then the reader can't distinguish between a programmer explicitly setting that value versus not setting it at all. Or at least the API I was using a few years ago didn't distinguish between the two by default. Anyway, in order to avoid accidentally having semantics for enum fields the pattern was to have the default enum value always be something like UNSET or NOT_SET so that it was clear when someone explicitly set the value versus not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly see some value in defensive programming, but I am not sure I like polluting the protocol with something that protects against PEBKAC-style programming errors.

It would also be rather inconsistent as I don't think we do this anywhere else.

Copy link
Contributor Author

@vyzo vyzo Apr 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the casual programmer, who is more likely to suffer from such PEBKAC, should never have to touch these protobufs directly.
These should be well abstracted by implementation libraries, and bugs of this type should be quickly weeded out there.

message Register {
optional string ns = 1;
optional PeerInfo peer = 2;
optional int64 ttl = 3; // in seconds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this were named ttlSec it would be unambiguous and the comment would be unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it (it's ugly -- what is ttlSec? It's ttl in seconds. Is there a ttl not in seconds?), but I also don't feel strongly about it -- so if people would prefer ttlSec as the field, we can change it.

optional bytes cookie = 3;
}

message DiscoverResponse {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might you want a way to signal errors? For example namespace too long or internal server error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.
We briefly discussed this with @mkg20001 and it seemed of limited utility for namespace errors, but internal server errors is probably something we want reported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a ResponseStatus for reporting errors.

}

message RegisterResponse {
optional RegisterStatus status = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always found an optional error string to be polite. For example if E_INVALID_NAMESPACE you could say what the max length is or what the problem is otherwise. Would save the caller from having to guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a nicety we can add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as statusText.

c2}
```

If `D` wants to progressively poll for real time discovery, it can use
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me the intended use so perhaps this comment is moot, but: might the progressive poller also want information about unregisters? If they don't get that information then it seems like they'll incrementally build a registration set of non-decreasing size and decreasing usefulness because the list the caller keeps will start to include many stale peers who have unregistered over time. Depending on churn rate the number of stale registrations the caller has could potentially dwarf the number of actually registered peers. There are obvious solutions on the caller side but unclear to a reader like me what's intended here. Might be worth a comment at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting the unregistrations in a discovery response is problematic for a couple of reasons:

  • requires the rendezvous point to keep track of unregisters modulo the poller's state, which is expensive and couples too tightly.
  • complicates the processing of the response in the client, as it assumes a state machine that may simply not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that explicit unregisters are expected to be rare events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another points is that the progressive discovery mechanism is not intended to build a consistent replica of the rendezvou's point internal state, but rather discover new peers since the last discovery.
A rendezvous federation protocol (TBD) would certainly forward unregistrations though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that explicit unregisters are expected to be rare events

I think that explicit unregisters might be useful when a node does a clean shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course!

The question is whether it is worth the trouble propagating this in discover queries, or simply let the TTLs expire stale entries in clients.
I really don't like having to track unregistrations in the database though; Unregister should simply signal deletion and no further propagation.

Copy link
Member

@victorb victorb May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leave it up to the clients to decide when they want to disconnect. If they want to decrease the amount of registrations, they can purge it locally (checking if ttl expired and remove that registration would be one way, bad ping could be another). Seems to simplify by not having the unregister.

new registrations.

So here we consider a new client `E` registering after the first query,
and a subsequent query that discovers just that peer by including the cookie:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like might not be a good idea to require registries to keep cookies forever (and probably not even across restarts) so the language here should probably indicate that what is returned when you pass the cookie is only likely an increment set; it could be the full set if the cookie is too old, unknown, the peer reset, whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registries shouldn't keep state for cookies at all, that's why they are client-side cookies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a stale cookie could elicit an arbitrary set in response.
I think we should simply return an error if the cookie has become stale, which is something the registry can easily detect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an E_INVALID_COOKIE status code that can fail a progressive discovery and signal a restart.

of query responses, so that large numbers of peer registrations can be
managed.

### Registration Lifetime
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there are several comments referencing "namespace ownership" but that concept does not appear to be present in this document. If it's a part of the protocol or caller contract it's probably worth explicitly mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace ownership is beyond the scope of the protocol I think -- it quickly gets into ACLs and that's a rathole I don't want to go down :)

@phritz
Copy link

phritz commented Apr 20, 2018

I guess one more question for my edification: there seems to be concern about load (limiting peers to 200 or 1,000; 20k causing a huge load). Can you help me understand that? Coming into this whole thing super naively I assume the "load" referred to is bandwidth? Why the concern (20k * 100 bytes say is only 16Mbits)?

Also if bandwidth is really a concern I wonder if you might consider having the payload be bytes with an encoding tag, that way you could ship peerids compressed. (The server could pre-compress batches for example). Or compression at a lower layer could be very effective as well.

It's also possible I'm widely missing the point of concern :)

@vyzo
Copy link
Contributor Author

vyzo commented Apr 21, 2018

Why the concern (20k * 100 bytes say is only 16Mbits)?

You'd have to multiply this by 20k to get to the full load -- note that it's not bandwidth so much, but rather memory buffers to fill these responses.

that way you could ship peerids compressed.

They are cryptographic hashes already, can't compress those.

It's also possible I'm widely missing the point of concern :)

The main concern is a node just connecting and doing a query that requires a 16MB response -- and the memory buffer to fill and then read this response. Note that it's ok to have those 16MB transmitted as the result of several threaded queries, as this does not commit any significant resources for the interaction.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 21, 2018

Updates:

  • Generalized response status enum (not just register responses)
  • Added response status to DiscoverResponse so that we can signal errors in discover queries
  • Added statusText to convey error messages in responses.
  • Added E_INVALID_COOKIE and E_INTERNAL_ERROR error status codes.

@vyzo
Copy link
Contributor Author

vyzo commented Apr 24, 2018

Go implementation in libp2p/go-libp2p-rendezvous#1

@victorb
Copy link
Member

victorb commented May 21, 2018

Closing this PR and will open a new one from the same branch. GitHub is having a hard time loading this PR (probably due to the amount of comments?)

@victorb victorb closed this May 21, 2018
@ghost ghost removed the in progress label May 21, 2018
@victorb victorb mentioned this pull request May 21, 2018
@victorb
Copy link
Member

victorb commented May 21, 2018

New PR: #56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants