-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLIP] Network API change #1306
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Network Layer API (Core Protocol) | ||
|
||
| Status | Proposed | | ||
:-------------- |:--------------------------------------------------------- | | ||
| **FLIP #** | [1306](https://github.com/onflow/flow-go/pull/1306) | | ||
| **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com) | | ||
| **Sponsor** | Simon Zhu (simon.zhu@dapperlabs.com) | | ||
| **Updated** | 9/16/2021 | | ||
|
||
## Objective | ||
|
||
Refactor the networking layer to split it into separate APIs for the public and private network, allow us to implement a strict separation in the code between these two networks. | ||
|
||
Enable registering a custom message ID function for the gossip layer. | ||
|
||
## Current Implementation | ||
|
||
When the network layer receives a message, it will pass the message to the [`Engine`](https://github.com/onflow/flow-go/blob/7763000ba5724bb03f522380e513b784b4597d46/network/engine.go) registered on | ||
the corresponding channel by [calling the engine's `Process` method](https://github.com/onflow/flow-go/blob/d31fd63eb651ed9faf0f677e9934baef6c4d9792/network/p2p/network.go#L406), passing it the Flow ID of the message sender. | ||
|
||
[`Multicast`](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/conduit.go#L82) is implemented by including a [`TargetIDs`](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/message/message.proto#L12) field inside the message, which is published to a specific topic on the underlying gossip network. Upon receiving a new message on the gossip network, nodes must first [validate](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/validator/targetValiator.go) that they are one of the intended recipients of the message before processing it. | ||
|
||
### Potential problems | ||
|
||
The current network layer API was designed with the assumption that all messages sent and received either target or originate from staked Flow nodes. This is why an engine's [`Process`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L28) method accepts a Flow ID identifying the message sender, and outgoing messages [must specify Flow ID(s)](https://github.com/onflow/flow-go/blob/master/network/conduit.go#L62) as targets. | ||
|
||
This assumption is no longer true today. The access node, for example, may communicate with multiple (unstaked) consensus followers. It's perceivable that in the future there will be even more cases where communication with unstaked parties may happen (for example, execution nodes talking to DPS). | ||
|
||
Currently, a [`Message`](https://github.com/onflow/flow-go/blob/698c77460bc33d1a8ee8a154f7fe4877bc518a02/network/message/message.proto) which is sent over the network contains many unnecessary fields which can be deduced by the receiver of the message. The only exceptions to this are the `Payload` field (which contains the actual message data) and the `TargetIDs` field (which is used by `Multicast`). | ||
|
||
However, all of the existing calls to `Multicast` only target a very small number of recipients (3 to be exact), which means that there is a lot of noise on the network causing nodes to waste CPU cycles processing messages only to ignore them once they realize they are not one of the intended recipients. | ||
|
||
## Proposal | ||
|
||
We should split the existing network layer API into two distinct APIs / packages for the public and private network, and the `Engine` API should be modified so that the [`Process`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L28) and [`Submit`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L20) methods receive a `Context` as the first argument: | ||
|
||
* Private network | ||
```golang | ||
type Engine interface { | ||
Submit(ctx context.Context, channel Channel, originID flow.Identifier, event interface{}) | ||
Process(ctx context.Context, channel Channel, originID flow.Identifier, event interface{}) error | ||
} | ||
|
||
type Conduit interface { | ||
Publish(event interface{}, targetIDs ...flow.Identifier) error | ||
Unicast(event interface{}, targetID flow.Identifier) error | ||
Multicast(event interface{}, num uint, targetIDs ...flow.Identifier) error | ||
} | ||
``` | ||
* Public network | ||
```golang | ||
type Engine interface { | ||
Submit(ctx context.Context, channel Channel, senderPeerID peer.ID, event interface{}) | ||
Process(ctx context.Context, channel Channel, senderPeerID peer.ID, event interface{}) error | ||
} | ||
|
||
type Conduit interface { | ||
Publish(event interface{}, targetIDs ...peer.ID) error | ||
Unicast(event interface{}, targetID peer.ID) error | ||
Multicast(event interface{}, num uint, targetIDs ...peer.ID) error | ||
} | ||
``` | ||
|
||
Various types of request-scoped data may be included in the `Context` as [values](https://pkg.go.dev/context#WithValue). For example, if a message sent on the public network originates from a staked node, that node's Flow ID may be included as a value. Once engine-side message queues are standardized as described in [FLIP 343](https://github.com/onflow/flow/pull/343), the given `Context` can be placed in the message queue along with the message itself in a wrapper struct: | ||
|
||
```golang | ||
type Message struct { | ||
ctx context.Context | ||
event interface{} | ||
} | ||
``` | ||
|
||
> While this may seem to break the general rule of not storing `Context`s in structs, storing `Context`s in structs which are being passed like parameters is one of the exceptions to this rule. See [this](https://github.com/golang/go/issues/22602#:~:text=While%20we%27ve%20told,documentation%20and%20examples.) and [this](https://medium.com/@cep21/how-to-correctly-use-context-context-in-go-1-7-8f2c0fafdf39#:~:text=The%20one%20exception%20to%20not%20storing%20a%20context%20is%20when%20you%20need%20to%20put%20it%20in%20a%20struct%20that%20is%20used%20purely%20as%20a%20message%20that%20is%20passed%20across%20a%20channel.%20This%20is%20shown%20in%20the%20example%20below.). The idea is that `Context`s should not be **stored** but should **flow** through the program, which is what they do in this usecase. | ||
|
||
When the message is dequeued, the engine should check the `Context` to see whether the message might already be obsolete before processing it. At this point, we will have two distinct `Context`s in scope: | ||
* The message `Context` | ||
* The `Context` of the goroutine which is dequeing / processing the message | ||
|
||
These can be combined into a [single context](https://github.com/teivah/onecontext) which can be used by the message processing business logic, so that the processing can be cancelled either by the network or by the engine. This will allow us to deprecate [`engine.Unit`](https://github.com/onflow/flow-go/blob/master/engine/unit.go), which uses a single `Context` for the entire engine. | ||
|
||
There are certain types of messages (e.g block proposals) which may transit between the private and public networks via relay nodes (e.g Access Nodes). Libp2p's [default message ID function](https://github.com/libp2p/go-libp2p-pubsub/blob/0c7092d1f50091ae88407ba93103ac5868da3d0a/pubsub.go#L1040-L1043) will treat a message originating from one network, relayed to the other network by `n` distinct relay nodes, as `n` distinct messages, causing unnacceptable message duplification / traffic amplification. In order to prevent this, we will need to define a [custom message ID function](https://pkg.go.dev/github.com/libp2p/go-libp2p-pubsub#WithMessageIdFn) which returns the hash of the message [`Payload`](https://github.com/onflow/flow-go/blob/698c77460bc33d1a8ee8a154f7fe4877bc518a02/network/message/message.proto#L13). | ||
|
||
In order to avoid making the message ID function deserialize the `Message` to access the `Payload`, we need to remove all other fields from the `Message` protobuf so that the message ID function can simply take the hash of the the pubsub [`Data`](https://github.com/libp2p/go-libp2p-pubsub/blob/0c7092d1f50091ae88407ba93103ac5868da3d0a/pb/rpc.pb.go#L145) field without needing to do any deserialization. | ||
|
||
The `Multicast` implementation will need to be changed to make direct connections to the target peers instead of sending messages with a `TargetIDs` field via gossip. | ||
|
||
### Motivations | ||
- Having a strict separation between the public and private networks provides better safety by preventing unintended passage of messages between the two networks, and makes it easier to implement mechanisms for message prioritization / rate-limiting on staked nodes which participate in both. | ||
- Passing `Context`s gives the network layer the ability to cancel the processing of a network message. This can be leveraged to implement [timeouts](https://pkg.go.dev/context#WithTimeout), but may also be useful for other situations. For example, if the network layer becomes aware that a certain peer has become unreachable, it can cancel the processing of any sync requests from that peer. | ||
- Since existing calls to `Multicast` only target 3 peers, changing the implementation to use direct connections instead of gossip will reduce traffic on the network and make it more efficient. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on benchmarking we have done using pubsub instead of multicast not only does not degrade the performance but also reduces the goroutines and memory of nodes by a gain of at least 15%. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure the size of the network the benchmarking was done with, but the message overhead from using pubsub vs direct connections scales with the number of nodes on the network, so the larger the network is the more of an issue this will be. Imagine a network with 10000 nodes. Under the current implementation, if one of those nodes sends a multicast targeting three other nodes, all 10000 nodes will need to receive and process that message, and 99997 of them will drop the message after it reaches the application level message validator (note we cannot use libp2p topic validators for this), while only 3 of them actually process the message. This seems very unnecessary.
This will need to be dealt with anyways, whether or not we choose to change the multicast implementation. As long as there are any valid scenarios where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yhassanzadeh13 Are there any artefacts from this benchmark? Where could we read about the data you collected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huitseeker yes, please find the report and data here, also the links on the report will lead you to Grafana dashboards containing the experimentation data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Your argument is valid @smnzhu, and by the way, the experimentation was in the order of ~150 nodes. When both scalability and security are a concern, maybe we need to refactor the solution with the following problem definition:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jwinkler2083233 @simonhf The gist of the benchmarks above is that this change would cost us in terms of resources (connections, memory and goroutines). What do you think of our ability to:
More generally, do you have alternative suggestions? @JeffreyDoyle I know you were suggesting QUIC in https://github.com/dapperlabs/flow-go/issues/5708, where QUIC has real multicast delivery. Any direciton we should explore there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huitseeker thanks for the notification!
I agree with this comment in general for a naive implementation, however, if the implementation focuses on the following aspects then I believe it is possible to reap the benefits, and make it the least expensive option:
In this scenario, the number of goroutines stays the same whether 1000, 10000 or 1M TCP connections. There is no overhead for establishing or closing a TCP connection (however, there will be stragglers due to network issues and bugs) during node operation because all TCP connections are pre-established at startup or before use. If all lower level data structures (e.g. input and output buffers to read packets into) per TCP connection are pre-allocated, then only a fixed bump in RAM usage will be noticed upon node start, with no extra GC necessary and/or dependent upon the number of TCP connections. In a previous job (around 2008) I worked on a TCP connection socket server daemon which used these techniques with data structures for up to 1M TCP connections pre-created upon daemon startup, and using the Linux kernel's But how to implement such a model in Golang and/or libp2p? This would require more research... There seems to be no technical reason why Golang cannot achieve the same result. However, existing packages may have not been designed that way -- or if more flexible -- used that way. Although Golang is capable of having 1M goroutines, just because it's capable, doesn't mean it's a good idea... :-) |
||
- While `engine.Unit` provides some useful functionalities, it also uses the anti-pattern of [storing a `Context` inside a struct](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/unit.go#L117), something which is [specifically advised against](https://pkg.go.dev/context#:~:text=Do%20not%20store%20Contexts%20inside%20a%20struct%20type%3B%20instead%2C%20pass%20a%20Context%20explicitly%20to%20each%20function%20that%20needs%20it.%20The%20Context%20should%20be%20the%20first%20parameter%2C%20typically%20named%20ctx%3A) by [the developers of Go](https://go.dev/blog/context-and-structs#TOC_2.). Here is an [example](https://go.dev/blog/context-and-structs#:~:text=Storing%20context%20in%20structs%20leads%20to%20confusion) illustrating some of the problems with this approach. | ||
|
||
## Implementation (TODO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean excluding the
OriginID
from the message? if that is the case, I would have this concern about breaking self-certifying feature of messages.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my response to that comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things on top of that comment:
Besides the fact that I fail to understand any additional guarantee that protocol-level whole-message signing would bring us:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of my discussion with @yhassanzadeh13:
From, for a valid message, is always equal (modulo format conversion) to OriginID. Hence the OriginID information is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for summarizing the discussions @huitseeker, it perfectly addresses the concern I initially explained.