Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLIP] Component API #1167

Merged
merged 26 commits into from Oct 9, 2021
Merged

[FLIP] Component API #1167

merged 26 commits into from Oct 9, 2021

Conversation

synzhu
Copy link
Contributor

@synzhu synzhu commented Aug 19, 2021

FLIP 1167

TODO

  • Add implementation steps

@synzhu
Copy link
Contributor Author

synzhu commented Aug 19, 2021

This touches on two somewhat distinct issues, but they are related at a high level so I've included them together for now to give the bigger picture.

They can be separated later on into two FLIPs.

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2021

Codecov Report

Merging #1167 (e70f781) into master (08b9420) will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1167      +/-   ##
==========================================
- Coverage   55.32%   55.27%   -0.05%     
==========================================
  Files         516      516              
  Lines       32181    32181              
==========================================
- Hits        17803    17789      -14     
- Misses      11988    12004      +16     
+ Partials     2390     2388       -2     
Flag Coverage Δ
unittests 55.27% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sus/approvals/assignment_collector_statemachine.go 42.30% <0.00%> (-7.70%) ⬇️
...ngine/common/synchronization/finalized_snapshot.go 68.75% <0.00%> (-4.17%) ⬇️
admin/command_runner.go 77.58% <0.00%> (-1.73%) ⬇️
engine/collection/synchronization/engine.go 62.90% <0.00%> (-1.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08b9420...e70f781. Read the comment docs.


### Potential problems
#### Non-idempotency of `ReadyDoneAware`
The current `ReadyDoneAware` interface is misleading, as by the name one might expect that it is only used to check the state of a component. However, in almost all current implementations the `Ready` method is used to both start the component *and* check when it has started up, and similarly for the `Done` method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method is used to both start the component and check when it has started up

I'm not sure if I agree. It's used to start the component and returns a way to await for the startup to finish, its not intended as a check. Check is done by reading returned channel.
I do agree that Ready name can be a bit misleading though. Start?

Copy link
Contributor Author

@synzhu synzhu Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this poses a few difficulties.

First, if multiple components want to wait for the startup of another to complete (ie common dependency), then with this approach we would need to pass the channel returned from Ready to all of those components. This also means that we cannot construct those components until we've started the dependency, because we cannot get the ready channel until we call Ready to start it. Ideally, we don't want this restriction because it forces sequentialism of certain steps that could potentially run in parallel.

Example: A needs to wait for B to be ready before it is ready, but it also needs to start its own subcomponent C and wait for that to be ready as well. With the approach in the above paragraph, we need to wait for B to be started before we can even construct A (which then constructs and starts C). Ideally, we should be able to construct A as soon as B is constructed, so that it can construct and start C. This means the startup code of B and C can run in parallel.

Now, what if these components also want to wait for the dependency's shutdown to complete later on? Then we have no way of doing this, because we only get a done channel when we call Done to shutdown the dependency. But these components have already been created by then, meaning they certainly don't have this done channel, and any mechanism we come up with of injecting it afterwards is probably gonna be ugly.

If instead we have a ReadyDoneAware interface which is idempotent, we can pass this to each of the components and they can use it to wait for the dependency to be started up or fully shut down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about another solution, that might be easier to implement.

Since the goal is to make ReadyDoneAware idempotent, we could achieve by making unit.Ready and unit.Done idempotent, and each engine could wraps its Ready work with unit.Ready.

How to make unit.Ready idempotent?
We could cache the ready channel. If Ready is called more than once, then the cached ready channel will be returned. This way the second call will return the same ready channel and won't trigger additional initialization work.

In fact, it doesn't make sense to me that a module would initialize more than once, and I don't think there is such case exist now.

So what I'm proposing is to just adjust unit.Ready:

func (u *Unit) Ready(checks ...func()) <-chan struct{} {
        // concurrent call will be blocking by the lock, this ensures only the first process will 
        // call the `checks` function, the second process will wait, and get notified once the first 
        // process finish all the checks.
        u.Lock()
        defer u.Unlock()

        if u.ready != nil {
                return u.ready
        }
            
	ready := make(chan struct{})
	go func() {
		for _, check := range checks {
			check()
		}
		close(ready)
	}()

        c.ready = ready
	return ready
}

In your example, A could be constructed with B whose Ready has been called by its owner, and A can call B's Ready again, which will not trigger another initialization, but wait on the previous initialization call to finish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one valid solution, and I agree it may be easier to implement given the current state of the codebase. However, I would argue that if I were to design the API from scratch today, I would design it the way described in this FLIP, for all the reasons described in the "Motivations" section. Using contexts to start / stop components is so much more idiomatic in Go, and makes the APIs much more self explanatory. It also makes the separation of concerns more explicit.

#### Non-idempotency of `ReadyDoneAware`
The current `ReadyDoneAware` interface is misleading, as by the name one might expect that it is only used to check the state of a component. However, in almost all current implementations the `Ready` method is used to both start the component *and* check when it has started up, and similarly for the `Done` method.

This introduces issues of concurrency safety, as most implementations do not properly handle the case where the `Ready` or `Done` methods are called more than once. See [this example](https://github.com/onflow/flow-go/pull/1026).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive programming isn't the standard Go approach and I would argue it's not needed here.
Component should have clearly defined lifecycle and the owner should manage the lifecycle.
If there is a need for synchronization this can be provided externally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here is more about the semantics of ReadyDoneAware.

You are correct that the owner of a component should manager the lifecycle. The problem here is that we have not separated lifecycle management from state observation, because ReadyDoneAware currently implements both.

While only one owner should be able to start a component, multiple other components may want to wait for it to be ready. This is where it becomes a problem if calling Ready also starts the component each time.

```golang
type Runnable interface {
// Run will start the component
Run(context.Context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a JVM talking, but Runnable and Run seems unrelated to starting a component. Wouldn't simply Start be better? Or Initialise? Also, does this function returns immediately or after the component has finished initialisation?
I see no added benefit over using Ready() (with this or other name). Returned channel gives great flexibility, including options to build notification or multiple waits system,

Copy link
Contributor Author

@synzhu synzhu Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm not tied to the name Runnable haha 😛.

Also, does this function returns immediately or after the component has finished initialisation?

Let me add a better comment for this. It should say "Run will start the component and returns immediately".

I see no added benefit over using Ready() (with this or other name). Returned channel gives great flexibility, including options to build notification or multiple waits system,

The benefits are listed below. We are not getting rid of a returned channel! The channel will just not be returned from this particular method, but from the Ready method (below) 🙂

This approach gives us strictly more flexibility than having a single Ready method that is responsible for both starting the component and waiting for startup to complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that one of the reasons this FLIP proposed was to handle the non-idempotency of the ReadyDoneAware. However, it seems the problem is bleeding to a separate domain by introducing Runnable. I consider the solution does not fully comply with the problem defined by the FLIP. In my opinion, nothing can still prevent a Runnable component to be Run several times. By accepting distinct contexts, Run does not show an idempotent behavior.

Hypothetically, even introducing the capability of binding the same Run function to several distinct contexts looks unsafer than the existing ReadyDoneAware non-idempotency vulnerability. As we still totally delegate this responsibility to the developer to treat Run as a non-idempotent component. This can cause more serious safety vulnerabilities.

So, while checking the readiness of a component is idempotent in this solution, the way Run method is defined extends the surface of vulnerability.

- If context propagation is done properly, there is no need to worry about any cleanup code in the `Done` method.
- This allows us to separate the capability to check a component's state from the capability to start / stop it. We may want to give multiple other components the capability to check its state, without giving them the capability to start or stop it. Here is an [example](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/common/splitter/network/network.go#L112) of where this would be useful.
- This provides a clearer way of defining ownership of components, and hence may potentially eliminate the need to deal with concurrency-safety altogether. Whoever creates a component should be responsible for starting it, and therefore they should be the only one with access to its `Runnable` interface. If each component only has a single parent that is capable of starting it, then we should never run into concurrency issues.
- This enables us to define components which startup upon creation, and don't need to be explicitly `Run()` at a later point in time. Such components would only implement the `ReadyDoneAware` interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably miss something, but if component starts up immediately and has no startup/shutdown code, why even bother with ReadyDoneAware? It can just be simple Go object without special lifecycle right?
We use .Module() in NodeBuilder, which is also badly named, as this is just a callback, nothing modular there.

Copy link
Contributor Author

@synzhu synzhu Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would refer to a component which does have startup / shutdown code, but the startup code isn't explicitly triggered by a call to Run (or Start, which is probably a better name as you suggested).

In other words, it could look like this:

ctx, cancel := context.WithCancel(context.TODO())

// create and start component
component := NewComponent(ctx)

// wait for it to be ready
<-component.Ready()

// stop component
cancel()

// wait for it to be done
<-component.Done()

#### Motivations
- `Context`s are the standard way of doing go-routine lifecycle management in Go, and adhering to standards helps eliminate confusion and ambiguity for anyone interacting with the `flow-go` codebase. This is especially true now that we are beginning to provide API's and interfaces for third parties to interact with the codebase (e.g DPS).
- Even to someone unfamiliar with our codebase (but familiar with Go idioms), it is clear how a method signature like `Run(context.Context)` will behave. A method signature like `Ready()` is not so clear.
- If context propagation is done properly, there is no need to worry about any cleanup code in the `Done` method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the idea, but ReadyDoneAware was plugged because cancelling context doesn't provide a way for component to signalise cleanup.
Its especially useful where components rely on being shutdown in order - for example. Ledger and WAL and Checkpointers.

Copy link
Contributor Author

@synzhu synzhu Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the idea, but ReadyDoneAware was plugged because cancelling context doesn't provide a way for component to signalise cleanup.

Yes, and this is precisely why I'm not suggesting we get rid of ReadyDoneAware. We should keep it, as a way for components to signal that they have finished startup, or finished shutdown / cleanup. However, it should not also be responsible for actually starting or stopping the component.

@m4ksio
Copy link
Contributor

m4ksio commented Aug 19, 2021

I have only reviewed proposed changes regarding ReadyDoneAware. While I agree we there are certain drawbacks I feel that proposed rework would increase implied complexity by introducing by pushing component readiness as a status all components need to be aware of.
There are places that would benefit from the proposed approach, but I think this can be fixed by adding extra checks/propagation on top of existing system and only apply them where needed.

@synzhu
Copy link
Contributor Author

synzhu commented Aug 19, 2021

I have only reviewed proposed changes regarding ReadyDoneAware. While I agree we there are certain drawbacks I feel that proposed rework would increase implied complexity by introducing by pushing component readiness as a status all components need to be aware of.
There are places that would benefit from the proposed approach, but I think this can be fixed by adding extra checks/propagation on top of existing system and only apply them where needed.

@m4ksio Thanks for the review :)

I've answered all of your comments above and I hope I've addressed them clearly, but happy to discuss more.

I feel that proposed rework would increase implied complexity by introducing by pushing component readiness as a status all components need to be aware of.

This sentence is a little hard to understand for me, can you explain?

@synzhu synzhu marked this pull request as ready for review August 27, 2021 03:46
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the write-up. Added a few comments, would be happy to discuss in details.

```golang
type Runnable interface {
// Run will start the component
Run(context.Context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that one of the reasons this FLIP proposed was to handle the non-idempotency of the ReadyDoneAware. However, it seems the problem is bleeding to a separate domain by introducing Runnable. I consider the solution does not fully comply with the problem defined by the FLIP. In my opinion, nothing can still prevent a Runnable component to be Run several times. By accepting distinct contexts, Run does not show an idempotent behavior.

Hypothetically, even introducing the capability of binding the same Run function to several distinct contexts looks unsafer than the existing ReadyDoneAware non-idempotency vulnerability. As we still totally delegate this responsibility to the developer to treat Run as a non-idempotent component. This can cause more serious safety vulnerabilities.

So, while checking the readiness of a component is idempotent in this solution, the way Run method is defined extends the surface of vulnerability.

case <-ctx.Done():
return
default:
// do work...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a reliable way of life-cycle management for Flow. I acknowledge the Golang best practices around contexts that you also elaborated. The problem is with this approach we keep delegating the life-cycle management of sub-components to themselves, which is in contrast to managing the life-cycle by the parent component.

For example, assume the // do work part here is a long-running task. Canceling the context wouldn't kill that task until it returns and the subsequent iteration of for starts. However, the parent component called ctx.Cancel() assumes all sub-components have been terminated immediately. Right?

The problem is a faulty component that faced deadlock may never get terminated in this approach, and there is no way that the parent component is aware of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, assume the // do work part here is a long-running task. Canceling the context wouldn't kill that task until it returns and the subsequent iteration of for starts. However, the parent component called ctx.Cancel() assumes all sub-components have been terminated immediately. Right?

No, it doesn't assume that all sub-components have been terminated immediately. That's what the Done method is for. You used the channel returned from Done to actually wait for shutdown to complete. the cancelling of the context just triggers the shutdown.

Finally, we should either deprecate or modify the behavior of [`engine.Unit`](https://github.com/onflow/flow-go/blob/master/engine/unit.go).

#### Motivations
- The `OriginID` field really provides no use. The value of this field is set by the sender of the message, so we certainly cannot rely on its correctness. We can validate it by [checking it against the peer ID of the sender](https://github.com/onflow/flow-go/pull/1163), but this is actually unnecessary since the peer ID is enough to tell us who the sender is. We can extract the sender's network key from their peer ID, and if a downstream engine needs to know the Flow ID corresponding to this network key, we can determine this from the protocol state. If we cannot find a corresponding Flow ID, then the sender is not a staked node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be easier to keep the OriginID as part of the payload, require every message being signed by its OriginID, and then on the recipient side verify the signature on payload against the verification key of the OriginID?

The advantage of this signature checking is reducing the heavy lifting of proposed cross-referencing between peer ID and flow ID at the recipient side to just a signature verification (which we literally need to do). It also saves a lot from the application layer running the assumption that incoming messages are already checked against authenticity and integrity and reduces the surface of redundancy that decoupling the authenticity (at the network layer) and integrity checking (at the application layer) in the proposed solution introduces.

Moreover, we keep an incoming message a self-contained data structure by including its OriginID. To account for unstaked senders, we can have OriginID as *flow.Identifier type, and let unstaked senders emit a message with a nil value for OriginID, which is a compliant representation as we won't assume a flow.Identifier for unstaked nodes.

P.S: I think most of our message types are literally meant to have a valid signature. So, the only networking level compromise we make is to send an extra 32 bytes of OriginID while at return we have a more coherent authenticity and integrity checking for our messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of this signature checking is reducing the heavy lifting of proposed cross-referencing between peer ID and flow ID at the recipient side to just a signature verification (which we literally need to do). It also saves a lot from the application layer running the assumption that incoming messages are already checked against authenticity and integrity and reduces the surface of redundancy that decoupling the authenticity (at the network layer) and integrity checking (at the application layer) in the proposed solution introduces.

I don't think the cross-referencing between peer and flow ID is actually a lot of work, in fact with a hashmap it would just be a (approximately) constant-time lookup. LibP2P already does signature verification, so I'm not sure why we would want to do again at the application level as this seems redundant. I don't think the proposed approach introduces more redundancy, because authenticity and integrity checking are completely separate, so the two stages are doing two separate but necessary things. We would be introducing redundancy by checking signatures at the application layer, as this combines authenticity and integrity checking into a single step but authentication is already done by libp2p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we keep an incoming message a self-contained data structure by including its OriginID. To account for unstaked senders, we can have OriginID as *flow.Identifier type, and let unstaked senders emit a message with a nil value for OriginID, which is a compliant representation as we won't assume a flow.Identifier for unstaked nodes

We can do this, but how will a receiver of a message send a response to an unstaked node with only a nil origin ID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LibP2P already does signature verification, so I'm not sure why we would want to do again at the application level as this seems redundant

This application layer signature verification is not redundant:

When a peer sends message M it signs it with its networking key, which is verified at any intermediary pubsub node as well as the receiver. This signature verification is done at libp2p level against the networking key of the sender. The aim of this signature verification provided by libp2p is solely to ensure integrity on the end-to-end delivery of the message.

The reason we can't only rely on the libp2p signature verification is it does not account for the authenticity of the message, i.e., Alice receiving a message from Bob can ensure the integrity of message sent by Bob and not the authenticity of the message as whether it has been created by Bob. Moving down the BFT road and relying only on signature verification at networking layer, the messages fail to be attributable to their creator and lack non-repudiation. Backing to our example, when Alice receives message M from Bob, she cannot verify whether Bob created that message or Eve, or anyone else. In other words, the origin of the message remains in mist from the cryptographic point of view.

That is why we have a protocol level key (i.e., staking key) as well, and at the protocol level the messages are supposed to be signed by the staking key of their corresponding creator. In this way, no matter who Alice receives message M from, she always can check it against the public staking key of the creator to ensure it is originated from that creator. This later on enables Alice to submit message M as an evidence of a malicious behavior originated from Eve (i.e., slashing), and anyone can verify the correctness of Alice's claim by (1) checking whether content of M represents a misbehavior, and (2) whether M includes a signature on its context that is verifiable against Eve's public staking key. This is just an example of a vital protocol feature of Flow that sticking solely on signature verification at libp2p level fails to provide. Another example is the general provider scenario that we aim that literally nodes can cache some entities they receive (e.g., ChunkDataPack) and provide them to other nodes requesting on the same topic. In this case, a verification node receiving a ChunkDataPack from another verification node needs to make sure that this ChunkDataPack has been created by a staked execution node, and this assurance is provided by checking the content of ChunkDataPack for a signature on its payload that is valid against the public key of its origin ID, i.e., the execution node that created that ChunkDataPack.

Now the reason including OriginID in the message M is crucial is that it enables any node receiving M to lookup public staking key of M from protocol state and check the signature of the message against OriginID's public key.
Including the OriginID together with signature in the messages makes them self-certifying, i.e., the message contains all forensic information needed to verify it later on by any party. Note that libp2p signature and protocol level signatures are different. The latter is part of the protocol-level message.

I don't think the cross-referencing between peer and flow ID is actually a lot of work, in fact with a hashmap it would just be a (approximately) constant-time lookup.

Looking up from hash table is a constant-time operation with respect to the number of entries in the hash table. However, it is not necessarily cheap, as it includes a hash operation which can be computation-heavy. I admit that we have hundreds of places that we do hashmap lookup, and I am not against adding such lookup at networking layer. But as we go BFT and especially on unstaked network, can we need to be careful about adding extra operations at networking layer. Such operations can be exploited by adversarial nodes to perform DDoS attacks, i.e., flood of messages that cause additional hashmap lookups. To the best of my knowledge, our security approach at networking layer is mostly proactive than reactive especially when it comes to availability-based attacks such as flooding. So the more we can save networking layer from extra bookkeeping, the more secure networking layer we have against such DDoS attacks. Of course DDoS preventions and rate limiting are possible, but they have their own overheads as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we can't only rely on the libp2p signature verification is it does not account for the authenticity of the message, i.e., Alice receiving a message from Bob can ensure the integrity of message sent by Bob and not the authenticity of the message as whether it has been created by Bob. Moving down the BFT road and relying only on signature verification at networking layer, the messages fail to be attributable to their creator and lack non-repudiation. Backing to our example, when Alice receives message M from Bob, she cannot verify whether Bob created that message or Eve, or anyone else. In other words, the origin of the message remains in mist from the cryptographic point of view.

Actually, the cool thing is that libp2p verifies both :)

ReceivedFrom is for integrity, and From is for authenticity.

In your example, ReceivedFrom would be Bob, and From would be the origin of the message. The entire message envelope is signed by the origin and the signature is included along with the message, and so libp2p is able to verify both of these :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Verifies both" is a bit of a simplification. For the nitty-gritty:

  • libp2p as configured in the Flow codebase today requires and checks a message signature from the source of a message (the From) at every carrier of the message,
  • libp2p also has transport encryption and authentication on every direct p2p link, using a Noise key exchange with an XX pattern, which guarantees that the ReceivedFrom field is the PeerID of the actor on the other end of the direct link (i.e. the bearer of the message upon delivery).
  • Checking that this ReceivedFrom is the Flow node we expect on the other end of the line, or a fortiori a valid Flow node, is left as an exercise to the out-of-libp2p implementer. There are places where we do this upon link establishment (e.g. subscription validator, readConnection & its callback).
  • this last part (bearer is as-expected) could be done more systematically (and cheaply) by allowing libp2p to be configured with different key exchange patterns (a KK comes to mind)

But overall this is correct:

  • libp2p as configured today enforces non-repudiation for all messages, because it forces whole-message signing at the source
  • as long as the correspondence between staking key and networking key is one-to-one and non-repudiable (i.e. enforced by the protocol), there is no doubt as to the origin of a message, and that network signature can (and in the short-to-middle term future, will) be exhibited as slashable evidence.

@synzhu
Copy link
Contributor Author

synzhu commented Aug 30, 2021

I can see that one of the reasons this FLIP proposed was to handle the non-idempotency of the ReadyDoneAware. However, it seems the problem is bleeding to a separate domain by introducing Runnable. I consider the solution does not fully comply with the problem defined by the FLIP. In my opinion, nothing can still prevent a Runnable component to be Run several times. By accepting distinct contexts, Run does not show an idempotent behavior.

Hypothetically, even introducing the capability of binding the same Run function to several distinct contexts looks unsafer than the existing ReadyDoneAware non-idempotency vulnerability. As we still totally delegate this responsibility to the developer to treat Run as a non-idempotent component. This can cause more serious safety vulnerabilities.

So, while checking the readiness of a component is idempotent in this solution, the way Run method is defined extends the surface of vulnerability.

@yhassanzadeh13 If we really want to make Run idempotent, we can easily do so using something like sync.Once. Then, both Run and readiness checking will idempotent.

However, as @m4ksio and @zhangchiqing have both noted, defensive programming isn't the standard Go approach and is not actually needed in most cases.

The reason why non-idempotency was an issue in the first place is because there was no separation of the readiness checking capability and starting capability for a component. Readiness checking is the one that actually needs to be idempotent, because this capability must be available to multiple other components and readiness checking may be done multiple times. However, starting a component doesn't necessarily need to be idempotent, because a component only needs to be started once. Because these two currently aren't separated, then we have to make them either idempotent or non-idempotent together. By separating them, we can make the necessary one idempotent (readiness checking), while avoiding the complexity of introducing idempotency for the one that's not necessary (starting the component)

@AlexHentschel
Copy link
Member

Summary of discussion

Alex & Simon; Sept 10, 2021

  • Networking layer must provide strict admission control for staked nodes

    • when some node attempts to open connection, the networking layer must check
      that this is an authorized connection and refuse it otherwise
    • this must happen at the point of the handshake
    • only messages from authorized peers should be decoded (!)
      Otherwise a potential attack vector
  • For gossip network: messages should be de-duplicated based on payload alone

    • message origin should not factor into whether a message is considered a duplicate or not
    • for example, consider a block proposal from Alice. If Bob re-broadcasts the proposal, we need to consider it as a duplicate. Otherwise, 1/3 of byzantine nodes (about 40 consensus nodes) could 40x message processing overhead by just re-broadcasting all blocks.
    • this also allows us to properly separate networks between staked nodes and unstaked nodes, without having to worry about repetition of the same message (e.g block) broadcasted by multiple access nodes
  • There should be a strong separation between network of staked nodes and network for unstaked nodes

    • Minimum separation on an engine level: Engines should either deal with messages from staked or unstaked nodes; but not both. Processing both by the same engine is an invitation for liveness attacks, because then messages from the unstaked nodes can starve liveness-critical messages from staked nodes of processing resources.
    • better: complete process separation (different OS processes for processing messages from staked vs unstaked nodes)
    • ideally (eventual goal): hardware separation

    In all cases, we only require one identifier:
    flowID (for staked nodes) or network-derived peerID (unstaked nodes)

  • For messages from staked nodes, we would like an Inspector object:

    • networking layer for staked network must guarantees message integrity and authenticity
    • if the recipient finds that something is wrong with the message, we can always slash the message origin
    • proof that origin sent this message, would be generated on-demand
    • could be an extension of context:
      • provide proof that peer really sent the message
      • we can also include the peer.ID of the sender in this inspector object
      • it makes for requests from staked nodes that they can be canceled

For API safety, it would probably make sense to have different APIs for the staked and unstaked network. Thereby, the API reflects that messages from both networks should not be mixed.

@synzhu
Copy link
Contributor Author

synzhu commented Sep 13, 2021

Once we have message queues implemented on all engines according to onflow/flow#343, we should be able to combine the context passed in to engine.Process with the context passed into any threads processing the queue with something like https://github.com/teivah/onecontext

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to keep track of this as needing updates from #1275 (an possibly #1355 too).

@synzhu synzhu requested a review from huitseeker October 4, 2021 05:40
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch, this is definitely going in the right direction˜

flips/component-interface.md Outdated Show resolved Hide resolved
flips/component-interface.md Outdated Show resolved Hide resolved
flips/component-interface.md Show resolved Hide resolved
flips/component-interface.md Show resolved Hide resolved
flips/component-interface.md Outdated Show resolved Hide resolved
@synzhu synzhu requested a review from huitseeker October 6, 2021 19:15
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thanks a lot!

s.errChan <- err
close(s.errChan)
} else {
// Another thread, possibly from the same component, has already thrown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @smnzhu this looks great

@synzhu
Copy link
Contributor Author

synzhu commented Oct 8, 2021

bors merge

bors bot added a commit that referenced this pull request Oct 8, 2021
1167: [FLIP] Component API r=smnzhu a=smnzhu

[FLIP 1167](https://github.com/onflow/flow-go/blob/smnzhu/runnable-flip/flips/component-interface.md)

### TODO
- [ ] Add implementation steps

Co-authored-by: Simon Zhu <simon.zsiyan@gmail.com>
@bors
Copy link
Contributor

bors bot commented Oct 8, 2021

Build failed:

@synzhu
Copy link
Contributor Author

synzhu commented Oct 9, 2021

bors retry

bors bot added a commit that referenced this pull request Oct 9, 2021
1167: [FLIP] Component API r=smnzhu a=smnzhu

[FLIP 1167](https://github.com/onflow/flow-go/blob/smnzhu/runnable-flip/flips/component-interface.md)

### TODO
- [ ] Add implementation steps

Co-authored-by: Simon Zhu <simon.zsiyan@gmail.com>
@bors
Copy link
Contributor

bors bot commented Oct 9, 2021

Canceled.

@synzhu
Copy link
Contributor Author

synzhu commented Oct 9, 2021

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 9, 2021

@bors bors bot merged commit cdd3ee6 into master Oct 9, 2021
@bors bors bot deleted the smnzhu/runnable-flip branch October 9, 2021 02:30
bors bot added a commit that referenced this pull request Oct 12, 2021
1355: [Network] Middleware component r=smnzhu a=smnzhu

* Refactor middleware to implement the `Component` interface.
* Introduces new `ComponentManager` struct to help implement `Component` interface
* Various refactoring in network layer and scaffold to enable the changes above.

### TODO

- [x] As mentioned in #1167 (comment), we should probably explicitly throw an error when `Start` is called multiple times, instead of simply ignoring subsequent calls
- [x] Update the godoc for Startable to reflect this
- [x] Add tests for ComponentManager

Co-authored-by: Simon Zhu <simon.zsiyan@gmail.com>
bors bot added a commit that referenced this pull request Oct 12, 2021
1355: [Network] Middleware component r=smnzhu a=smnzhu

* Refactor middleware to implement the `Component` interface.
* Introduces new `ComponentManager` struct to help implement `Component` interface
* Various refactoring in network layer and scaffold to enable the changes above.

### TODO

- [x] As mentioned in #1167 (comment), we should probably explicitly throw an error when `Start` is called multiple times, instead of simply ignoring subsequent calls
- [x] Update the godoc for Startable to reflect this
- [x] Add tests for ComponentManager

Co-authored-by: Simon Zhu <simon.zsiyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants